NLP Final Project
This repository contains the final project for the Natural Language Processing (NLP) course. The project focuses on building a mini BERT model with multiple layers and pretraining it using contrastive learning for the Quora Question Pairs task.
Github repository: MinBert
Artifacts are available on Kaggle
Project Overview
The main objective of this project is to develop a lightweight BERT architecture that efficiently captures semantic similarities between question pairs. We apply contrastive learning techniques to enhance the model’s ability to distinguish between similar and dissimilar questions.
Dataset
We use the Quora Question Pairs (QQP) and other datasets to train the model, which consists of question pairs labeled as duplicate or non-duplicate. The dataset is preprocessed with the following steps:
- Tokenization using a pre-defined vocabulary.
- Removing stop words and special characters.
- Applying padding and truncation to standardize input length.
Model Architecture
Our mini BERT model consists of:
- Embeddings Layer: Converts tokenized inputs into dense vector representations.
- Multiple Transformer Layers: Self-attention mechanisms for capturing contextual relationships.
- Feedforward Neural Networks: Additional layers for classification.
- Contrastive Learning Objective: Optimized to enhance the model’s ability to differentiate between semantically similar and dissimilar questions.
Contributors
- Trần Văn Quang
- Nguyễn Việt Cường