NLP Final Project

This repository contains the final project for the Natural Language Processing (NLP) course. The project focuses on building a mini BERT model with multiple layers and pretraining it using contrastive learning for the Quora Question Pairs task.

Github repository: MinBert

Artifacts are available on Kaggle

Project Overview

The main objective of this project is to develop a lightweight BERT architecture that efficiently captures semantic similarities between question pairs. We apply contrastive learning techniques to enhance the model’s ability to distinguish between similar and dissimilar questions.

Dataset

We use the Quora Question Pairs (QQP) and other datasets to train the model, which consists of question pairs labeled as duplicate or non-duplicate. The dataset is preprocessed with the following steps:

Tokenization using a pre-defined vocabulary.
Removing stop words and special characters.
Applying padding and truncation to standardize input length.

Model Architecture

Our mini BERT model consists of:

Embeddings Layer: Converts tokenized inputs into dense vector representations.
Multiple Transformer Layers: Self-attention mechanisms for capturing contextual relationships.
Feedforward Neural Networks: Additional layers for classification.
Contrastive Learning Objective: Optimized to enhance the model’s ability to differentiate between semantically similar and dissimilar questions.

Contributors

Trần Văn Quang
Nguyễn Việt Cường