This project implements an abstractive Question & Answering (Q&A) system using Generative AI. It retrieves relevant documents based on a natural language question and generates human-readable answers using a generator model.
- Input: Ask a question in natural language.
- Retriever:
- Each context passage is converted into a embedding and stored in a vector database (e.g., Pinecone). This embedding captures the semantic and syntactic meaning of the text.
- Convert the input question into a query vector and compare it with the stored vectors to find the most relevant segments.
- Generator:
- Convert the relevant vectors back to text.
- Combine the original question with the retrieved text and pass it to a generator model (e.g., GPT-3 or BART) to produce a human-readable answer.
- Wiki Snippets Dataset:
- Contains over 17 million passages from Wikipedia.
- For simplicity, we use 50,000 passages that include "History" in the
section_title
column.
- Streaming Mode:
- The dataset is loaded iteratively to avoid downloading the entire 9GB file upfront.
- Extracted fields:
article_title
,section_title
, andpassage_text
.
- Model:
flax-sentence-embeddings/all_datasets_v3_mpnet-base
by Microsoft. - Embedding:
- Encodes sentences into a 768-dimensional vector.
- Stores embeddings in Pinecone with
dimensions=768
andmetric=cosine
.
- Processing:
- Passages are encoded in batches of 64.
- Metadata (
article_title
,section_title
,passage_text
) is attached to each embedding. - Data is indexed and upserted into the Pinecone database.
- Model:
bart_lfqa
, trained on the ELI5 dataset. - Input Format:
- A single string combining the query and relevant documents, separated by a special
<P>
token:question: What is a sonic boom? context: <P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. <P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. <P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures.
- A single string combining the query and relevant documents, separated by a special
- Processing:
- Retrieves the most relevant context vectors from the query vector (
xq
). - Extracts metadata (
passage_text
) and concatenates all context passages. - Adds the original query to the concatenated context.
- Retrieves the most relevant context vectors from the query vector (
- Output:
- The model tokenizes the final input, generates answers in token IDs, and decodes them to human-readable text.
This project demonstrates the fine-tuning of the pre-trained open-source BERT model for a binary classification task.
- Duplicate and Null Values: Removed any duplicate rows or rows containing null values.
- Class Imbalance: Checked for class imbalance. Addressed it by assigning higher penalties for misclassifying the minority class during training, ensuring the model focuses adequately on minority class instances.
- The dataset was split into three sets:
- Training Set (70%)
- Validation Set (15%)
- Test Set (15%)
- Model: Used the pre-trained
BERT-base
(uncased) model. - Tokenizer: Loaded the
BERT-base
tokenizer for tokenizing the input text.
- Since the BERT model accepts inputs of a maximum length of 512 tokens:
- Padding: Smaller sentences were padded to match the batch length.
- Truncation: Longer sentences were truncated to retain relevant information.
- Chose an optimal max input length of 25 tokens based on dataset characteristics.
- Converted input IDs, attention masks, and labels into PyTorch tensors.
- Used
DataLoader
and samplers for batching and shuffling data at every epoch.
We explored the following methods of fine-tuning:
- Train all weights: All layers are trained.
- Freeze a few layers: Only the unfreezed layer weights are trained.
- Freeze all layers: Added new layers on top and trained only the new layers.
For this project, we opted to train all weights.
- Optimizer:
AdamW
with a learning rate of1e-5
. - Loss Function: Cross-entropy loss, handling class imbalance.
- Training Epochs: Set to 10.
- A learning rate scheduler was considered but omitted due to the small dataset size.
This project focuses on text language translation from German to English using a neural network-based approach. The model takes a German sentence as input and outputs its English translation.
-
Data Cleaning:
- Removed duplicates and null values.
- Converted all text to lowercase and stripped punctuation.
-
Feature Engineering:
- Max Length: Set to 8 for both English and German sentences.
- Tokenization:
- Separate tokenizers for German and English.
- Vocabulary size: 6098 (English) and 10071 (German).
- Padding & Truncation: Performed to ensure uniform input sizes within a batch.
- Converted tokenized sentences to tensors for model training.
- Embedding Layer:
- Converts input integers to dense vectors of fixed size.
- Dropout applied with a probability of 0.2.
- Bidirectional LSTM:
- Three stacked layers, with the first two returning sequences.
- Each layer is followed by Layer Normalization and Dropout (0.2).
- RepeatVector:
- Repeats encoder output to match decoder time steps.
- LSTM Layers:
- Two stacked layers with 2× units size.
- Includes Layer Normalization and Dropout (0.2).
- Output Layer:
- Dense layer with a softmax activation to generate probabilities for the target vocabulary.
- Optimizer: Adam with learning rate 0.001.
- Loss Function: Sparse categorical crossentropy.
- Metric: Accuracy.
- Early Stopping:
- Monitored validation loss with a patience of 5 epochs.
The aim of this project is to identify pairs of questions that have the same intent, even if they are phrased differently due to variations in wording or grammar. This task is inspired by a Kaggle competition hosted by Quora, with the goal of improving user experience by reducing fragmented answers across duplicate questions.
Key transformations performed on the data include:
- Converted text to lowercase.
- Replaced emojis with their meanings.
- Expanded contractions (e.g.,
've
tohave
). - Replaced special characters with descriptive names.
- Shortened large numbers (e.g.,
1,000,000
to1m
). - Removed HTML tags.
- Abbreviated common terms (e.g.,
GM
forGood Morning
). - Applied stemming to reduce words to their root forms.
Important Notes:
- Stop Words: Retained to create new features for classification.
- Spelling Correction: Omitted due to computational constraints with the large dataset.
- Length of both questions (q1, q2).
- Number of words in both questions.
- Common words between q1 and q2.
- Total words in q1 and q2.
- Ratio of common words to total words.
- Ratio of common words to minimum and maximum lengths of both questions.
- Ratio of common stop words to minimum and maximum lengths.
- Ratio of common tokens to minimum and maximum lengths.
- First and last word match status between both questions.
- Absolute difference in the number of tokens.
- Average number of tokens.
- Ratio of longest common substring to the minimum length.
- Fuzzy Ratio
- Fuzzy Partial Ratio
- Token Sort Ratio
- Token Set Ratio
- A vocabulary of the top 3000 most frequent words was created for both questions (q1, q2).
- Each sentence was represented as a vector using this vocabulary.
- Combined these 6000 dimensions with the 23 features created earlier to form a total of 6023 features.
Challenges with Bag of Words:
- Sparse matrix representation.
- Lack of semantic context.
- High-dimensional data.
- Out-of-vocabulary (OOV) issues.
- Dataset Split:
- 80% for training and 20% for testing.
- Classifier: Random Forest.
- Evaluation:
- Achieved an accuracy of approximately 0.7863 on the test dataset.
This project focuses on text summarization. The input is a long review of Amazon Fine Food products, and the output is a concise summary of the review.
- Converted all text to lowercase.
- Replaced emojis with their meanings.
- Removed pre-encoded emojis that could not be demojized.
- Expanded contractions (e.g.,
've
tohave
). - Removed HTML tags.
- Abbreviated common terms (e.g.,
GM
for "Good Morning"). - Removed stop words.
- Stemming: Avoided because root forms may not produce proper English words.
- Lemmatization: Skipped due to high computation time.
- Spelling Correction: Omitted because of the dataset's size and associated time costs.
- Input:
Bought several vitality canned dog food products, found good quality.
- Output:
good quality dog food
-
Input Length Optimization:
- Text max length: 80 tokens.
- Summary max length: 7 tokens.
-
Padding and Truncation:
- Smaller sentences are padded.
- Longer sentences are truncated.
-
Tokenization:
- Used BERT tokenizer (uncased) to convert text into token IDs.
- Pre-existing vocabulary ensures better linguistic coverage.
- Embedding layer converts tokens into 256-dimensional embeddings.
- Three LSTM layers process input, propagating context across timesteps.
- Final LSTM outputs:
- Hidden states at each timestep.
- Final hidden and cell states (
state h
,state c
).
- Inputs:
- Final encoder states (
state h
,state c
). - Summary tokens embedded similarly.
- Final encoder states (
- Processes sequence with attention applied to encoder outputs.
- Keys/Values: Encoder outputs.
- Queries: Decoder's current hidden states.
- Produces context vectors focusing on relevant parts of the input.
- Combines context vectors with decoder outputs.
- Applies softmax to predict the next token probabilities.
-
Parameters:
- Optimizer: RMSprop.
- Loss function: Sparse categorical crossentropy.
- Metric: Accuracy.
-
Regularization:
- Early stopping: Halt training after 5 epochs of no improvement.
- Inputs: Review text.
- Outputs:
- Encoder hidden states.
- Final states (
state h
,state c
).
- Inputs:
- Current token.
- Previous hidden and cell states (
state h
,state c
). - Encoder outputs for attention.
- Outputs:
- Updated states.
- Token probability distribution.
- Purpose: Generates a summary from the input review.
- Process:
- Encode input sequence using the encoder.
- Initialize with the
start
token. - Iteratively predict tokens until:
- Reaching the
end
token. - Hitting the maximum summary length.
- Reaching the
- Update states and target sequence at each step.
- Purpose: Converts the sequence into a readable summary.
- Process:
- Remove padding (
0
) and special tokens (start
,end
). - Convert indices into words using the reverse target vocabulary.
- Remove padding (
- Purpose: Converts the input sequence into readable text.
- Process:
- Remove padding (
0
). - Convert indices into words using the reverse source vocabulary.
- Remove padding (
This project addresses a Kaggle competition hosted by Twitter. The goal is to identify tweets containing racist or sexist content, enabling measures to block such tweets and reduce online bullying and negativity.
- Converted all text to lowercase.
- Replaced emojis with their meanings.
- Removed pre-encoded emojis that could not be demojized.
- Expanded contractions (e.g.,
've
→have
). - Replaced special characters with their names, except for
#
(useful for identifying trends). - Replaced usernames (
@tags
) withuser
and subsequently dropped them to maintain privacy. - Removed HTML tags.
- Abbreviated common terms (e.g.,
GM
→ "Good Morning"). - Removed stop words.
- Applied stemming to reduce words to their root forms.
- Spelling correction: Skipped due to the large dataset size and the computational cost.
-
Hashtag Extraction:
- Analyzed hashtags (
#
) to evaluate their association with racist or non-racist tweets.
- Analyzed hashtags (
-
Corpus Creation:
- Combined training and testing datasets to build a corpus of words.
- Selected the top 1000 most frequent words as vector dimensions to represent each tweet.
-
Comparison:
- Evaluated Bag of Words (BOW) and TF-IDF methods for vector representation.
-
Feature Combination:
- Created 20 additional features based on 10 common labels from each category (racist and non-racist).
- Final dataset comprised 1020 features (1000 from BOW/TF-IDF + 20 custom features).
- Used a Logistic Regression model for classification.
- Bag of Words (BOW):
- Achieved an f1 score of 0.544.
- TF-IDF:
- Achieved an improved f1 score of 0.559.