Added initial draft of main notebook

fhnw-ivy · Jun 4, 2024 · c2d795a · c2d795a
1 parent b2b8146
commit c2d795a
Showing 1 changed file with 361 additions and 0 deletions.
diff --git a/notebooks/main.ipynb b/notebooks/main.ipynb
@@ -0,0 +1,361 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "55bc471f",
+   "metadata": {},
+   "source": [
+    "# Sentiment Analysis Mini-Challenge"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23f4b2dd",
+   "metadata": {},
+   "source": [
+    "## 1. Introduction\n",
+    "Sentiment analysis is a crucial task in Natural Language Processing (NLP) that involves determining the sentiment or tone of a given text. It has numerous applications, such as understanding customer feedback, monitoring social media sentiment, and analyzing product reviews. However, manually labeling large datasets for sentiment analysis can be time-consuming and costly. Semi-supervised learning techniques, such as weak supervision, can help alleviate this challenge by leveraging a small amount of labeled data along with a larger set of unlabeled data to improve model performance."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "75667560",
+   "metadata": {},
+   "source": [
+    " ## 2. Dataset Selection and Exploratory Data Analysis\n",
+    "The dataset used for this mini-challenge is the Amazon Polarity dataset, which consists of product reviews from Amazon labeled as either positive or negative. The dataset is loaded using the Hugging Face Datasets library. Exploratory data analysis is performed to gain insights into the distribution of labels, length of reviews, and other relevant characteristics.\n",
+    "\n",
+    "As the dataset contains 4 million reviews we cut this down into a subset of 6666 reviews for the purpose of this mini-challenge. 666 of the reviews are used for validation, the remaining 6000 are split into 1000 labeled samples and 5000 artificially unlabeled samples.\n",
+    "\n",
+    "Each subset has a 50/50 split of positive and negative reviews. "
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "import seaborn as sns\n",
+    "\n",
+    "from src.data_loader import load_datasets\n",
+    "\n",
+    "train_df, unlabeled, validation = load_datasets(\"../data/partitions\")"
+   ],
+   "id": "b7d121e24f626f5a",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "print(f\"Labeled Dataset Length: {len(train_df)}\")\n",
+    "print(f\"Unlabeled Dataset Length: {len(unlabeled)}\")\n",
+    "print(f\"Validation Dataset Length: {len(validation)}\")"
+   ],
+   "id": "306d3359e046c56f",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": "To get a better idea of the whole dataset, we will merge the data again and perform some exploratory data analysis.",
+   "id": "e2a464f170d2351f"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "import pandas as pd\n",
+    "from src.data_loader import LABEL_MAP\n",
+    "\n",
+    "# for eda purposes\n",
+    "\n",
+    "unlabeled.rename(columns={'ground_truth': 'label'}, inplace=True)\n",
+    "\n",
+    "\n",
+    "eda_df = pd.concat([train_df, unlabeled, validation])\n",
+    "\n",
+    "eda_df['label'] = eda_df['label'].map(LABEL_MAP)"
+   ],
+   "id": "bd69ceb3469b9745",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "from matplotlib import pyplot as plt\n",
+    "\n",
+    "eda_df['review_length'] = eda_df['content'].apply(len)\n",
+    "\n",
+    "plt.figure(figsize=(10, 6))\n",
+    "sns.histplot(eda_df['review_length'], bins=50, kde=True)\n",
+    "plt.title('Distribution of Review Lengths in Training Data')\n",
+    "plt.xlabel('Review Length')\n",
+    "plt.ylabel('Frequency')\n",
+    "plt.show()\n",
+    "\n"
+   ],
+   "id": "5b0b9ca062123f99",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": "eda_df",
+   "id": "6bc35bfd1ef89908",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "from wordcloud import WordCloud, STOPWORDS\n",
+    "from sklearn.feature_extraction.text import CountVectorizer\n",
+    "\n",
+    "\n",
+    "def plot_most_common_words(df, top_n=20):\n",
+    "    df['label'] = df['label'].map({v: k for k, v in LABEL_MAP.items()})\n",
+    "    pos_reviews = df[df['label'] == 1]['content']\n",
+    "    neg_reviews = df[df['label'] == 0]['content']\n",
+    "\n",
+    "    vectorizer_pos = CountVectorizer(stop_words='english')\n",
+    "    vectorizer_neg = CountVectorizer(stop_words='english')\n",
+    "\n",
+    "    pos_word_count = vectorizer_pos.fit_transform(pos_reviews)\n",
+    "    neg_word_count = vectorizer_neg.fit_transform(neg_reviews)\n",
+    "\n",
+    "    pos_sum_words = pos_word_count.sum(axis=0)\n",
+    "    neg_sum_words = neg_word_count.sum(axis=0)\n",
+    "\n",
+    "    pos_words_freq = [(word, pos_sum_words[0, idx]) for word, idx in\n",
+    "                      zip(vectorizer_pos.get_feature_names_out(), range(pos_sum_words.shape[1]))]\n",
+    "    neg_words_freq = [(word, neg_sum_words[0, idx]) for word, idx in\n",
+    "                      zip(vectorizer_neg.get_feature_names_out(), range(neg_sum_words.shape[1]))]\n",
+    "\n",
+    "    pos_words_freq = sorted(pos_words_freq, key=lambda x: x[1], reverse=True)\n",
+    "    neg_words_freq = sorted(neg_words_freq, key=lambda x: x[1], reverse=True)\n",
+    "\n",
+    "    words, freq = zip(*pos_words_freq[:top_n])\n",
+    "    plt.figure(figsize=(10, 5))\n",
+    "    plt.bar(words, freq)\n",
+    "    plt.title('Most common words in positive reviews')\n",
+    "    plt.xticks(rotation=90)\n",
+    "    plt.show()\n",
+    "\n",
+    "    words, freq = zip(*neg_words_freq[:top_n])\n",
+    "    plt.figure(figsize=(10, 5))\n",
+    "    plt.bar(words, freq)\n",
+    "    plt.title('Most common words in negative reviews')\n",
+    "    plt.xticks(rotation=90)\n",
+    "    plt.show()\n",
+    "\n",
+    "plot_most_common_words(eda_df, top_n=20)\n"
+   ],
+   "id": "8b944f080f06f4b9",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "def generate_word_cloud(text, title):\n",
+    "    wordcloud = WordCloud(width=800, height=800,\n",
+    "                          background_color='white',\n",
+    "                          stopwords=set(STOPWORDS),\n",
+    "                          min_font_size=10).generate(text)\n",
+    "\n",
+    "    plt.figure(figsize=(8, 8), facecolor=None)\n",
+    "    plt.imshow(wordcloud)\n",
+    "    plt.axis(\"off\")\n",
+    "    plt.tight_layout(pad=0)\n",
+    "    plt.title(title)\n",
+    "    plt.show()\n",
+    "\n"
+   ],
+   "id": "eda21d9741f19da5",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "pos_reviews_text = \" \".join(eda_df[eda_df['label'] == \"positive\"]['content'])\n",
+    "generate_word_cloud(pos_reviews_text, \"Word Cloud for Positive Reviews\")"
+   ],
+   "id": "b6ab0019c556ed8f",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "neg_reviews_text = \" \".join(eda_df[eda_df['label'] == \"negative\"]['content'])\n",
+    "generate_word_cloud(neg_reviews_text, \"Word Cloud for Negative Reviews\")\n"
+   ],
+   "id": "3aeefd9b49d6905",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b901248",
+   "metadata": {},
+   "source": [
+    "## 3. Data Splitting Strategy\n",
+    "The dataset is split into development, validation, labeled, and unlabeled sets using a nested split approach. The development set is a fraction of the full dataset, the validation set is a fraction of the test dataset, and the labeled set is a fraction of the development set. The remaining samples in the development set are considered unlabeled. The nested split always adds in 25% increments (25, 50, 75), and a 1/6 split is used, resulting in 1000 labeled and 5000 weakly labeled samples in total.\n",
+    "\n",
+    "Given the focus of the MC on the impact of weak labelling and its impact, we introduce a nested split which further divides our training data into splits.\n",
+    "\n",
+    "The goal is to identify the optimal amount of additional data that can be used to improve model performance without the need for manual annotation."
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": "",
+   "id": "73e761cf1be82629",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "47939c9e",
+   "metadata": {},
+   "source": [
+    "## 4. Baseline Model Performance\n",
+    "A pretrained language model, specifically `sentence-transformers/all-MiniLM-L6-v2`, is used as the baseline model for sentiment classification without training. The baseline model's performance is evaluated and the implications of the results are discussed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a8ef2289",
+   "metadata": {},
+   "source": [
+    "## 5. Supervised Learning Performance\n",
+    "Supervised learning techniques, including transfer learning and fine-tuning, are applied to the sentiment classification task using different amounts of labeled samples. The impact of the number of labeled samples on model performance is analyzed and presented."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9317a134",
+   "metadata": {},
+   "source": [
+    "### 5.1 Transfer Learning\n",
+    "The performance of transfer learning using different amounts of labeled samples is presented and discussed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23d7a8e1",
+   "metadata": {},
+   "source": [
+    "### 5.2 Fine-tuning\n",
+    "The performance of fine-tuning using different amounts of labeled samples is presented and discussed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "22cfcf6e",
+   "metadata": {},
+   "source": [
+    "## 6. Semi-Supervised Learning Performance\n",
+    "Semi-supervised learning techniques, specifically K-Nearest Neighbors (KNN) and Logistic Regression (LogReg), are employed to generate weak labels for the unlabeled samples. The impact of the number of labeled samples and weak labeling strategies on model performance is analyzed and presented."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f0c55c17",
+   "metadata": {},
+   "source": [
+    "### 6.1 K-Nearest Neighbors (KNN)\n",
+    "The performance of KNN-based weak labeling using different amounts of labeled samples is presented and discussed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1c53e734",
+   "metadata": {},
+   "source": [
+    "### 6.2 Logistic Regression (LogReg)\n",
+    "The performance of LogReg-based weak labeling using different amounts of labeled samples is presented and discussed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e3d33785",
+   "metadata": {},
+   "source": [
+    "## 7. Learning Curve Analysis\n",
+    "The learning curve, plotting the model performance against varying numbers of labeled samples for each technique (supervised and semi-supervised), is presented and analyzed. The focus is on the range with few labeled samples, and the practical implications of the results are discussed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "eabcf731",
+   "metadata": {},
+   "source": [
+    "# Code for generating the learning curve plot"
+   ],
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ce2752d",
+   "metadata": {},
+   "source": [
+    "## 8. Model Comparison and Analysis\n",
+    "A thorough analysis of the results is conducted, comparing the baseline model, supervised learning techniques, and semi-supervised learning techniques. The impact of different weak labeling strategies and training data sizes on model performance is evaluated. The best approach for the chosen dataset is determined, emphasizing the models that achieve acceptable performance with few manually annotated samples."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "45316041",
+   "metadata": {},
+   "source": [
+    "## 9. Time Savings Factor and Implications\n",
+    "The time savings factor, quantifying the reduction in manually labeled data required to achieve acceptable performance levels using weak labeling approaches, is calculated. The implications of the findings are discussed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d4fd172b",
+   "metadata": {},
+   "source": [
+    "## 10. Conclusion and Future Directions\n",
+    "The key findings, insights, and potential implications of the sentiment analysis mini-challenge are summarized. The effectiveness of weak supervision techniques in reducing the need for manual annotation while maintaining acceptable model performance is discussed. Future directions for research and improvements are outlined."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3595d590",
+   "metadata": {},
+   "source": [
+    "## 11. AI Tool Usage Assessment\n",
+    "The use of ChatGPT or other AI tools throughout the mini-challenge is documented and assessed. The tasks for which they were used, the prompting strategies employed, and their contribution to solving the problem and acquiring new skills are specified."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  },
+  "kernelspec": {
+   "name": "python3",
+   "language": "python",
+   "display_name": "Python 3 (ipykernel)"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}