Merge remote-tracking branch 'origin/main'

fhnw-ivy · Jun 5, 2024 · 218f3b6 · 218f3b6
2 parents ed74894 + 5d45182
commit 218f3b6
Show file tree

Hide file tree

Showing 3 changed files with 152 additions and 58 deletions.
diff --git a/USE-OF-AI.md b/USE-OF-AI.md
@@ -0,0 +1,25 @@
+## Use of AI Tools
+
+Throughout this mini-challenge, our team leveraged AI tools, specifically ChatGPT and GitHub Copilot, to assist with the development and optimization of our sentiment analysis system. These tools were instrumental in managing coding and debugging tasks, allowing us to focus on the core challenge of implementing and evaluating weak labeling strategies to improve model performance with limited labeled data. Here's a detailed look at how we utilized these AI tools and the strategies that proved most effective.
+
+### ChatGPT
+
+ChatGPT served as a vital support tool for our software engineering tasks. Although it couldn't directly influence conceptual decisions due to its training data limitations, it was incredibly useful for generating well-structured code and solving programming issues.
+
+**How We Used ChatGPT:**
+
+- **Specific Technical Questions**: We frequently asked ChatGPT detailed technical questions to help generate efficient and clean code. For instance, asking, "How can I create a modular class structure for processing sentiment data?" resulted in well-organized code suggestions that improved our project’s architecture.
+- **Debugging Assistance**: ChatGPT was invaluable for troubleshooting. When encountering bugs, we described the problems in detail, such as "ChatGPT, why is this data preprocessing function failing with certain input formats?" This approach helped us quickly identify and fix issues, ensuring smoother progress.
+
+### GitHub Copilot
+
+GitHub Copilot significantly accelerated our development process by handling boilerplate and utility code directly within our IDE. This integration allowed us to concentrate on the more complex aspects of the sentiment analysis system, particularly the weak labeling techniques.
+
+**How We Used GitHub Copilot:**
+
+- **Inline Code Generation**: We relied on Copilot to generate code snippets as we typed. For example, while writing a function to generate text embeddings, we would start with the function signature and let Copilot complete the body. This approach saved time and reduced manual coding effort.
+- **Code Refactoring**: Copilot was also effective for optimizing existing code. By adding comments like `# Refactor this code for efficiency`, we received improved code suggestions that enhanced readability and performance.
+
+### Evaluation and Impact
+
+The integration of ChatGPT and GitHub Copilot greatly improved our development efficiency. These tools allowed us to automate routine coding tasks and focus our efforts on the strategic implementation of weak labeling techniques. As a result, we could efficiently explore and refine our approaches, ultimately boosting the performance of our sentiment classification model despite the initial scarcity of labeled data. The use of AI tools not only sped up development but also ensured that our codebase remained clean, maintainable, and well-documented, contributing significantly to the overall success of the project.
diff --git a/notebooks/embedding_analysis.ipynb b/notebooks/embedding_analysis.ipynb
@@ -11,14 +11,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The notebook at hand aims to dive into the possible patterns dimensionality reduction techniques can show within the proposed weak labeling models and the embedding models used."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Loading Weak Labeled data"
+    "The notebook at hand aims to dive into the possible patterns dimensionality reduction techniques can show within the proposed embedding models.\n",
+    "\n",
+    "To analyze the embedding spaces we use Arize's Phoenix app which decomposes the high dimensionality into a 3-dimensional space using UMAP. Additionally we will also look at each embedding space in a PCA-decomposed representation to see how much impact the decomposition algorithm may have.\n",
+    "\n",
+    "We looked at the performance of the following two embedding models:\n",
+    "- [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)\n",
+    "- [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)"
    ]
   },
   {
@@ -33,6 +32,7 @@
     "import pandas as pd\n",
     "import plotly\n",
     "import plotly.express as px\n",
+    "\n",
     "from dotenv import load_dotenv\n",
     "\n",
     "current_dir = os.getcwd()\n",
@@ -44,6 +44,16 @@
     "load_dotenv()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Loading Embeddings\n",
+    "The first step is to load the persisted embeddings from each embedding model. The embedding vectors (i.e. embedding vector matrix) was saved in the `data/embeddings/*` folder for each split in the initial train, test and validation split. \n",
+    "\n",
+    "The following block loads these split up matrices into memory."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -71,6 +81,14 @@
     "    embeddings[dir] = curr_embeddings"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Loading Data Partitions\n",
+    "To get a full glance at the embedding space's attributes we may want to look at the content of a review and how it relates to other reviews in the space so within this next code block we gather the nominal attributes of each review and load the three split parquets (train, test and validation or in our case unlabelled, labelled and validation) into the memory."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -94,7 +112,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Merging content, title and label to embedding vectors"
+    "#### Merging content, title and label to embedding vectors\n",
+    "The goal now is to merge the matrix representations with the dataframes. This will later on allow us to pass a dataset into Phoenix that contains the reviews content and its embedding vector."
    ]
   },
   {
@@ -103,46 +122,39 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import numpy as np\n",
+    "\n",
     "merged_partitions = {}\n",
     "\n",
     "for embedding_model in embeddings:\n",
-    "    merged_partitions[embedding_model] = {}\n",
-    "    print(f'For {embedding_model}:')\n",
-    "    for partition in partitions:\n",
-    "        curr_partition_name = partition.split('_')[0]\n",
-    "        embeddings_keys = embeddings[embedding_model].keys()\n",
+    "    print(f'Merging partitions for model {embedding_model}')\n",
+    "    merged_list = []\n",
+    "    \n",
+    "    for partition_key, partition_df in partitions.items():\n",
+    "        curr_partition_name = partition_key.split('_')[0]\n",
+    "        matched = False\n",
     "        \n",
-    "        for embedding_key in embeddings_keys:\n",
+    "        for embedding_key, embedding_array in embeddings[embedding_model].items():\n",
     "            if curr_partition_name == embedding_key.split('_')[0]:\n",
-    "                partition_data = partitions[partition]\n",
-    "                embedding_data = embeddings[embedding_model][embedding_key]\n",
-    "                partition_data['embedding'] = embedding_data.tolist()\n",
-    "                \n",
-    "                merged_partitions[embedding_model][partition] = partition_data\n",
+    "                if isinstance(embedding_array, (list, pd.Series)):\n",
+    "                    embedding_array = np.array(embedding_array)\n",
     "                \n",
-    "                print(f\"- Merged {embedding_key} with {partition}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# concatenate all partitions for each embedding model\n",
-    "for embedding_model in merged_partitions:\n",
-    "    print(f\"Concatenating partitions for {embedding_model}\")\n",
-    "    partitions_data = pd.concat(merged_partitions[embedding_model].values(), ignore_index=True)\n",
-    "    merged_partitions[embedding_model] = partitions_data"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "merged_partitions['mini_lm']"
+    "                if len(partition_df) == embedding_array.shape[0]:\n",
+    "                    partition_df = partition_df.copy()\n",
+    "                    partition_df['embedding'] = embedding_array.tolist()\n",
+    "                    merged_list.append(partition_df)\n",
+    "                    matched = True\n",
+    "                    print(f\"  - Merged {embedding_key} with {partition_key}\")\n",
+    "                else:\n",
+    "                    print(f\"  - Number of rows do not match for {embedding_key} and {partition_key}\")\n",
+    "        \n",
+    "        if not matched:\n",
+    "            print(f\"  - No matching embedding found for {partition_key}\")\n",
+    "    \n",
+    "    if merged_list:\n",
+    "        merged_partitions[embedding_model] = pd.concat(merged_list, ignore_index=True)\n",
+    "    else:\n",
+    "        print(f\"No partitions were merged for model {embedding_model}\")"
    ]
   },
   {
@@ -161,8 +173,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import numpy as np\n",
-    "import pandas as pd\n",
     "from sklearn.decomposition import PCA\n",
     "\n",
     "def break_content(text, length=50):\n",
@@ -204,7 +214,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### MiniLM Embedding Space"
+    "### MiniLM Embedding Space\n",
+    "First we will take a look at how the embedding space of the `mini-lm` embedding model looks.\n",
+    "\n",
+    "Note, to see the projected space in the Phoenix app, make sure to click the \"text_embedding\" link inside the app, this will load the 3-dimensional UMAP projection. Another thing to note is that UMAP in uses stochastic algorithms to speed up calculation so the representation you see may not look the same as we noted down so **this decomposition approach is non-deterministic**."
    ]
   },
   {
@@ -215,8 +228,6 @@
    "source": [
     "from src.px_utils import create_dataset, launch_px\n",
     "\n",
-    "knn_key = 'mlp_weak_labeling_weaklabels.parquet'\n",
-    "\n",
     "mini_lm_ds = create_dataset('mini_lm', merged_partitions['mini_lm'], merged_partitions['mini_lm']['embedding'], content=merged_partitions['mini_lm']['content'])\n",
     "\n",
     "px_session = launch_px(mini_lm_ds, None)\n",
@@ -239,7 +250,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Since Phoenix doesn't allow for a different dimension reduction technique we implement a PCA strategy ourselves. The UMAP technique differs vastly from PCA so looking at another technique could yield more interesting observations in the embedding space."
+    "#### PCA of MiniLM\n",
+    "Since Phoenix doesn't allow for a different dimension reduction technique we implement a PCA strategy ourselves. The UMAP technique differs vastly from PCA so looking at another technique could yield more interesting observations in the embedding space. PCA on the other hand is deterministic so the observations made may make more sense. "
    ]
   },
   {
@@ -255,11 +267,68 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Compared to the UMAP representation the PCA reduction doesn't seem to show much more separation in labels or semantics. We can still roughly see the following four clusters:\n",
+    "Compared to the UMAP representation the PCA reduction shows a triangular shaped embedding space, at each corner a cluster emerges. We can still roughly see the following four clusters:\n",
     "- Music albums\n",
     "- Books\n",
     "- Movies\n",
-    "- Tech Gadgets"
+    "- Tech Gadgets\n",
+    "\n",
+    "So this visualization again support the claims made in the above analysis; The `all-MiniLM-L6-v2` clearly succeeds in embedding and clustering the reviews according to their semantic relatedness."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### mpnet-base Embedding Space\n",
+    "Now we look at the embedding space of the `all-mpnet-base-v2` model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mpnet_base_ds = create_dataset('mpnet_base', merged_partitions['mpnet_base'], merged_partitions['mpnet_base']['embedding'], content=merged_partitions['mpnet_base']['content'])\n",
+    "\n",
+    "px_session = launch_px(mpnet_base_ds, None)\n",
+    "px_session.view()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The UMAP projection of `mpnet_base` as seen in Phoenix also shows roughly the same clusters as the UMAP projection of the `mini_lm` embedding space.\n",
+    "The main and most obvious sights stay the same as already noted in the previous exploration on the `mini_lm`'s decomposition:\n",
+    "- One cluster that separates itself from the other points in the space is the music-related cluster\n",
+    "- On the other  side of the space much data points seem to be about books"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### PCA of mpnet-base"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plot_pca(merged_partitions, 'mpnet_base')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This `mpnet_base` PCA projection shows a decomposed space similar to the PCA of the `mini_lm` embedding model. This observation makes sense because both embedding models were trained with similar BERT-style objectives focused on mapping and clustering the semantic meanings of sentences. Consequently, the decomposed spaces map similar variances onto the principal components.\n",
+    "\n",
+    "A confusion that might arise is the fact that when comparing both 3D-PCA plots the principal component #2 seems to be flipped. This does not change the components meaning since the principal components derived from PCA are unique up to a sign flip. This is because the eigenvectors of a covariance matrix (which define the principal components) can point in either direction along the axis they define. Both directions represent the same principal component, just with inverted signs."
    ]
   }
  ],

diff --git a/src/px_utils.py b/src/px_utils.py
@@ -8,13 +8,13 @@
 load_dotenv()
 
 DEFAULT_SCHEMA = px.Schema(
-    actual_label_column_name="label",
-    embedding_feature_column_names={
-        "text_embedding": px.EmbeddingColumnNames(
-            vector_column_name="content_vector",
-            raw_data_column_name="content",
-        ),
-    }
+        actual_label_column_name="label",
+        embedding_feature_column_names={
+                "text_embedding": px.EmbeddingColumnNames(
+                vector_column_name="embedding", 
+                raw_data_column_name="content",
+                ),
+        }
 )