Added first draft of main

fhnw-ivy · Jun 7, 2024 · e1ace7d · e1ace7d
1 parent b202fe6
commit e1ace7d
Showing 1 changed file with 90 additions and 82 deletions.
diff --git a/notebooks/main.ipynb b/notebooks/main.ipynb
@@ -225,11 +225,11 @@
    "metadata": {},
    "source": [
     "## 3. Data Splitting Strategy\n",
-    "The dataset is split into development, validation, labeled, and unlabeled sets using a nested split approach. The development set is a fraction of the full dataset, the validation set is a fraction of the test dataset, and the labeled set is a fraction of the development set. The remaining samples in the development set are considered unlabeled. The nested split always adds in 25% increments (25, 50, 75), and a 1/6 split between labelled and unlabelled data is used, resulting in 1000 labeled and 5000 weakly labeled samples in total.\n",
+    "The dataset is split into development, validation, labeled, and unlabeled sets using a nested split approach. The development set is a fraction of the full dataset, the validation set is a fraction of the test dataset, and the labeled set is a fraction of the development set. The remaining samples in the development set are considered unlabeled. The nested split always adds in 25% increments (25, 50, 75), and a 1/6 split between labeled and unlabeled data is used, resulting in 1000 labeled and 5000 weakly labeled samples in total.\n",
     "\n",
     "All the pre-split datasets are stored in the `data/partitions` directory as `.parquet` files.\n",
     "\n",
-    "Given the focus of the MC on the impact of weak labelling and its impact, we introduce a nested split which further divides our training data into splits. Here is a brief overview of the nested split algorithm we use:\n",
+    "Given the focus of the MC on the impact of weak labeling and its impact, we introduce a nested split which further divides our training data into splits. Here is a brief overview of the nested split algorithm we use:\n",
     "\n",
     "1. **Validate the Fractions**: We start by ensuring that the proportions we want to use for our subsets are reasonable—each should be a fraction of the whole dataset.\n",
     "2. **Shuffle the Data**: To make sure our subsets are representative and unbiased, we randomly shuffle the entire dataset. This ensures that each subset is a good mix of the data.\n",
@@ -355,8 +355,7 @@
    "metadata": {},
    "cell_type": "code",
    "source": [
-    "from sklearn.metrics import roc_curve\n",
-    "import json\n",
+    "from sklearn.metrics import roc_curve, precision_recall_curve, auc\n",
     "import matplotlib.pyplot as plt\n",
     "\n",
     "import os\n",
@@ -365,9 +364,49 @@
     "\n",
     "MODEL_DIR = os.getenv(\"MODELS_DIR\")\n",
     "\n",
-    "def plot_model_performance(results_data, model_names, baseline_data=None, metrics=None, baseline_name='Baseline'):\n",
-    "    if metrics is None:\n",
-    "        metrics = ['eval_accuracy', 'eval_f1_macro', 'eval_f1_weighted']\n",
+    "def plot_additional_metrics(results_data, model_names):\n",
+    "    fig = plt.figure(figsize=(16, 13))\n",
+    "    gs = fig.add_gridspec(nrows=3, ncols=2, height_ratios=[3, 1.5, 1.5], wspace=0.3, hspace=0.6)\n",
+    "\n",
+    "    # Precision-Recall curve\n",
+    "    ax_pr = fig.add_subplot(gs[0, 0])\n",
+    "    for i, model_results in enumerate(results_data):\n",
+    "        precision, recall, _ = precision_recall_curve(model_results['eval_true_labels'], model_results['eval_pred_probs'])\n",
+    "        auprc = auc(recall, precision)\n",
+    "        ax_pr.plot(recall, precision, label=f'{model_names[i]} (AUPRC = {auprc:.2f})')\n",
+    "    ax_pr.set_title('Precision-Recall Curve', fontsize=14, pad=15)\n",
+    "    ax_pr.set_xlabel('Recall', fontsize=12, labelpad=10)\n",
+    "    ax_pr.set_ylabel('Precision', fontsize=12, labelpad=10)\n",
+    "    ax_pr.legend(loc='lower left', fontsize=11, bbox_to_anchor=(0.0, 0.0), borderaxespad=0.5)\n",
+    "    ax_pr.grid(True)\n",
+    "\n",
+    "    # ROC curve and AUC\n",
+    "    ax_roc = fig.add_subplot(gs[0, 1])\n",
+    "    for i, model_results in enumerate(results_data):\n",
+    "        fpr, tpr, _ = roc_curve(model_results['eval_true_labels'], model_results['eval_pred_probs'])\n",
+    "        roc_auc = auc(fpr, tpr)\n",
+    "        ax_roc.plot(fpr, tpr, label=f'{model_names[i]} (AUC = {roc_auc:.2f})')\n",
+    "    ax_roc.plot([0, 1], [0, 1], 'k--')\n",
+    "    ax_roc.set_title('ROC Curve', fontsize=14, pad=15)\n",
+    "    ax_roc.set_xlabel('False Positive Rate', fontsize=12, labelpad=10)\n",
+    "    ax_roc.set_ylabel('True Positive Rate', fontsize=12, labelpad=10)\n",
+    "    ax_roc.legend(loc='lower right', fontsize=11, bbox_to_anchor=(1.0, 0.0), borderaxespad=0.5)\n",
+    "    ax_roc.grid(True)\n",
+    "\n",
+    "    # Confusion matrices\n",
+    "    for i, model_results in enumerate(results_data):\n",
+    "        ax_cm = fig.add_subplot(gs[1 + i // 2, i % 2])\n",
+    "        cm = model_results['eval_confusion_matrix']\n",
+    "        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax_cm, xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'], cbar=False, annot_kws={\"size\": 14})\n",
+    "        ax_cm.set_title(f'Confusion Matrix - {model_names[i]}', fontsize=13, pad=15)\n",
+    "        ax_cm.set_xlabel('Predicted', fontsize=11, labelpad=10)\n",
+    "        ax_cm.set_ylabel('True', fontsize=11, labelpad=10)\n",
+    "        ax_cm.tick_params(axis='both', labelsize=11)\n",
+    "\n",
+    "    plt.tight_layout()\n",
+    "    plt.show()    \n",
+    "\n",
+    "def plot_model_performance(results_data, model_names, baseline_data=None, baseline_name='Baseline'):\n",
     "\n",
     "    results = results_data\n",
     "\n",
@@ -379,11 +418,15 @@
     "\n",
     "    if baseline_data:\n",
     "        baseline_accuracy = baseline_data['eval_accuracy']\n",
+    "        baseline_f1 = baseline_data['eval_f1_weighted']\n",
     "        ax.axhline(y=baseline_accuracy, color='r', linestyle='--', label=baseline_name)\n",
+    "        ax.axhline(y=baseline_f1, color='g', linestyle='--', label=f'{baseline_name} (F1-Score Weighted)')\n",
     "\n",
     "    for i, model_results in enumerate(results):\n",
     "        values = [model_results[fraction]['eval_accuracy'] for fraction in fractions]\n",
+    "        values_f1 = [model_results[fraction]['eval_f1_weighted'] for fraction in fractions]\n",
     "        ax.plot(x, values, marker='o', label=model_names[i])\n",
+    "        ax.plot(x, values_f1, marker='x', label=f'{model_names[i]} (F1-Score Weighted)')\n",
     "\n",
     "    ax.set_xticks(x)\n",
     "    ax.set_xticklabels(fractions)\n",
@@ -395,35 +438,6 @@
     "    ax.grid(True)\n",
     "\n",
     "    plt.tight_layout()\n",
-    "    plt.show()\n",
-    "\n",
-    "def plot_model_auroc(results_data, model_names, baseline_data=None, baseline_name='Baseline'):\n",
-    "    results = results_data\n",
-    "\n",
-    "    fractions = sorted(results[0].keys(), key=float)\n",
-    "    num_fractions = len(fractions)\n",
-    "\n",
-    "    fig, ax = plt.subplots(figsize=(10, 6))\n",
-    "    x = range(num_fractions)\n",
-    "\n",
-    "    if baseline_data:\n",
-    "        baseline_auroc = baseline_data['eval_auroc']\n",
-    "        ax.axhline(y=baseline_auroc, color='r', linestyle='--', label=baseline_name)\n",
-    "\n",
-    "    for i, model_results in enumerate(results):\n",
-    "        values = [model_results[fraction]['eval_auroc'] for fraction in fractions]\n",
-    "        ax.plot(x, values, marker='o', label=model_names[i])\n",
-    "\n",
-    "    ax.set_xticks(x)\n",
-    "    ax.set_xticklabels(fractions)\n",
-    "    ax.set_xlabel('Fraction of Labeled Samples')\n",
-    "    ax.set_ylabel('AUROC')\n",
-    "    ax.set_title('Model AUROC Comparison')\n",
-    "    ax.set_ylim(0, 1)\n",
-    "    ax.legend(loc='lower right')\n",
-    "    ax.grid(True)\n",
-    "\n",
-    "    plt.tight_layout()\n",
     "    plt.show()\n"
    ],
    "id": "51c8c9ffc108cafe",
@@ -449,7 +463,7 @@
    "source": [
     "import json\n",
     "\n",
-    "relevant_metrics = ['eval_accuracy', 'eval_f1_macro', 'eval_f1_weighted']\n",
+    "relevant_metrics = ['eval_accuracy', 'eval_precision', 'eval_recall', 'eval_f1_weighted']\n",
     "\n",
     "with open(f'{MODEL_DIR}/eval/eval_results.json') as file:\n",
     "    baseline_data = json.load(file)\n",
@@ -490,7 +504,7 @@
    "source": [
     "## 5. Supervised Learning Performance\n",
     "\n",
-    "Before we dive deeper into the chosen weak labelling technique and its impact on the model performance, we will first decide whether we will train our model via transfer learning or fine-tuning.\n",
+    "Before we dive deeper into the chosen weak labeling technique and its impact on the model performance, we will first decide whether we will train our model via transfer learning or fine-tuning.\n",
     "\n",
     "For this we will train the model using the nested splits on both techniques and compare the results. The results are stored in the `data/eval` directory as `.json` files."
    ]
@@ -511,7 +525,7 @@
     "with open(f'{MODEL_DIR}/supervised/transfer_nested/eval_results.json') as file:\n",
     "    transfer_nested_data = json.load(file)\n",
     "\n",
-    "plot_model_performance([transfer_nested_data], ['Transfer Learning'], baseline_data, metrics=relevant_metrics)\n",
+    "plot_model_performance([transfer_nested_data], ['Transfer Learning'], baseline_data)\n",
     "\n"
    ],
    "id": "3bd5576049fc6c83",
@@ -524,6 +538,14 @@
    "source": "The results show that the transfer learning model barely outperforms the baseline model. This indicates that the pretrained model's knowledge is not sufficient to achieve high performance on the sentiment analysis task. ",
    "id": "5d0cb8c4c5b514f2"
   },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": "plot_additional_metrics([results for k, results in transfer_nested_data.items()], [f'Transfer Learning {percentage}' for percentage in transfer_nested_data.keys()])",
+   "id": "59ef6c6b3924b5d0",
+   "outputs": [],
+   "execution_count": null
+  },
   {
    "cell_type": "markdown",
    "id": "23d7a8e1",
@@ -542,7 +564,7 @@
     "with open(f'{MODEL_DIR}/supervised/finetune_nested/eval_results.json') as file:\n",
     "    finetune_nested_data = json.load(file)\n",
     "\n",
-    "plot_model_performance([finetune_nested_data], ['Fine-tuning'], baseline_data, metrics=relevant_metrics)\n"
+    "plot_model_performance([finetune_nested_data], ['Fine-tuning'], baseline_data)\n"
    ],
    "id": "1871656e5dc97ef9",
    "outputs": [],
@@ -551,16 +573,24 @@
   {
    "metadata": {},
    "cell_type": "markdown",
-   "source": "The results show that fine-tuning outperforms both the baseline model and the transfer learning model. We can also see that after using 75% of the labelled data (750 labelled samples) the model stagnates in its performance. This indicates that the model has reached its capacity to learn from the data and adding more data does not substantially improve the performance.\n",
+   "source": "The results show that fine-tuning outperforms both the baseline model and the transfer learning model. We can also see that after using 75% of the labeled data (750 labeled samples) the model stagnates in its performance. This indicates that the model has reached its capacity to learn from the data and adding more data does not substantially improve the performance.\n",
    "id": "d760fdb577139928"
   },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": "plot_additional_metrics([results for k, results in finetune_nested_data.items()], [f'Fine-Tuning {percentage}' for percentage in finetune_nested_data.keys()])",
+   "id": "9f7855d88095d18b",
+   "outputs": [],
+   "execution_count": null
+  },
   {
    "cell_type": "markdown",
    "id": "22cfcf6e",
    "metadata": {},
    "source": [
     "## 6. Semi-Supervised Learning Performance\n",
-    "After we established that fine-tuning is the best approach for training the model, we will now evaluate the performance of the semi-supervised learning techniques. We will compare the performance of the fine-tuned model with weak labels generated using different weak labelling strategies. \n",
+    "After we established that fine-tuning is the best approach for training the model, we will now evaluate the performance of the semi-supervised learning techniques. We will compare the performance of the fine-tuned model with weak labels generated using different weak labeling strategies. \n",
     "\n",
     "The nested split logic above is used, with the small difference that each split contains the fully labeled data. This means that the nested split is applied to the weak labels and then concatenated with the fully labeled data. "
    ]
@@ -569,7 +599,7 @@
    "cell_type": "markdown",
    "id": "1c53e734",
    "metadata": {},
-   "source": "### 6.1 Logistic Regression (LogReg) Weak Labelling"
+   "source": "### 6.1 Logistic Regression (LogReg) Weak labeling"
   },
   {
    "metadata": {},
@@ -578,7 +608,7 @@
     "with open(f'{MODEL_DIR}/semi-supervised/finetune_nested/eval_results.json') as file:\n",
     "    logreg_nested_data = json.load(file)\n",
     "    \n",
-    "plot_model_performance([logreg_nested_data], ['LogReg Weak-Labelling'], finetune_nested_data[\"1.0\"], metrics=relevant_metrics, baseline_name='Fine-tuning 100% (Fully Labeled)')\n"
+    "plot_model_performance([logreg_nested_data], ['Weak-Labeling Fine-Tuning '], finetune_nested_data[\"1.0\"], baseline_name='Fine-Tuning 100% (Fully Labeled)')\n"
    ],
    "id": "6d5220fe8adfcba9",
    "outputs": [],
@@ -587,32 +617,21 @@
   {
    "metadata": {},
    "cell_type": "markdown",
-   "source": "Adding weak labels to the dataset has a significant impact on the model performance. With only an addition 25% ",
+   "source": [
+    "Adding weak labels to the dataset has a significant impact on the model performance. With only an additional 25% of weakly labeled (total 812, 250 labeled and 562 weakly labeled) data, the model achieves an accuracy of around 88%. This is a substantial improvement compared to the fine-tuning model, which only used the fully labeled data. \n",
+    "\n",
+    "However, we can also see an interesting pattern, the difference between 25% and 100% additional weak labels has little to no impact. This indicates that the model has reached its capacity to learn from the data and adding more data does not substantially improve the performance. We even spot a slight decrease in performance when at 75% weak labels.\n"
+   ],
    "id": "79dc94ed9017120e"
   },
   {
-   "cell_type": "markdown",
-   "id": "e3d33785",
    "metadata": {},
-   "source": "## 7. Learning Curve Analysis"
-  },
-  {
    "cell_type": "code",
-   "id": "eabcf731",
-   "metadata": {},
-   "source": [
-    "# Plot all results\n",
-    "plot_model_performance([transfer_nested_data, finetune_nested_data, logreg_nested_data], ['Transfer Learning', 'Fine-tuning', 'LogReg Weak-Labelling'], baseline_data, metrics=relevant_metrics)"
-   ],
+   "source": "plot_additional_metrics([results for k, results in logreg_nested_data.items()], [f'Weak-Labeling Fine-Tuning {percentage}' for percentage in logreg_nested_data.keys()])",
+   "id": "c48ceb0ceeba0a0d",
    "outputs": [],
    "execution_count": null
   },
-  {
-   "metadata": {},
-   "cell_type": "markdown",
-   "source": "Using the ",
-   "id": "ea3d85c7caa6c502"
-  },
   {
    "cell_type": "markdown",
    "id": "7ce2752d",
@@ -626,30 +645,19 @@
    "metadata": {},
    "cell_type": "code",
    "source": [
-    "# Plot all results\n",
-    "plot_model_auroc([transfer_nested_data, finetune_nested_data, logreg_nested_data], ['Transfer Learning', 'Fine-tuning', 'LogReg Weak-Labelling'], baseline_data)"
+    "with open(f'{MODEL_DIR}/semi-supervised/finetune_nested/eval_results.json') as file:\n",
+    "    logreg_nested_data = json.load(file)\n",
+    "\n",
+    "best_transfer = transfer_nested_data[\"1.0\"]\n",
+    "best_finetune = finetune_nested_data[\"1.0\"]\n",
+    "best_weak_labeling = logreg_nested_data[\"0.25\"]\n",
+    "\n",
+    "\n",
+    "plot_additional_metrics([best_transfer, best_finetune, best_weak_labeling], ['Transfer Learning', 'Fine-Tuning', 'Weak-Labeling Fine-Tuning'])"
    ],
    "id": "76592bb8196721ae",
    "outputs": [],
    "execution_count": null
-  },
-  {
-   "cell_type": "markdown",
-   "id": "45316041",
-   "metadata": {},
-   "source": [
-    "## 9. Time Savings Factor and Implications\n",
-    "The time savings factor, quantifying the reduction in manually labeled data required to achieve acceptable performance levels using weak labeling approaches, is calculated. The implications of the findings are discussed."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d4fd172b",
-   "metadata": {},
-   "source": [
-    "## 10. Conclusion and Future Directions\n",
-    "The key findings, insights, and potential implications of the sentiment analysis mini-challenge are summarized. The effectiveness of weak supervision techniques in reducing the need for manual annotation while maintaining acceptable model performance is discussed. Future directions for research and improvements are outlined."
-   ]
   }
  ],
  "metadata": {