Skip to content

Notebooks for ML Tasks w/ Scikit and LLMs using Cohere, HuggingFace, LangChain, and OpenAI

Notifications You must be signed in to change notification settings

turinglayer/notebooks

Repository files navigation

Notebooks for ML Tasks w/ Scikit and LLMs

Jupyter notebooks to apply and experiment with ML and Large Language Models (LLMs) provided by industry leaders such as Cohere, HuggingFace, LangChain, and OpenAI.

01. Binary Classification w/ SVM and Transformer-based Embeddings

[Notebook] [Open in Colab]

Tags: [binary-classification] [embeddings] [svm] [cohere] [openai] [tfidfvectorizer]

This notebook illustrates how to perform binary text classification with just a few hundred samples. It trains a basic Support Vector Machine with a collection of labeled financial sentences (400 training samples), and compares its accuracy with:

  • transformer-based embeddings using Cohere.
  • transformer-based embeddings using OpenAI.
  • frequency-based embeddings using TfidfVectorizer.
SVM Binary-Text Classification Accuracy (550 samples):
------------------------------------------------------
w/ Cohere 'embed-english-v3.0': 94.93%
w/ OpenAI 'text-embedding-ada-002': 89.13%
w/ TfidfVectorizer: 65.22%

02. Multiclass Classification w/ Random Forest and Transformer-based Embeddings

[Notebook] [Open in Colab]

Tags: [multiclass-classification] [embeddings] [hyperparameter-tuning] [random-forest] [cohere] [countvectorizer]

This notebook illustrates how to train a random-forest model with hyperparameter tuning for multiclass classification. It assesses the perfomance of combining said random-forest with:

It achieves 88.80% accuracy with approximately 200 training samples per class.

Accuracy: 88.80%

              precision    recall  f1-score   support

    Business       0.85      0.82      0.83        55
    Sci/Tech       0.89      0.85      0.87        65
      Sports       0.90      0.93      0.91        69
       World       0.91      0.95      0.93        61

    accuracy                           0.89       250
   macro avg       0.89      0.89      0.89       250
weighted avg       0.89      0.89      0.89       250

03. Multiclass Classification w/ Cohere-Classify

[Notebook] [Open in Colab]

Tags: [multiclass-classification] [cohere]

This notebook illustrates how to use Cohere Classify for multiclass classification. It achieves 94.74% accuracy with approximately 200 training samples per class.

Accuracy: 94.74%

              precision    recall  f1-score   support

    Business       0.90      0.90      0.90        20
    Sci/Tech       0.96      0.92      0.94        24
      Sports       1.00      0.96      0.98        28
       World       0.92      1.00      0.96        23

    accuracy                           0.95        95
   macro avg       0.94      0.95      0.94        95
weighted avg       0.95      0.95      0.95        95

04. OpenAI Functions w/ Langchain and Pydantic

[Notebook] [Open in Colab]

Tags: [openai] [langchain] [pydantic] [function-calling] [function-creation]

This notebook demonstrates how to combine LangChain and Pydantic as an abstraction layer to facilitate the process of creating OpenAI functions and handling JSON formatting.

05. Named Entity Recognition to Enrich Text

[Notebook] [Open in Colab]

Tags: [openai] [named-entity-recognition] [function-calling] [function-creation] [wikipedia]

Named Entity Recognition (NER) is a Natural Language Processing task that identifies and classifies named entities (NE) into predefined semantic categories (such as persons, organizations, locations, events, time expressions, and quantities). By converting raw text into structured information, NER makes data more actionable, facilitating tasks like information extraction, data aggregation, analytics, and social media monitoring.

This notebook demonstrates how to carry out NER with OpenAI Chat Completion and functions-calling to enrich a block of text with links to a knowledge base such as Wikipedia.

This notebook is also available at openai/openai-cookbook/examples/Named_Entity_Recognition_to_enrich_text.ipynb

06. Clustering and Topic Modeling of arXiv dataset (10k) w/ Cohere Embedv3 | Pydantic | OpenAI | LangChain

[Notebook] [Open in Colab]

Tags: [clustering] [cohere] [embeddings] [HDBSCAN] [langchain] [pydantic] [topic-modeling] [openai]

We combine the advanced Cohere and GPT-4 Large Langaguge Models with HDBSCAN, Pydantic and LangChain for Clustering and Topic Modeling. Our playground is a dataset of 10,000 research arXiv documents from Computational Linguistics (Natural Language Processing) published between 2019 and 2023, and enriched with title and abstract embeddings that have been generated with the newest Cohere Embedv3 for the specific clustering task. To measure the clustering and topic modeling effectiveness, we visualize the outcomes after applying UMAP dimensionality reduction.

07. Transformers Self-Attention

[Notebook] [Open in Colab]

Transformers have revolutionized the way we approach tasks in NLP. At its core lies self-attention, a mechanism that allows models to weigh the importance of each sequence element (token embeddings). This basic notebook explores the intricacies of self-attention by providing bertviz visualizations on model, heads and neurons.

Tags: [bertviz] [transformers] [tokenizer] [self-attention]