GitHub - maliozer/The-NLP-Pandect: A comprehensive reference for all topics related to Natural Language Processing

This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.

Compendiums and awesome lists on the topic of NLP:

The NLP Index by Quantum Stat / NLP Cypher
Awesome NLP by keon [GitHub, 12821 stars]
Speech and Natural Language Processing Awesome List by elaboshira [GitHub, 2065 stars]
Awesome Deep Learning for Natural Language Processing (NLP) [GitHub, 1010 stars]
Text Mining and Natural Language Processing Resources by stepthom [GitHub, 454 stars]
Made with ML List by madewithml.com
Brainsources for #NLP enthusiasts by Philip Vollet
Awesome AI/ML/DL - NLP Section [GitHub, 1010 stars]
Resources on various machine learning topics by Backprop

NLP Conferences, Paper Summaries and Paper Compendiums:

Non-English resources and compendiums

NLP Resources for Bahasa Indonesian [GitHub, 238 stars]
Indic NLP Catalog [GitHub, 327 stars]
Pre-trained language models for Vietnamese [GitHub, 421 stars]
Natural Language Toolkit for Indic Languages (iNLTK) [GitHub, 748 stars]
Indic NLP Library [GitHub, 405 stars]
AI4Bharat-IndicNLP Portal
ARBML - Implementation of many Arabic NLP and ML projects [GitHub, 211 stars]
zemberek-nlp - NLP tools for Turkish [GitHub, 970 stars]
KLUE - Korean Language Understanding Evaluation [GitHub, 398 stars]
Persian NLP Benchmark - benchmark for evaluation and comparison of various NLP tasks in Persian language [GitHub, 61 stars]

Pre-trained NLP models

List of pre-trained NLP models [GitHub, 150 stars]
Pretrained language models developed by Huawei Noah's Ark Lab [GitHub, 2104 stars]
Spanish Language Models and resources [GitHub, 180 stars]

NLP History

General

History of Natural Language Processing
A Review of the Neural History of Natural Language Processing [Blog, October 2018]

2020 Year in Review

Natural Language Processing in 2020: The Year In Review [Blog, December 2020]
ML and NLP Research Highlights of 2020 [Blog, January 2021]

NLP-only podcasts

NLP Highlights [Years: 2017 - now, Status: active]
The NLP Zone Episodes [Years: 2021 - now, Status: active]

Many NLP episodes

TWIML AI [Years: 2016 - now, Status: active]
Practical AI [Years: 2018 - now, Status: active]
The Data Exchange [Years: 2019 - now, Status: active]
Gradient Dissent [Years: 2020 - now, Status: active]
Machine Learning Street Talk [Years: 2020 - now, Status: active]
DataFramed - latest trends and insights on how to scale the impact of data science in organizations [Years: 2019 - now, Status: active]

Some NLP episodes

The Super Data Science Podcast [Years: 2016 - now, Status: active]
Data Hack Radio [Years: 2018 - now, Status: active]
AI Game Changers [Years: 2020 - now, Status: active]
The Analytics Show [Years: 2019 - now, Status: active]

General NLU

GLUE - General Language Understanding Evaluation (GLUE) benchmark
SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
RACE - ReAding Comprehension dataset collected from English Examinations
dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking

Summarization

WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset
WikiLingua - A Multilingual Abstractive Summarization Dataset

Question Answering

SQuAD - Stanford Question Answering Dataset (SQuAD)
XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
GrailQA - Strongly Generalizable Question Answering (GrailQA)
CSQA - Complex Sequential Question Answering

Multilingual and Non-English Benchmarks

XTREME - Massively Multilingual Multi-task Benchmark
GLUECoS - A benchmark for code-switched NLP
IndoNLU Benchmark - collection of resources for training, evaluating, and analyzing NLP for Bahasa Indonesia
IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
LinCE - Linguistic Code-Switching Evaluation Benchmark
Russian SuperGlue - Russian SuperGlue Benchmark

Bio, Law, and other scientific domains

BLURB - Biomedical Language Understanding and Reasoning Benchmark
BLUE - Biomedical Language Understanding Evaluation benchmark
LexGLUE - A Benchmark Dataset for Legal Language Understanding in English

Transformer Efficiency

Long-Range Arena - Long Range Arena for Benchmarking Efficient Transformers (Pre-print) [GitHub, 393 stars]

Speech Processing

SUPERB - Speech processing Universal PERformance Benchmark

Other

CodeXGLUE - A benchmark dataset for code intelligence
CrossNER - CrossNER: Evaluating Cross-Domain Named Entity Recognition
MultiNLI - Multi-Genre Natural Language Inference corpus
iSarcasm: A Dataset of Intended Sarcasm - iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic

General

A Recipe for Training Neural Networks by Andrej Karpathy [Keywords: research, training, 2019]

Embeddings

Repositories

Pre-trained ELMo Representations for Many Languages [GitHub, 1374 stars]
sense2vec - Contextually-keyed word vectors [GitHub, 1340 stars]
wikipedia2vec [GitHub, 751 stars]
StarSpace [GitHub, 3719 stars]
fastText [GitHub, 23354 stars]

Blogs

Language Models and Contextualised Word Embeddings by David S. Batista [Blog, 2018]
An Essential Guide to Pretrained Word Embeddings for NLP Practitioners by AnalyticsVidhya [Blog, 2020]
Polyglot Word Embeddings Discover Language Clusters [Blog, 2020]
The Illustrated Word2vec by Jay Alammar [Blog, 2019]

Cross-lingual Word and Sentence Embeddings

vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 573 stars]
sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 7021 stars]

Byte Pair Encoding

bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1031 stars]
subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 1840 stars]
python-bpe - Byte Pair Encoding for Python [GitHub, 163 stars]

Transformer-based Architectures

General

The Transformer Family by Lilian Weng [Blog, 2020]
Keeping up with the BERTs: a review of the main NLP benchmarks by Manuel Tonneau [Blog, 2020]
Playing the lottery with rewards and multiple languages - about the effect of random initialization [ICLR 2020 Paper]
Attention? Attention! by Lilian Weng [Blog, 2018]
the transformer … “explained”? [Blog, 2019]
Attention is all you need; Attentional Neural Network Models by Łukasz Kaiser [Talk, 2017]
Understanding and Applying Self-Attention for NLP [Talk, 2018]
The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures [Paper, April 2021]
Pre-Trained Models: Past, Present and Future [Paper, June 2021]
A Survey of Transformers [Paper, June 2021]

Transformer

The Annotated Transformer by Harvard NLP [Blog, 2018]
The Illustrated Transformer by Jay Alammar [Blog, 2018]
Illustrated Guide to Transformers by Hong Jing [Blog, 2020]
Sequential Transformer with Adaptive Attention Span by Facebook. Blog [Blog, 2019]
Evolution of Representations in the Transformer by Lena Voita [Blog, 2019]
Reformer: The Efficient Transformer [Blog, 2020]
Longformer — The Long-Document Transformer by Viktor Karlsson [Blog, 2020]
TRANSFORMERS FROM SCRATCH [Blog, 2019]
Universal Transformers by Mostafa Dehghani [Blog, 2019]
Transformers in Natural Language Processing — A Brief Survey by George Ho [Blog, May 2020]
Lite Transformer - Lite Transformer with Long-Short Range Attention [GitHub, 506 stars]
Transformers from Scratch [Blog, Oct 2021]

BERT

A Visual Guide to Using BERT for the First Time by Jay Alammar [Blog, 2019]
The Dark Secrets of BERT by Anna Rogers [Blog, 2020]
Understanding searches better than ever before [Blog, 2019]
Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework [Blog, 2019]
SemBERT - Semantics-aware BERT for Language Understanding [GitHub, 251 stars]
BERTweet - BERTweet: A pre-trained language model for English Tweets [GitHub, 412 stars]
Optimal Subarchitecture Extraction for BERT [GitHub, 443 stars]
CharacterBERT: Reconciling ELMo and BERT [GitHub, 132 stars]
When BERT Plays The Lottery, All Tickets Are Winning [Blog, Dec 2020]
BERT-related Papers a list of BERT-related papers [GitHub, 1784 stars]

Other Transformer Variants

T5

T5 Understanding Transformer-Based Self-Supervised Architectures [Blog, August 2020]
T5: the Text-To-Text Transfer Transformer [Blog, 2020]
multilingual-t5 - Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model [GitHub, 806 stars]

BigBird

Big Bird: Transformers for Longer Sequences original paper by Google Research [Paper, July 2020]

Reformer / Linformer / Longformer / Performers

Reformer: The Efficient Transformer - [Paper, February 2020] [Video, October 2020]
Longformer: The Long-Document Transformer - [Paper, April 2020] [Video, April 2020]
Linformer: Self-Attention with Linear Complexity - [Paper, June 2020] [Video, June 2020]
Rethinking Attention with Performers - [Paper, September 2020] [Video, September 2020]
performer-pytorch - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 781 stars]

Switch Transformer

Switch Transformers: Scaling to Trillion Parameter Models original paper by Google Research [Paper, January 2021]

GPT-family

General

The Illustrated GPT-2 by Jay Alammar [Blog, 2019]
The Annotated GPT-2 by Aman Arora
OpenAI’s GPT-2: the model, the hype, and the controversy by Ryan Lowe [Blog, 2019]
How to generate text by Patrick von Platen [Blog, 2020]

GPT-3

Learning Resources

Zero Shot Learning for Text Classification by Amit Chaudhary [Blog, 2020]
GPT-3 A Brief Summary by Leo Gao [Blog, 2020]
GPT-3, a Giant Step for Deep Learning And NLP by Yoel Zeldes [Blog, June 2020]
GPT-3 Language Model: A Technical Overview by Chuan Li [Blog, June 2020]
Is it possible for language models to achieve language understanding? by Christopher Potts

Applications

Awesome GPT-3 - list of all resources related to GPT-3 [GitHub, 3414 stars]
GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
GPT-3 Demo Showcase - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources
OpenAI API - API Demo to use GPT-3 for commercial applications

Open-source Efforts

GPT-Neo - in-progress GPT-3 open source replication HuggingFace Hub
GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile
Effectively using GPT-J with few-shot learning [Blog, July 2021]

Other

What is Two-Stream Self-Attention in XLNet by Xu LIANG [Blog, 2019]
Visual Paper Summary: ALBERT (A Lite BERT) by Amit Chaudhary [Blog, 2020]
Turing NLG by Microsoft
Multi-Label Text Classification with XLNet by Josh Xin Jie Lee [Blog, 2019]
ELECTRA [GitHub, 1965 stars]
Performer implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 781 stars]

Distillation, Pruning and Quantization

Reading Material

Distilling knowledge from Neural Networks to build smaller and faster models by FloydHub [Blog, 2019]
David over Goliath: towards smaller models for cheaper, faster, and greener NLP by Manuel Tonneau [Blog, 2020]
Compression of Deep Learning Models for Text: A Survey (+Overview of Approaches) [Paper, April 2021]

Tools

Bert-squeeze - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 54 stars]
XtremeDistil - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 93 stars]

Automated Summarization

PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]
CTRLsum - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 95 stars]
XL-Sum - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 136 stars]
SummerTime - an open-source text summarization toolkit for non-experts [GitHub, 162 stars]
PRIMER - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 42 stars]
summarus - Models for automatic abstractive summarization [GitHub, 133 stars]

Knowledge Graphs and NLP

Fusing Knowledge into Language Model [Presentation ,Oct 2021]

Best Practices for NLP

In Search of Best Practices for NLP Projects [Slides, Dec. 2020]
EMNLP 2020: High Performance Natural Language Processing by Google Research [Slides, Recording, Nov. 2020]
Practical Natural Language Processing - A Comprehensive Guide to Building Real-World NLP Systems [Book, June 2020]
How to Structure and Manage NLP Projects [Blog, May 2021]
Applied NLP Thinking - Applied NLP Thinking: How to Translate Problems into Solutions [Blog, June 2021]
Introduction to NLP for Industry Use - DataTalksClub presentation on Introduction to NLP for Industry Use [Recording, December 2021]

MLOps for NLP

MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.

In general, MLOps for NLP includes having the following processes in place:

Data Versioning - make sure your training, annotation and other types of data are versioned and tracked
Experiment Tracking - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced
Model Registry - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them
Automated Testing and Behavioral Testing - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks
Model Deployment and Serving - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.
Data and Model Observability - track data drift, model accuracy drift etc.

Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:

Feature Store - centralized storage of all features developed for ML models than can be easily reused by any other ML project
Metadata Management - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.

Reading Material

MLOps: What It Is, Why it Matters, and How To Implement It by Neptune AI [Blog, July 2021]
Best MLOps Tools You Need to Know as a Data Scientist by Neptune AI [Blog, July 2021]
Robust MLOps - Robust MLOps with Open-Source: ModelDB, Docker, Jenkins and Prometheus [Blog, May 2021]
State of MLOps 2021 by Valohai [Blog, August 2021]
The MLOps Stack by Valohai [Blog, October 2020]
Data Version Control for Machine Learning Applications by Megagon AI [Blog, July 2021]
The Rapid Evolution of the Canonical Stack for Machine Learning [Blog, July 2021]
MLOps: Comprehensive Beginner’s Guide [Blog, March 2021]
What I’ve learned about MLOps from speaking with 100+ ML practitioners [Blog, May 2021]
DataRobot Challenger Models - MLOps Champion/Challenger Models
State of MLOps Blog by Dr. Ori Cohen

Learning Material

MLOps cource by Made With ML
GitHub MLOps - collection of resources on how to facilitate Machine Learning Ops with GitHub

MLOps Communities

The MLOps Community - blogs, slack group, newsletter and more all about MLOps

Data Versioning

DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]

Experiment Tracking

mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
SigOpt - automate training & tuning, visualize & compare runs [Paid Service]
Optuna - hyperparameter optimization framework [GitHub, 5924 stars]
Clear ML - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] Link to GitHub
Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5285 stars]

Model Registry

DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
ModelDB - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1387 stars]
Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
Valohai - End-to-end ML pipelines [Paid Service]
Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]

Automated Testing and Behavioral Testing

CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1583 stars]
TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1839 stars]
WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 70 stars]
Great Expectations - Write tests for your data [GitHub, 6058 stars]
Deepchecks - Python package for comprehensively validating your machine learning models and data [GitHub, 1035 stars]

Model Deployability and Serving

mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
Amazon SageMaker [Paid Service]
Valohai - End-to-end ML pipelines [Paid Service]
NLP Cloud - Production-ready NLP API [Paid Service]
Saturn Cloud [Paid Service]
SELDON - machine learning deployment for enterprise [Paid Service]
Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
TorchServe - flexible and easy to use tool for serving PyTorch models [GitHub, 2407 stars]
Kubeflow - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]
KFServing - Serverless Inferencing on Kubernetes [GitHub, 1327 stars]
TFX - TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines [Paid Service]
Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
Cortex - containers as a service on AWS [Paid Service]
Azure Machine Learning - end-to-end machine learning lifecycle [Paid Service]
End2End Serverless Transformers On AWS Lambda [GitHub, 96 stars]
NLP-Service - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars]
Dagster - data orchestrator for machine learning [Free and Open Source]
Verta - AI and machine learning deployment and operations [Paid Service]
Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5285 stars]
flyte - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 1902 stars]
MLRun - Machine Learning automation and tracking [GitHub, 565 stars]
DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI

Model Debugging

imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 490 stars]
Cockpit - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 380 stars]

Model Accuracy Prediction

WeightWatcher - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 633 stars]

Data and Model Observability

General

whylogs - open source standard for data and ML logging [GitHub, 750 stars]
Rubrix - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 842 stars]
MLRun - Machine Learning automation and tracking [GitHub, 565 stars]
DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
Cortex - containers as a service on AWS [Paid Service]

Model Centric

Algorithmia - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]
Dataiku - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]
Evidently AI - tools to analyze and monitor machine learning models [Free and Open Source] Link to GitHub
Fiddler - ML Model Performance Management Tool [Paid Service]
Hydrosphere - open-source platform for managing ML models [Paid Service]
Verta - AI and machine learning deployment and operations [Paid Service]
Domino Model Ops - Deploy and Manage Models to Drive Business Impact [Paid Service]
iguazio - deployment and management of your AI applications with MLOps and end-to-end automation of machine learning pipelines [Paid Service]

Data Centric

Datafold - data quality through diffs, profiling, and anomaly detection [Paid Service]
acceldata - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]
Bigeye - monitoring and alerting to your datasets in minutes [Paid Service]
datakin - end-to-end, real-time data lineage solution [Paid Service]
Monte Carlo - data integrity, drifts, schema, lineage [Paid Service]
SODA - data monitoring, testing and validation [Paid Service]
whatify - data quality and action recommendation on it [Paid Service]

Feature Stores

Tecton - enterprise feature store for machine learning [Paid Service]
FEAST - open source feature store for machine learning Website [GitHub, 2841 stars]
Hopsworks Feature Store - data management system for managing machine learning features [Paid Service]

Metadata Management

ML Metadata - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 431 stars]
Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]

MLOps Frameworks

Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5285 stars]
kedro - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 6453 stars]
Seldon Core - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 2912 stars]
ZenML - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 1646 stars]
Google Vertex AI - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]
Diffgram - Complete training data platform for machine learning delivered as a single application [GitHub, 682 stars]

Transformer-based Architectures

General

Why BERT Fails in Commercial Environments by Intel AI [Blog, 2020]
Fine Tuning BERT for Text Classification with FARM by Sebastian Guggisberg [Blog, 2020]
Pretrain Transformers Models in PyTorch using Hugging Face Transformers [GitHub, 150 stars]
Practical NLP for the Real World [Presentation, 2019]
From Paper to Product – How we implemented BERT by Christoph Henkelmann [Talk, 2020]

Multi-GPU Transfomers

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 436 stars]

Training Transformers Effectively

Training BERT with Compute/Time (Academic) Budget [GitHub, 179 stars]

Embeddings as a Service

embedding-as-service [GitHub, 165 stars]
Bert-as-service [GitHub, 9870 stars]

NLP Recipes Industrial Applications:

NLP Recipes by microsoft [GitHub, 5814 stars]
NLP with Python by susanli2016 [GitHub, 2232 stars]
Basic Utilities for PyTorch NLP by PetrochukM [GitHub, 2018 stars]

NLP Applications in Bio, Finance, Legal and other industries

Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 527 stars]
Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1089 stars]
FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 160 stars]
LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 502 stars]
NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 346 stars]
BioIE - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 181 stars]

General Speech Recognition

wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5931 stars]
DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 19002 stars]
Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
kaldi - Kaldi is a toolkit for speech recognition [GitHub, 11306 stars]
awesome-kaldi - resources for using Kaldi [GitHub, 479 stars]
ESPnet - End-to-End Speech Processing Toolkit [GitHub, 4674 stars]
HuBERT - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]

Text to Speech

FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 675 stars]
TTS - a deep learning toolkit for Text-to-Speech [GitHub, 3915 stars]

Datasets

VoxPopuli - large-scale multilingual speech corpus for representation learning [GitHub, 332 stars]

Blogs

Topic Modelling with PySpark and Spark NLP by Maria Obedkova [Spark, Blog, 2020]
A Unique Approach to Short Text Clustering (Algorithmic Theory) by Brittany Bowers [Blog, 2020]

Frameworks for Topic Modeling

gensim - framework for topic modeling [GitHub, 12896 stars]
Spark NLP [GitHub, 2609 stars]

Repositories

Top2Vec [GitHub, 1537 stars]
Anchored Correlation Explanation Topic Modeling [GitHub, 274 stars]
Topic Modeling in Embedding Spaces [GitHub, 410 stars] Paper
TopicNet - A high-level interface for BigARTM library [GitHub, 118 stars]
BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 1938 stars]
OCTIS - A python package to optimize and evaluate topic models [GitHub, 289 stars]
Contextualized Topic Models [GitHub, 780 stars]
GSDMM - GSDMM: Short text clustering [GitHub, 249 stars]

Text Rank

PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 1738 stars]
textrank - TextRank implementation for Python 3 [GitHub, 1076 stars]

RAKE - Rapid Automatic Keyword Extraction

rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 879 stars]
yake - Single-document unsupervised keyword extraction [GitHub, 932 stars]
RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 361 stars]
rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 879 stars]

Other Approaches

flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5048 stars]
BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 198 stars]
keyBERT - Minimal keyword extraction with BERT [GitHub, 1152 stars]

NLP and ML Interpretability

NLP-centric

Explainability for Natural Language Processing - KDD'2021 Tutorial Slides [Presentation, August 2021]
ecco - Tools to visuals and explore NLP language models [GitHub, 1316 stars]
NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 214 stars]
transformers-interpret - Model explainability that works seamlessly with transformers [GitHub, 551 stars]
Awesome-explainable-AI - collection of research materials on explainable AI/ML [GitHub, 535 stars]
LAMA - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 869 stars]

General

Language Interpretability Tool (LIT) [GitHub, 2811 stars]
WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 334 stars]
Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 300 stars]
InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 4539 stars]
thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 87 stars]
Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 204 stars]
imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 490 stars]

Ethics, Bias, and Equality in NLP

Bias in Natural Language Processing @EMNLP 2020 [Blog, Nov 2020]
Machine Learning as a Software Engineering Enterprise - NeurIPS 2020 Keynote [Presentation, Dec 2020]
Computational Ethics for NLP - course resources from the Carnegie Mellon University [Lecture Notes, Spring 2020]
Ethics in NLP - resources from ACLs Ethics in NLP track
The Institute for Ethical AI & Machine Learning
Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [Paper, Feb 2021]
Fairness-in-AI - this package is used to detect and mitigate biases in NLP tasks [GitHub, 12 stars]
nlg-bias - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 35 stars]

Adversarial Attacks for NLP

Privacy Considerations in Large Language Models [Blog, Dec 2020]
DeepWordBug - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 51 stars]
Adversarial-Misspellings - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 47 stars]

Hate Speech Analysis

HateXplain - BERT for detecting abusive language [GitHub, 94 stars]

General Purpose

spaCy by Explosion AI [GitHub, 22381 stars]
flair by Zalando [GitHub, 11232 stars]
AllenNLP by AI2 [GitHub, 10794 stars]
stanza (former Stanford NLP) [GitHub, 5984 stars]
spaCy stanza [GitHub, 612 stars]
nltk [GitHub, 10434 stars]
gensim - framework for topic modeling [GitHub, 12896 stars]
pororo - Platform of neural models for natural language processing [GitHub, 1081 stars]
NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2797 stars]
FARM [GitHub, 1466 stars]
gobbli by RTI International [GitHub, 263 stars]
headliner - training and deployment of seq2seq models [GitHub, 229 stars]
SyferText - A privacy preserving NLP framework [GitHub, 185 stars]
DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1202 stars]
TextHero - Text preprocessing, representation and visualization [GitHub, 2431 stars]
textblob - TextBlob: Simplified Text Processing [GitHub, 8033 stars]
AdaptNLP - A high level framework and library for NLP [GitHub, 386 stars]
textacy - NLP, before and after spaCy [GitHub, 1875 stars]
texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2243 stars]
jiant - jiant is an NLP toolkit [GitHub, 1371 stars]

Data Augmentation

WildNLP Text manipulation library to test NLP models [GitHub, 70 stars]
snorkel Framework to generate training data [GitHub, 5016 stars]
NLPAug Data augmentation for NLP [GitHub, 2915 stars]
SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 350 stars]
faker - Python package that generates fake data for you [GitHub, 13729 stars]
textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 537 stars]
Parrot - Practical and feature-rich paraphrasing framework [GitHub, 382 stars]
AugLy - data augmentations library for audio, image, text, and video [GitHub, 4290 stars]
TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 210 stars]

Adversarial NLP Attacks & Behavioral Testing

TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1839 stars]
CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 5399 stars]
CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1583 stars]

Transformer-oriented

transformers by HuggingFace [GitHub, 57928 stars]
Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 661 stars]
haystack - Transformers at scale for question answering & neural search. [GitHub, 4137 stars]

Dialog Systems and Speech

DeepPavlov by MIPT [GitHub, 5592 stars]
ParlAI by FAIR [GitHub, 8608 stars]
rasa - Framework for Conversational Agents [GitHub, 13537 stars]
wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5931 stars]
ChatterBot - conversational dialog engine for creating chat bots [GitHub, 11982 stars]
SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 3698 stars]

Word/Sentence-embeddings oriented

MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 2915 stars]
vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 573 stars]
sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 7021 stars]

Social Media Oriented

Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 530 stars]

Phonetics

DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 125 stars]

Morphology

LemmInflect - python module for English lemmatization and inflection [GitHub, 149 stars]
Inflect - generate plurals, ordinals, indefinite articles [GitHub, 612 stars]
simplemma - simple multilingual lemmatizer for Python [GitHub, 612 stars]

Multi-lingual tools

polyglot - Multi-lingual NLP Framework [GitHub, 1954 stars]
trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 572 stars]

Distributed NLP / Multi-GPU NLP

Spark NLP [GitHub, 2609 stars]
Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 436 stars]

Machine Translation

COMET -A Neural Framework for MT Evaluation [GitHub, 121 stars]
marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 861 stars]
argos-translate - Open source neural machine translation in Python [GitHub, 1023 stars]
Opus-MT - Open neural machine translation models and web services [GitHub, 184 stars]
dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 190 stars]

Entity and String Matching

PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 423 stars]
pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 682 stars]
fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 8603 stars]
jellyfish - approximate and phonetic matching of strings [GitHub, 1610 stars]
textdistance - Compute distance between sequences [GitHub, 2596 stars]
DeepMatcher - Compute distance between sequences [GitHub, 401 stars]
RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 317 stars]
Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 3 stars]

Discourse Analysis

ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 331 stars]

PII scrubbing

scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 264 stars]

Hastag Segmentation

hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 22 stars]

Books Analysis / Literary Analysis

booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 472 stars]
bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 70 stars]

Non-English oriented

Japanese

fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 190 stars]
SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 275 stars]
Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 171 stars]
jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 142 stars]
Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 548 stars]
kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 790 stars]
nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 290 stars]
KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 184 stars]
Jigg - Pipeline framework for easy natural language processing [GitHub, 68 stars]
Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 277 stars]
RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 443 stars]
toiro - a comparison tool of Japanese tokenizers [GitHub, 100 stars]

Other

textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 90 stars]
Kashgari Transfer Learning with focus on Chinese [GitHub, 2256 stars]
Underthesea - Vietnamese NLP Toolkit [GitHub, 930 stars]
PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 53 stars]

Text Data Labelling

Small-Text - Active Learning for Text Classifcation in Python [GitHub, 172 stars]
Doccano - open source annotation tool for machine learning practitioners [GitHub, 5761 stars]
Prodigy - annotation tool powered by active learning [Paid Service]

General

Learn NLP the practical way [Blog, Nov. 2019]
Learn NLP the Stanford way (+Part 2) [Blog, Nov 2020]
Choosing the right course for a Practical NLP Engineer
12 Best Natural Language Processing Courses & Tutorials to Learn Online
Treasure of Transformers - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 256 stars]

Courses

NLP Course | For You - Great and interactive course on NLP
OpenClass NLP - Natural language processing (NLP) assignments
Advanced NLP with spaCy - how to use spaCy to build advanced natural language understanding systems
Transformer models for NLP by HuggingFace
Stanford NLP Seminar - slides from the Stanford NLP course
Applied Language Technology - Natural Language Processing for Linguists

Books

Natural Language Processing with Transformers - [Book, February 2022]
Applied Natural Language Processing in the Enterprise - [Book, May 2021]
Practical Natural Language Processing - [Book, June 2020]
Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)
Top NLP Books to Read 2020 - Blog post by Raymong Cheng [Blog, Sep 2020]

Tutorials

nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1284 stars]
nlp-tutorial - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 10105 stars]
Hands-On NLTK Tutorial [GitHub, 469 stars]
Modern Practical Natural Language Processing [GitHub, 259 stars]
Transformers-Tutorials - demos with the Transformers library by HuggingFace [GitHub, 1089 stars]

r/LanguageTechnology - NLP Reddit forum

Tokenization

tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 5225 stars]
SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 5652 stars]
SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 100 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks

WildNLP Text manipulation library to test NLP models [GitHub, 70 stars]
NLPAug Data augmentation for NLP [GitHub, 2915 stars]
SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 350 stars]
TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1839 stars]
skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 645 stars]
NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 553 stars]
EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1177 stars]
snorkel Framework to generate training data [GitHub, 5016 stars]

Reading Material and Tutorials

A Survey of Data Augmentation Approaches for NLP [Paper, May 2021] GitHub Link
A Visual Survey of Data Augmentation in NLP [Blog, 2020]
Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]

Named Entity Recognition (NER)

Datasets for Entity Recognition [GitHub, 1099 stars]
Datasets to train supervised classifiers for Named-Entity Recognition [GitHub, 273 stars]
Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 149 stars]
Few-NERD - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 230 stars]

Relation Extraction

tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 310 stars]
tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 49 stars]
tac-self-attention Relation extraction with position-aware self-attention [GitHub, 61 stars]
Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 31 stars]

Coreference Resolution

NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub, 2478 stars]
coref - BERT and SpanBERT for Coreference Resolution [GitHub, 356 stars]

Domain Adaptation

Neural Adaptation in Natural Language Processing - curated list [GitHub, 202 stars]

Low Resource NLP

CMU LTI Low Resource NLP Bootcamp 2020 - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 529 stars]

Spell Correction

Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 911 stars]
NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 412 stars]
SymSpellPy - Python port of SymSpell [GitHub, 529 stars]
Speller100 by Microsoft [Blog, Feb 2021]
JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 472 stars]

Style Transfer for NLP

Styleformer - Neural Language Style Transfer framework [GitHub, 322 stars]
StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 33 stars]

Automata Theory for NLP

pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 682 stars]

Obscene words detection

LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 1679 stars]

Reddit Analysis

Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 475 stars]

Skill Detection

SkillNER - rule based NLP module to extract job skills from text [GitHub, 33 stars]

Reinforcement Learning for NLP

nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 109 stars]

AutoML / AutoNLP

AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 679 stars]
TPOT - Python Automated Machine Learning tool [GitHub, 8434 stars]
Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 1535 stars]
HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 656 stars]
AutoML Natural Language - Google's paid AutoML NLP service
Optuna - hyperparameter optimization framework [GitHub, 5924 stars]
FLAML - fast and lightweight AutoML library [GitHub, 1741 stars]
Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 267 stars]

Text Generation

keytotext - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 218 stars]
Controllable Neural Text Generation [Blog, Jan 2021]
BARTScore Evaluating Generated Text as Text Generation [GitHub, 96 stars]

Title / Headlines Generation

TitleStylist Learning to Generate Headlines with Controlled Styles [GitHub, 56 stars]

License CC0

Attributions

Resources

All linked resources belong to original authors

Icons

Akropolis by parkjisun from the Noun Project
Book of Ester by Gilad Sotil from the Noun Project
quill by Juan Pablo Bravo from the Noun Project
acting by Flatart from the Noun Project
olympic by supalerk laipawat from the Noun Project
aristocracy by Eucalyp from the Noun Project
Horn by Eucalyp from the Noun Project
temple by Eucalyp from the Noun Project
constellation by Eucalyp from the Noun Project
ancient greek round pattern by Olena Panasovska from the Noun Project
Harp by Vectors Point from the Noun Project
Atlas by parkjisun from the Noun Project
Parthenon by Eucalyp from the Noun Project
papyrus by IconMark from the Noun Project
papyrus by Smalllike from the Noun Project
pegasus by Saeful Muslim from the Noun Project

Fonts

Dalek Font

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
Resources/Images		Resources/Images
Scripts		Scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

maliozer/The-NLP-Pandect

Folders and files

Latest commit

History

Repository files navigation

Compendiums and awesome lists on the topic of NLP:

NLP Conferences, Paper Summaries and Paper Compendiums:

Papers and Paper Summaries

Conferences

NLP Progress and NLP Tasks:

NLP Datasets:

Word and Sentence embeddings:

Notebooks, Scripts and Repositories

Non-English resources and compendiums

Pre-trained NLP models

NLP History

General

2020 Year in Review

NLP-only podcasts

Many NLP episodes

Some NLP episodes

General NLU

Summarization

Question Answering

Multilingual and Non-English Benchmarks

Bio, Law, and other scientific domains

Transformer Efficiency

Speech Processing

Other

General

Embeddings

Repositories

Blogs

Cross-lingual Word and Sentence Embeddings

Byte Pair Encoding

Transformer-based Architectures

General

Transformer

BERT

Other Transformer Variants

T5

BigBird

Reformer / Linformer / Longformer / Performers

Switch Transformer

GPT-family

General

GPT-3

Learning Resources

Applications

Open-source Efforts

Other

Distillation, Pruning and Quantization

Reading Material

Tools

Automated Summarization

Knowledge Graphs and NLP

Best Practices for NLP

MLOps for NLP

Reading Material

Learning Material

MLOps Communities

Data Versioning

Experiment Tracking

Model Registry

Automated Testing and Behavioral Testing

Model Deployability and Serving

Model Debugging

Model Accuracy Prediction

Data and Model Observability

General

Model Centric

Data Centric

Feature Stores

Metadata Management

MLOps Frameworks

Transformer-based Architectures

General

Multi-GPU Transfomers

Training Transformers Effectively