Skip to content

A comprehensive reference for all topics related to Natural Language Processing

License

Notifications You must be signed in to change notification settings

maliozer/The-NLP-Pandect

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The-NLP-Pandect

This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.

The-NLP-Resources

Compendiums and awesome lists on the topic of NLP:

NLP Conferences, Paper Summaries and Paper Compendiums:

Papers and Paper Summaries
Conferences

NLP Progress and NLP Tasks:

NLP Datasets:

Word and Sentence embeddings:

Notebooks, Scripts and Repositories

Non-English resources and compendiums

Pre-trained NLP models

NLP History

General
2020 Year in Review

The-NLP-Podcasts

NLP-only podcasts

Many NLP episodes

Some NLP episodes

The-NLP-Newsletter

The-NLP-Meetups

The-NLP-Youtube

The-NLP-Benchmarks

General NLU

  • GLUE - General Language Understanding Evaluation (GLUE) benchmark
  • SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
  • decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
  • RACE - ReAding Comprehension dataset collected from English Examinations
  • dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
  • DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking

Summarization

  • WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset
  • WikiLingua - A Multilingual Abstractive Summarization Dataset

Question Answering

  • SQuAD - Stanford Question Answering Dataset (SQuAD)
  • XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
  • GrailQA - Strongly Generalizable Question Answering (GrailQA)
  • CSQA - Complex Sequential Question Answering

Multilingual and Non-English Benchmarks

  • XTREME - Massively Multilingual Multi-task Benchmark
  • GLUECoS - A benchmark for code-switched NLP
  • IndoNLU Benchmark - collection of resources for training, evaluating, and analyzing NLP for Bahasa Indonesia
  • IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
  • LinCE - Linguistic Code-Switching Evaluation Benchmark
  • Russian SuperGlue - Russian SuperGlue Benchmark

Bio, Law, and other scientific domains

  • BLURB - Biomedical Language Understanding and Reasoning Benchmark
  • BLUE - Biomedical Language Understanding Evaluation benchmark
  • LexGLUE - A Benchmark Dataset for Legal Language Understanding in English

Transformer Efficiency

Speech Processing

  • SUPERB - Speech processing Universal PERformance Benchmark

Other

The-NLP-Research

General

Embeddings

Repositories

Blogs

Cross-lingual Word and Sentence Embeddings

  • vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 573 stars]
  • sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 7021 stars]

Byte Pair Encoding

  • bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1031 stars]
  • subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 1840 stars]
  • python-bpe - Byte Pair Encoding for Python [GitHub, 163 stars]

Transformer-based Architectures

General

Transformer

BERT

Other Transformer Variants

T5
BigBird
Reformer / Linformer / Longformer / Performers
Switch Transformer

GPT-family

General
GPT-3
Learning Resources
Applications
  • Awesome GPT-3 - list of all resources related to GPT-3 [GitHub, 3414 stars]
  • GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
  • GPT-3 Demo Showcase - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources
  • OpenAI API - API Demo to use GPT-3 for commercial applications
Open-source Efforts

Other

Distillation, Pruning and Quantization

Reading Material
Tools
  • Bert-squeeze - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 54 stars]
  • XtremeDistil - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 93 stars]

Automated Summarization

  • PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]
  • CTRLsum - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 95 stars]
  • XL-Sum - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 136 stars]
  • SummerTime - an open-source text summarization toolkit for non-experts [GitHub, 162 stars]
  • PRIMER - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 42 stars]
  • summarus - Models for automatic abstractive summarization [GitHub, 133 stars]

Knowledge Graphs and NLP

The-NLP-Industry

Best Practices for NLP

MLOps for NLP

MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.

In general, MLOps for NLP includes having the following processes in place:

  • Data Versioning - make sure your training, annotation and other types of data are versioned and tracked
  • Experiment Tracking - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced
  • Model Registry - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them
  • Automated Testing and Behavioral Testing - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks
  • Model Deployment and Serving - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.
  • Data and Model Observability - track data drift, model accuracy drift etc.

Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:

  • Feature Store - centralized storage of all features developed for ML models than can be easily reused by any other ML project
  • Metadata Management - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.

Reading Material

Learning Material

  • MLOps cource by Made With ML
  • GitHub MLOps - collection of resources on how to facilitate Machine Learning Ops with GitHub

MLOps Communities

Data Versioning

  • DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
  • Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
  • Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]

Experiment Tracking

  • mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
  • Weights & Biases - tools for experiment tracking and dataset versioning [Paid Service]
  • Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
  • Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
  • SigOpt - automate training & tuning, visualize & compare runs [Paid Service]
  • Optuna - hyperparameter optimization framework [GitHub, 5924 stars]
  • Clear ML - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] Link to GitHub
  • Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5285 stars]
Model Registry
  • DVC - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] Link to GitHub
  • mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
  • ModelDB - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1387 stars]
  • Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]
  • Valohai - End-to-end ML pipelines [Paid Service]
  • Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
  • polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
  • Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]

Automated Testing and Behavioral Testing

  • CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1583 stars]
  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1839 stars]
  • WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 70 stars]
  • Great Expectations - Write tests for your data [GitHub, 6058 stars]
  • Deepchecks - Python package for comprehensively validating your machine learning models and data [GitHub, 1035 stars]

Model Deployability and Serving

  • mlflow - open source platform for the machine learning lifecycle [Free and Open Source] Link to GitHub
  • Amazon SageMaker [Paid Service]
  • Valohai - End-to-end ML pipelines [Paid Service]
  • NLP Cloud - Production-ready NLP API [Paid Service]
  • Saturn Cloud [Paid Service]
  • SELDON - machine learning deployment for enterprise [Paid Service]
  • Comet ML - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
  • polyaxon - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
  • TorchServe - flexible and easy to use tool for serving PyTorch models [GitHub, 2407 stars]
  • Kubeflow - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]
  • KFServing - Serverless Inferencing on Kubernetes [GitHub, 1327 stars]
  • TFX - TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines [Paid Service]
  • Pachyderm - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
  • Cortex - containers as a service on AWS [Paid Service]
  • Azure Machine Learning - end-to-end machine learning lifecycle [Paid Service]
  • End2End Serverless Transformers On AWS Lambda [GitHub, 96 stars]
  • NLP-Service - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars]
  • Dagster - data orchestrator for machine learning [Free and Open Source]
  • Verta - AI and machine learning deployment and operations [Paid Service]
  • Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5285 stars]
  • flyte - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 1902 stars]
  • MLRun - Machine Learning automation and tracking [GitHub, 565 stars]
  • DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI

Model Debugging

  • imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 490 stars]
  • Cockpit - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 380 stars]

Model Accuracy Prediction

  • WeightWatcher - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 633 stars]

Data and Model Observability

General
  • whylogs - open source standard for data and ML logging [GitHub, 750 stars]
  • Rubrix - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 842 stars]
  • MLRun - Machine Learning automation and tracking [GitHub, 565 stars]
  • DataRobot MLOps - DataRobot MLOps provides a center of excellence for your production AI
  • Cortex - containers as a service on AWS [Paid Service]
Model Centric
  • Algorithmia - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]
  • Dataiku - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]
  • Evidently AI - tools to analyze and monitor machine learning models [Free and Open Source] Link to GitHub
  • Fiddler - ML Model Performance Management Tool [Paid Service]
  • Hydrosphere - open-source platform for managing ML models [Paid Service]
  • Verta - AI and machine learning deployment and operations [Paid Service]
  • Domino Model Ops - Deploy and Manage Models to Drive Business Impact [Paid Service]
  • iguazio - deployment and management of your AI applications with MLOps and end-to-end automation of machine learning pipelines [Paid Service]
Data Centric
  • Datafold - data quality through diffs, profiling, and anomaly detection [Paid Service]
  • acceldata - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]
  • Bigeye - monitoring and alerting to your datasets in minutes [Paid Service]
  • datakin - end-to-end, real-time data lineage solution [Paid Service]
  • Monte Carlo - data integrity, drifts, schema, lineage [Paid Service]
  • SODA - data monitoring, testing and validation [Paid Service]
  • whatify - data quality and action recommendation on it [Paid Service]

Feature Stores

  • Tecton - enterprise feature store for machine learning [Paid Service]
  • FEAST - open source feature store for machine learning Website [GitHub, 2841 stars]
  • Hopsworks Feature Store - data management system for managing machine learning features [Paid Service]

Metadata Management

  • ML Metadata - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 431 stars]
  • Neptune AI - experiment tracking and model registry built for research and production teams [Paid Service]

MLOps Frameworks

  • Metaflow - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 5285 stars]
  • kedro - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 6453 stars]
  • Seldon Core - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 2912 stars]
  • ZenML - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 1646 stars]
  • Google Vertex AI - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]
  • Diffgram - Complete training data platform for machine learning delivered as a single application [GitHub, 682 stars]

Transformer-based Architectures

General

Multi-GPU Transfomers
Training Transformers Effectively

Embeddings as a Service

NLP Recipes Industrial Applications:

NLP Applications in Bio, Finance, Legal and other industries

  • Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 527 stars]
  • Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1089 stars]
  • FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 160 stars]
  • LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 502 stars]
  • NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
  • Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 346 stars]
  • BioIE - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 181 stars]

The-NLP-Speech

General Speech Recognition

  • wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5931 stars]
  • DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 19002 stars]
  • Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
  • kaldi - Kaldi is a toolkit for speech recognition [GitHub, 11306 stars]
  • awesome-kaldi - resources for using Kaldi [GitHub, 479 stars]
  • ESPnet - End-to-End Speech Processing Toolkit [GitHub, 4674 stars]
  • HuBERT - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]

Text to Speech

  • FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 675 stars]
  • TTS - a deep learning toolkit for Text-to-Speech [GitHub, 3915 stars]

Datasets

  • VoxPopuli - large-scale multilingual speech corpus for representation learning [GitHub, 332 stars]

The-NLP-Topics

Blogs

Frameworks for Topic Modeling

  • gensim - framework for topic modeling [GitHub, 12896 stars]
  • Spark NLP [GitHub, 2609 stars]

Repositories

Keyword-Extraction

Text Rank

  • PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 1738 stars]
  • textrank - TextRank implementation for Python 3 [GitHub, 1076 stars]

RAKE - Rapid Automatic Keyword Extraction

  • rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 879 stars]
  • yake - Single-document unsupervised keyword extraction [GitHub, 932 stars]
  • RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 361 stars]
  • rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 879 stars]

Other Approaches

  • flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5048 stars]
  • BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 198 stars]
  • keyBERT - Minimal keyword extraction with BERT [GitHub, 1152 stars]

Further Reading

Responsible-NLP

NLP and ML Interpretability

NLP-centric

General

  • Language Interpretability Tool (LIT) [GitHub, 2811 stars]
  • WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 334 stars]
  • Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 300 stars]
  • InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 4539 stars]
  • thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 87 stars]
  • Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 204 stars]
  • imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 490 stars]

Ethics, Bias, and Equality in NLP

Adversarial Attacks for NLP

Hate Speech Analysis

  • HateXplain - BERT for detecting abusive language [GitHub, 94 stars]

The-NLP-Frameworks

General Purpose

  • spaCy by Explosion AI [GitHub, 22381 stars]
  • flair by Zalando [GitHub, 11232 stars]
  • AllenNLP by AI2 [GitHub, 10794 stars]
  • stanza (former Stanford NLP) [GitHub, 5984 stars]
  • spaCy stanza [GitHub, 612 stars]
  • nltk [GitHub, 10434 stars]
  • gensim - framework for topic modeling [GitHub, 12896 stars]
  • pororo - Platform of neural models for natural language processing [GitHub, 1081 stars]
  • NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2797 stars]
  • FARM [GitHub, 1466 stars]
  • gobbli by RTI International [GitHub, 263 stars]
  • headliner - training and deployment of seq2seq models [GitHub, 229 stars]
  • SyferText - A privacy preserving NLP framework [GitHub, 185 stars]
  • DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1202 stars]
  • TextHero - Text preprocessing, representation and visualization [GitHub, 2431 stars]
  • textblob - TextBlob: Simplified Text Processing [GitHub, 8033 stars]
  • AdaptNLP - A high level framework and library for NLP [GitHub, 386 stars]
  • textacy - NLP, before and after spaCy [GitHub, 1875 stars]
  • texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2243 stars]
  • jiant - jiant is an NLP toolkit [GitHub, 1371 stars]

Data Augmentation

  • WildNLP Text manipulation library to test NLP models [GitHub, 70 stars]
  • snorkel Framework to generate training data [GitHub, 5016 stars]
  • NLPAug Data augmentation for NLP [GitHub, 2915 stars]
  • SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 350 stars]
  • faker - Python package that generates fake data for you [GitHub, 13729 stars]
  • textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 537 stars]
  • Parrot - Practical and feature-rich paraphrasing framework [GitHub, 382 stars]
  • AugLy - data augmentations library for audio, image, text, and video [GitHub, 4290 stars]
  • TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 210 stars]

Adversarial NLP Attacks & Behavioral Testing

  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1839 stars]
  • CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 5399 stars]
  • CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1583 stars]

Transformer-oriented

  • transformers by HuggingFace [GitHub, 57928 stars]
  • Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 661 stars]
  • haystack - Transformers at scale for question answering & neural search. [GitHub, 4137 stars]

Dialog Systems and Speech

  • DeepPavlov by MIPT [GitHub, 5592 stars]
  • ParlAI by FAIR [GitHub, 8608 stars]
  • rasa - Framework for Conversational Agents [GitHub, 13537 stars]
  • wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5931 stars]
  • ChatterBot - conversational dialog engine for creating chat bots [GitHub, 11982 stars]
  • SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 3698 stars]

Word/Sentence-embeddings oriented

  • MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 2915 stars]
  • vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 573 stars]
  • sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 7021 stars]

Social Media Oriented

  • Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 530 stars]

Phonetics

  • DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 125 stars]

Morphology

  • LemmInflect - python module for English lemmatization and inflection [GitHub, 149 stars]
  • Inflect - generate plurals, ordinals, indefinite articles [GitHub, 612 stars]
  • simplemma - simple multilingual lemmatizer for Python [GitHub, 612 stars]

Multi-lingual tools

  • polyglot - Multi-lingual NLP Framework [GitHub, 1954 stars]
  • trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 572 stars]

Distributed NLP / Multi-GPU NLP

Machine Translation

  • COMET -A Neural Framework for MT Evaluation [GitHub, 121 stars]
  • marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 861 stars]
  • argos-translate - Open source neural machine translation in Python [GitHub, 1023 stars]
  • Opus-MT - Open neural machine translation models and web services [GitHub, 184 stars]
  • dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 190 stars]

Entity and String Matching

  • PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 423 stars]
  • pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 682 stars]
  • fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 8603 stars]
  • jellyfish - approximate and phonetic matching of strings [GitHub, 1610 stars]
  • textdistance - Compute distance between sequences [GitHub, 2596 stars]
  • DeepMatcher - Compute distance between sequences [GitHub, 401 stars]
  • RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 317 stars]
  • Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 3 stars]

Discourse Analysis

  • ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 331 stars]

PII scrubbing

  • scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 264 stars]

Hastag Segmentation

  • hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 22 stars]

Books Analysis / Literary Analysis

  • booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 472 stars]
  • bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 70 stars]

Non-English oriented

Japanese

  • fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 190 stars]
  • SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 275 stars]
  • Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 171 stars]
  • jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 142 stars]
  • Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 548 stars]
  • kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 790 stars]
  • nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 290 stars]
  • KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 184 stars]
  • Jigg - Pipeline framework for easy natural language processing [GitHub, 68 stars]
  • Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 277 stars]
  • RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 443 stars]
  • toiro - a comparison tool of Japanese tokenizers [GitHub, 100 stars]

Other

  • textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 90 stars]
  • Kashgari Transfer Learning with focus on Chinese [GitHub, 2256 stars]
  • Underthesea - Vietnamese NLP Toolkit [GitHub, 930 stars]
  • PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 53 stars]

Text Data Labelling

  • Small-Text - Active Learning for Text Classifcation in Python [GitHub, 172 stars]
  • Doccano - open source annotation tool for machine learning practitioners [GitHub, 5761 stars]
  • Prodigy - annotation tool powered by active learning [Paid Service]

The-NLP-Learning

General

Courses

Books

Tutorials

The-NLP-Communities

Other-NLP-Topics

Tokenization

  • tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 5225 stars]
  • SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 5652 stars]
  • SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 100 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks
  • WildNLP Text manipulation library to test NLP models [GitHub, 70 stars]
  • NLPAug Data augmentation for NLP [GitHub, 2915 stars]
  • SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 350 stars]
  • TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1839 stars]
  • skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 645 stars]
  • NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 553 stars]
  • EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1177 stars]
  • snorkel Framework to generate training data [GitHub, 5016 stars]
Reading Material and Tutorials

Named Entity Recognition (NER)

Relation Extraction

  • tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 310 stars]
  • tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 49 stars]
  • tac-self-attention Relation extraction with position-aware self-attention [GitHub, 61 stars]
  • Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 31 stars]

Coreference Resolution

Domain Adaptation

Low Resource NLP

Spell Correction

  • Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 911 stars]
  • NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 412 stars]
  • SymSpellPy - Python port of SymSpell [GitHub, 529 stars]
  • Speller100 by Microsoft [Blog, Feb 2021]
  • JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 472 stars]

Style Transfer for NLP

  • Styleformer - Neural Language Style Transfer framework [GitHub, 322 stars]
  • StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 33 stars]

Automata Theory for NLP

  • pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 682 stars]

Obscene words detection

  • LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 1679 stars]

Reddit Analysis

  • Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 475 stars]

Skill Detection

  • SkillNER - rule based NLP module to extract job skills from text [GitHub, 33 stars]

Reinforcement Learning for NLP

  • nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 109 stars]

AutoML / AutoNLP

  • AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 679 stars]
  • TPOT - Python Automated Machine Learning tool [GitHub, 8434 stars]
  • Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 1535 stars]
  • HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 656 stars]
  • AutoML Natural Language - Google's paid AutoML NLP service
  • Optuna - hyperparameter optimization framework [GitHub, 5924 stars]
  • FLAML - fast and lightweight AutoML library [GitHub, 1741 stars]
  • Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 267 stars]

Text Generation

Title / Headlines Generation

  • TitleStylist Learning to Generate Headlines with Controlled Styles [GitHub, 56 stars]

License CC0

Attributions

Resources

  • All linked resources belong to original authors

Icons

Fonts


The Pandect Series also includes

     

About

A comprehensive reference for all topics related to Natural Language Processing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%