Skip to content

Latest commit

 

History

History
103 lines (85 loc) · 10.1 KB

tools.md

File metadata and controls

103 lines (85 loc) · 10.1 KB

Persian Natural Processing Tools

Table Of Contents

Part-of-Speech Tagger

  • farsiNLPTools - Open-source dependency parser, part-of-speech tagger, and text normalizer for Farsi (Persian).
  • HAZM - Python library for digesting Persian text.
  • Persian Language Model for HunPoS - HunPoS (Halacsy et al, 2007) is an open source reimplementation of the statistical part-of-speech tagger Trigrams'n Tags, also called TnT (Brants, 2000) allowing the user to tune the tagger by using different feature settings.
  • Maryam Tavafi POS Tagger - This software includes implementation of a Persian part of speech tagger based on Structured Support Vector Machines.
  • Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.
  • Persianp Toolbox - Multi-purpose persian NLP toolbox.
  • UM-wtlab pos tagger - This software is a C# implementation of the Viberbi and Brill part-of-speech taggers.
  • RDRPOSTagger - Provides a pre-trained part-of-speech (POS) tagging model for Persian. This POS tagging toolkit is implemented in both Python and Java.
  • jPTDP - Provides a pre-trained model for joint POS tagging and dependency parsing for Persian.
  • Parsivar - A Language Processing Toolkit for Persian

Language Detection

Tokenization & Segmentation

  • HAZM - Python library for digesting Persian text.
  • polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).
  • tok-tok - Tok-tok is a fast, simple, multilingual tokenizer(single .pl file).
  • segmental - You can train your model based on plain-text corpus for text segmentation by powerful deep learning platform.
  • Persian Sentence Segmenter and Tokenizer: SeTPer - Regex based sentence segmenter.
  • Farsi-Verb-Tokenizer - Tokenizes Farsi Verbs.
  • Parsivar - A Language Processing Toolkit for Persian
  • ParsiAnalyzer - Persian Analyzer For Elasticsearch.
  • ParsiNorm - Persain Text Pre-Proceesing Tool

Normalizer And Text Cleaner

  • HAZM - Python library for digesting Persian text.
  • Persian Pre-processor: PrePer - Another signle .pl tools that normals your persian text.
  • virastar - Cleaning up Persian text!.replace double dash to ndash and triple dash to mdash, replace English numbers with their Persian equivalent, correct :;,.?! spacing (one space after and no space before), replace English percent sign to its Persian equivalent and many other normalization. Virastar is written by ruby and has python port.
  • Virastyar - A collection of C# libraries for Persian text processing (Spell Checking, Purification, Punctuation Correction, Persian Character Standardization, Pinglish Conversion & ...)
  • Parsivar - A Language Processing Toolkit for Persian (Has Half-Space Normalizer and Pinglish Conversion)
  • ParsiAnalyzer - Persian Analyzer For Elasticsearch.
  • ParsiNorm - Persain Text Pre-Proceesing Tool

Translator

  • SPL - Semantic Parser Localizer toolkit can be used to translate text between any language pairs for which an NMT model exists. We currently support Marian models and Google Translate. In general, for translations to or from Persian, Google Translate has higher quality.

Transliterator

  • Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.

Morphological Analysis

  • polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).

Stemmer

  • PersianStemmer - (Java, Delphi,C# and Python) - PersianStemmer is a longest-match stemming algorithm that is based on pattern matching. It uses a knowledge base which consist of a collection of rules named "patterns". Furthermore, the exceptions and problems in the Persian morphology have been studied, and a solution is presented for each of them. So our stemmer evaluated. Its result was much better than the previous stemmers.
  • Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.
  • polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).
  • Parsivar - A Language Processing Toolkit for Persian
  • ParsiAnalyzer - Persian Analyzer For Elasticsearch.

Sentiment Analysis

  • polyglot (polarity) - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).

Spell Checking

  • async_faspell - Persian spellchecker. An algorithm that suggests words for misspelled words.

Dependency Parser

  • HAZM - Python library for digesting Persian text.

Shallow Parser

  • HAZM - Python library for digesting Persian text.
  • Parsivar - A Language Processing Toolkit for Persian

Information Extraction

  • Baaz - Open information extraction from Persian web.

Text To Speech Preprocessing

  • ParsiNorm - Persain Text Pre-Proceesing Tool

Text To Speech

  • AlisterTA TTS - A convolutional sequence to sequence model for Persian text to speech based on Tachibana et al with a few modifications.

MISC

  • petit - Convert alphabet-written numbers to digit-form

Keyphrase Extractor

  • Perke - Perke is a Python keyphrase extraction package for Persian language. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models.

Speech Recognition

  • Vosk - Vosk is an offline open source speech recognition toolkit. It enables speech recognition for 20+ languages and dialects. Supports Persian.
  • m3hrdadfi/wav2vec - Persian speech recognition model based on XLS-R.