Skip to content

Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources

Notifications You must be signed in to change notification settings

QormeSabz/awesome-persian-nlp-ir

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Awesome Persian NLP/IR, Tools And Resources Awesome

This list is curation of the best, not of everything. Please participate in its development.Thanks to ACL WEB.

Contents

Tools

Part-of-Speech Tagger

  • farsiNLPTools - Open-source dependency parser, part-of-speech tagger, and text normalizer for Farsi (Persian).
  • HAZM - Python library for digesting Persian text.
  • Persian Language Model for HunPoS - HunPoS (Halacsy et al, 2007) is an open source reimplementation of the statistical part-of-speech tagger Trigrams'n Tags, also called TnT (Brants, 2000) allowing the user to tune the tagger by using different feature settings.
  • Maryam Tavafi POS Tagger - This software includes implementation of a Persian part of speech tagger based on Structured Support Vector Machines.
  • Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.
  • Persianp Toolbox - Multi-purpose persian NLP toolbox.
  • UM-wtlab pos tagger - This software is a C# implementation of the Viberbi and Brill part-of-speech taggers.
  • RDRPOSTagger provides a pre-trained part-of-speech (POS) tagging model for Persian. This POS tagging toolkit is implemented in both Python and Java.
  • jPTDP provides a pre-trained model for joint POS tagging and dependency parsing for Persian.
  • Parsivar - A Language Processing Toolkit for Persian

Language Detection

Tokenization & Segmentation

  • HAZM - Python library for digesting Persian text.
  • polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).
  • tok-tok - Tok-tok is a fast, simple, multilingual tokenizer(single .pl file).
  • segmental - You can train your model based on plain-text corpus for text segmentation by powerful deep learning platform.
  • Persian Sentence Segmenter and Tokenizer: SeTPer - Regex based sentence segmenter.
  • Farsi-Verb-Tokenizer - Tokenizes Farsi Verbs.
  • Parsivar - A Language Processing Toolkit for Persian

Normalizer And Text Cleaner

  • HAZM - Python library for digesting Persian text.
  • Persian Pre-processor: PrePer - Another signle .pl tools that normals your persian text.
  • virastar - Cleanning up Persian text!.replace double dash to ndash and triple dash to mdash, replace English numbers with their Persian equivalent, correct :;,.?! spacing (one space after and no space before), replace English percent sign to its Persian equivalent and many other normalization. Virastar is written by ruby and has python port.
  • Virastyar - A collection of C# libraries for Persian text processing (Spell Checking, Purification, Punctuation Correction, Persian Character Standardization, Pinglish Conversion & ...)
  • Parsivar - A Language Processing Toolkit for Persian (Has Half-Space Normalizer and Pinglish Conversion)

Transliterator

  • Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.

Named Entity Recognition

  • ParsBERT-NER - It is a fine-tuned model based on ParsBERT (a monolingual Persian language model) on a vast range of dataset PEYMA, ARMAN, and PEYMA+ARMAN. And it is available from HuggingFace for using both in TensorFlow 2.0 and Pytorch!

Embeddings

Morphological Analysis

  • polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).

Stemmer

  • PersianStemmer - (Java, Delphi,C# and Python) - PersianStemmer is a longest-match stemming algorithm that is based on pattern matching. It uses a knowledge base which consist of a collection of rules named "patterns". Furthermore, the exceptions and problems in the Persian morphology have been studied, and a solution is presented for each of them. So our stemmer evaluated. Its result was much better than the previous stemmers.

  • Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.

  • polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).

  • Parsivar - A Language Processing Toolkit for Persian

Sentiment Analysis

  • polyglot (polarity) - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).

Language Model

  • ParsBERT: Transformer-based Model for Persian Language Understanding) - It is a monolingual language model based on Google’s BERT architecture for the Persian Language only! This model is pre-trained on a large Persian corpus with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 2M documents. A large subset of this corpus was crawled manually.

Spell Checking

  • async_faspell -Persian spellchecker. An algorithm that suggests words for misspelled words.

Dependency Parser

  • HAZM - Python library for digesting Persian text.

Shallow Parser

  • HAZM - Python library for digesting Persian text.
  • Parsivar - A Language Processing Toolkit for Persian

Information Extraction

  • baaz - Open information extraction from Persian web.

Text To Speech

  • AlisterTA TTS - A convolutional sequence to sequence model for Persian text to speech based on Tachibana et al with a few modifications.

Summerizer

Data

Part-of-Speech Tagger

  • Bijankhan Corpus - Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags.

  • Mojgan Seraji Corpus - Uppsala Persian Corpus (UPC) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags. The part-of-speech tags are listed with explanations in this table.

  • Large-Scale Colloquial Persian - Large Scale Colloquial Persian Dataset (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in English (EN), German (DE), Czech (CS), Italian (IT) and Hindi (HI) spoken languages. Learn more about this project at LSCP webpage.

Named Entity Recognition

  • ArmanPersoNERCorpus - The dataset includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format.
  • FarsiYar PersianNER - The dataset includes about 25,000,000 tokens and about 1,000,000 Persian sentences in total based on Persian Wikipedia Corpus. The NER tags are in IOB format. More than 1000 volunteers contributed tag improvements to this dataset via web panel or android app. They release updated tags every two weeks.
  • Workshop on NLP Solutions for Under Resourced Languages (NSURL) 2019 - Task 7 dataset - contains a medium size NER corpus with 7 classes of named entities (person, location and organization, money, percent, dates, and time). This corpus contains more than 700 news documents.

Dependency Parsing

  • Persian Syntactic Dependency Treebank - This treebank is supplied for free noncommercial use. For commercial uses feel free to contact us. The number of annotated sentences is 29,982 sentences including samples from almost all verbs of the Persian valency lexicon.
  • Uppsala Persian Dependency Treebank: UPDT - Dependency-based syntactically annotated corpus.
  • Pretrained model
  • Universal Dependencies 1.3 - Multi lingual corpus that holds IOB gold data for dependency parsing
  • HamleDT 3.0 - HArmonized Multi-LanguagE Dependency Treebank is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style.
  • Large-Scale Colloquial Persian - Large Scale Colloquial Persian Dataset (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in English (EN), German (DE), Czech (CS), Italian (IT) and Hindi (HI) spoken languages. Learn more about this project at LSCP webpage.

Text Categorization and Classification

  • Hamshahri - Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems.
  • Bijankhan Corpus - Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags.

Spell Checking

  • FAspell - FASpell dataset was developed for the evaluation of spell checking algorithms. It contains a set of pairs of misspelled Persian words and their corresponding corrected forms similar to the ASpell dataset used for English.
  • Persian-Spell-checker - We're collecting persian words' dictionary (verbs, nouns, and etc.) for Persian spell checker.

Persian Poems And Classic Texts

  • Farsi Poem Corpus - This corpus consists of text documents for 48 Persian poets. The corpus comes in three formats; original, normalized (only 32 main Farsi alphabet), and stop words removed. The corpus consists of 1,216,286 mesras of Farsi poems and 8,102,119 words from which 148,588 are unique.

Sentiment Analysis

  • NRC-Persian-Lexicon - NRC Word-Emotion Association Lexicon useful for persian sentiment analysis.
  • Digikala Sentiment Analysis data - Scraped reviews from Digikala websites. The labels are the stars people who had assigned each product.
  • Pars-ABSA - Manually annotated Persian dataset, verified by 3 native Persian speakers. The dataset consists of 5,114 positive, 3,061 negative and 1,827 neutral data samples from 5,602 unique reviews.
  • PerSent - This dataset presents real-valued polarity labels, in the range from -1 to 1, for thousands of Persian words and expressions.
  • SentiPers - Documents in SentiPers are manually annotated at different levels.
  • LexiPers - An ontology based sentiment lexicon for Persian.

Machine Tanslation

Parallel Corpus

  • TEP: Tehran English-Persian Parallel Corpus - First free English-Persian corpus.
  • OPUS: the open parallel corpus - OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out.
  • Large-Scale Colloquial Persian - Large Scale Colloquial Persian Dataset (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in English (EN), German (DE), Czech (CS), Italian (IT) and Hindi (HI) spoken languages. Learn more about this project at LSCP webpage.

Monolingual Corpus

  • TMC: Tehran Monolingual Corpus - The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural Language Processing. The corpus is extracted from Hamshahri Corpus and ISNA news agency website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of tokenization and spell-checking steps.
  • VOA Persian Corpus - A medium-sized corpus of 7.9 million words, 2003-2008. The corpus is in the public domain, so no copyright restrictions.
  • MirasText: Automatically Extracted Text Persian Corpus (about 12GB).
  • A large collection of Persian raw text - About 80GB Persian raw text, collected from a variety of sources, particularly CommonCrawl.

Comparable Corpus

Web Collected

  • W2C – Web to Corpus – Corpora - A set of corpora for 120 languages automatically collected from wikipedia and the web.
  • dotIR Collection - dotIR is a standard Persian test collection that is suitable for evaluation of web information retrieval algorithms in Iranian web.dotIR Contains many Persian web pages including their text, links, metadata, etc that are stored in XML format. It is prepared in such a way to be a good representative of Iranian web.It is A good test bed for evaluation of link based information retrieval algorithms. It includes enough Queries and relevance judgments for a valid evaluation.It is not very large, so that it does not require high processing resources.

IR Ranking Evaluation

  • Hamshahri - Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems.

IR Crawling And Linking Evaluation

  • dotIR Collection - dotIR is a standard Persian test collection that is suitable for evaluation of web information retrieval algorithms in Iranian web.dotIR Contains many Persian web pages including their text, links, metadata, etc that are stored in XML format. It is prepared in such a way to be a good representative of Iranian web.It is A good test bed for evaluation of link based information retrieval algorithms. It includes enough Queries and relevance judgments for a valid evaluation.It is not very large, so that it does not require high processing resources.

Pretrained Models

  • Farsi Poem word2vec model - This is a word2vec model deveoped based on a corpus of 48 Persian poets. The corpus consists of 1,216,286 mesras of Farsi poems and 8,102,119 words from which 148,588 are unique.

Stop Word Lists

MISC

  • PersianStemmingDataset - PersianStemmingDataset is consist of two manually stemmed persian corpora and an evalution tools in order to compute stemming evaluatin metrics.
  • PersPred - PersPred, is the first online multilingual syntactic and semantic database of Persian compound verbs (complex predicates), developed by the members of the research unit Mondes iranien et indien (CNRS, Sorbonne Nouvelle, Inalco, EPHE) within the ANR-DFG project PERGRAM (2008-2012) and the LR4.1 work package of the Strand 6 of the Labex Empirical Foundations of Linguistics (EFL).
  • ACL-Wiki Resources for Persian - Another list of resources for Persian computing.
  • petit - Convert alphabet-written numbers to digit-form

Tutorials

Sentiment Analysis

  • Persian Sentiment Analysis - Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی ) is a simple ready to use project that use Python to create the model and Also it's include a very good IPython Tutorial.

Papers

Contribute

Contributions welcome! Read the contribution guidelines first.

License

CC0

About

Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published