diff --git a/docs/src/_static/images/gensim_logo_positive_complete_tb.png b/docs/src/_static/images/gensim_logo_positive_complete_tb.png new file mode 100644 index 0000000000..f02d0530ff Binary files /dev/null and b/docs/src/_static/images/gensim_logo_positive_complete_tb.png differ diff --git a/docs/src/about.rst b/docs/src/about.rst deleted file mode 100644 index 017596e00b..0000000000 --- a/docs/src/about.rst +++ /dev/null @@ -1,82 +0,0 @@ -:orphan: - -.. _about: - -===== -About -===== - -Gensim = "Generate Similar" ---------------------------- - -Gensim started off as a collection of various Python scripts for the Czech Digital Mathematics Library `dml.cz `_ in 2008, where it served to generate a short list of the most similar articles to a given article. - -I also wanted to try these fancy "Latent Semantic Methods", but the libraries that realized the necessary computation were `not much fun to work with `_. - -Naturally, I set out to reinvent the wheel. Our `2010 LREC publication `_ describes the initial design decisions behind Gensim: clarity, efficiency and scalability. It is fairly representative of how Gensim works even today. - -Later versions of gensim improved this efficiency and scalability tremendously. In fact, I made algorithmic scalability of distributional semantics the topic of my `PhD thesis `_. - -By now, Gensim is---to my knowledge---the most robust, efficient and hassle-free piece -of software to realize unsupervised semantic modelling from plain text. It stands -in contrast to brittle homework-assignment-implementations that do not scale on one hand, -and robust java-esque projects that take forever just to run "hello world". - -In 2011, I started using `Github `_ for source code hosting -and the Gensim website moved to its present domain. In 2013, Gensim got its current logo and website design. - - -Licensing ----------- - -Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license `_. -This means that it's free for both personal and commercial use, but if you make any -modification to Gensim that you distribute to other people, you have to disclose -the source code of these modifications. - -Apart from that, you are free to redistribute Gensim in any way you like, though you're -not allowed to modify its license (doh!). - -My intent here is to **get more help and community involvement** with the development of Gensim. -The legalese is therefore less important to me than your input and contributions. - -`Contact me `_ if LGPL doesn't fit your bill and you'd like the open source restrictions lifted. - -.. seealso:: - - We also built a commercial product for automatic data discovery for privacy compliance, https://pii-tools.com. - - Reach out at info@pii-tools.com if you need industry-grade PII / PCI / PHI software for compliance or breach management. - - -Contributors ------------- - -Credit goes to the many people who contributed to Gensim, be it in `discussions `_, -ideas, `code contributions `_ or `bug reports `_. - -It's really useful and motivating to get feedback, in any shape or form, so big thanks to you all! - -Some honorable mentions are included in the `CHANGELOG.txt `_. - -Academic citing ---------------- - -Gensim has been used in `over two thousand research papers and student theses `_. - -When citing Gensim, please use `this BibTeX entry `_:: - - @inproceedings{rehurek_lrec, - title = {{Software Framework for Topic Modelling with Large Corpora}}, - author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka}, - booktitle = {{Proceedings of the LREC 2010 Workshop on New - Challenges for NLP Frameworks}}, - pages = {45--50}, - year = 2010, - month = May, - day = 22, - publisher = {ELRA}, - address = {Valletta, Malta}, - note={\url{http://is.muni.cz/publication/884893/en}}, - language={English} - } diff --git a/docs/src/conf.py b/docs/src/conf.py index ca2b020825..75fa293f14 100644 --- a/docs/src/conf.py +++ b/docs/src/conf.py @@ -54,7 +54,7 @@ # General information about the project. project = u'gensim' -copyright = u'2009-now Radim Řehůřek, https://radimrehurek.com.' +copyright = u'2009-now' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the diff --git a/docs/src/distributed.rst b/docs/src/distributed.rst index 8b27fedc4f..b79e097e08 100644 --- a/docs/src/distributed.rst +++ b/docs/src/distributed.rst @@ -1,3 +1,5 @@ +:orphan: + .. _distributed: Distributed Computing diff --git a/docs/src/indextoc.rst b/docs/src/indextoc.rst index c121bf659d..ec981338e6 100644 --- a/docs/src/indextoc.rst +++ b/docs/src/indextoc.rst @@ -3,8 +3,6 @@ :maxdepth: 1 intro - distributed auto_examples/index - support - wiki apiref + support diff --git a/docs/src/intro.rst b/docs/src/intro.rst index 2b9d564600..275c74e43f 100644 --- a/docs/src/intro.rst +++ b/docs/src/intro.rst @@ -1,17 +1,22 @@ .. _intro: -============ -Introduction -============ +=============== +What is Gensim? +=============== -Gensim is a :ref:`free ` Python library designed to automatically extract semantic -topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible. +Gensim is a free open-source Python library for representing +documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible. -Gensim is designed to process raw, unstructured digital texts ("*plain text*"). +.. image:: _static/images/gensim_logo_positive_complete_tb.png + :width: 600 + :alt: Gensim logo + +Gensim is designed to process raw, unstructured digital texts ("*plain text*") using unsupervised machine learning algorithms. The algorithms in Gensim, such as :class:`~gensim.models.word2vec.Word2Vec`, :class:`~gensim.models.fasttext.FastText`, -Latent Semantic Analysis (LSI, LSA, see :class:`~gensim.models.lsimodel.LsiModel`), Latent Dirichlet -Allocation (LDA, see :class:`~gensim.models.ldamodel.LdaModel`) etc, automatically discover the semantic structure of documents by examining statistical +Latent Semantic Indexing (LSI, LSA, :class:`~gensim.models.lsimodel.LsiModel`), Latent Dirichlet +Allocation (LDA, :class:`~gensim.models.ldamodel.LdaModel`) etc, automatically discover the semantic +structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are **unsupervised**, which means no human input is necessary -- you only need a corpus of plain text documents. @@ -24,42 +29,88 @@ Once these statistical patterns are found, any plain text documents (sentence, p .. _design: -Features --------- +Design principles +----------------- + +We built Gensim from scratch for: +* **Practicality** -- as industry experts, we focus on proven, battle-hardened algorithms to solve real industry problems. More focus on engineering, less on academia. * **Memory independence** -- there is no need for the whole training corpus to - reside fully in RAM at any one time (can process large, web-scale corpora). -* **Memory sharing** -- trained models can be persisted to disk and loaded back via `mmap `_. Multiple processes can share the same data, cutting down RAM footprint. -* Efficient implementations for several popular vector space algorithms, - including :class:`~gensim.models.word2vec.Word2Vec`, :class:`~gensim.models.doc2vec.Doc2Vec`, :class:`~gensim.models.fasttext.FastText`, - TF-IDF, Latent Semantic Analysis (LSI, LSA, see :class:`~gensim.models.lsimodel.LsiModel`), - Latent Dirichlet Allocation (LDA, see :class:`~gensim.models.ldamodel.LdaModel`) or Random Projection (see :class:`~gensim.models.rpmodel.RpModel`). -* I/O wrappers and readers from several popular data formats. -* Fast similarity queries for documents in their semantic representation. + reside fully in RAM at any one time. Can process large, web-scale corpora using data streaming. +* **Performance** – highly optimized implementations of popular vector space algorithms using C, BLAS and memory-mapping. -The **principal design objectives** behind Gensim are: -1. Straightforward interfaces and low API learning curve for developers. Good for prototyping. -2. Memory independence with respect to the size of the input corpus; all intermediate - steps and algorithms operate in a streaming fashion, accessing one document - at a time. +Installation +------------ -.. seealso:: +Gensim is a Python library, so you need `Python `_. Gensim supports all Python versions that haven't reached their `end-of-life `_. - We also built a high performance commercial server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai. ScaleText is available both on-prem and as SaaS. +If you need with an older Python (such as Python 2.7), you must install an older version of Gensim (such as `Gensim 3.8.3 `_). - Reach out at info@scaletext.com if you need an industry-grade NLP tool with professional support. +To install gensim, simply run:: -.. _availability: + pip install --upgrade gensim -Availability ------------- +Alternatively, you can download the source code from `Github `__ +or the `Python Package Index `_. + +After installation, learn how to use Gensim from its :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` tutorials. + + +.. _Licensing: + +Licensing +---------- + +Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license `_. +This means that it's free for both personal and commercial use, but if you make any +modification to Gensim that you distribute to other people, you have to disclose +the source code of these modifications. + +Apart from that, you are free to redistribute Gensim in any way you like, though you're +not allowed to modify its license (doh!). + +If LGPL doesn't fit your bill, you can ask for :ref:`Commercial support`. + +.. _Academic citing: + +Academic citing +--------------- + +Gensim has been used in `over two thousand research papers and student theses `_. + +When citing Gensim, please use `this BibTeX entry `_:: + + @inproceedings{rehurek_lrec, + title = {{Software Framework for Topic Modelling with Large Corpora}}, + author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka}, + booktitle = {{Proceedings of the LREC 2010 Workshop on New + Challenges for NLP Frameworks}}, + pages = {45--50}, + year = 2010, + month = May, + day = 22, + publisher = {ELRA}, + address = {Valletta, Malta}, + note={\url{http://is.muni.cz/publication/884893/en}}, + language={English} + } + +Gensim = "Generate Similar" +--------------------------- + +Historically, Gensim started off as a collection of Python scripts for the Czech Digital Mathematics Library `dml.cz `_ project, back in 2008. The scripts served to generate a short list of the most similar math articles to a given article. + +I (Radim) also wanted to try these fancy "Latent Semantic Methods", but the libraries that realized the necessary computation were `not much fun to work with `_. -Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license `_ and can be downloaded either from its `Github repository `_ -or from the `Python Package Index `_. +Naturally, I set out to reinvent the wheel. Our `2010 LREC publication `_ describes the initial design decisions behind Gensim: **clarity, efficiency and scalability**. It is fairly representative of how Gensim works even today. +Later versions of Gensim improved this efficiency and scalability tremendously. In fact, I made algorithmic scalability of distributional semantics the topic of my `PhD thesis `_. -Core concepts -------------- +By now, Gensim is---to my knowledge---the most robust, efficient and hassle-free piece +of software to realize unsupervised semantic modelling from plain text. It stands +in contrast to brittle homework-assignment-implementations that do not scale on one hand, +and robust java-esque projects that take forever just to run "hello world". -See the :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` tutorial. +In 2011, I moved Gensim's source code to `Github `__ +and created the Gensim website. In 2013 Gensim got its current logo, and in 2020 a website redesign. diff --git a/docs/src/sphinx_rtd_theme/advertisement.html b/docs/src/sphinx_rtd_theme/advertisement.html index 530eb89b7e..a7c93b60d0 100644 --- a/docs/src/sphinx_rtd_theme/advertisement.html +++ b/docs/src/sphinx_rtd_theme/advertisement.html @@ -2,4 +2,5 @@

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

+

PII Tools automated discovery of personal and sensitive data

diff --git a/docs/src/sphinx_rtd_theme/layout.html b/docs/src/sphinx_rtd_theme/layout.html index 2633a168e7..6738fdcd5b 100644 --- a/docs/src/sphinx_rtd_theme/layout.html +++ b/docs/src/sphinx_rtd_theme/layout.html @@ -234,10 +234,10 @@ {%- if hasdoc('copyright') %} {% set path = pathto('copyright') %} {% set copyright = copyright|e %} - © {% trans %}Copyright{% endtrans %} {{ copyright }} + © {% trans %}Copyright{% endtrans %} {{ copyright }}, Radim Řehůřek. {%- else %} {% set copyright = copyright|e %} - © {% trans %}Copyright{% endtrans %} {{ copyright }} + © {% trans %}Copyright{% endtrans %} {{ copyright }}, Radim Řehůřek. {%- endif %} {%- endif %} @@ -257,11 +257,6 @@ {% trans last_updated=last_updated|e %}Last updated on {{ last_updated }}.{% endtrans %} {%- endif %} -
- Radim Řehůřek – Machine learing and data - mining expert -
- Created by edgy.digital diff --git a/docs/src/sphinx_rtd_theme/layouthome.html b/docs/src/sphinx_rtd_theme/layouthome.html index 45dbe0c02f..d64aad6738 100644 --- a/docs/src/sphinx_rtd_theme/layouthome.html +++ b/docs/src/sphinx_rtd_theme/layouthome.html @@ -100,7 +100,7 @@
Do you have a question?

Gensim Support


-

See the Gensim support page for how to get open source and commercial support.

+

See the Gensim support page for how to ask for open source and commercial support.

@@ -123,10 +123,10 @@

See the {% trans %}Copyright{% endtrans %} {{ copyright }} + © {% trans %}Copyright{% endtrans %} {{ copyright }}, Radim Řehůřek. {%- else %} {% set copyright = copyright|e %} - © {% trans %}Copyright{% endtrans %} {{ copyright }} + © {% trans %}Copyright{% endtrans %} {{ copyright }}, Radim Řehůřek. {%- endif %} {%- endif %} diff --git a/docs/src/sphinx_rtd_theme/topbar.html b/docs/src/sphinx_rtd_theme/topbar.html index d3e6091e1a..aa75d7498c 100644 --- a/docs/src/sphinx_rtd_theme/topbar.html +++ b/docs/src/sphinx_rtd_theme/topbar.html @@ -29,16 +29,16 @@ Home - Documentation + Documentation - Support + Support - API + API - - About + + About diff --git a/docs/src/support.rst b/docs/src/support.rst index ad2214f66b..f55a29163a 100644 --- a/docs/src/support.rst +++ b/docs/src/support.rst @@ -9,30 +9,32 @@ Open source support The main communication channel is the `Gensim mailing list `_. -Additional channels are `twitter @gensim_py `_ and `Gitter RARE-Technologies/gensim `_. +This is the preferred way to ask for help, report problems and share insights with the community. Newbie questions are perfectly fine, as long as you've read the :ref:`tutorials `. -This is the preferred way to **ask for help**, **report problems** and **share insights** with the community. Newbie questions are perfectly fine, just make sure you've read the :ref:`tutorials `. +**⚠️ Please don't send me private emails unless you have a substantial budget for commercial support (see below).** -I discourage sending private emails, because the mailing list serves as a knowledge base for all Gensim users, cutting maintenance efforts needed for support. If you feel your problem is too special, data too sensitive, technical scope too demanding, **see the "business" section below**. +FAQ and some useful snippets of code are maintained on GitHub: https://github.com/RARE-Technologies/gensim/wiki/Recipes-&-FAQ. -When posting on the mailing list, try to include all relevant information, such as what it is you are trying to achieve, what went wrong, relevant Gensim logs, package versions etc. +We're on `Twitter @gensim_py `_. You can also try asking on StackOverflow, using the `gensim tag `_, but the mailing list above will give you more authoritative answers, faster. -**FAQ** and some useful **snippets of code** are maintained on GitHub: https://github.com/RARE-Technologies/gensim/wiki/Recipes-&-FAQ. -You can also try asking on StackOverflow, using the `gensim tag `_. +.. _Commercial support: +Commercial support +------------------ -Business support ----------------- +I run a consulting R&D company focused on data mining and unstructured text processing, https://rare-technologies.com. -We run a consulting R&D company focused on data mining and unstructured text processing, https://rare-technologies.com. +If you need commercial support for Gensim or a corporate training in machine learning, `get in touch `_ for a quote. -If you need commercial support, design validation, technical training or custom system development, `get in touch `_ for a quote. +We're not interested in any sort of equity arrangements. -Developer support ------------------- +For developers +-------------- + +Developers who want to contribute to Gensim are welcome – Gensim is an open source project. -Developers who `tweak Gensim internals `_ are encouraged to report issues at the `GitHub issue tracker `_. +First propose your feature / fix on the `Gensim mailing list `_ and if there is consensus for accepting your contribution, read the `Developer page `_ and implement it. Thanks! -Note that Github is not a medium for discussions or asking open-ended questions; please use the `mailing list `_ for that. +Note that Github is not a medium for asking open-ended questions. Please use the `Gensim mailing list `_ for that. diff --git a/docs/src/wiki.rst b/docs/src/wiki.rst index aa8110e4fa..40e7c6343f 100644 --- a/docs/src/wiki.rst +++ b/docs/src/wiki.rst @@ -1,3 +1,5 @@ +:orphan: + .. _wiki: Experiments on the English Wikipedia diff --git a/gensim/models/fasttext.py b/gensim/models/fasttext.py index 3476f7c5dc..64a21aafa7 100644 --- a/gensim/models/fasttext.py +++ b/gensim/models/fasttext.py @@ -20,8 +20,8 @@ This module supports loading models trained with Facebook's fastText implementation. It also supports continuing training from such models. -For a tutorial see `this notebook -`_. +For a tutorial see :ref:`sphx_glr_auto_examples_tutorials_run_fasttext.py`. + Usage examples -------------- @@ -250,29 +250,6 @@ >>> analogies_result = model.wv.evaluate_word_analogies(datapath('questions-words.txt')) -Implementation Notes --------------------- - -These notes may help developers navigate our fastText implementation. -The implementation is split across several submodules: - -- :mod:`gensim.models.fasttext`: This module. Contains FastText-specific functionality only. -- :mod:`gensim.models.keyedvectors`: Implements generic functionality. -- :mod:`gensim.models.word2vec`: Provides much of the basic scan & train framework. -- :mod:`gensim.utils`: Implements model I/O (loading and saving). - -Our implementation relies heavily on inheritance. -It consists of several important classes: - -- :class:`~gensim.models.word2vec.Word2VecVocab`: the vocabulary. - Keeps track of all the unique words, sometimes discarding the extremely rare ones. - This is sometimes called the Dictionary within Gensim. -- :class:`~gensim.models.fasttext.FastTextKeyedVectors`: the vectors. - Once training is complete, this class is sufficient for calculating embeddings. -- :class:`~gensim.models.fasttext.FastTextTrainables`: the underlying neural network. - The implementation uses this class to *learn* the word embeddings. -- :class:`~gensim.models.fasttext.FastText`: ties everything together. - """ import logging