Refactor API reference gensim.topic_coherence. Fix #1669 (#1714)

* Refactored aggregation * Micro-Fix for aggregation.py, partially refactored direct_confirmation.py * Partially refactored indirect_confirmation_measure * Some additions * Math attempts * add math extension for sphinx * Minor refactoring * Some refactoring for probability_estimation * Beta-strings * Different additions * Minor changes * text_analysis left * Added example for ContextVectorComputer class * probability_estimation 0.9 * beta_version * Added some examples for text_analysis * text_analysis: corrected example for class UsesDictionary * Final additions for text_analysis.py * fix cross-reference problem * fix pep8 * fix aggregation * fix direct_confirmation_measure * fix types in direct_confirmation_measure * partial fix indirect_confirmation_measure * HotFix for probability_estimation and segmentation * Refactoring for probability_estimation * Changes for indirect_confirmation_measure * Fixed segmentation, partly fixed text_analysis * Add Notes for text_analysis * fix di/ind * fix doc examples in probability_estimation * fix probability_estimation * fix segmentation * fix docstring in probability_estimation * partial fix test_analysis * add latex stuff for docs build * doc fix[1] * doc fix[2] * remove apt install from travis (now doc build in circle)
piskvorky · Jan 10, 2018 · 0a4419f · 0a4419f
1 parent 4644606
commit 0a4419f
Show file tree

Hide file tree

Showing 8 changed files with 748 additions and 261 deletions.
diff --git a/docs/src/topic_coherence/text_analysis.rst b/docs/src/topic_coherence/text_analysis.rst
@@ -7,3 +7,4 @@
     :inherited-members:
     :undoc-members:
     :show-inheritance:
+    :special-members: __getitem__
diff --git a/gensim/models/atmodel.py b/gensim/models/atmodel.py
@@ -560,10 +560,10 @@ def update(self, corpus=None, author2doc=None, doc2author=None, chunksize=None,
         Args:
             corpus (gensim corpus): The corpus with which the author-topic model should be updated.
 
-            author2doc (dictionary): author to document mapping corresponding to indexes in input
+            author2doc (dict): author to document mapping corresponding to indexes in input
                 corpus.
 
-            doc2author (dictionary): document to author mapping corresponding to indexes in input
+            doc2author (dict): document to author mapping corresponding to indexes in input
                 corpus.
 
             chunks_as_numpy (bool): Whether each chunk passed to `.inference` should be a np

diff --git a/gensim/topic_coherence/aggregation.py b/gensim/topic_coherence/aggregation.py
@@ -4,10 +4,7 @@
 # Copyright (C) 2013 Radim Rehurek <radimrehurek@seznam.cz>
 # Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
 
-"""
-This module contains functions to perform aggregation on a list of values
-obtained from the confirmation measure.
-"""
+"""This module contains functions to perform aggregation on a list of values obtained from the confirmation measure."""
 
 import logging
 import numpy as np
@@ -17,13 +14,24 @@
 
 def arithmetic_mean(confirmed_measures):
     """
-    This functoin performs the arithmetic mean aggregation on the output obtained from
+    Perform the arithmetic mean aggregation on the output obtained from
     the confirmation measure module.
 
-    Args:
-        confirmed_measures : list of calculated confirmation measure on each set in the segmented topics.
+    Parameters
+    ----------
+    confirmed_measures : list of float
+        List of calculated confirmation measure on each set in the segmented topics.
+
+    Returns
+    -------
+    `numpy.float`
+        Arithmetic mean of all the values contained in confirmation measures.
+
+    Examples
+    --------
+    >>> from gensim.topic_coherence.aggregation import arithmetic_mean
+    >>> arithmetic_mean([1.1, 2.2, 3.3, 4.4])
+    2.75
 
-    Returns:
-        mean : Arithmetic mean of all the values contained in confirmation measures.
     """
     return np.mean(confirmed_measures)
diff --git a/gensim/topic_coherence/direct_confirmation_measure.py b/gensim/topic_coherence/direct_confirmation_measure.py
@@ -4,37 +4,61 @@
 # Copyright (C) 2013 Radim Rehurek <radimrehurek@seznam.cz>
 # Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
 
-"""
-This module contains functions to compute direct confirmation on a pair of words or word subsets.
-"""
+"""This module contains functions to compute direct confirmation on a pair of words or word subsets."""
 
 import logging
 
 import numpy as np
 
 logger = logging.getLogger(__name__)
 
-EPSILON = 1e-12  # Should be small. Value as suggested in paper.
+# Should be small. Value as suggested in paper http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf
+EPSILON = 1e-12
 
 
 def log_conditional_probability(segmented_topics, accumulator, with_std=False, with_support=False):
-    """
-    This function calculates the log-conditional-probability measure
-    which is used by coherence measures such as U_mass.
+    """Calculate the log-conditional-probability measure which is used by coherence measures such as `U_mass`.
     This is defined as :math:`m_{lc}(S_i) = log \\frac{P(W', W^{*}) + \epsilon}{P(W^{*})}`.
 
-    Args:
-        segmented_topics (list): Output from the segmentation module of the segmented
-            topics. Is a list of list of tuples.
-        accumulator: word occurrence accumulator from probability_estimation.
-        with_std (bool): True to also include standard deviation across topic segment
-            sets in addition to the mean coherence for each topic; default is False.
-        with_support (bool): True to also include support across topic segments. The
-            support is defined as the number of pairwise similarity comparisons were
-            used to compute the overall topic coherence.
-
-    Returns:
-        list : of log conditional probability measure for each topic.
+    Parameters
+    ----------
+    segmented_topics : list of lists of (int, int)
+        Output from the :func:`~gensim.topic_coherence.segmentation.s_one_pre`,
+        :func:`~gensim.topic_coherence.segmentation.s_one_one`.
+    accumulator : :class:`~gensim.topic_coherence.text_analysis.InvertedIndexAccumulator`
+        Word occurrence accumulator from :mod:`gensim.topic_coherence.probability_estimation`.
+    with_std : bool, optional
+        True to also include standard deviation across topic segment sets in addition to the mean coherence
+        for each topic.
+    with_support : bool, optional
+        True to also include support across topic segments. The support is defined as the number of pairwise
+        similarity comparisons were used to compute the overall topic coherence.
+
+    Returns
+    -------
+    list of float
+        Log conditional probabilities measurement for each topic.
+
+    Examples
+    --------
+    >>> from gensim.topic_coherence import direct_confirmation_measure, text_analysis
+    >>> from collections import namedtuple
+    >>>
+    >>> # Create dictionary
+    >>> id2token = {1: 'test', 2: 'doc'}
+    >>> token2id = {v: k for k, v in id2token.items()}
+    >>> dictionary = namedtuple('Dictionary', 'token2id, id2token')(token2id, id2token)
+    >>>
+    >>> # Initialize segmented topics and accumulator
+    >>> segmentation = [[(1, 2)]]
+    >>>
+    >>> accumulator = text_analysis.InvertedIndexAccumulator({1, 2}, dictionary)
+    >>> accumulator._inverted_index = {0: {2, 3, 4}, 1: {3, 5}}
+    >>> accumulator._num_docs = 5
+    >>>
+    >>> # result should be ~ ln(1 / 2) = -0.693147181
+    >>> result = direct_confirmation_measure.log_conditional_probability(segmentation, accumulator)[0]
+
     """
     topic_coherences = []
     num_docs = float(accumulator.num_docs)
@@ -56,17 +80,33 @@ def log_conditional_probability(segmented_topics, accumulator, with_std=False, w
 
 
 def aggregate_segment_sims(segment_sims, with_std, with_support):
-    """Compute various statistics from the segment similarities generated via
-    set pairwise comparisons of top-N word lists for a single topic.
-
-    Args:
-        segment_sims (iterable): floating point similarity values to aggregate.
-        with_std (bool): Set to True to include standard deviation.
-        with_support (bool): Set to True to include number of elements in `segment_sims`
-            as a statistic in the results returned.
+    """Compute various statistics from the segment similarities generated via set pairwise comparisons
+    of top-N word lists for a single topic.
+
+    Parameters
+    ----------
+    segment_sims : iterable of float
+        Similarity values to aggregate.
+    with_std : bool
+        Set to True to include standard deviation.
+    with_support : bool
+        Set to True to include number of elements in `segment_sims` as a statistic in the results returned.
+
+    Returns
+    -------
+    (float[, float[, int]])
+        Tuple with (mean[, std[, support]]).
+
+    Examples
+    ---------
+    >>> from gensim.topic_coherence import direct_confirmation_measure
+    >>>
+    >>> segment_sims = [0.2, 0.5, 1., 0.05]
+    >>> direct_confirmation_measure.aggregate_segment_sims(segment_sims, True, True)
+    (0.4375, 0.36293077852394939, 4)
+    >>> direct_confirmation_measure.aggregate_segment_sims(segment_sims, False, False)
+    0.4375
 
-    Returns:
-        tuple: with (mean[, std[, support]])
     """
     mean = np.mean(segment_sims)
     stats = [mean]
@@ -78,32 +118,61 @@ def aggregate_segment_sims(segment_sims, with_std, with_support):
     return stats[0] if len(stats) == 1 else tuple(stats)
 
 
-def log_ratio_measure(
-        segmented_topics, accumulator, normalize=False, with_std=False, with_support=False):
-    """
-    If normalize=False:
-        Popularly known as PMI.
-        This function calculates the log-ratio-measure which is used by
-        coherence measures such as c_v.
-        This is defined as: m_lr(S_i) = log[(P(W', W*) + e) / (P(W') * P(W*))]
-
-    If normalize=True:
-        This function calculates the normalized-log-ratio-measure, popularly knowns as
-        NPMI which is used by coherence measures such as c_v.
-        This is defined as: m_nlr(S_i) = m_lr(S_i) / -log[P(W', W*) + e]
-
-    Args:
-        segmented_topics (list): Output from the segmentation module of the segmented
-            topics. Is a list of list of tuples.
-        accumulator: word occurrence accumulator from probability_estimation.
-        with_std (bool): True to also include standard deviation across topic segment
-            sets in addition to the mean coherence for each topic; default is False.
-        with_support (bool): True to also include support across topic segments. The
-            support is defined as the number of pairwise similarity comparisons were
-            used to compute the overall topic coherence.
-
-    Returns:
-        list : of log ratio measure for each topic.
+def log_ratio_measure(segmented_topics, accumulator, normalize=False, with_std=False, with_support=False):
+    """Compute log ratio measure for `segment_topics`.
+
+    Parameters
+    ----------
+    segmented_topics : list of lists of (int, int)
+        Output from the :func:`~gensim.topic_coherence.segmentation.s_one_pre`,
+        :func:`~gensim.topic_coherence.segmentation.s_one_one`.
+    accumulator : :class:`~gensim.topic_coherence.text_analysis.InvertedIndexAccumulator`
+        Word occurrence accumulator from :mod:`gensim.topic_coherence.probability_estimation`.
+    normalize : bool, optional
+        Details in the "Notes" section.
+    with_std : bool, optional
+        True to also include standard deviation across topic segment sets in addition to the mean coherence
+        for each topic.
+    with_support : bool, optional
+        True to also include support across topic segments. The support is defined as the number of pairwise
+        similarity comparisons were used to compute the overall topic coherence.
+
+    Notes
+    -----
+    If `normalize=False`:
+        Calculate the log-ratio-measure, popularly known as **PMI** which is used by coherence measures such as `c_v`.
+        This is defined as :math:`m_{lr}(S_i) = log \\frac{P(W', W^{*}) + \epsilon}{P(W') * P(W^{*})}`
+
+    If `normalize=True`:
+        Calculate the normalized-log-ratio-measure, popularly knowns as **NPMI**
+        which is used by coherence measures such as `c_v`.
+        This is defined as :math:`m_{nlr}(S_i) = \\frac{m_{lr}(S_i)}{-log(P(W', W^{*}) + \epsilon)}`
+
+    Returns
+    -------
+    list of float
+        Log ratio measurements for each topic.
+
+    Examples
+    --------
+    >>> from gensim.topic_coherence import direct_confirmation_measure, text_analysis
+    >>> from collections import namedtuple
+    >>>
+    >>> # Create dictionary
+    >>> id2token = {1: 'test', 2: 'doc'}
+    >>> token2id = {v: k for k, v in id2token.items()}
+    >>> dictionary = namedtuple('Dictionary', 'token2id, id2token')(token2id, id2token)
+    >>>
+    >>> # Initialize segmented topics and accumulator
+    >>> segmentation = [[(1, 2)]]
+    >>>
+    >>> accumulator = text_analysis.InvertedIndexAccumulator({1, 2}, dictionary)
+    >>> accumulator._inverted_index = {0: {2, 3, 4}, 1: {3, 5}}
+    >>> accumulator._num_docs = 5
+    >>>
+    >>> # result should be ~ ln{(1 / 5) / [(3 / 5) * (2 / 5)]} = -0.182321557
+    >>> result = direct_confirmation_measure.log_ratio_measure(segmentation, accumulator)[0]
+
     """
     topic_coherences = []
     num_docs = float(accumulator.num_docs)