Skip to content

Commit

Permalink
Refactor API reference gensim.topic_coherence. Fix #1669 (#1714)
Browse files Browse the repository at this point in the history
* Refactored aggregation

* Micro-Fix for aggregation.py, partially refactored direct_confirmation.py

* Partially refactored indirect_confirmation_measure

* Some additions

* Math attempts

* add math extension for sphinx

* Minor refactoring

* Some refactoring for probability_estimation

* Beta-strings

* Different additions

* Minor changes

* text_analysis left

* Added example for ContextVectorComputer class

* probability_estimation 0.9

* beta_version

* Added some examples for text_analysis

* text_analysis: corrected example for class UsesDictionary

* Final additions for text_analysis.py

* fix cross-reference problem

* fix pep8

* fix aggregation

* fix direct_confirmation_measure

* fix types in direct_confirmation_measure

* partial fix indirect_confirmation_measure

* HotFix for probability_estimation and segmentation

* Refactoring for probability_estimation

* Changes for indirect_confirmation_measure

* Fixed segmentation, partly fixed text_analysis

* Add Notes for text_analysis

* fix di/ind

* fix doc examples in probability_estimation

* fix probability_estimation

* fix segmentation

* fix docstring in probability_estimation

* partial fix test_analysis

* add latex stuff for docs build

* doc fix[1]

* doc fix[2]

* remove apt install from travis (now doc build in circle)
  • Loading branch information
CLearERR authored and menshikh-iv committed Jan 10, 2018
1 parent 4644606 commit 0a4419f
Show file tree
Hide file tree
Showing 8 changed files with 748 additions and 261 deletions.
1 change: 1 addition & 0 deletions docs/src/topic_coherence/text_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@
:inherited-members:
:undoc-members:
:show-inheritance:
:special-members: __getitem__
4 changes: 2 additions & 2 deletions gensim/models/atmodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -560,10 +560,10 @@ def update(self, corpus=None, author2doc=None, doc2author=None, chunksize=None,
Args:
corpus (gensim corpus): The corpus with which the author-topic model should be updated.
author2doc (dictionary): author to document mapping corresponding to indexes in input
author2doc (dict): author to document mapping corresponding to indexes in input
corpus.
doc2author (dictionary): document to author mapping corresponding to indexes in input
doc2author (dict): document to author mapping corresponding to indexes in input
corpus.
chunks_as_numpy (bool): Whether each chunk passed to `.inference` should be a np
Expand Down
26 changes: 17 additions & 9 deletions gensim/topic_coherence/aggregation.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,7 @@
# Copyright (C) 2013 Radim Rehurek <radimrehurek@seznam.cz>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
This module contains functions to perform aggregation on a list of values
obtained from the confirmation measure.
"""
"""This module contains functions to perform aggregation on a list of values obtained from the confirmation measure."""

import logging
import numpy as np
Expand All @@ -17,13 +14,24 @@

def arithmetic_mean(confirmed_measures):
"""
This functoin performs the arithmetic mean aggregation on the output obtained from
Perform the arithmetic mean aggregation on the output obtained from
the confirmation measure module.
Args:
confirmed_measures : list of calculated confirmation measure on each set in the segmented topics.
Parameters
----------
confirmed_measures : list of float
List of calculated confirmation measure on each set in the segmented topics.
Returns
-------
`numpy.float`
Arithmetic mean of all the values contained in confirmation measures.
Examples
--------
>>> from gensim.topic_coherence.aggregation import arithmetic_mean
>>> arithmetic_mean([1.1, 2.2, 3.3, 4.4])
2.75
Returns:
mean : Arithmetic mean of all the values contained in confirmation measures.
"""
return np.mean(confirmed_measures)
179 changes: 124 additions & 55 deletions gensim/topic_coherence/direct_confirmation_measure.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,37 +4,61 @@
# Copyright (C) 2013 Radim Rehurek <radimrehurek@seznam.cz>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
This module contains functions to compute direct confirmation on a pair of words or word subsets.
"""
"""This module contains functions to compute direct confirmation on a pair of words or word subsets."""

import logging

import numpy as np

logger = logging.getLogger(__name__)

EPSILON = 1e-12 # Should be small. Value as suggested in paper.
# Should be small. Value as suggested in paper http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf
EPSILON = 1e-12


def log_conditional_probability(segmented_topics, accumulator, with_std=False, with_support=False):
"""
This function calculates the log-conditional-probability measure
which is used by coherence measures such as U_mass.
"""Calculate the log-conditional-probability measure which is used by coherence measures such as `U_mass`.
This is defined as :math:`m_{lc}(S_i) = log \\frac{P(W', W^{*}) + \epsilon}{P(W^{*})}`.
Args:
segmented_topics (list): Output from the segmentation module of the segmented
topics. Is a list of list of tuples.
accumulator: word occurrence accumulator from probability_estimation.
with_std (bool): True to also include standard deviation across topic segment
sets in addition to the mean coherence for each topic; default is False.
with_support (bool): True to also include support across topic segments. The
support is defined as the number of pairwise similarity comparisons were
used to compute the overall topic coherence.
Returns:
list : of log conditional probability measure for each topic.
Parameters
----------
segmented_topics : list of lists of (int, int)
Output from the :func:`~gensim.topic_coherence.segmentation.s_one_pre`,
:func:`~gensim.topic_coherence.segmentation.s_one_one`.
accumulator : :class:`~gensim.topic_coherence.text_analysis.InvertedIndexAccumulator`
Word occurrence accumulator from :mod:`gensim.topic_coherence.probability_estimation`.
with_std : bool, optional
True to also include standard deviation across topic segment sets in addition to the mean coherence
for each topic.
with_support : bool, optional
True to also include support across topic segments. The support is defined as the number of pairwise
similarity comparisons were used to compute the overall topic coherence.
Returns
-------
list of float
Log conditional probabilities measurement for each topic.
Examples
--------
>>> from gensim.topic_coherence import direct_confirmation_measure, text_analysis
>>> from collections import namedtuple
>>>
>>> # Create dictionary
>>> id2token = {1: 'test', 2: 'doc'}
>>> token2id = {v: k for k, v in id2token.items()}
>>> dictionary = namedtuple('Dictionary', 'token2id, id2token')(token2id, id2token)
>>>
>>> # Initialize segmented topics and accumulator
>>> segmentation = [[(1, 2)]]
>>>
>>> accumulator = text_analysis.InvertedIndexAccumulator({1, 2}, dictionary)
>>> accumulator._inverted_index = {0: {2, 3, 4}, 1: {3, 5}}
>>> accumulator._num_docs = 5
>>>
>>> # result should be ~ ln(1 / 2) = -0.693147181
>>> result = direct_confirmation_measure.log_conditional_probability(segmentation, accumulator)[0]
"""
topic_coherences = []
num_docs = float(accumulator.num_docs)
Expand All @@ -56,17 +80,33 @@ def log_conditional_probability(segmented_topics, accumulator, with_std=False, w


def aggregate_segment_sims(segment_sims, with_std, with_support):
"""Compute various statistics from the segment similarities generated via
set pairwise comparisons of top-N word lists for a single topic.
Args:
segment_sims (iterable): floating point similarity values to aggregate.
with_std (bool): Set to True to include standard deviation.
with_support (bool): Set to True to include number of elements in `segment_sims`
as a statistic in the results returned.
"""Compute various statistics from the segment similarities generated via set pairwise comparisons
of top-N word lists for a single topic.
Parameters
----------
segment_sims : iterable of float
Similarity values to aggregate.
with_std : bool
Set to True to include standard deviation.
with_support : bool
Set to True to include number of elements in `segment_sims` as a statistic in the results returned.
Returns
-------
(float[, float[, int]])
Tuple with (mean[, std[, support]]).
Examples
---------
>>> from gensim.topic_coherence import direct_confirmation_measure
>>>
>>> segment_sims = [0.2, 0.5, 1., 0.05]
>>> direct_confirmation_measure.aggregate_segment_sims(segment_sims, True, True)
(0.4375, 0.36293077852394939, 4)
>>> direct_confirmation_measure.aggregate_segment_sims(segment_sims, False, False)
0.4375
Returns:
tuple: with (mean[, std[, support]])
"""
mean = np.mean(segment_sims)
stats = [mean]
Expand All @@ -78,32 +118,61 @@ def aggregate_segment_sims(segment_sims, with_std, with_support):
return stats[0] if len(stats) == 1 else tuple(stats)


def log_ratio_measure(
segmented_topics, accumulator, normalize=False, with_std=False, with_support=False):
"""
If normalize=False:
Popularly known as PMI.
This function calculates the log-ratio-measure which is used by
coherence measures such as c_v.
This is defined as: m_lr(S_i) = log[(P(W', W*) + e) / (P(W') * P(W*))]
If normalize=True:
This function calculates the normalized-log-ratio-measure, popularly knowns as
NPMI which is used by coherence measures such as c_v.
This is defined as: m_nlr(S_i) = m_lr(S_i) / -log[P(W', W*) + e]
Args:
segmented_topics (list): Output from the segmentation module of the segmented
topics. Is a list of list of tuples.
accumulator: word occurrence accumulator from probability_estimation.
with_std (bool): True to also include standard deviation across topic segment
sets in addition to the mean coherence for each topic; default is False.
with_support (bool): True to also include support across topic segments. The
support is defined as the number of pairwise similarity comparisons were
used to compute the overall topic coherence.
Returns:
list : of log ratio measure for each topic.
def log_ratio_measure(segmented_topics, accumulator, normalize=False, with_std=False, with_support=False):
"""Compute log ratio measure for `segment_topics`.
Parameters
----------
segmented_topics : list of lists of (int, int)
Output from the :func:`~gensim.topic_coherence.segmentation.s_one_pre`,
:func:`~gensim.topic_coherence.segmentation.s_one_one`.
accumulator : :class:`~gensim.topic_coherence.text_analysis.InvertedIndexAccumulator`
Word occurrence accumulator from :mod:`gensim.topic_coherence.probability_estimation`.
normalize : bool, optional
Details in the "Notes" section.
with_std : bool, optional
True to also include standard deviation across topic segment sets in addition to the mean coherence
for each topic.
with_support : bool, optional
True to also include support across topic segments. The support is defined as the number of pairwise
similarity comparisons were used to compute the overall topic coherence.
Notes
-----
If `normalize=False`:
Calculate the log-ratio-measure, popularly known as **PMI** which is used by coherence measures such as `c_v`.
This is defined as :math:`m_{lr}(S_i) = log \\frac{P(W', W^{*}) + \epsilon}{P(W') * P(W^{*})}`
If `normalize=True`:
Calculate the normalized-log-ratio-measure, popularly knowns as **NPMI**
which is used by coherence measures such as `c_v`.
This is defined as :math:`m_{nlr}(S_i) = \\frac{m_{lr}(S_i)}{-log(P(W', W^{*}) + \epsilon)}`
Returns
-------
list of float
Log ratio measurements for each topic.
Examples
--------
>>> from gensim.topic_coherence import direct_confirmation_measure, text_analysis
>>> from collections import namedtuple
>>>
>>> # Create dictionary
>>> id2token = {1: 'test', 2: 'doc'}
>>> token2id = {v: k for k, v in id2token.items()}
>>> dictionary = namedtuple('Dictionary', 'token2id, id2token')(token2id, id2token)
>>>
>>> # Initialize segmented topics and accumulator
>>> segmentation = [[(1, 2)]]
>>>
>>> accumulator = text_analysis.InvertedIndexAccumulator({1, 2}, dictionary)
>>> accumulator._inverted_index = {0: {2, 3, 4}, 1: {3, 5}}
>>> accumulator._num_docs = 5
>>>
>>> # result should be ~ ln{(1 / 5) / [(3 / 5) * (2 / 5)]} = -0.182321557
>>> result = direct_confirmation_measure.log_ratio_measure(segmentation, accumulator)[0]
"""
topic_coherences = []
num_docs = float(accumulator.num_docs)
Expand Down
Loading

0 comments on commit 0a4419f

Please sign in to comment.