Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast object counting + Phrases #1446

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 150 additions & 0 deletions gensim/models/fast_counter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2017 Radim Rehurek <me@radimrehurek.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
Fast & memory efficient counting of things (and n-grams of things).

This module is designed to count item frequencies over large, streamed corpora (lazy iteration).

Such counts are useful in various other modules, such as Dictionary, TfIdf, Phrases etc.

"""

import sys
import os
from collections import defaultdict
import logging

from gensim import utils
from gensim.models.fast_counter_cython import FastCounterCython, FastCounterPreshed

logger = logging.getLogger(__name__)


def iter_ngrams(document, ngrams):
assert ngrams[0] <= ngrams[1]

for n in range(ngrams[0], ngrams[1] + 1):
for ngram in zip(*[document[i:] for i in range(n)]):
logger.debug("yielding ngram %r", ngram)
yield ngram

def iter_gram1(document):
return iter_ngrams(document, (1, 1))

def iter_gram2(document):
return iter_ngrams(document, (2, 2))

def iter_gram12(document):
return iter_ngrams(document, (1, 2))


class FastCounter(object):
"""
Fast counting of item frequency frequency across large, streamed iterables.
"""

def __init__(self, doc2items=iter_gram1, max_size=None):
self.doc2items = doc2items
self.max_size = max_size
self.min_reduce = 0
self.hash2cnt = defaultdict(int)

def hash(self, item):
return hash(item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use mmh3 for this purposes?


def update(self, documents):
"""
Update the relevant ngram counters from the iterable `documents`.

If the memory structures get too large, clip them (then the internal counts may be only approximate).
"""
for document in documents:
for item in self.doc2items(document):
self.hash2cnt[self.hash(item)] += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That this implementation ignores hash-collisions is important detail to document.

self.prune_items()

return self # for easier chaining

def prune_items(self):
"""Trim data structures to fit in memory, if too large."""
# XXX: Or use a fixed-size data structure to start with (hyperloglog?)
while self.max_size and len(self) > self.max_size:
self.min_reduce += 1
utils.prune_vocab(self.hash2cnt, self.min_reduce)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can potentially be very slow, as very utils.prune_vocab call is going to iterate over the entire vocabulary. Could be worthwhile to store a min_count property. Depending on the data structures used, this might not be trivial though.
Anyway, the discussion right now is API/design and not optimization, I suppose.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would this min_count property do?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd want to see profiling that indicates this survey is a major time-cost before spending much time optimizing. If it needed to be optimized, it might be possible to retain a min_reduce/min_count threshold and, during tallying, track when a key's tally exceeds this count, perhaps removing it from a 'most-eligible-for-pruning' list.

With this existing algorithm, where each prune, everything under the count reached last time is automatically discarded, this hypothetical "most eligible" list could be discarded 1st – without even looking at the existing counts again. Maybe that'd be enough... though if keeping this 'escalating min_reduce' you'd still need to iterate over all the remaining keys. (It seems a lot like GC, with a 'tenured' set that might be able to skip some collections.)

But as I've mentioned in previous discussions around Phrases: I suspect this 'escalating floor prune' is a somehat undesirable algorithm, injecting more imprecision/bias in final counts than necessary, and (maybe) losing a lot of keys that would survive a truly unbounded counting method. (TLDR: because of typical freq-distributions plus the always-escalating floow, every prune may wind up eliminating 90%+ of all existing words, starving some words of the chance to ever survive a prune, or in some cases inefficiently leaving the final dict at a small fraction of the desired max_size.) Maybe it's necessary to prevent too-frequent prunes, but I'd prefer an option to be more precise via less-frequent prunes – especially if I'd hypothetically minimized the full-time-cost of Phrases-analysis via other process improvements or effective parallelism.

Copy link
Owner Author

@piskvorky piskvorky Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're on to optimizations and choosing algorithms now, so let me tick off the "API design" checkbox.

API conclusion: mimic the Counter API, except add a parameter that allows restricting its maximum size, ideally in bytes. Make it clear approximations were necessary to achieve this, and what the trade-offs are (Counter methods we cannot support, perf-memory balance implications).

@jayantj @menshikh-iv do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky I agree with you
Also, maybe need to add a static method for a size calculation (for example, if we store X tokens what's megabytes we needed for this in RAM). It's can be useful if we choose a max_size parameter.

Copy link
Owner Author

@piskvorky piskvorky Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though the best place for that is probably the docstring. Something like max_size==1GB gives you room for approximately 100,000,000 items, and the number scales linearly with more/less RAM (or whatever the case may be).


def get(self, item, default=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__getitem__ would be good too.

"""Return the item frequency of `item` (or `default` if item not present)."""
return self.hash2cnt.get(self.hash(item), default)

def merge(self, other):
"""
Merge counts from another FastCounter into self, in-place.
"""
self.hash2cnt.update(other.hash2cnt)
self.min_reduce = max(self.min_reduce, other.min_reduce)
self.prune_items()

def __len__(self):
return len(self.hash2cnt)

def __str__(self):
return "%s<%i items>" % (self.__class__.__name__, len(self))


class Phrases(object):
def __init__(self, min_count=5, threshold=10.0, max_vocab_size=40000000):
self.threshold = threshold
self.min_count = min_count
self.max_vocab_size = max_vocab_size
# self.counter = FastCounter(iter_gram12, max_size=max_vocab_size)
self.counter = FastCounterCython()
# self.counter = FastCounterPreshed()

def add_documents(self, documents):
self.counter.update(documents)

return self # for easier chaining

def export_phrases(self, document):
"""
Yield all collocations (pairs of adjacent closely related tokens) from the
input `document`, as 2-tuples `(score, bigram)`.
"""
norm = 1.0 * len(self.counter)
for bigram in iter_gram2(document):
pa, pb, pab = self.counter.get((bigram[0],)), self.counter.get((bigram[1],)), self.counter.get(bigram, 0)
if pa and pb:
score = norm / pa / pb * (pab - self.min_count)
if score > self.threshold:
yield score, bigram


if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s', level=logging.INFO)
logger.info("running %s", " ".join(sys.argv))

# check and process cmdline input
program = os.path.basename(sys.argv[0])
if len(sys.argv) < 2:
print(globals()['__doc__'] % locals())
sys.exit(1)
infile = sys.argv[1]

from gensim.models.word2vec import Text8Corpus
documents = Text8Corpus(infile)

logger.info("training phrases")
bigram = Phrases(min_count=5, threshold=100).add_documents(documents)
logger.info("finished training phrases")
print(bigram.counter)
# for doc in documents:
# s = u' '.join(doc)
# for _, bigram in bigram.export_phrases(doc):
# s = s.replace(u' '.join(bigram), u'_'.join(bigram))
# print(utils.to_utf8(s))

logger.info("finished running %s", " ".join(sys.argv))
137 changes: 137 additions & 0 deletions gensim/models/fast_counter_cython.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
#!/usr/bin/env cython
# cython: boundscheck=False
# cython: wraparound=False
# cython: cdivision=True
# coding: utf-8
#
# Copyright (C) 2017 Radim Rehurek <me@radimrehurek.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

from collections import defaultdict

from libc.stdint cimport int64_t, uint64_t

cimport preshed.counter


cdef uint64_t chash(obj):
# TODO use something faster, can assume string
return <uint64_t>hash(obj)


class FastCounterCython(object):
"""
Fast counting of item frequency frequency across large, streamed iterables.
"""

def __init__(self, doc2items=None, max_size=None):
self.doc2items = doc2items
self.max_size = max_size
self.min_reduce = 0
self.hash2cnt = defaultdict(int)

def update(self, documents):
"""
Update the relevant ngram counters from the iterable `documents`.

If the memory structures get too large, clip them (then the internal counts may be only approximate).
"""
cdef Py_ssize_t idx, l
cdef uint64_t h1, h2
hash2cnt = self.hash2cnt
for document in documents:
l = len(document)
if l:
h1 = chash(document[0])
hash2cnt[h1] += 1
for idx in range(1, l):
h2 = chash(document[idx])
hash2cnt[h2] += 1
hash2cnt[h1 + h2] += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple addition to determine hash of bigram gives (A, B) and (B, A) same hash value – problematic for Phrase-promotion.

Copy link
Owner Author

@piskvorky piskvorky Jun 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had XOR here before, but then A ^ A == 0 == B ^ B, which is also not good. The hashing requires a little more thought, too hackish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use h1 ** h2, but this will lead to overflow

Copy link
Owner Author

@piskvorky piskvorky Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a native operation (instruction), so probably not a good idea.

Google shows me this for boost::hash_combine

size_t hash_combine( size_t lhs, size_t rhs ) {
  lhs^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  return lhs;
}

That looks fancy, but probably anything that breaks the symmetry would be enough, such as 3*lhs + rhs.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a repo with a bunch of useful hashing algorithms & their comparison: https://github.com/rurban/smhasher

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more interesting resource on hashing (in the context of Bloom filters / min-sketch counting in #508 ): https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
CC @isamaru

h1 = h2

# FIXME: add optimized prune

return self # for easier chaining

def prune_items(self):
"""Trim data structures to fit in memory, if too large."""
# XXX: Or use a fixed-size data structure to start with (hyperloglog?)
pass

def get(self, item, default=None):
"""Return the item frequency of `item` (or `default` if item not present)."""
return self.hash2cnt.get(chash(item), default)

def merge(self, other):
"""
Merge counts from another FastCounter into self, in-place.
"""
self.hash2cnt.update(other.hash2cnt)
self.min_reduce = max(self.min_reduce, other.min_reduce)
self.prune_items()

def __len__(self):
return len(self.hash2cnt)

def __str__(self):
return "%s<%i items>" % (self.__class__.__name__, len(self))


class FastCounterPreshed(object):
"""
Fast counting of item frequency frequency across large, streamed iterables.
"""

def __init__(self, doc2items=None, max_size=None):
self.doc2items = doc2items
self.max_size = max_size
self.min_reduce = 0
self.hash2cnt = preshed.counter.PreshCounter() # TODO replace by some GIL-free low-level struct

def update(self, documents):
"""
Update the relevant ngram counters from the iterable `documents`.

If the memory structures get too large, clip them (then the internal counts may be only approximate).
"""
cdef Py_ssize_t idx, l
cdef uint64_t h1, h2
cdef preshed.counter.PreshCounter hash2cnt = self.hash2cnt
for document in documents:
l = len(document)
if l:
h1 = chash(document[0])
hash2cnt.inc(h1, 1)
for idx in range(1, l):
h2 = chash(document[idx])
hash2cnt.inc(h2, 1)
hash2cnt.inc(h1 + h2, 1)
h1 = h2

# FIXME: add optimized prune

return self # for easier chaining

def prune_items(self):
"""Trim data structures to fit in memory, if too large."""
# XXX: Or use a fixed-size data structure to start with (hyperloglog?)
pass

def get(self, item, default=None):
"""Return the item frequency of `item` (or `default` if item not present)."""
return self.hash2cnt.get(chash(item), default)

def merge(self, other):
"""
Merge counts from another FastCounter into self, in-place.
"""
self.hash2cnt.update(other.hash2cnt)
self.min_reduce = max(self.min_reduce, other.min_reduce)
self.prune_items()

def __len__(self):
return len(self.hash2cnt)

def __str__(self):
return "%s<%i items>" % (self.__class__.__name__, len(self))
20 changes: 10 additions & 10 deletions gensim/models/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,12 +134,6 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
should be a byte string (e.g. b'_').

"""
if min_count <= 0:
raise ValueError("min_count should be at least 1")

if threshold <= 0:
raise ValueError("threshold should be positive")

self.min_count = min_count
self.threshold = threshold
self.max_vocab_size = max_vocab_size
Expand Down Expand Up @@ -169,7 +163,7 @@ def learn_vocab(sentences, max_vocab_size, delimiter=b'_', progress_per=10000):
if sentence_no % progress_per == 0:
logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" %
(sentence_no, total_words, len(vocab)))
sentence = [utils.any2utf8(w) for w in sentence]
# sentence = [utils.any2utf8(w) for w in sentence]
for bigram in zip(sentence, sentence[1:]):
vocab[bigram[0]] += 1
vocab[delimiter.join(bigram)] += 1
Expand Down Expand Up @@ -394,7 +388,7 @@ def __getitem__(self, sentence):

if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.info("running %s" % " ".join(sys.argv))
logger.info("running %s", " ".join(sys.argv))

# check and process cmdline input
program = os.path.basename(sys.argv[0])
Expand All @@ -408,6 +402,12 @@ def __getitem__(self, sentence):
sentences = Text8Corpus(infile)

# test_doc = LineSentence('test/test_data/testcorpus.txt')
logger.info("training phrases")
bigram = Phrases(sentences, min_count=5, threshold=100)
for s in bigram[sentences]:
print(utils.to_utf8(u' '.join(s)))
print bigram
logger.info("finished training phrases")

# for s in bigram[sentences]:
# print(utils.to_utf8(u' '.join(s)))

logger.info("finished running %s", " ".join(sys.argv))
5 changes: 3 additions & 2 deletions gensim/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1121,8 +1121,9 @@ def prune_vocab(vocab, min_reduce, trim_rule=None):
if not keep_vocab_item(w, vocab[w], min_reduce, trim_rule): # vocab[w] <= min_reduce:
result += vocab[w]
del vocab[w]
logger.info("pruned out %i tokens with count <=%i (before %i, after %i)",
old_len - len(vocab), min_reduce, old_len, len(vocab))
logger.info(
"pruned out %i tokens with count <=%i (before %i, after %i)",
old_len - len(vocab), min_reduce, old_len, len(vocab))
return result


Expand Down
5 changes: 4 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,10 @@ def finalize_options(self):
include_dirs=[model_dir]),
Extension('gensim.models.doc2vec_inner',
sources=['./gensim/models/doc2vec_inner.c'],
include_dirs=[model_dir])
include_dirs=[model_dir]),
Extension('gensim.models.fast_counter_cython',
sources=['./gensim/models/fast_counter_cython.c'],
include_dirs=[model_dir]),
],
cmdclass=cmdclass,
packages=find_packages(),
Expand Down