-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast object counting + Phrases #1446
Changes from all commits
392c672
6a98f86
c601cbf
a4fdfdb
24f5b63
f87722a
df84033
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright (C) 2017 Radim Rehurek <me@radimrehurek.com> | ||
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
|
||
""" | ||
Fast & memory efficient counting of things (and n-grams of things). | ||
|
||
This module is designed to count item frequencies over large, streamed corpora (lazy iteration). | ||
|
||
Such counts are useful in various other modules, such as Dictionary, TfIdf, Phrases etc. | ||
|
||
""" | ||
|
||
import sys | ||
import os | ||
from collections import defaultdict | ||
import logging | ||
|
||
from gensim import utils | ||
from gensim.models.fast_counter_cython import FastCounterCython, FastCounterPreshed | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
def iter_ngrams(document, ngrams): | ||
assert ngrams[0] <= ngrams[1] | ||
|
||
for n in range(ngrams[0], ngrams[1] + 1): | ||
for ngram in zip(*[document[i:] for i in range(n)]): | ||
logger.debug("yielding ngram %r", ngram) | ||
yield ngram | ||
|
||
def iter_gram1(document): | ||
return iter_ngrams(document, (1, 1)) | ||
|
||
def iter_gram2(document): | ||
return iter_ngrams(document, (2, 2)) | ||
|
||
def iter_gram12(document): | ||
return iter_ngrams(document, (1, 2)) | ||
|
||
|
||
class FastCounter(object): | ||
""" | ||
Fast counting of item frequency frequency across large, streamed iterables. | ||
""" | ||
|
||
def __init__(self, doc2items=iter_gram1, max_size=None): | ||
self.doc2items = doc2items | ||
self.max_size = max_size | ||
self.min_reduce = 0 | ||
self.hash2cnt = defaultdict(int) | ||
|
||
def hash(self, item): | ||
return hash(item) | ||
|
||
def update(self, documents): | ||
""" | ||
Update the relevant ngram counters from the iterable `documents`. | ||
|
||
If the memory structures get too large, clip them (then the internal counts may be only approximate). | ||
""" | ||
for document in documents: | ||
for item in self.doc2items(document): | ||
self.hash2cnt[self.hash(item)] += 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That this implementation ignores hash-collisions is important detail to document. |
||
self.prune_items() | ||
|
||
return self # for easier chaining | ||
|
||
def prune_items(self): | ||
"""Trim data structures to fit in memory, if too large.""" | ||
# XXX: Or use a fixed-size data structure to start with (hyperloglog?) | ||
while self.max_size and len(self) > self.max_size: | ||
self.min_reduce += 1 | ||
utils.prune_vocab(self.hash2cnt, self.min_reduce) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can potentially be very slow, as very There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd want to see profiling that indicates this survey is a major time-cost before spending much time optimizing. If it needed to be optimized, it might be possible to retain a With this existing algorithm, where each prune, everything under the count reached last time is automatically discarded, this hypothetical "most eligible" list could be discarded 1st – without even looking at the existing counts again. Maybe that'd be enough... though if keeping this 'escalating min_reduce' you'd still need to iterate over all the remaining keys. (It seems a lot like GC, with a 'tenured' set that might be able to skip some collections.) But as I've mentioned in previous discussions around Phrases: I suspect this 'escalating floor prune' is a somehat undesirable algorithm, injecting more imprecision/bias in final counts than necessary, and (maybe) losing a lot of keys that would survive a truly unbounded counting method. (TLDR: because of typical freq-distributions plus the always-escalating floow, every prune may wind up eliminating 90%+ of all existing words, starving some words of the chance to ever survive a prune, or in some cases inefficiently leaving the final dict at a small fraction of the desired There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like we're on to optimizations and choosing algorithms now, so let me tick off the "API design" checkbox. API conclusion: mimic the Counter API, except add a parameter that allows restricting its maximum size, ideally in bytes. Make it clear approximations were necessary to achieve this, and what the trade-offs are (Counter methods we cannot support, perf-memory balance implications). @jayantj @menshikh-iv do you agree? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @piskvorky I agree with you There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, though the best place for that is probably the docstring. Something like |
||
|
||
def get(self, item, default=None): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
"""Return the item frequency of `item` (or `default` if item not present).""" | ||
return self.hash2cnt.get(self.hash(item), default) | ||
|
||
def merge(self, other): | ||
""" | ||
Merge counts from another FastCounter into self, in-place. | ||
""" | ||
self.hash2cnt.update(other.hash2cnt) | ||
self.min_reduce = max(self.min_reduce, other.min_reduce) | ||
self.prune_items() | ||
|
||
def __len__(self): | ||
return len(self.hash2cnt) | ||
|
||
def __str__(self): | ||
return "%s<%i items>" % (self.__class__.__name__, len(self)) | ||
|
||
|
||
class Phrases(object): | ||
def __init__(self, min_count=5, threshold=10.0, max_vocab_size=40000000): | ||
self.threshold = threshold | ||
self.min_count = min_count | ||
self.max_vocab_size = max_vocab_size | ||
# self.counter = FastCounter(iter_gram12, max_size=max_vocab_size) | ||
self.counter = FastCounterCython() | ||
# self.counter = FastCounterPreshed() | ||
|
||
def add_documents(self, documents): | ||
self.counter.update(documents) | ||
|
||
return self # for easier chaining | ||
|
||
def export_phrases(self, document): | ||
""" | ||
Yield all collocations (pairs of adjacent closely related tokens) from the | ||
input `document`, as 2-tuples `(score, bigram)`. | ||
""" | ||
norm = 1.0 * len(self.counter) | ||
for bigram in iter_gram2(document): | ||
pa, pb, pab = self.counter.get((bigram[0],)), self.counter.get((bigram[1],)), self.counter.get(bigram, 0) | ||
if pa and pb: | ||
score = norm / pa / pb * (pab - self.min_count) | ||
if score > self.threshold: | ||
yield score, bigram | ||
|
||
|
||
if __name__ == '__main__': | ||
logging.basicConfig(format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s', level=logging.INFO) | ||
logger.info("running %s", " ".join(sys.argv)) | ||
|
||
# check and process cmdline input | ||
program = os.path.basename(sys.argv[0]) | ||
if len(sys.argv) < 2: | ||
print(globals()['__doc__'] % locals()) | ||
sys.exit(1) | ||
infile = sys.argv[1] | ||
|
||
from gensim.models.word2vec import Text8Corpus | ||
documents = Text8Corpus(infile) | ||
|
||
logger.info("training phrases") | ||
bigram = Phrases(min_count=5, threshold=100).add_documents(documents) | ||
logger.info("finished training phrases") | ||
print(bigram.counter) | ||
# for doc in documents: | ||
# s = u' '.join(doc) | ||
# for _, bigram in bigram.export_phrases(doc): | ||
# s = s.replace(u' '.join(bigram), u'_'.join(bigram)) | ||
# print(utils.to_utf8(s)) | ||
|
||
logger.info("finished running %s", " ".join(sys.argv)) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
#!/usr/bin/env cython | ||
# cython: boundscheck=False | ||
# cython: wraparound=False | ||
# cython: cdivision=True | ||
# coding: utf-8 | ||
# | ||
# Copyright (C) 2017 Radim Rehurek <me@radimrehurek.com> | ||
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
|
||
from collections import defaultdict | ||
|
||
from libc.stdint cimport int64_t, uint64_t | ||
|
||
cimport preshed.counter | ||
|
||
|
||
cdef uint64_t chash(obj): | ||
# TODO use something faster, can assume string | ||
return <uint64_t>hash(obj) | ||
|
||
|
||
class FastCounterCython(object): | ||
""" | ||
Fast counting of item frequency frequency across large, streamed iterables. | ||
""" | ||
|
||
def __init__(self, doc2items=None, max_size=None): | ||
self.doc2items = doc2items | ||
self.max_size = max_size | ||
self.min_reduce = 0 | ||
self.hash2cnt = defaultdict(int) | ||
|
||
def update(self, documents): | ||
""" | ||
Update the relevant ngram counters from the iterable `documents`. | ||
|
||
If the memory structures get too large, clip them (then the internal counts may be only approximate). | ||
""" | ||
cdef Py_ssize_t idx, l | ||
cdef uint64_t h1, h2 | ||
hash2cnt = self.hash2cnt | ||
for document in documents: | ||
l = len(document) | ||
if l: | ||
h1 = chash(document[0]) | ||
hash2cnt[h1] += 1 | ||
for idx in range(1, l): | ||
h2 = chash(document[idx]) | ||
hash2cnt[h2] += 1 | ||
hash2cnt[h1 + h2] += 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Simple addition to determine hash of bigram gives (A, B) and (B, A) same hash value – problematic for Phrase-promotion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I had XOR here before, but then A ^ A == 0 == B ^ B, which is also not good. The hashing requires a little more thought, too hackish. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's not a native operation (instruction), so probably not a good idea. Google shows me this for size_t hash_combine( size_t lhs, size_t rhs ) {
lhs^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
return lhs;
} That looks fancy, but probably anything that breaks the symmetry would be enough, such as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's a repo with a bunch of useful hashing algorithms & their comparison: https://github.com/rurban/smhasher There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One more interesting resource on hashing (in the context of Bloom filters / min-sketch counting in #508 ): https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf |
||
h1 = h2 | ||
|
||
# FIXME: add optimized prune | ||
|
||
return self # for easier chaining | ||
|
||
def prune_items(self): | ||
"""Trim data structures to fit in memory, if too large.""" | ||
# XXX: Or use a fixed-size data structure to start with (hyperloglog?) | ||
pass | ||
|
||
def get(self, item, default=None): | ||
"""Return the item frequency of `item` (or `default` if item not present).""" | ||
return self.hash2cnt.get(chash(item), default) | ||
|
||
def merge(self, other): | ||
""" | ||
Merge counts from another FastCounter into self, in-place. | ||
""" | ||
self.hash2cnt.update(other.hash2cnt) | ||
self.min_reduce = max(self.min_reduce, other.min_reduce) | ||
self.prune_items() | ||
|
||
def __len__(self): | ||
return len(self.hash2cnt) | ||
|
||
def __str__(self): | ||
return "%s<%i items>" % (self.__class__.__name__, len(self)) | ||
|
||
|
||
class FastCounterPreshed(object): | ||
""" | ||
Fast counting of item frequency frequency across large, streamed iterables. | ||
""" | ||
|
||
def __init__(self, doc2items=None, max_size=None): | ||
self.doc2items = doc2items | ||
self.max_size = max_size | ||
self.min_reduce = 0 | ||
self.hash2cnt = preshed.counter.PreshCounter() # TODO replace by some GIL-free low-level struct | ||
|
||
def update(self, documents): | ||
""" | ||
Update the relevant ngram counters from the iterable `documents`. | ||
|
||
If the memory structures get too large, clip them (then the internal counts may be only approximate). | ||
""" | ||
cdef Py_ssize_t idx, l | ||
cdef uint64_t h1, h2 | ||
cdef preshed.counter.PreshCounter hash2cnt = self.hash2cnt | ||
for document in documents: | ||
l = len(document) | ||
if l: | ||
h1 = chash(document[0]) | ||
hash2cnt.inc(h1, 1) | ||
for idx in range(1, l): | ||
h2 = chash(document[idx]) | ||
hash2cnt.inc(h2, 1) | ||
hash2cnt.inc(h1 + h2, 1) | ||
h1 = h2 | ||
|
||
# FIXME: add optimized prune | ||
|
||
return self # for easier chaining | ||
|
||
def prune_items(self): | ||
"""Trim data structures to fit in memory, if too large.""" | ||
# XXX: Or use a fixed-size data structure to start with (hyperloglog?) | ||
pass | ||
|
||
def get(self, item, default=None): | ||
"""Return the item frequency of `item` (or `default` if item not present).""" | ||
return self.hash2cnt.get(chash(item), default) | ||
|
||
def merge(self, other): | ||
""" | ||
Merge counts from another FastCounter into self, in-place. | ||
""" | ||
self.hash2cnt.update(other.hash2cnt) | ||
self.min_reduce = max(self.min_reduce, other.min_reduce) | ||
self.prune_items() | ||
|
||
def __len__(self): | ||
return len(self.hash2cnt) | ||
|
||
def __str__(self): | ||
return "%s<%i items>" % (self.__class__.__name__, len(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use mmh3 for this purposes?