-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make token2id mapping reproducible #1715
Make token2id mapping reproducible #1715
Conversation
Hi @formi23, thanks for PR, but I don't think that this is a bug. WDYT @piskvorky @gojomo ? |
Hi @menshikh-iv, |
@formi23 but it does not matter how the words will be numbered (for any algorithm), all difference will be in final indexes, but if you have two exactly same models with different-numbering dictionaries, the models will produce exactly same results (up to numbering words). Also, if you save your model in python2 and load in python3 (and vice versa), you'll receive exactly same indexes in the dictionary. |
The cross-Python reproducibility is not critical, but nice to have. What price do we pay for it @formi23 ? What is the impact on performance / memory of this PR? |
Since memory and performance are affected by dataset size, Python version (e.g. the implementation of
I just want to clarify that the intent of this PR is not strictly to introduce cross-version reproducibility, but to introduce determinism. The use of The behaviour of @menshikh-iv , the difference is between algorithmically guaranteeing the same results versus depending on circumstance. Using |
@formi23 nice description 💣 ! Can you make the small benchmark for dictionary construction/filtering/ |
@menshikh-iv do you have an example of such benchmarks? |
gensim/corpora/dictionary.py
Outdated
@@ -149,6 +149,7 @@ def doc2bow(self, document, allow_update=False, return_missing=False): | |||
token2id = self.token2id | |||
if allow_update or return_missing: | |||
missing = {w: freq for w, freq in iteritems(counter) if w not in token2id} | |||
missing = OrderedDict(sorted(missing.items(), key=lambda x: (x[1], x[0]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really needed to sort items explicitly (everywhere)? If yes - please use iteritems
instead of .items()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the essential step because it introduces determinism in the assignment of ids and therefore consistency in the word-id mapping between executions (for a given dataset). However, OrderedDict
can be omitted - see my comment regarding benchmarks for the implications of only using sorted
and dict
.
@formi23 no, but it's very simple. You need to calculate what's time/memory needed for the fill-up, filter and apply Dictionary with old/new code version and post results here. |
I'm familiar with benchmarking, I just don't see the value in comparing the performance before and after the changes on an arbitrary dataset. The algorithmic complexity implications of the changes are clear from the implementation of If you insist, the following scatter plot shows the execution time for three configurations using both Python 3.5 (
Four methods of
So, for example, The |
Thanks for the clarification @formi23 . @menshikh-iv is still struggling with English, please excuse if he sounded too rude or bossy. It wasn't intentional. Doubling the memory sounds like a really high price, for a nice-to-have feature. Where exactly did you hit the non-reproducibility problem yourself, how severe was it? Is there a more frugal way to achieve the same effect? |
@piskvorky, the memory concerns are understandable. At the same time, I consider full reproducibility essential for the nature of the work If the memory cost of It's possible that faster or more memory efficient approaches may exist to achieve consistent word-id mappings but that's outside the scope of this PR. In principle, the problem is that the ids for tokens are assigned in the order of dictionary iteration, which changes between the executions in Python versions 3.3-3.6. I posted the results of LDA topic modeling here. You can see the difference between two executions with Python 3.5. This is a toy example, but the effect is more pronounced in a dataset that I unfortunately don't have permission to share in which I first encountered the problem. |
@formi23 thanks for the detailed benchmark, double-memory is unacceptable for us, so please try to use only |
7dabc31
to
ba36a1b
Compare
@menshikh-iv, I made the changes. Regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check again that this works as expected, for me LGTM
gensim/corpora/dictionary.py
Outdated
@@ -148,9 +148,9 @@ def doc2bow(self, document, allow_update=False, return_missing=False): | |||
|
|||
token2id = self.token2id | |||
if allow_update or return_missing: | |||
missing = {w: freq for w, freq in iteritems(counter) if w not in token2id} | |||
missing = sorted((x for x in iteritems(counter) if x[0] not in token2id), key=lambda x: (x[1], x[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorting by (freq, w), are you sure that this is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv, that's a good point. Either way works and ensures reproducibility. It can be assumed that sorting integers is faster than strings, but then the branching for words with the same frequency seems to introduce overhead (see the figure, where freq
indicates frequency-word sorting and string
indicates word-frequency sorting, i
, f
, c
, and d
methods are as in the previous example).
Since the words in token2id
guaranteed to be unique but the frequencies are not (and will likely overlap given patterns within natural language), sorting by token then frequency is probably the better choice but will vary on a dataset-by-dataset basis. What's your preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO (w, freq)
better than (freq, w)
(just because of the uniqueness of the tokens). For bigger corpus, the situation will be even worse for (freq, w)
because frequencies obey zipf's law -> more and more duplicates by freq in "tail" of distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the changes.
I checked manually, now mapping looks reproducible between different python versions, nice work @formi23 👍 🔥 |
gensim/corpora/dictionary.py
Outdated
@@ -148,7 +148,7 @@ def doc2bow(self, document, allow_update=False, return_missing=False): | |||
|
|||
token2id = self.token2id | |||
if allow_update or return_missing: | |||
missing = sorted((x for x in iteritems(counter) if x[0] not in token2id), key=lambda x: (x[1], x[0])) | |||
missing = sorted((x for x in iteritems(counter) if x[0] not in token2id), key=lambda x: x[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort
works lexicographically by default; no need for the key
parameter at all.
Congratz with first contributions @formi23 🔥 |
Impressive indeed. Incubator or not, we'd like to keep you around @formi23 :) Do you have an appetite for another open source project with gensim? |
…tionary (piskvorky#1715) * Make token2id mapping reproducible * Use iteritems instead of items * Use sorted for deterministic token2id mapping * Replace frequency-token sort with token sort * remove key from sort
@piskvorky Thanks, happy I got to contribute to the library I'm using. I'll keep an eye out for PR opportunities as they become apparent in my work. |
This change modifies the attributes
token2id
anddfs
of thegensim.corpora.Dictionary
class to beOrderedDict
s rather thandict
s. This ensures deterministic order of elements for consecutive executions regardless of Python version.Between Python versions 3.3 and 3.6, the order of iteration for sets and dictionaries is randomised before each execution by default (see link). In prior versions, hash randomisation is disabled by default (see link). As of 3.6, the order of set and dictionary iteration reflects the order in which elements were added, but this is an implementation detail not to be relied upon (see link). The changes in this PR resolve this issue by ensuring that the mapping behaviour of
gensim.corpora.Dictionary
is consistent across all modern versions of Python and between executions.Note that unit testing this functionality is not possible because of the quirks of the various Python versions. The problem only manifests between executions, which undermines the reproducibility of experiments using
gensim.corpora.Dictionary
. A minimal working example that demonstrates the problem follows (compare the output between the runs for versions of Python that use hash seed randomisation; 3.5, for example):Example output: