-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gensim Doc2Vec model Segmentation Faulting for Large Corpus #2894
Comments
For pinning down the exact number over which it always segmentations faults and below which it always works, I am trying some other experiments as well. For now, it works fine for 5M documents and gives the expected results. |
Thanks for the effort to create a reproducible example with random data! But, what is the significance of |
I tried my best to reproduce the exact scenario that I had. The |
If it's a simple list of int document lengths, a file with one number per line should work as well. (And, if simply making every doc the same length works to reproduce, that'd be just as good.) This data, even with the |
I'll convert the pickle to a text file with one number per line.
Yes it creates the segmentation fault.
Will check this aswell. |
Another fairly quick test worth running: for a |
Updated the repository for the pickle to a text file.
I'll convert the pickle to a text file with one number per line. |
Thanks! These are word counts for the docs, right? I see that of 10,000,000 docs, about 781K are over 10,000 words. While this should be accepted OK by gensim (& certainly not create a crash), just FYI: there is an internal implementation limit where words past the 10,000th of a text are silently ignored. In order for longer docs to be considered by |
Thanks for letting me know about some of the internal details of
|
It shouldn't cause a crash; it will mean only the 1st 10K tokens of those docs will be used for training. (It might be a factor involved in the crash, I'm not sure.) |
When you run this code, how big is the When you report that a 5M-line variant works, is that with half your real data, or half the |
Also, note: if you can With such a quick-reproduce recipe, we would then want to try: (1) the non-corpus-file path. instead of:
...try...
If this starts training without the same instant fault in the (2) getting a core-dump of the crash, & opening it in |
Its 315GB in size.
It is half of my real data. The exact size of the corpus file is 157GB. |
Thanks for that! I am currently working on finding the exact number above which we always get a segmentation fault and below which we always get the training successful. I am pretty close to the number (using binary search). For other options |
I have finally found the exact number over which the doc2vec model always gives segmentation fault, and under which it always starts training (although I did not let the training process to complete). 7158293 is the exact number on which and below which the doc2vec model starts successful training whereas the if we increase the number even by one it gives the segmentation fault. I used the synthetic dataset and used documents on lengths 100 tokens ony to speed up the process. |
Can you post the full log (at least INFO level) from that run? |
Good to hear of your progress, & that's a major clue, as 7158293 * 300 dimensions = 2,147,487,900, suspiciously close to 2^31 (2,147,483,648). That's strongly suggestive that the problem is some misuse of a signed 32-bit int where a wider int type should be used, and indexing overflow is causing the crash. Have you been able to verify my theory that training would get past that quick crash if the (Another too-narow-type problem, though one that's only caused missed training & not a segfault, is #2679. All the cython code should get a scan for potential use of signed/unsigned 32-bit ints where 64-bits would be required for the very-large, > 2GB/4GB array-indexing that's increasingly common in *2Vec models. |
Thanks. I wanted to eyeball the log in order to spot any suspicious numbers (signs of overflow), but @gojomo's observation above is already a good smoking gun. If you want this resolved quickly the best option might be to check for potential int32-vs-int64 variable problems yourself. It shouldn't be too hard, the file is here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/doc2vec_corpusfile.pyx (look for |
Should I make a PR after updating the code? |
Of course :) |
I have failed the circle CI checks could you take a look at my PR here and let me know what I'm doing wrong. I wasn't able to install the test and wasn't able to run the |
Using |
But note:
If you're able to do the |
The textual output of the gdb command |
The full logging file for |
|
Just updated my comment above |
When this issue will be resolved in the gensim? |
Re: the backtrace(s): I think the one thread backtrace you've highlighted is should be the one where the segfault occurred, though I believe I've occasionally seen cases where the 'current thread' in a core is something else. (And often, the thread/frame that "steps on" some misguided data isn't the one that caused the problem, via some more subtle error arbitrarily earlier.) Having symbols & line-numbers in the trace would make it more useful, but I'm not sure what (probably minor) steps you'd have to take to get those. (It might be enabled via installing some gdb extra, or using a However, looking at just the filenames, it seems the segfault actually occurs inside Re: when fixed? You've provided a clear map to reproducing & what's likely involved (a signed 32-bit int overflow), and I suspect the recipe to trigger can be made even smaller/faster. (EG: instead of 7.2M 300D vectors w/ 100-word training docs, 400K 6000D vectors w/ 1-word training docs is likely to trigger the same overflow - so no 300GB+ test file or long slow vocab-scan necessary.) That will make it easier for myself or others to investigate further. But I'm not sure when there'll be time for that, or it will succeed in finding a fix, or when an official release with the fix will happen. In the meantime, workarounds could include: (1) using the non- |
I wrote:
Confirmed that with a test file created by... with open('400klines', 'w') as f:
for _ in range(400000):
f.write('a\n') ...the following is enough to trigger a fault... model = Doc2Vec(corpus_file='400klines', min_count=1, vector_size=6000) Further, it may be sufficient to change the 3 lines in
...to...
That avoids the crash in the tiny test case above. |
Problem description
What are you trying to achieve?
I was trying to train a doc2vec model on a corpus of 10M (10 millions) documents for my dataset roughly having a length of ~5000 words on average. The idea was to generate a semantic search index on these documents using the doc2vec model.
What is the expected result?
I was expecting it to be completed successfully as I tested for the smaller dataset. On a smaller dataset of size 100K documents, it worked fine and I was able to do basic benchmarking for the search index, which successfully passed the criteria.
What are you seeing instead?
When I started training on the 10M dataset. After building the vocabulary the training of the doc2vec model stoped and resulted in Segmentation Fault.
Steps/code/corpus to reproduce
Include full tracebacks, logs, and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").
Here is the link to the example to reproduce it. It uses the following libraries (unfortunately I could not make a virtual env. due to some issues.) other than mentioned below.
RandomWords
Attached is the logging file.
logging_progress.log
Versions
Please provide the output of:
Here is the google group thread for a detailed discussion
The text was updated successfully, but these errors were encountered: