-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel scan_vocab #406
Parallel scan_vocab #406
Conversation
…ded parallel testing to test_word2vec.py
…port non-indexable types.
Apparently doc2vec had its own scan_vocab method, but didn't have a build_vocab method, making it use word2vec's scan_vocab. I guess this wasn't the intended behavior so I copied the build_vocab method over to doc2vec, making it use its own scan_vocab. |
Apparently, most of the time the test is passing, but occasionally it fails. I'm currently investigating possible race condition preventing vocabulary to fully build. Also, seems like Counter support in Python 2.6 is somehow different? That is not what the doc seemed to suggest, but I'll also investigate that. |
…ar() to wipe some unprocessed data.
Intermittently failing test has been fixed, all passing 100% time for 2.7-3.3-3.4 ! Now only problem remaining is support of Counters in Python 2.6, which I just realized are not supported. I can either revert to a more complex logic with the standard dict, backport Counter in gensim itself, or consider dropping support for 2.6. I don't know how much this compatibility is needed... |
""" | ||
Build vocabulary from a sequence of sentences (can be a once-only generator stream). | ||
Each sentence must be a list of unicode strings. | ||
|
||
""" | ||
self.scan_vocab(sentences) # initial survey | ||
if isinstance(sentences, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for? Is it necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new logic in place, if sentences is a generator (which it shouldn't) it won't get detected and the code will spin indefinitely. It's the best way I found to prevent it, making sure the code is passing the unit test for the generator.
If someone has a better way to deal with that, I'll be more than happy to apply it!
P.S. I also tried inspect.isgeneratorobject
, but that will trigger an exception if sentences
is non-indexable, just like in the unit test case. I guess I could empty-catch it, but I though this was cleaner...
Thanks @fortiema ! The changes around doc2vec will need @gojomo's sharp-eyed review. Regarding Counter: I'm +1 on using the plain |
Sure thing for Counter, I'll revert to dict as soon as I have some time tomorrow morning. |
What sort of a speedup are you seeing, after these changes, on a large-corpus scan? |
@gojomo This is also what I'm wondering, didn't have big enough corpus to bench it correctly today. Plan to run it on 1B+ corpus tomorrow and get some real numbers, I'll be sure to share them here. |
2.6 build also passing now. Will post training results later today. |
@gojomo Right now, there doesn't seem to be any significant speedup. I also tried using I need to think more about this and see why it doesn't scale. Any idea why? |
In the threading scenario, the GIL is probably the main issue. I don't see any big blocks of work where the GIL would be released, and thus multi-threading possible. (Threads might offer some benefit if the 'upstream' source of the token-lists were the result of some costly parsing process that for some stages could release the GIL, or maybe if the text was coming from multiple laggy IO sources, so one thread could proceed while another was awaiting input. But there the multithreading would have to be elsewhere, supporting the iterator.) In the multiprocessing case, there's bigger communication overhead moving the data read (in the 'master' process) to workers, then back again. That serialize/deserialize 'out' (then serialize/deserialize of the totals back 'in') may be outweighing any benefit from having multiple cores tallying into separate dicts at once. If the data source were easily/equally split among N files (or file-ranges), then spawning N independent processes to scan each part, then report back the totals, might get a noticeable speedup. It'd eliminate the larger 'out' side of the serialize/deserialize mentioned above, and also presumably move any other decompress/tokenization from the one (bottlenecked) master process to N processes (independent until final tally reporting). (Or, it might just reveal the source disk to be the next/real bottleneck.) Still, that parallelization would require bigger assumptions about the real format of data sources, compared to the current API of "just feed us an iterator [of token lists]". So my hunch is it'd be most appropriate for such format-specific parallelism to live outside the Word2Vec class, as one or more utility classes that are specialized for the data's raw format(s) – compressed/uncompressed, one-file-or-many, etc. There could be support in Word2Vec for receiving (and merging) the summary-tallies of these external utilities. What's your usual raw corpus format and source disk(s)? |
Thanks for the clarification about threads/process and GIL, parallel python is new for me so still trying to get a good understanding of how everything works! My raw corpus are single-files, 10s of GB in size and sit on normal HDDs. I agree with you on dealing with that problem outside word2vec (or even gensim), as it does not make much sense to implement that kind of input splitting logic in here. I am also working with Hadoop, and am starting to consider using that instead to build the vocabulary, then feed it back to Word2Vec by putting together a utility class. |
Ha, finally a practical use case for Hadoop's WordCount example? :) And yes, threading won't help because of GIL (as discussed in the original issue #400). A good first step could be profiling the current code (serial), to see where the bottlenecks are. There's a line profiler It's quite possible that for such trivial tasks, (de)serialization is too slow. Then the solution would be to go compiled (Cython), which sounds pretty straightforward too. That may be even easier (and faster) than parallelization. |
@piskvorky Thanks for the pointers, I did manage to profile scan_vocab. Here's the result for LeeCorpus:
I also ran it on one of my small 5MB gzip corpus for the sake of comparison:
Is my understanding correct in saying the bottleneck is coming from I/O (enumerate), and so using The other thing is of course the wordcount for loop itself, is this something that is easy to cythonize? |
@gojomo What about |
Yes, my interpretation of that profiling data is that most of the time is spent in the Even if the for-loop – looking-up-words and incrementing-their-counts – took zero time the speedup would be only 30%. There's probably some potential win there from the compactness and lesser-indirection of C/cython, but I can't guess how much. (If going down that road, you'd likely want to do the whole process, starting with the IO/encoding/splitting, in C/cython, so there's never python string/list/object overhead in the mix. The needed raw result from the scan is so simple – a list of words and their frequencies – it could be a separate tool.) If just generally wanting a 10s-of-GB scan to go faster, my main thoughts would be: (1) SSD; (2) split to N (= number of cores) parts, ideally spread over different disks, run all parts in parallel, merge final results.
|
@gojomo You are right about I will investigate further and try to come up with a simple and clean way to speedup vocab building. As for the wordcount loop optimization, even 20-30% speedup on a multi-day training time really doesn't sound too bad. I'll be doing some testing on my server SSDs later, see how much performance gain can better IO provide. Not so straightforward after all :-p |
Yes, the loop is fairly easy to cythonize (unless you're on Windows, where compilation is generally weird). It's essentially the exact same python code, just compiled to C auto-magically using Cython. I mean, if you've never used Cython before, there's some learning curve in "what file goes where, how do I write it, how do I load it", but it's really simple in this case. And even if the Cython code is without all the robustness bells and whistles (=not ready to merge), it'd be an interesting comparison point against the pure Python version. Since the bottleneck seems to be file iteration itself, it may also be worth cythonizing your iterator (not just the word count for-loop discussed here). The lower bound here is obviously your disk speed: if your disk can read 50MB/s, and you have 5GB of disk data, then 100 seconds is the best you can do, no matter the language :) |
Okay, I just took quick measurements with basic cython code (no crazy optimization):
Not the most rigorous test there is, but we can see ~10% speedup with both |
Thanks a lot for investigating! Yeah, 10% doesn't sound worth the extra complexity. Can you share your code so we can check as well? |
I guess I could open up a new PR since I started from scratch on 'develop', but let me just paste the code here verbrose: Added to word2vec_inner.pyx: def scan_vocab(model, sentences):
logger.info("using cythonized wordcount")
logger.info("collecting all words and their counts")
cdef int sentence_no = -1
cdef int total_words = 0
cdef int min_reduce = 1
vocab = dict()
for sentence_no, sentence in enumerate(sentences):
if sentence_no % 10000 == 0:
logger.info("PROGRESS: at sentence #%i, encountered %i unique words", sentence_no, len(vocab))
for word in sentence:
if word not in vocab:
vocab[word] = ONE
else:
vocab[word] += ONE
if model.max_vocab_size and len(vocab) > model.max_vocab_size:
total_words += utils.prune_vocab(vocab, min_reduce)
min_reduce += 1
logger.info("PROGRESS: at sentence #%i, encountered %i unique words", sentence_no, len(vocab))
model.corpus_count = sentence_no + 1
model.raw_vocab = vocab Beginning of word2vec.py now looks like this: try:
from gensim.models.word2vec_inner import scan_vocab, train_sentence_sg, train_sentence_cbow, FAST_VERSION
except ImportError:
# failed... fall back to plain numpy (20-80x slower training than the above)
FAST_VERSION = -1
def scan_vocab(model, sentences, progress_per=10000):
"""Do an initial scan of all words appearing in sentences."""
if FAST_VERSION < 0:
import warnings
warnings.warn("C extension not loaded for Word2Vec, wordcount will be slow. "
"Install a C compiler and reinstall gensim for fast worccount.")
print "PYTHON VERSION"
logger.info("collecting all words and their counts")
sentence_no = -1
total_words = 0
min_reduce = 1
vocab = defaultdict(int)
for sentence_no, sentence in enumerate(sentences):
if sentence_no % progress_per == 0:
logger.info("PROGRESS: at sentence #%i, processed %i words, keeping %i word types",
sentence_no, sum(itervalues(vocab)) + total_words, len(vocab))
for word in sentence:
vocab[word] += 1
if model.max_vocab_size and len(vocab) > model.max_vocab_size:
total_words += utils.prune_vocab(vocab, min_reduce)
min_reduce += 1
total_words += sum(itervalues(vocab))
logger.info("collected %i word types from a corpus of %i raw words and %i sentences",
len(vocab), total_words, sentence_no + 1)
model.corpus_count = sentence_no + 1
model.raw_vocab = vocab
... |
Thanks again @fortiema . How large (in MB) are these three testing corpora? I'll try to profile too when I get some time, because Python's IO is actually pretty fast, so with cythonization added, I'd expect the performance to be closer to HW limits. Assuming your "1M sentences" is 100 MB of data, and it takes you 11s to process, that's 9MB/s. For SSD, I think that's too slow. Even for spinning disk this is slow, considering we're reading the data off disk sequentially = the most predictable pattern imaginable. What iterator do you use to iterate the sentences? Is it Can you confirm that just iterating over the sentences (no scanning, no training) takes this long? |
I have actually started playing with the dask library for another project, and am starting to think it would be a good fit to solve this particular problem here. I'm gonna put up a quick prototype as soon as I have some spare time this week. Any prior objection(s) to integrate this into gensim if it were to work? |
Not at all, I've had my eyes on dask / blaze for a while. The "streaming" abstraction in gensim is very powerful, but the guys at Continuum are doing lots of great work, so why not tap into it. Similarly with a Spark bridge. |
@fortiema , see my code snippet in #400 : I think the big problem is actually that the counts are stored in a Python dictionary, which gets slow when it gets very large. I'm working on proper benchmarks, over large jobs. So far my process is at 3.5bn words counted after 30 minutes. I've only just written this, so there could be bugs. But this matches my experience with doing large word counts in the past. I'd also suggest that instead of line profiling, you could just comment out the access to the vocab, and run the test on a large job. Since we're doing something small billions of times here, and the hash table is growing very large, I tend not to trust the profiler so much. At least, not when there's an easy way to collect another data point. |
Ping @fortiema, what status of this PR? Will you finish it soon? |
Connected with #1446 |
Related to Issue #400
In Word2Vec.build_vocab, scan_vocab not supports multi-threaded workers, in the same fashion as train.
The compromise is that I had to remove prune_vocab, as its logic is much harder to parallelize. I think this feature is more important, but I guess this is highly debatable.
Each worker temp vocab is a Counter instead of a dict, which eases merging of word counts.
Also modified tests to add multiple workers and different batch sizes.
Currently passing all unit tests, but so far only tested on Linux as other OSes are not easily accessible to me. Maybe someone can do this quicker than me? That would be appreciated.