-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed out-of-core computation with chunks #22
Conversation
New parallelism mode - ParChunks. Only inputs are read from chunks as of now. Model still needs to be in shared memory mode. So this can be used in cases where the inputs are too large to fit into memory of a single machine, though the model can. Movielens example has been modified to demonstrate working of this mode. Also improved logging - `logmsg` is a macro now and also logs pid and threadid.
Some performance comparisons of chunks with shared memory parallelism mode on two types of machines:
Data sets used:
All tests were done with:
The laptop memory usage was maxed out with the smallest dataset in shared memory mode, and could not complete the rmse validations. It however ran to completion with chunks, even with the larger datasets.
|
distributed out-of-core computation with chunks
Will take up HDFS and Parquet interfaces as separate PRs. |
@amitmurthy - Huge gap between pmap and |
cc @simonbyrne I also think some of these notes should go into the README. |
This implements a distributed out-of-core model for ALS.
The key component used here is a ChunkedFile. It's a file split into chunks, with metadata describing the range of keys held in each chunk. Chunks can be loaded independently. Loaded data is weakly referenced, so they get garbage collected on memory pressure. Based on available memory, some references are kept in a LRU cache with a configurable limit.
The current implementation stores two types of data structures as chunks, both of which are memory mapped:
mmap_csc_save
andmmap_csc_load
handle the file format. Since input data is relatively constant, and it's impractical to do the same ETL process on large inputs everytime, they should be precomputed and stored as chunks. A transpose of the ratings matrix is also pre-computed. The data must be clean, empty items and users removed, and user/item id mappings must be handled separately.U
andP
). These are created at run time. Chunk size is calculated automatically to fit one chunk into a 128MB file.Concrete types for
Model
andInputs
for distributed memory mode implement the abstractions introduced by #21.DistModel
uses memory mapped dense matrix chunksDistInputs
uses memory mapped sparse matrix chunksI have tried this with the movielens and last.fm datasets. Both seem to be running correctly, with much less memory pressure. With this, I am able to complete a run on the last.fm dataset on my laptop which otherwise runs out of memory. This mode is much slower than shared memory mode though, as expected.
Way forward: