Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed out-of-core computation with chunks #22

Merged
merged 2 commits into from
Feb 4, 2016

Conversation

tanmaykm
Copy link
Collaborator

This implements a distributed out-of-core model for ALS.

The key component used here is a ChunkedFile. It's a file split into chunks, with metadata describing the range of keys held in each chunk. Chunks can be loaded independently. Loaded data is weakly referenced, so they get garbage collected on memory pressure. Based on available memory, some references are kept in a LRU cache with a configurable limit.

The current implementation stores two types of data structures as chunks, both of which are memory mapped:

  • SparseMatChunks: used to represent the input ratings. Methods mmap_csc_save and mmap_csc_load handle the file format. Since input data is relatively constant, and it's impractical to do the same ETL process on large inputs everytime, they should be precomputed and stored as chunks. A transpose of the ratings matrix is also pre-computed. The data must be clean, empty items and users removed, and user/item id mappings must be handled separately.
  • DenseMatChunks: used to represent the model (U and P). These are created at run time. Chunk size is calculated automatically to fit one chunk into a 128MB file.

Concrete types for Model and Inputs for distributed memory mode implement the abstractions introduced by #21.

  • DistModel uses memory mapped dense matrix chunks
  • DistInputs uses memory mapped sparse matrix chunks

I have tried this with the movielens and last.fm datasets. Both seem to be running correctly, with much less memory pressure. With this, I am able to complete a run on the last.fm dataset on my laptop which otherwise runs out of memory. This mode is much slower than shared memory mode though, as expected.

Way forward:

  • test this with larger datasets
  • performance comparisons
  • use parquet format, instead of a custom format
  • should be able to work on HDFS

New parallelism mode - ParChunks. Only inputs are read from chunks as of now.
Model still needs to be in shared memory mode.
So this can be used in cases where the inputs are too large to fit into memory of a single machine, though the model can.

Movielens example has been modified to demonstrate working of this mode.

Also improved logging - `logmsg` is a macro now and also logs pid and threadid.
@tanmaykm
Copy link
Collaborator Author

tanmaykm commented Feb 4, 2016

Some performance comparisons of chunks with shared memory parallelism mode on two types of machines:

  • laptop, with 16GB RAM, 4 cores (x2 hyperthreaded)
  • julia.mit.edu machine, with 1T RAM, 40 cores (x2 hyperthreaded, slower than the laptop cores)

Data sets used:

  • D: last.fm dataset (users: 148,111, items: 1,568,126, 24,296,858 observations, ~1GB mmapped size, split across 20 chunks)
  • 4D: 4 x last.fm dataset (generated by duplicating rows and columns of D)
  • 25D: 25 x last.fm dataset

All tests were done with:

  • 8 workers
  • 20 iterations
  • chunk cache: 10 for each file

The laptop memory usage was maxed out with the smallest dataset in shared memory mode, and could not complete the rmse validations. It however ran to completion with chunks, even with the larger datasets.

machine dataset mode time (sec)
julia.mit.edu 25D chunks 37658
julia.mit.edu D chunks 1646
julia.mit.edu D shmem (pmap) 5770
julia.mit.edu D shmem (@parallel) 629
laptop 4D chunks 2951
laptop D chunks 875
laptop D shmem (pmap) 2220
laptop D shmem (@parallel) 320

@tanmaykm tanmaykm changed the title WIP: distributed out-of-core computation distributed out-of-core computation with chunks Feb 4, 2016
tanmaykm added a commit that referenced this pull request Feb 4, 2016
distributed out-of-core computation with chunks
@tanmaykm tanmaykm merged commit 078a7d0 into abhijithch:master Feb 4, 2016
@tanmaykm
Copy link
Collaborator Author

tanmaykm commented Feb 4, 2016

Will take up HDFS and Parquet interfaces as separate PRs.

@ViralBShah
Copy link
Contributor

@amitmurthy - Huge gap between pmap and @parallel. We know why but need to find a good solution

@ViralBShah
Copy link
Contributor

Cc @andreasnoack

@ViralBShah
Copy link
Contributor

cc @simonbyrne

I also think some of these notes should go into the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants