distributed out-of-core computation with chunks #22

tanmaykm · 2016-01-28T14:54:37Z

This implements a distributed out-of-core model for ALS.

The key component used here is a ChunkedFile. It's a file split into chunks, with metadata describing the range of keys held in each chunk. Chunks can be loaded independently. Loaded data is weakly referenced, so they get garbage collected on memory pressure. Based on available memory, some references are kept in a LRU cache with a configurable limit.

The current implementation stores two types of data structures as chunks, both of which are memory mapped:

SparseMatChunks: used to represent the input ratings. Methods mmap_csc_save and mmap_csc_load handle the file format. Since input data is relatively constant, and it's impractical to do the same ETL process on large inputs everytime, they should be precomputed and stored as chunks. A transpose of the ratings matrix is also pre-computed. The data must be clean, empty items and users removed, and user/item id mappings must be handled separately.
DenseMatChunks: used to represent the model (U and P). These are created at run time. Chunk size is calculated automatically to fit one chunk into a 128MB file.

Concrete types for Model and Inputs for distributed memory mode implement the abstractions introduced by #21.

DistModel uses memory mapped dense matrix chunks
DistInputs uses memory mapped sparse matrix chunks

I have tried this with the movielens and last.fm datasets. Both seem to be running correctly, with much less memory pressure. With this, I am able to complete a run on the last.fm dataset on my laptop which otherwise runs out of memory. This mode is much slower than shared memory mode though, as expected.

Way forward:

test this with larger datasets
performance comparisons
use parquet format, instead of a custom format
should be able to work on HDFS

New parallelism mode - ParChunks. Only inputs are read from chunks as of now. Model still needs to be in shared memory mode. So this can be used in cases where the inputs are too large to fit into memory of a single machine, though the model can. Movielens example has been modified to demonstrate working of this mode. Also improved logging - `logmsg` is a macro now and also logs pid and threadid.

tanmaykm · 2016-02-04T07:55:38Z

Some performance comparisons of chunks with shared memory parallelism mode on two types of machines:

laptop, with 16GB RAM, 4 cores (x2 hyperthreaded)
julia.mit.edu machine, with 1T RAM, 40 cores (x2 hyperthreaded, slower than the laptop cores)

Data sets used:

D: last.fm dataset (users: 148,111, items: 1,568,126, 24,296,858 observations, ~1GB mmapped size, split across 20 chunks)
4D: 4 x last.fm dataset (generated by duplicating rows and columns of D)
25D: 25 x last.fm dataset

All tests were done with:

8 workers
20 iterations
chunk cache: 10 for each file

The laptop memory usage was maxed out with the smallest dataset in shared memory mode, and could not complete the rmse validations. It however ran to completion with chunks, even with the larger datasets.

machine	dataset	mode	time (sec)
julia.mit.edu	25D	chunks	37658
julia.mit.edu	D	chunks	1646
julia.mit.edu	D	shmem (`pmap`)	5770
julia.mit.edu	D	shmem (`@parallel`)	629
laptop	4D	chunks	2951
laptop	D	chunks	875
laptop	D	shmem (`pmap`)	2220
laptop	D	shmem (`@parallel`)	320

distributed out-of-core computation with chunks

tanmaykm · 2016-02-04T09:40:16Z

Will take up HDFS and Parquet interfaces as separate PRs.

ViralBShah · 2016-02-04T09:53:06Z

@amitmurthy - Huge gap between pmap and @parallel. We know why but need to find a good solution

ViralBShah · 2016-02-04T09:55:22Z

Cc @andreasnoack

ViralBShah · 2016-02-16T09:37:52Z

cc @simonbyrne

I also think some of these notes should go into the README.

tanmaykm force-pushed the distributed branch from d6f4425 to 74afba5 Compare February 1, 2016 18:26

reorganize stuff, generate test data

1f9d03a

tanmaykm force-pushed the distributed branch from 248c868 to 1f9d03a Compare February 2, 2016 15:37

tanmaykm changed the title ~~WIP: distributed out-of-core computation~~ distributed out-of-core computation with chunks Feb 4, 2016

tanmaykm added a commit that referenced this pull request Feb 4, 2016

Merge pull request #22 from tanmaykm/distributed

078a7d0

distributed out-of-core computation with chunks

tanmaykm merged commit 078a7d0 into abhijithch:master Feb 4, 2016

tanmaykm mentioned this pull request Feb 18, 2016

RFC: Simplifying and generalising pmap JuliaLang/julia#14843

Closed

tanmaykm mentioned this pull request Mar 9, 2016

chunks externalized as Blobs.jl #23

Merged

tanmaykm mentioned this pull request Mar 22, 2016

chunks externalized as Blobs #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed out-of-core computation with chunks #22

distributed out-of-core computation with chunks #22

tanmaykm commented Jan 28, 2016

tanmaykm commented Feb 4, 2016

tanmaykm commented Feb 4, 2016

ViralBShah commented Feb 4, 2016

ViralBShah commented Feb 4, 2016

ViralBShah commented Feb 16, 2016

distributed out-of-core computation with chunks #22

distributed out-of-core computation with chunks #22

Conversation

tanmaykm commented Jan 28, 2016

tanmaykm commented Feb 4, 2016

tanmaykm commented Feb 4, 2016

ViralBShah commented Feb 4, 2016

ViralBShah commented Feb 4, 2016

ViralBShah commented Feb 16, 2016