Dask provides multi-core execution on larger-than-memory datasets using blocked algorithms and task scheduling. It maps high-level NumPy, Pandas, and list operations on large datasets on to many operations on small in-memory datasets. It then executes these graphs in parallel on a single machine. Dask lets us use traditional NumPy, Pandas, and list programming while operating on inconveniently large data in a small amount of space.
dask
is a specification to describe task dependency graphs.dask.array
is a drop-in NumPy replacement (for a subset of NumPy) that encodes blocked algorithms indask
dependency graphs.dask.bag
encodes blocked algorithms on Python lists of arbitrary Python objects.dask.dataframe
encodes blocked algorithms on Pandas DataFrames.dask.async
is a shared-memory asynchronous scheduler efficiently executedask
dependency graphs on multiple cores.
See full documentation at http://dask.pydata.org or read developer-focused blogposts about dask's development.
Dask.array implements a numpy clone on larger-than-memory datasets using multiple cores.
>>> import dask.array as da
>>> x = da.random.normal(10, 0.1, size=(100000, 100000), chunks=(1000, 1000))
>>> x.mean(axis=0)[:3].compute()
array([ 10.00026926, 10.0000592 , 10.00038236])
Dask.dataframe implements a Pandas clone on larger-than-memory datasets using multiple cores.
>>> import dask.dataframe as dd
>>> df = dd.read_csv('nyc-taxi-*.csv.gz')
>>> g = df.groupby('medallion')
>>> g.trip_time_in_secs.mean().head(5)
medallion
0531373C01FD1416769E34F5525B54C8 795.875026
867D18559D9D2941173AD7A0F3B33E77 924.187954
BD34A40EDD5DC5368B0501F704E952E7 717.966875
5A47679B2C90EA16E47F772B9823CE51 763.005149
89CE71B8514E7674F1C662296809DDF6 869.274052
Name: trip_time_in_secs, dtype: float64
Dask.bag implements a large collection of Python objects and mimicing the toolz interface
>>> import dask.bag as db
>>> import json
>>> b = db.from_filenames('2014-*.json.gz')
... .map(json.loads)
>>> alices = b.filter(lambda d: d['name'] == 'Alice')
>>> alices.take(3)
({'name': 'Alice', 'city': 'LA', 'balance': 100},
{'name': 'Alice', 'city': 'LA', 'balance': 200},
{'name': 'Alice', 'city': 'NYC', 'balance': 300},
>>> dict(alices.pluck('city').frequencies())
{'LA': 10000, 'NYC': 20000, ...}
Dask.array, dask.dataframe, and dask.bag are thin layers on top of dask graphs, which represent computational task graphs of regular Python functions on regular Python objects.
As an example consider the following simple program:
def inc(i):
return i + 1
def add(a, b):
return a + b
x = 1
y = inc(x)
z = add(y, 10)
We encode this computation as a dask graph in the following way:
d = {'x': 1,
'y': (inc, 'x'),
'z': (add, 'y', 10)}
A dask graph is just a dictionary of tuples where the first element of the tuple is a function and the rest are the arguments for that function. While this representation of the computation above may be less aesthetically pleasing, it may now be analyzed, optimized, and computed by other Python code, not just the Python interpreter.
Dask is easily installable through your favorite Python package manager:
conda install dask or pip install dask[array] or pip install dask[bag] or pip install dask[dataframe] or pip install dask[complete]
dask.core
supports Python 2.6+ and Python 3.3+ with a common codebase. It
is pure Python and requires no dependencies beyond the standard library. It is
a light weight dependency.
dask.array
depends on numpy
.
dask.bag
depends on toolz
and dill
.
Dask examples are available in the following repository: https://github.com/blaze/dask-examples.
You can also find them in Anaconda.org: https://notebooks.anaconda.org/dask/.
New BSD. See License File.
One might ask why we didn't use one of these other fine libraries:
- Luigi
- Joblib
- mrjob
- Any of the fine schedulers in numeric analysis (Plasma, PaRSEC, ...)
- Any of the fine high-throughput schedulers (Condor, Pegasus, Swiftlang, ...)
The answer is because we wanted all of the following:
- Fine-ish grained parallelism (latencies around 1ms)
- In-memory communication of intermediate results
- Dependency structures more complex than
map
- Good support for numeric data
- First class Python support
- Trivial installation
Most task schedulers in the Python ecosystem target long-running batch jobs, often for processing large amounts of text and aren't appropriate for executing multi-core numerics.
There are many "Big NumPy Array" or general distributed array solutions all with fine characteristics. Some projects in the Python ecosystem include the following:
There is a rich history of distributed array computing. An incomplete sampling includes the following projects: