Performance issues with adlfs mappers #161

jhamman · 2021-01-07T18:11:00Z

What happened:

I'm noticing that the write performance of adlfs, specifically when using xarray/dask to write to zarr, is much slower than that of ABSStore (built in to Zarr). I've noticed that the performance further diverges with larger datasets (more chunks). I have a hunch that this is an async issue but I'm not sure how to test that theory.

What you expected to happen:

I expected adlfs performance to be on par or better with the Zarr implementation.

Minimal Complete Verifiable Example:

from adlfs import AzureBlobFileSystem
from zarr.storage import ABSStore
import xarray as xr

# sample data
ds = xr.tutorial.load_dataset('rasm').chunk({'x': 140, 'y': 105, 'time': 18},)

# zarr mapper
store1 = ABSStore(container='carbonplan-scratch', prefix='test/store1', account_name='carbonplan', account_key=...)

# adlfs mapper
fs = AzureBlobFileSystem(account_name='carbonplan', account_key=...)
store2 = fs.get_mapper('carbonplan-scratch/test/store2')

%%timeit -n 5 -r 5
ds.to_zarr(store1, mode='w')
# 1.02 s ± 79.6 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

%%timeit -n 5 -r 5
ds.to_zarr(store2, mode='w')
# 9.1 s ± 1.98 s per loop (mean ± std. dev. of 5 runs, 5 loops each)

Anything else we need to know?:

The example below was tested on adlfs@master and zarr-developers/zarr-python#620.

Environment:

Dask version: 2.30.0
adlfs version: v0.5.9
xarray version: 0.16.2
zarr version: 2.2.0a2.dev650
Python version: 3.7.9
Operating System: Linux
Install method (conda, pip, source): conda/pip

cc @TomAugspurger

hayesgb · 2021-01-07T23:00:41Z

Thanks Joe — currently the AzureBlobFile methods occur synchronously, but leverage Azure Storage Python v12. Do you know how it compares abfs == 0.2.4 ? That version uses the Azure Storage Python 2.1.

…

On Jan 7, 2021, at 12:11 PM, Joe Hamman ***@***.***> wrote: What happened: I'm noticing that the write performance of adlfs, specifically when using xarray/dask to write to zarr, is much slower than that of ABSStore (built in to Zarr). I've noticed that the performance further diverges with larger datasets (more chunks). I have a hunch that this is an async issue but I'm not sure how to test that theory. What you expected to happen: I expected adlfs performance to be on par or better with the Zarr implementation. Minimal Complete Verifiable Example: from adlfs import AzureBlobFileSystem from zarr.storage import ABSStore import xarray as xr # sample data ds = xr.tutorial.load_dataset('rasm').chunk({'x': 140, 'y': 105, 'time': 18},) # zarr mapper store1 = ABSStore(container='carbonplan-scratch', prefix='test/store1', account_name='carbonplan', account_key=...) # adlfs mapper fs = AzureBlobFileSystem(account_name='carbonplan', account_key=...) store2 = fs.get_mapper('carbonplan-scratch/test/store2') %%timeit -n 5 -r 5 ds.to_zarr(store1, mode='w') # 1.02 s ± 79.6 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) %%timeit -n 5 -r 5 ds.to_zarr(store2, mode='w') # 9.1 s ± 1.98 s per loop (mean ± std. dev. of 5 runs, 5 loops each) Anything else we need to know?: The example below was tested on ***@***.*** and zarr-developers/zarr-python#620. Environment: Dask version: 2.30.0 adlfs version: v0.5.9 xarray version: 0.16.2 zarr version: 2.2.0a2.dev650 Python version: 3.7.9 Operating System: Linux Install method (conda, pip, source): conda/pip cc @TomAugspurger — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

jhamman · 2021-01-08T00:55:32Z

I ran the same write tests on with adlfs==0.2.4 azure-storage-blob==2.1.0 zarr==2.6.1:

%%timeit -n 5 -r 5
ds.to_zarr(store1, mode='w')
# 1.09 s ± 32.3 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

%%timeit -n 5 -r 5
ds.to_zarr(store2, mode='w')
# 2.96 s ± 116 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Zarr's store is basically the same. Adlfs was faster in 0.2.4 but still about 3x slower than Zarr.

hayesgb · 2021-01-08T16:34:34Z

I'm trying to reproduce the observed issue. It looks like zarray is pinned to azure-storage-blob==2.0.1. Is this correct? I'm looking at zarr==2.6.1

jhamman · 2021-01-08T16:39:55Z

Yes, the current version of zarr is pinning azure-storage-blob==2.0.1 (link). My first set of benchmarks were using azure-storage-blob==12.5.0 (zarr-developers/zarr-python#620).

hayesgb · 2021-01-09T12:51:59Z

@jhamman -- I'm still looking at this, but while investigating, I found what I believe is a bug in zarr-python storage.py, located here. I've issued a PR #688 for it. This allows the filesystem to perform the rmdirs.

hayesgb · 2021-01-11T22:14:26Z

@jhamman -- I have a new branch that implements asynchronous read and write. Timing it, I the following:

%%timeit -n 5 -r 5

with fs.open("data/root/a1/file.txt", "w") as f:
    f.write("0123456789")

with fs.open("data/root/a1/file.txt") as f:
    d = f.read()
assert d == b"0123456789"

# 519 ms ± 30 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
# 479 ms ± 10.5 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
# 505 ms ± 16.8 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

vs adlfs==0.5.9

# 913 ms ± 34.6 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
# 918 ms ± 14.7 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
# 915 ms ± 11 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

I've modified pipe_file (which is called by fsmapper) to directly upload to azure asynchronously. I'd be curious what you're seeing with this branch.

jhamman · 2021-01-11T22:56:09Z

@hayesgb - this is huge. On my test case, this just resulted in a ~10x speedup.

The current timings are:

zarr-python#620
848 ms ± 40.1 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

adlfs@master
11.5 s ± 4.61 s per loop (mean ± std. dev. of 5 runs, 5 loops each)

adlfs@speedup
1.21 s ± 79.3 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

It is still a bit slower than the Zarr implementation but this is super promising. Do we think the difference is overhead from fsspec? Or are there more pieces that could be tuned?

hayesgb · 2021-01-14T01:56:38Z

Thanks for the feedback @jhamman. As you saw, I've merged this in as #164. Tagging @mgsnuno based on feedback in #57

I've tried doing some more tuning, and not seeing much additional benefit. Even went so far as to try creating a ADLSMap object and consolidating the mkdir and pipe_file methods into a single asynchronous function. There may be an opportunity to consolidate session objects, as described here, or otherwise improve the use of client objects.

I plan to update to address #134, and then release 0.6 over the next few days.

What are your thoughts on closing this for now?

jhamman · 2021-01-14T15:48:55Z

@hayesgb - happy to see this closed now. I agree there may be a few bits of further tuning to do but let's get this released and address them down the road. Thanks again for your help on this.

jhamman closed this as completed Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues with adlfs mappers #161

Performance issues with adlfs mappers #161

jhamman commented Jan 7, 2021

hayesgb commented Jan 7, 2021 via email

jhamman commented Jan 8, 2021

hayesgb commented Jan 8, 2021 •

edited

Loading

jhamman commented Jan 8, 2021

hayesgb commented Jan 9, 2021

hayesgb commented Jan 11, 2021

jhamman commented Jan 11, 2021

hayesgb commented Jan 14, 2021

jhamman commented Jan 14, 2021

Performance issues with adlfs mappers #161

Performance issues with adlfs mappers #161

Comments

jhamman commented Jan 7, 2021

hayesgb commented Jan 7, 2021 via email

jhamman commented Jan 8, 2021

hayesgb commented Jan 8, 2021 • edited Loading

jhamman commented Jan 8, 2021

hayesgb commented Jan 9, 2021

hayesgb commented Jan 11, 2021

jhamman commented Jan 11, 2021

hayesgb commented Jan 14, 2021

jhamman commented Jan 14, 2021

hayesgb commented Jan 8, 2021 •

edited

Loading