Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with adlfs mappers #161

Closed
jhamman opened this issue Jan 7, 2021 · 9 comments
Closed

Performance issues with adlfs mappers #161

jhamman opened this issue Jan 7, 2021 · 9 comments

Comments

@jhamman
Copy link

jhamman commented Jan 7, 2021

What happened:

I'm noticing that the write performance of adlfs, specifically when using xarray/dask to write to zarr, is much slower than that of ABSStore (built in to Zarr). I've noticed that the performance further diverges with larger datasets (more chunks). I have a hunch that this is an async issue but I'm not sure how to test that theory.

What you expected to happen:

I expected adlfs performance to be on par or better with the Zarr implementation.

Minimal Complete Verifiable Example:

from adlfs import AzureBlobFileSystem
from zarr.storage import ABSStore
import xarray as xr

# sample data
ds = xr.tutorial.load_dataset('rasm').chunk({'x': 140, 'y': 105, 'time': 18},)

# zarr mapper
store1 = ABSStore(container='carbonplan-scratch', prefix='test/store1', account_name='carbonplan', account_key=...)

# adlfs mapper
fs = AzureBlobFileSystem(account_name='carbonplan', account_key=...)
store2 = fs.get_mapper('carbonplan-scratch/test/store2')

%%timeit -n 5 -r 5
ds.to_zarr(store1, mode='w')
# 1.02 s ± 79.6 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

%%timeit -n 5 -r 5
ds.to_zarr(store2, mode='w')
# 9.1 s ± 1.98 s per loop (mean ± std. dev. of 5 runs, 5 loops each)

Anything else we need to know?:

The example below was tested on adlfs@master and zarr-developers/zarr-python#620.

Environment:

  • Dask version: 2.30.0
  • adlfs version: v0.5.9
  • xarray version: 0.16.2
  • zarr version: 2.2.0a2.dev650
  • Python version: 3.7.9
  • Operating System: Linux
  • Install method (conda, pip, source): conda/pip

cc @TomAugspurger

@hayesgb
Copy link
Collaborator

hayesgb commented Jan 7, 2021 via email

@jhamman
Copy link
Author

jhamman commented Jan 8, 2021

I ran the same write tests on with adlfs==0.2.4 azure-storage-blob==2.1.0 zarr==2.6.1:

%%timeit -n 5 -r 5
ds.to_zarr(store1, mode='w')
# 1.09 s ± 32.3 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

%%timeit -n 5 -r 5
ds.to_zarr(store2, mode='w')
# 2.96 s ± 116 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Zarr's store is basically the same. Adlfs was faster in 0.2.4 but still about 3x slower than Zarr.

@hayesgb
Copy link
Collaborator

hayesgb commented Jan 8, 2021

I'm trying to reproduce the observed issue. It looks like zarray is pinned to azure-storage-blob==2.0.1. Is this correct? I'm looking at zarr==2.6.1

@jhamman
Copy link
Author

jhamman commented Jan 8, 2021

Yes, the current version of zarr is pinning azure-storage-blob==2.0.1 (link). My first set of benchmarks were using azure-storage-blob==12.5.0 (zarr-developers/zarr-python#620).

@hayesgb
Copy link
Collaborator

hayesgb commented Jan 9, 2021

@jhamman -- I'm still looking at this, but while investigating, I found what I believe is a bug in zarr-python storage.py, located here. I've issued a PR #688 for it. This allows the filesystem to perform the rmdirs.

@hayesgb
Copy link
Collaborator

hayesgb commented Jan 11, 2021

@jhamman -- I have a new branch that implements asynchronous read and write. Timing it, I the following:

%%timeit -n 5 -r 5

with fs.open("data/root/a1/file.txt", "w") as f:
    f.write("0123456789")

with fs.open("data/root/a1/file.txt") as f:
    d = f.read()
assert d == b"0123456789"

# 519 ms ± 30 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
# 479 ms ± 10.5 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
# 505 ms ± 16.8 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

vs adlfs==0.5.9

# 913 ms ± 34.6 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
# 918 ms ± 14.7 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
# 915 ms ± 11 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

I've modified pipe_file (which is called by fsmapper) to directly upload to azure asynchronously. I'd be curious what you're seeing with this branch.

@jhamman
Copy link
Author

jhamman commented Jan 11, 2021

@hayesgb - this is huge. On my test case, this just resulted in a ~10x speedup.

The current timings are:

zarr-python#620
848 ms ± 40.1 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

adlfs@master
11.5 s ± 4.61 s per loop (mean ± std. dev. of 5 runs, 5 loops each)

adlfs@speedup
1.21 s ± 79.3 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

It is still a bit slower than the Zarr implementation but this is super promising. Do we think the difference is overhead from fsspec? Or are there more pieces that could be tuned?

@hayesgb
Copy link
Collaborator

hayesgb commented Jan 14, 2021

Thanks for the feedback @jhamman. As you saw, I've merged this in as #164. Tagging @mgsnuno based on feedback in #57

I've tried doing some more tuning, and not seeing much additional benefit. Even went so far as to try creating a ADLSMap object and consolidating the mkdir and pipe_file methods into a single asynchronous function. There may be an opportunity to consolidate session objects, as described here, or otherwise improve the use of client objects.

I plan to update to address #134, and then release 0.6 over the next few days.

What are your thoughts on closing this for now?

@jhamman
Copy link
Author

jhamman commented Jan 14, 2021

@hayesgb - happy to see this closed now. I agree there may be a few bits of further tuning to do but let's get this released and address them down the road. Thanks again for your help on this.

@jhamman jhamman closed this as completed Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants