-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues with adlfs mappers #161
Comments
Thanks Joe — currently the AzureBlobFile methods occur synchronously, but leverage Azure Storage Python v12.
Do you know how it compares abfs == 0.2.4 ? That version uses the Azure Storage Python 2.1.
… On Jan 7, 2021, at 12:11 PM, Joe Hamman ***@***.***> wrote:
What happened:
I'm noticing that the write performance of adlfs, specifically when using xarray/dask to write to zarr, is much slower than that of ABSStore (built in to Zarr). I've noticed that the performance further diverges with larger datasets (more chunks). I have a hunch that this is an async issue but I'm not sure how to test that theory.
What you expected to happen:
I expected adlfs performance to be on par or better with the Zarr implementation.
Minimal Complete Verifiable Example:
from adlfs import AzureBlobFileSystem
from zarr.storage import ABSStore
import xarray as xr
# sample data
ds = xr.tutorial.load_dataset('rasm').chunk({'x': 140, 'y': 105, 'time': 18},)
# zarr mapper
store1 = ABSStore(container='carbonplan-scratch', prefix='test/store1', account_name='carbonplan', account_key=...)
# adlfs mapper
fs = AzureBlobFileSystem(account_name='carbonplan', account_key=...)
store2 = fs.get_mapper('carbonplan-scratch/test/store2')
%%timeit -n 5 -r 5
ds.to_zarr(store1, mode='w')
# 1.02 s ± 79.6 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
%%timeit -n 5 -r 5
ds.to_zarr(store2, mode='w')
# 9.1 s ± 1.98 s per loop (mean ± std. dev. of 5 runs, 5 loops each)
Anything else we need to know?:
The example below was tested on ***@***.*** and zarr-developers/zarr-python#620.
Environment:
Dask version: 2.30.0
adlfs version: v0.5.9
xarray version: 0.16.2
zarr version: 2.2.0a2.dev650
Python version: 3.7.9
Operating System: Linux
Install method (conda, pip, source): conda/pip
cc @TomAugspurger
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I ran the same write tests on with
Zarr's store is basically the same. Adlfs was faster in 0.2.4 but still about 3x slower than Zarr. |
I'm trying to reproduce the observed issue. It looks like zarray is pinned to azure-storage-blob==2.0.1. Is this correct? I'm looking at zarr==2.6.1 |
Yes, the current version of zarr is pinning |
@jhamman -- I have a new branch that implements asynchronous read and write. Timing it, I the following:
I've modified pipe_file (which is called by fsmapper) to directly upload to azure asynchronously. I'd be curious what you're seeing with this branch. |
@hayesgb - this is huge. On my test case, this just resulted in a ~10x speedup. The current timings are:
It is still a bit slower than the Zarr implementation but this is super promising. Do we think the difference is overhead from fsspec? Or are there more pieces that could be tuned? |
Thanks for the feedback @jhamman. As you saw, I've merged this in as #164. Tagging @mgsnuno based on feedback in #57 I've tried doing some more tuning, and not seeing much additional benefit. Even went so far as to try creating a ADLSMap object and consolidating the mkdir and pipe_file methods into a single asynchronous function. There may be an opportunity to consolidate session objects, as described here, or otherwise improve the use of client objects. I plan to update to address #134, and then release 0.6 over the next few days. What are your thoughts on closing this for now? |
@hayesgb - happy to see this closed now. I agree there may be a few bits of further tuning to do but let's get this released and address them down the road. Thanks again for your help on this. |
What happened:
I'm noticing that the write performance of adlfs, specifically when using xarray/dask to write to zarr, is much slower than that of ABSStore (built in to Zarr). I've noticed that the performance further diverges with larger datasets (more chunks). I have a hunch that this is an async issue but I'm not sure how to test that theory.
What you expected to happen:
I expected adlfs performance to be on par or better with the Zarr implementation.
Minimal Complete Verifiable Example:
Anything else we need to know?:
The example below was tested on adlfs@master and zarr-developers/zarr-python#620.
Environment:
cc @TomAugspurger
The text was updated successfully, but these errors were encountered: