We need a fast path for open_mfdataset #1823

rabernat · 2018-01-12T17:01:49Z

It would be great to have a "fast path" option for open_mfdataset, in which all alignment / coordinate checking is bypassed. This would be used in cases where the user knows that many netCDF files all share the same coordinates (e.g. model output, satellite records from the same product, etc.). The coordinates would just be taken from the first file, and only the data variables would be read from all subsequent files. The only checking would be that the data variables have the correct shape.

Implementing this would require some refactoring. @jbusecke mentioned that he had developed a solution for this (related to #1704), so maybe he could be the one to add this feature to xarray.

This is also related to #1385.

The text was updated successfully, but these errors were encountered:

jhamman · 2018-01-12T19:46:12Z

@rabernat - Depending on the structure of the dataset, another possibility that would speed up some open_mfdataset tasks substantially is to implement the step of opening each file and getting its metadata in in some parallel way (dask/joblib/etc.) and either returning the just dataset schema or a picklable version of the dataset itself. I think this will only be able to work with autoclose=True but it could be quite useful when working with many files.

jbusecke · 2018-01-19T19:45:00Z

I did not really find an elegant solution. What I did was just specify all dims and coords as drop_variables and then update those from a master file with

ds.update(ds_master)

Perhaps this could be generalized in a sense, by reading all coords and dims just from the first file.

jbusecke · 2018-03-13T23:40:54Z

Would these two options be necessarily mutually exclusive?

I think parallelizing the read in sounds amazing.

But isnt there some merit in skipping some of the checks all together, if the user is sure about the structure of the data contained in the many files?

I am often working with the aforementioned type of data (many files either contain a new timestep or a different variable, but most of the dimensions/coordinates are the same).

In some cases I am finding that reading the data "lazily" consumes a significant amount of the time in my workflow. I am unsure how hard this would be to achieve, and perhaps it is not worth it after all.

Just putting out a few ideas, while I wait for my xr.open_mfdataset to finish :-)

jhamman · 2018-03-14T00:13:34Z

@jbusecke - No. These options are not mutually exclusive. The parallel open is, in my opinion, the lowest hanging fruit so that's why I started there. There are other improvements that we can tackle incrementally.

jbusecke · 2018-03-14T18:16:38Z

Awesome, thanks for the clarification.
I just looked at #1981 and it seems indeed very elegant (in fact I just now used this approach to parallelize printing of movie frames!) Thanks for that!

dcherian · 2019-05-01T21:42:01Z

I am currently motivated to fix this.

Over in concat prealigned objects #1413 (comment) @rabernat mentioned

allowing the user to pass join='exact' via open_mfdataset. A related optimization would be to allow the user to pass coords='minimal' (or other concat coords options) via open_mfdataset.

@shoyer suggested calling decode_cf later here though perhaps this wont help too much: slow performance with open_mfdataset #1385 (comment)

Is this all that we can do on the xarray side?

TomNicholas · 2019-05-03T09:25:00Z

@dcherian I'm sorry, I'm very interested in this but after reading the issues I'm still not clear on what's being proposed:

What exactly is the bottleneck? Is it reading the coords from all the files? Is it loading the coord values into memory? Is it performing the alignment checks on those coords once they're in memory? Is it performing alignment checks on the dimensions? Is this suggestion relevant to datasets that don't have any coords?

Which of these steps would a join='exact' option omit?

A related optimization would be to allow the user to pass coords='minimal' (or other concat coords options) via open_mfdataset.

But this is already an option to open_mfdataset?

j08lue · 2019-05-03T11:26:06Z

The original issue of this thread is that you sometimes might want to disable alignment checks for coordinates other than the concat_dim and only check for same dimensions and dimension shapes.

When you xr.merge with join='exact', it still checks for alignment (see #1330 (comment)), but does not join the coordinates if they are not aligned. This behavior (not joining) is also included in what @rabernat envisioned here, but his suggestion goes beyond that: you don't even load coordinate values from all but the first dataset and just blindly trust that they are aligned.

So xr.open_mfdataset(join='exact', coords='minimal') does not fix this issue here, I think.

rabernat · 2019-05-03T13:47:12Z

So I think it is quite important to consider this issue together with #2697. An xml specification called NCML already exists which tells software how to put together multiple netCDF files into a single virtual netcdf. We should leverage this existing spec as much as possible.

A realistic use case for me is that I have, say 1000 files of high-res model output, each with large coordinate variables, all generated from the same model run. If we want to for for which we know a priori that certain coordinates (dimension coordinates or otherwise) are identical, we could save a lot of disk reads (the slow part of open_mfdataset) by never reading those coordinates at all. Enabling this would require a pretty low-level change in xarray. For example, we couldn't even rely on open_dataset in its current form to open files, because open_dataset eagerly loads all dimension coordinates into indexes. One way forward might be to create a new Store class.

For a catalog of tricks I use to optimize opening these sorts of big, complex, multi-file datasets (e.g. CMIP), check out
https://github.com/pangeo-data/esgf2xarray/blob/master/esgf2zarr/aggregate.py

dcherian · 2019-05-03T15:29:14Z

One common use-case is files with large numbers of concat_dim-invariant non-dimensional co-ordinates. This is easy to speed up by dropping those variables from all but the first file.

e.g.
https://github.com/pangeo-data/esgf2xarray/blob/6a5e4df0d329c2f23b403cbfbb65f0f1dfa98d52/esgf2zarr/aggregate.py#L107-L110

    # keep only coordinates from first ensemble member to simplify merge
    first = member_dsets_aligned[0]
    rest = [mds.reset_coords(drop=True) for mds in member_dsets_aligned[1:]]
    objs_to_concat = [first] + rest

Similarly https://github.com/NCAR/intake-esm/blob/e86a8e8a80ce0fd4198665dbef3ba46af264b5ea/intake_esm/aggregate.py#L53-L57

def merge_vars_two_datasets(ds1, ds2):
    """
    Merge two datasets, dropping all variables from
    second dataset that already exist in the first dataset's coordinates.
    """

See also #2039 (second code block)

One way to do this might be to add a master_file kwarg to open_mfdataset. This would imply coords='minimal', join='exact' (I think; prealigned=True in some other proposals) and would drop non-dimensional coordinates from all but the first file and then call concat.

As bonus it would assign attributes from the master_file to the merged dataset (for which I think there are open issues) : this functionality exists in netCDF4.MFDataset so that's a plus.

EDIT: #2039 (third code block) is also a possibility. This might look like

xr.open_mfdataset('files*.nc', master_file='first', concat_dim='time')

in which case the first file is read; all coords that are not concat_dim become drop_variables for an open_dataset call that reads the remaining files. We then merge with the first dataset and assign attrs.

EDIT2: master_file combines two different functionalities here: specifying a "template file" and a file to choose attributes from. So maybe we need two kwargs: template_file and attrs_from?

rabernat · 2019-09-16T14:53:57Z

Is this issue really closed?!?

🎉🎂🏆🥇

dcherian · 2019-09-16T15:00:16Z

YES!
(well almost)

The PR lets you skip compatibility checks.
The magic spell is xr.open_mfdataset(..., data_vars="minimal", coords="minimal", compat="override")
You can skip index comparison by adding join="override".

Whats left is extremely large indexes and lazy index / coordinate loading but we have #2039 open for that. I will rename that issue.

If you have time, can you test it out?

TomNicholas · 2019-09-16T18:43:52Z

This is big if true!

But surely to close an issue raised by complaints about speed, we should really have some new asv speed tests?

dcherian · 2019-09-16T19:01:57Z

=) @TomNicholas PRs welcome!

dcherian · 2019-09-16T19:03:47Z

PS @rabernat

%%time
ds = xr.open_mfdataset("/glade/p/cesm/community/ASD-HIGH-RES-CESM1/hybrid_v5_rel04_BC5_ne120_t12_pop62/ocn/proc/tseries/monthly/*.nc", 
                        parallel=True, coords="minimal", data_vars="minimal", compat='override')

This completes in 40 seconds with 10 workers on cheyenne.

jbusecke · 2019-09-16T20:29:35Z

Wooooow. Thanks. Ill have to give this a whirl soon.

dcherian · 2021-01-27T17:50:09Z

Let's close this since there is an opt-in mostly-fast path. I've added an item to #4648 to cover adding an asv benchmark for mfdataset.

Hossein-Madadi · 2021-01-27T21:51:24Z

PS @rabernat

%%time
ds = xr.open_mfdataset("/glade/p/cesm/community/ASD-HIGH-RES-CESM1/hybrid_v5_rel04_BC5_ne120_t12_pop62/ocn/proc/tseries/monthly/*.nc", 
                        parallel=True, coords="minimal", data_vars="minimal", compat='override')

This completes in 40 seconds with 10 workers on cheyenne.

@dcherian, thanks for your solution. In my experience with 34013 NetCDF files, I could open 117 Gib in 13min 14s. Can I decrease this time?

dcherian · 2021-01-27T22:43:59Z

That's 34k 3MB files! I suggest combining to 1k 100MB files, that would work a lot better.

jbusecke mentioned this issue Jan 19, 2018

speed up opening multiple files with changing data variables #1845

Closed

jhamman mentioned this issue Mar 11, 2018

use dask to open datasets in parallel #1981

Closed

rabernat mentioned this issue Apr 5, 2018

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

Open

rabernat mentioned this issue Jun 6, 2018

tolerance for alignment #2217

Open

TomNicholas mentioned this issue Nov 2, 2018

Concatenate across multiple dimensions with open_mfdataset #2159

Closed

TomNicholas mentioned this issue Dec 8, 2018

Concatenate using global indexes boutproject/xBOUT#3

Open

dcherian mentioned this issue Jan 22, 2019

Error when using engine='scipy' reading CM2.6 ocean output #1704

Closed

dcherian closed this as completed May 1, 2019

dcherian reopened this May 1, 2019

TomNicholas mentioned this issue May 23, 2019

Reading single grid cells from a multi-file netcdf dataset? #2979

Open

dcherian mentioned this issue Aug 1, 2019

Add join='override' #3175

Merged

3 tasks

dcherian mentioned this issue Sep 7, 2019

Refactor concat to use merge for non-concatenated variables #3239

Merged

4 tasks

dcherian closed this as completed in #3239 Sep 16, 2019

dcherian reopened this Sep 16, 2019

angus-g mentioned this issue Sep 17, 2019

Reducing complexity of cc.querying.getvar COSIMA/cosima-cookbook#147

Open

TomNicholas mentioned this issue Jan 11, 2020

Allow kwargs to open_boutdataset boutproject/xBOUT#102

Merged

aaronspring mentioned this issue Sep 3, 2020

speedup with xr.open_mfdataset kwargs antarcticrainforest/esm_analysis#14

Closed

dcherian mentioned this issue Jan 27, 2021

Comprehensive benchmarking suite #4648

Open

19 tasks

dcherian closed this as completed Jan 27, 2021

aaronspring mentioned this issue May 10, 2021

Two dates reforecasts dask warnings and takes ages ecmwf-lab/climetlab-s2s-ai-challenge#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We need a fast path for open_mfdataset #1823

We need a fast path for open_mfdataset #1823

rabernat commented Jan 12, 2018

jhamman commented Jan 12, 2018

jbusecke commented Jan 19, 2018

jbusecke commented Mar 13, 2018

jhamman commented Mar 14, 2018

jbusecke commented Mar 14, 2018

dcherian commented May 1, 2019 •

edited

Loading

TomNicholas commented May 3, 2019

j08lue commented May 3, 2019 •

edited

Loading

rabernat commented May 3, 2019

dcherian commented May 3, 2019 •

edited

Loading

rabernat commented Sep 16, 2019

dcherian commented Sep 16, 2019

TomNicholas commented Sep 16, 2019

dcherian commented Sep 16, 2019

dcherian commented Sep 16, 2019

jbusecke commented Sep 16, 2019

dcherian commented Jan 27, 2021

Hossein-Madadi commented Jan 27, 2021 •

edited

Loading

dcherian commented Jan 27, 2021

We need a fast path for open_mfdataset #1823

We need a fast path for open_mfdataset #1823

Comments

rabernat commented Jan 12, 2018

jhamman commented Jan 12, 2018

jbusecke commented Jan 19, 2018

jbusecke commented Mar 13, 2018

jhamman commented Mar 14, 2018

jbusecke commented Mar 14, 2018

dcherian commented May 1, 2019 • edited Loading

TomNicholas commented May 3, 2019

j08lue commented May 3, 2019 • edited Loading

rabernat commented May 3, 2019

dcherian commented May 3, 2019 • edited Loading

rabernat commented Sep 16, 2019

dcherian commented Sep 16, 2019

TomNicholas commented Sep 16, 2019

dcherian commented Sep 16, 2019

dcherian commented Sep 16, 2019

jbusecke commented Sep 16, 2019

dcherian commented Jan 27, 2021

Hossein-Madadi commented Jan 27, 2021 • edited Loading

dcherian commented Jan 27, 2021

dcherian commented May 1, 2019 •

edited

Loading

j08lue commented May 3, 2019 •

edited

Loading

dcherian commented May 3, 2019 •

edited

Loading

Hossein-Madadi commented Jan 27, 2021 •

edited

Loading