-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We need a fast path for open_mfdataset #1823
Comments
@rabernat - Depending on the structure of the dataset, another possibility that would speed up some |
I did not really find an elegant solution. What I did was just specify all dims and coords as
Perhaps this could be generalized in a sense, by reading all coords and dims just from the first file. |
Would these two options be necessarily mutually exclusive? I think parallelizing the read in sounds amazing. But isnt there some merit in skipping some of the checks all together, if the user is sure about the structure of the data contained in the many files? I am often working with the aforementioned type of data (many files either contain a new timestep or a different variable, but most of the dimensions/coordinates are the same). In some cases I am finding that reading the data "lazily" consumes a significant amount of the time in my workflow. I am unsure how hard this would be to achieve, and perhaps it is not worth it after all. Just putting out a few ideas, while I wait for my |
@jbusecke - No. These options are not mutually exclusive. The parallel open is, in my opinion, the lowest hanging fruit so that's why I started there. There are other improvements that we can tackle incrementally. |
Awesome, thanks for the clarification. |
I am currently motivated to fix this.
Is this all that we can do on the xarray side? |
@dcherian I'm sorry, I'm very interested in this but after reading the issues I'm still not clear on what's being proposed: What exactly is the bottleneck? Is it reading the coords from all the files? Is it loading the coord values into memory? Is it performing the alignment checks on those coords once they're in memory? Is it performing alignment checks on the dimensions? Is this suggestion relevant to datasets that don't have any coords? Which of these steps would a
But this is already an option to |
The original issue of this thread is that you sometimes might want to disable alignment checks for coordinates other than the When you So |
So I think it is quite important to consider this issue together with #2697. An xml specification called NCML already exists which tells software how to put together multiple netCDF files into a single virtual netcdf. We should leverage this existing spec as much as possible. A realistic use case for me is that I have, say 1000 files of high-res model output, each with large coordinate variables, all generated from the same model run. If we want to for for which we know a priori that certain coordinates (dimension coordinates or otherwise) are identical, we could save a lot of disk reads (the slow part of For a catalog of tricks I use to optimize opening these sorts of big, complex, multi-file datasets (e.g. CMIP), check out |
One common use-case is files with large numbers of # keep only coordinates from first ensemble member to simplify merge
first = member_dsets_aligned[0]
rest = [mds.reset_coords(drop=True) for mds in member_dsets_aligned[1:]]
objs_to_concat = [first] + rest def merge_vars_two_datasets(ds1, ds2):
"""
Merge two datasets, dropping all variables from
second dataset that already exist in the first dataset's coordinates.
""" See also #2039 (second code block) One way to do this might be to add a As bonus it would assign attributes from the EDIT: #2039 (third code block) is also a possibility. This might look like xr.open_mfdataset('files*.nc', master_file='first', concat_dim='time') in which case the first file is read; all coords that are not EDIT2: |
Is this issue really closed?!? 🎉🎂🏆🥇 |
YES! The PR lets you skip compatibility checks. Whats left is extremely large indexes and lazy index / coordinate loading but we have #2039 open for that. I will rename that issue. If you have time, can you test it out? |
This is big if true! But surely to close an issue raised by complaints about speed, we should really have some new asv speed tests? |
=) @TomNicholas PRs welcome! |
PS @rabernat
This completes in 40 seconds with 10 workers on cheyenne. |
Wooooow. Thanks. Ill have to give this a whirl soon. |
Let's close this since there is an opt-in mostly-fast path. I've added an item to #4648 to cover adding an asv benchmark for mfdataset. |
@dcherian, thanks for your solution. In my experience with 34013 NetCDF files, I could open 117 Gib in 13min 14s. Can I decrease this time? |
That's 34k 3MB files! I suggest combining to 1k 100MB files, that would work a lot better. |
It would be great to have a "fast path" option for
open_mfdataset
, in which all alignment / coordinate checking is bypassed. This would be used in cases where the user knows that many netCDF files all share the same coordinates (e.g. model output, satellite records from the same product, etc.). The coordinates would just be taken from the first file, and only the data variables would be read from all subsequent files. The only checking would be that the data variables have the correct shape.Implementing this would require some refactoring. @jbusecke mentioned that he had developed a solution for this (related to #1704), so maybe he could be the one to add this feature to xarray.
This is also related to #1385.
The text was updated successfully, but these errors were encountered: