Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: tutorial on merging datasets #3131

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rabernat
Copy link
Contributor

This is a start on a tutorial about merging / combining datasets.

@codecov
Copy link

codecov bot commented Jul 15, 2019

Codecov Report

Merging #3131 into scipy19-docs will decrease coverage by 0.16%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           scipy19-docs    #3131      +/-   ##
================================================
- Coverage         96.18%   96.02%   -0.17%     
================================================
  Files                66       63       -3     
  Lines             13858    12799    -1059     
================================================
- Hits              13330    12290    -1040     
+ Misses              528      509      -19     

@TomNicholas TomNicholas self-assigned this Jul 16, 2019
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, and sorely-needed, thanks for doing this!

I have a couple of comments but I'm not sure if it's a good idea to comment directly on the source code of a ipython notebook so I'll just write them here.

I assume the plan for the next bit is to go on to use combine_nested to accomplish the same thing as combine_by_coords? Then to start worrying about dirty data?

The structure is really nice, but you could also explicitly separate your data creation section from the data loading sections with a subtitle, because then it's almost like "meat of tutorial starts here".

I also really like the graphs to show what happens to your data if you concatenate in the wrong order.

I don't know if you've read the recent discussion on the mailing list but there was a nice example of a real-world problem yesterday, where the user had a set of datasets which each had a different length along one dimension, and wanted to pad them with NaNs. Something similar might be good to include here? Maybe instead of padding we do trimming using the preprocess argument?

Another thing - here you say the future default behaviour of open_mfdataset will be to use combine='by_coords', but I was under the impression it was going to be combine='nested'. I don't think this ambiguity is a problem, because in the error messages we haven't stated what the default will be, and we've just told people to be explicit in order to be future-compatible, which is fine. But we should be consistent about what the future default will be (@shoyer?).

@shoyer
Copy link
Member

shoyer commented Jul 17, 2019

I think "by_coords" is probably the most user friendly default for open_mfdataset? But I'm not entirely sure...

@TomNicholas
Copy link
Member

TomNicholas commented Jul 17, 2019 via email

@shoyer
Copy link
Member

shoyer commented Jul 17, 2019

It is possible that we do actually need a third combine mode that works like the old auto_combine.

@rabernat
Copy link
Contributor Author

I am still hoping to finish this one day. Any reason it needs to be closed?

@keewis
Copy link
Collaborator

keewis commented May 18, 2020

err, sorry, no. That happened because I deleted the branch you tried to merge into. Let me try to fix that.

@keewis keewis reopened this May 18, 2020
@keewis keewis changed the base branch from scipy19-docs to master May 18, 2020 14:24
@keewis keewis changed the base branch from master to scipy19-docs May 18, 2020 14:24
@keewis
Copy link
Collaborator

keewis commented May 18, 2020

we should rebase this and #3111 onto master so we don't depend on the old scipy19-docs branch. If we want to continue having a separate development branch for documentation, I think we should use one that is kept in sync with current master.

@keewis keewis force-pushed the tutorial-on-merging branch from 8e84a4a to 211a2b3 Compare June 2, 2020 13:24
@keewis keewis changed the base branch from scipy19-docs to master June 2, 2020 13:24
@keewis
Copy link
Collaborator

keewis commented Jun 2, 2020

@rabernat, I did the rebase for this and #3111, so when you eventually pick this up again, a simple merge should get this up-to-date with master

@dcherian
Copy link
Contributor

dcherian commented Jun 2, 2020

Thanks @keewis !

@andersy005
Copy link
Member

@rabernat, the gentlest of bumps on this :)... How much work (content) is left to bring this to completion? I'm asking because I'd be happy to help if there's still more work and/or follow-up PR needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Example/Tutorial of importing data to Xarray (Merge/conact/etc)
6 participants