-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: tutorial on merging datasets #3131
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## scipy19-docs #3131 +/- ##
================================================
- Coverage 96.18% 96.02% -0.17%
================================================
Files 66 63 -3
Lines 13858 12799 -1059
================================================
- Hits 13330 12290 -1040
+ Misses 528 509 -19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, and sorely-needed, thanks for doing this!
I have a couple of comments but I'm not sure if it's a good idea to comment directly on the source code of a ipython notebook so I'll just write them here.
I assume the plan for the next bit is to go on to use combine_nested
to accomplish the same thing as combine_by_coords
? Then to start worrying about dirty data?
The structure is really nice, but you could also explicitly separate your data creation section from the data loading sections with a subtitle, because then it's almost like "meat of tutorial starts here".
I also really like the graphs to show what happens to your data if you concatenate in the wrong order.
I don't know if you've read the recent discussion on the mailing list but there was a nice example of a real-world problem yesterday, where the user had a set of datasets which each had a different length along one dimension, and wanted to pad them with NaNs. Something similar might be good to include here? Maybe instead of padding we do trimming using the preprocess
argument?
Another thing - here you say the future default behaviour of open_mfdataset
will be to use combine='by_coords'
, but I was under the impression it was going to be combine='nested'
. I don't think this ambiguity is a problem, because in the error messages we haven't stated what the default will be, and we've just told people to be explicit in order to be future-compatible, which is fine. But we should be consistent about what the future default will be (@shoyer?).
I think "by_coords" is probably the most user friendly default for open_mfdataset? But I'm not entirely sure... |
Nor am I. I originally thought it should be 'nested' because the
concat-in-order behaviour is more similar to the original auto_combine, but
I don't know.
It seems to me like it depends on the quality of user's data: if they have
high-quality datasets which already have sensible coordinates for each
dimension then' by_coords' is best, but if their data is more primitive
without coordinates (like mine happens to be) then 'nested' is the most
natural default. So do we have a sense of what the largest number of users
would find most natural? Or is this not a good way to think about it?
…On Wed, 17 Jul 2019, 16:05 Stephan Hoyer, ***@***.***> wrote:
I think "by_coords" is probably the most user friendly default for
open_mfdataset? But I'm not entirely sure...
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#3131?email_source=notifications&email_token=AISNPI47L2F45PAJVZX6CHTP74YNFA5CNFSM4IDSQHX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2EQLPA#issuecomment-512296380>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AISNPI4UBJO5K5NAC65VURTP74YNFANCNFSM4IDSQHXQ>
.
|
It is possible that we do actually need a third combine mode that works like the old |
4dc2e56
to
1faf67c
Compare
I am still hoping to finish this one day. Any reason it needs to be closed? |
err, sorry, no. That happened because I deleted the branch you tried to merge into. Let me try to fix that. |
we should rebase this and #3111 onto |
8e84a4a
to
211a2b3
Compare
Thanks @keewis ! |
@rabernat, the gentlest of bumps on this :)... How much work (content) is left to bring this to completion? I'm asking because I'd be happy to help if there's still more work and/or follow-up PR needed. |
whats-new.rst
for all changes andapi.rst
for new APIThis is a start on a tutorial about merging / combining datasets.