-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add dataset->stacked dataarray/dataframe converters #25
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Naming is tough, but I think each method a user will understand what is happening when they call it as they are currently labelled.
src/arviz_base/converters.py
Outdated
__all__ = [ | ||
"convert_to_datatree", | ||
"convert_to_dataset", | ||
"extract", | ||
"to_labelled_stacked_da", | ||
"to_labelled_stacked_df", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion for naming is typically to use modules (or classes) and methods therein to write a sentence (or command) that someone can read as a sentence. For instance, the difference between
azb.converter.convert_to_datatree(...)
and
azb.convert.to_datatree(...)
is (written as sentences in my head as)
- ArviZ base converter convert to datatree (...).
- ArviZ base convert to datatree (...).
Semantically they both mean the same thing, but the second sentence makes more sense grammatically. Things like da
, ds
, df
and now dt
are shortened versions of objects people are familiar with, so the to_labelled_stacked_df
does make sense to me, and I like it.
If we are discussing an architecture for converting a dataset, I (as a user/reader of code) would think it is easier to read (for me at least) azb.convert.to_df(ds=ds, sample_dims=None, stacked=True, labeller=...)
than azb.converters.to_labelled_stacked_df(ds=ds, sample_dims=None, stacked=False, labeller=...)
since the former is closer to what I would read as this sentence;
ArviZ base convert to dataframe (using this dataset, with these sample dimensions, and ensure it is stacked, using this labeller).
This is purely an opinion though, and I think the way you have written it makes perfect sense.
I only thing I would suggest is to make things consistent, so maybe choosing how you want the methods to read. For example using all abbreviations or not (eg dataset
vs ds
or df
vs dataframe
).
original | abbreviated | long version |
---|---|---|
convert_to_datatree | to_dt | to_datatree |
convert_to_dataset | to_ds | to_dataset |
to_labelled_stacked_da | to_da | to_dataarray |
to_labelled_stacked_df | to_df | to_dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion for naming is typically to use modules (or classes) and methods therein to write a sentence (or command) that someone can read as a sentence.
I like this idea! In general arviz module names are ignored, at least for now. Maybe we could do a bit more of that for new things. You'll see they are available and documented as arviz_base.xyz
so far we have not expected any user to do arviz.stats.ess
or arviz.plot.plot_xyz
; modules were only internal file organization.
We should probably think about the functions and which do we really care about. As of now, the difference isn't only the output type but the accepted input types. convert_to_xyz
are (or attempt to) be catch all functions. Whatever you give to them, if arviz knows how to handle it you'll get an xyz
at the end. extract
only takes inferencedata/datatree and returns a dataset or dataarray. These new ones I added are even more restrictive, only valid input is dataset and they also have a single output.
I also thought about using to_xyz
directly for these new functions but I feared that would be quite confusing because xarray datasets already have methods .to_dataframe
, to_dataarray
and to_stacked_dataarray
. These are not that nor try to be, they are somewhat similar but extremely overloaded with ArviZ conventions and labeling; they won't work on datasets where there are no sample dimensions (aka a dimension that is present in all variables of the dataset), it ignores the fact that different variables in a dataset can have different dtypes... Maybe the fact that they are arviz.to_xyz
is already enough to convey that and it won't be confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the fact that they are arviz.to_xyz is already enough to convey that and it won't be confusing?
I would agree with this.
src/arviz_base/converters.py
Outdated
xr.set_options(display_expand_data=False) | ||
|
||
idata = load_arviz_data("centered_eight") | ||
to_labelled_stacked_da(idata.posterior.ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love the example
src/arviz_base/converters.py
Outdated
for idx_name in da.xindexes | ||
if sample_dim in da[idx_name].dims | ||
} | ||
columns = pd.MultiIndex.from_arrays(list(idx_dict.values()), names=list(idx_dict.keys())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
710f1a4
to
345933f
Compare
Co-authored-by: Andy Maloney <amaloney@mailbox.org>
Co-authored-by: Sandra Yojana Meneses <sandrayojana@gmail.com>
closes #15. I am very much not happy with these names, so please suggest alternatives.
Usable and documented but tests are still missing.
📚 Documentation preview 📚: https://arviz-base--25.org.readthedocs.build/en/25/