Add the ability to view/load single datasets in a data collection #554

nwlandry · 2024-06-24T15:08:44Z

From #540:
"xgi.load_xgi_data("hyperbard") loads a dict of datasets, so xgi.load_xgi_data("hyperbard")["coriolanus"] should load a single dataset if I understand correctly...I'm wondering if don't want to be able to access them directly from something like xgi.load_xgi_data("hyperbard-coriolanus") to be able to iterate over all datasets in XGI-data. They could also appear in the list we get with xgi.load_xgi_data() this way?"

The text was updated successfully, but these errors were encountered:

maximelucas · 2024-08-05T16:59:24Z

Some more thoughts on this: I'm currently iterating over multiple datasets to run some analysis. And I encountered precisely the problem described above: I need to have one for loop on the "normal" datasets", and another one on all the datasets from the hyperbard collection. If we add another collection, I'd need one more loop. It's not super practical and my code feels unnecessarily complicated.

Also, when we look at the datasets table, no stats appear for hyperbard (because it's a collection of course).

I'm thinking: what if each dataset had its own separate record in zenodo?
This would solve the two problems above.
Then, we could add metadata to our datasets. One of which could be {"collection" : "hyperbard"}. And in addition to our load_xgi_data(dataset_name) function, we could have another one that would be load_xgi_collection(collection_name) that would return a dict containing all the datasets from that collection.

Then, because we're starting to have many datasets, it would be useful to have a way of sorting/filtering them by stats/collection/category/... in the above table. So that one can find quickly what they're looking for.

What do you think?

nwlandry · 2024-08-06T13:15:17Z

This could be a great solution to this problem! I think that it could be quite tedious to add a lot of datasets to their own pages in Zenodo. So could we modify this to allow more than one dataset for each record? Then each dataset in the collection would have the same DOI, but we would still treat them as individual datasets like you're suggesting. In addition, how do you picture that we can efficiently access all the datasets in each collection? One thing that comes to mind is by specifying collections in index.json.

Regarding the table, I 100% agree. Something that I originally was thinking was to group the datasets in a collection in an expandable section of the table, but I like the idea of more flexibly searching the datasets.

maximelucas · 2024-08-06T16:00:52Z

Okay thinking out loud to this how we could make this work in practice.

Right now, the hyperbard collection has a single Zenodo record, that contains one .json file per dataset in the collection.
The above table is populated by reading entries in index.json.
Each entry in index.json has a url that links to the .json file in a given Zenodo record (but does not link to the Zenodo record), so the table would be fine. Actually right now it reads
"hyperbard": {"url": "https://zenodo.org/records/11211879/files/hyperbard_collection_information.json"} , which is why it doesn't display anything.
Also, load_xgi_data() reads from index.json so we would only need to add an entry for each of the datasets to the index.json for it to work, right?

This sounds like a minimal change that we can already make and see if we like it.

(Only small downside I see for this: if we need to make a correction to a single dataset, we need a new version for the whole collection.)

So in practice we could now:

update the index.json to add all single datasets in hyperbard collection
rerun get_stats to populate it with stats
add to it a collection attribute that would be "hyperbard" for those
create a function load_data_collection(collection_name) that would return a dict of datasets. How could it filter the right datasets? So far just with an if collection_name in dataset_name condition, iterating over all dataset names I guess.

Did I miss anything?

maximelucas · 2024-08-06T16:02:50Z

Second step would be to make the table more flexible for searching, adding more attributes in index.json, and maybe add filtering capabilities to the load_xgi_data() function too so we can ask "give me all datasets from biology", or "all datasets with less than 500 nodes".

nwlandry · 2024-08-30T19:56:58Z

Okay, I've had time to think more about this. What about this: We make collections top-level items and add the datasets contained in the collection under this item with relative paths to the collection url. Then, we can define single datasets with the tuple ("collection", "dataset") and a collection of datasets with the collection name by itself. And when we run

xgi.load_xgi_data()

it will return

Datasets:
dataset1
dataset2
...

Collections:
collection1
("collection1", "dataset1")
("collection1", "dataset2")

collection2
("collection2", "dataset1")
("collection2", "dataset2")
...

I don't think that this is a perfect solution, but it would be nice to have preserve the connection between the collection and its constituent datasets.

nwlandry added the improve Make an existing feature better label Jun 24, 2024

nwlandry changed the title ~~Add the ability to view/load singe datasets in a data collection~~ Add the ability to view/load single datasets in a data collection Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to view/load single datasets in a data collection #554

Add the ability to view/load single datasets in a data collection #554

nwlandry commented Jun 24, 2024

maximelucas commented Aug 5, 2024 •

edited

Loading

nwlandry commented Aug 6, 2024

maximelucas commented Aug 6, 2024

maximelucas commented Aug 6, 2024

nwlandry commented Aug 30, 2024

Add the ability to view/load single datasets in a data collection #554

Add the ability to view/load single datasets in a data collection #554

Comments

nwlandry commented Jun 24, 2024

maximelucas commented Aug 5, 2024 • edited Loading

nwlandry commented Aug 6, 2024

maximelucas commented Aug 6, 2024

maximelucas commented Aug 6, 2024

nwlandry commented Aug 30, 2024

maximelucas commented Aug 5, 2024 •

edited

Loading