Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to view/load single datasets in a data collection #554

Open
nwlandry opened this issue Jun 24, 2024 · 5 comments
Open

Add the ability to view/load single datasets in a data collection #554

nwlandry opened this issue Jun 24, 2024 · 5 comments
Labels
improve Make an existing feature better

Comments

@nwlandry
Copy link
Collaborator

From #540:
"xgi.load_xgi_data("hyperbard") loads a dict of datasets, so xgi.load_xgi_data("hyperbard")["coriolanus"] should load a single dataset if I understand correctly...I'm wondering if don't want to be able to access them directly from something like xgi.load_xgi_data("hyperbard-coriolanus") to be able to iterate over all datasets in XGI-data. They could also appear in the list we get with xgi.load_xgi_data() this way?"

@nwlandry nwlandry added the improve Make an existing feature better label Jun 24, 2024
@maximelucas
Copy link
Collaborator

maximelucas commented Aug 5, 2024

Some more thoughts on this: I'm currently iterating over multiple datasets to run some analysis. And I encountered precisely the problem described above: I need to have one for loop on the "normal" datasets", and another one on all the datasets from the hyperbard collection. If we add another collection, I'd need one more loop. It's not super practical and my code feels unnecessarily complicated.

Also, when we look at the datasets table, no stats appear for hyperbard (because it's a collection of course).

I'm thinking: what if each dataset had its own separate record in zenodo?
This would solve the two problems above.
Then, we could add metadata to our datasets. One of which could be {"collection" : "hyperbard"}. And in addition to our load_xgi_data(dataset_name) function, we could have another one that would be load_xgi_collection(collection_name) that would return a dict containing all the datasets from that collection.

Then, because we're starting to have many datasets, it would be useful to have a way of sorting/filtering them by stats/collection/category/... in the above table. So that one can find quickly what they're looking for.

What do you think?

@nwlandry nwlandry changed the title Add the ability to view/load singe datasets in a data collection Add the ability to view/load single datasets in a data collection Aug 6, 2024
@nwlandry
Copy link
Collaborator Author

nwlandry commented Aug 6, 2024

This could be a great solution to this problem! I think that it could be quite tedious to add a lot of datasets to their own pages in Zenodo. So could we modify this to allow more than one dataset for each record? Then each dataset in the collection would have the same DOI, but we would still treat them as individual datasets like you're suggesting. In addition, how do you picture that we can efficiently access all the datasets in each collection? One thing that comes to mind is by specifying collections in index.json.

Regarding the table, I 100% agree. Something that I originally was thinking was to group the datasets in a collection in an expandable section of the table, but I like the idea of more flexibly searching the datasets.

@maximelucas
Copy link
Collaborator

Okay thinking out loud to this how we could make this work in practice.

Right now, the hyperbard collection has a single Zenodo record, that contains one .json file per dataset in the collection.
The above table is populated by reading entries in index.json.
Each entry in index.json has a url that links to the .json file in a given Zenodo record (but does not link to the Zenodo record), so the table would be fine. Actually right now it reads
"hyperbard": {"url": "https://zenodo.org/records/11211879/files/hyperbard_collection_information.json"} , which is why it doesn't display anything.
Also, load_xgi_data() reads from index.json so we would only need to add an entry for each of the datasets to the index.json for it to work, right?

This sounds like a minimal change that we can already make and see if we like it.

(Only small downside I see for this: if we need to make a correction to a single dataset, we need a new version for the whole collection.)

So in practice we could now:

  • update the index.json to add all single datasets in hyperbard collection
  • rerun get_stats to populate it with stats
  • add to it a collection attribute that would be "hyperbard" for those
  • create a function load_data_collection(collection_name) that would return a dict of datasets. How could it filter the right datasets? So far just with an if collection_name in dataset_name condition, iterating over all dataset names I guess.

Did I miss anything?

@maximelucas
Copy link
Collaborator

Second step would be to make the table more flexible for searching, adding more attributes in index.json, and maybe add filtering capabilities to the load_xgi_data() function too so we can ask "give me all datasets from biology", or "all datasets with less than 500 nodes".

@nwlandry
Copy link
Collaborator Author

Okay, I've had time to think more about this. What about this: We make collections top-level items and add the datasets contained in the collection under this item with relative paths to the collection url. Then, we can define single datasets with the tuple ("collection", "dataset") and a collection of datasets with the collection name by itself. And when we run

xgi.load_xgi_data()

it will return

Datasets:
dataset1
dataset2
...

Collections:
collection1
("collection1", "dataset1")
("collection1", "dataset2")

collection2
("collection2", "dataset1")
("collection2", "dataset2")
...

I don't think that this is a perfect solution, but it would be nice to have preserve the connection between the collection and its constituent datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improve Make an existing feature better
Projects
None yet
Development

No branches or pull requests

2 participants