Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Integration with Huggingface datasets #4962

Open
aleSuglia opened this issue Feb 5, 2021 · 5 comments · May be fixed by #5095
Open

Integration with Huggingface datasets #4962

aleSuglia opened this issue Feb 5, 2021 · 5 comments · May be fixed by #5095
Milestone

Comments

@aleSuglia
Copy link
Contributor

Huggingface Datasets has nicely gathered popularity over the last few months and it has a very simple API for accessing the most common NLP datasets. In addition, it has the potential to support multi-modal datasets as well (see related issue). At the moment, AllenNLP integrates datasets by downloading them manually and by reporting in the configuration file the path to the dataset. This scenario works most of the time but doesn't guarantee complete transparency in the training process.

Based on this issue, I was considering whether it would be possible to support this library so that AllenNLP can potentially take advantage of their caching functionalities as well. I'm aware that AllenNLP has its own way of handling tokenization and indexing but I still believe having a common entry point for dataset creation would be very handy as well as very elegant from the reproducibility point of view.

Any thoughts about this idea?

Thanks,
Alessandro

@epwalsh
Copy link
Member

epwalsh commented Feb 5, 2021

It would be great to add support for Datasets. I was thinking about this a while ago and then it kind of fell off the map. I'm not sure yet how we'd integrate it, but I'm thinking it would either be through a new DatasetReader or DataLoader class that wraps it.

@dirkgr
Copy link
Member

dirkgr commented Feb 12, 2021

Same for TensorFlow Datasets. TFDS datasets have a schema, so we could automatically read it into TextField, LabelField, and so on.

@dirkgr dirkgr modified the milestones: 1.4, 2.1 Feb 12, 2021
@dirkgr dirkgr modified the milestones: 2.1, 2.2 Feb 22, 2021
@divijbajaj
Copy link

I'm trying to add a DataSetReader which can generically make instances from the huggingface dataset interface.
It will have limitations and may not work for all datasets in which case, a child can be added for it with selective overrides to take care of the missing gaps.

@epwalsh
Copy link
Member

epwalsh commented Mar 31, 2021

@divijbajaj great! Looking forward to seeing what you come up with.

@ghost ghost linked a pull request Apr 4, 2021 that will close this issue
@ghost
Copy link

ghost commented Apr 4, 2021

@epwalsh Raised a draft PR with slightly unbaked but functional code we did last week with @divijbajaj. It should give a rough direction. Would appreciate a high-level review if time permits.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants