Replies: 4 comments 1 reply
-
Hi Luis!
We identified this issue already, and discussed a bit with the HF folks.
If my memory is correct, it's because there are two mechanisms to provide
metadata for HF datasets: Data cards, and an older legacy mechanism. We're
getting the descriptions from data cards in Croissant, while the longer,
nicer descriptions come from the legacy mechanism.
+Quentin Lhoest ***@***.***> +Sylvain Lesage
***@***.***> in case you want to chime in with possible
approaches to address this issue.
Cheers,
Omar
…On Wed, Feb 28, 2024 at 3:04 PM Luis Oala ***@***.***> wrote:
hi gang,
i have been playing a bit with the nice crawler that @marcenacp
<https://github.com/marcenacp> built (
https://github.com/mlcommons/croissant/blob/main/health/crawler/spiders/huggingface.py
).
when analyzing the resulting dataframes i noticed that the "description"
tag in hf croissant files is, lets say, very light, especially compared to
kaggle. anecdotally, this type of diet description seems to be widespread
in the hf croissant files from the crawl.
here is a side by side example for two datasets
hf (dataset gui page:
https://huggingface.co/datasets/CohereForAI/aya_collection, croissant:
https://datasets-server.huggingface.co/croissant?dataset=CohereForAI/aya_collection&full=true
)
"description":"CohereForAI/aya_collection dataset hosted on Hugging Face and contributed by the HF Datasets community"
kaggle (dataset gui page:
https://www.kaggle.com/datasets/kanchana1990/airbnb-las-vegas-listings,
croissant:
https://storage.googleapis.com/kaggle-data-sets/4482993/7683348/croissant/metadata.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240228%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240228T140305Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=78fbb6537a3cf1a6f8ac9f85dbb46aaa244a66b9dfce03d37405c3dd2bb9f6c6d40a7897e516841d5f5bb516e806252852752905b6f12bb9f9a67a4becd2664df514e33dde9af3931576acc78bfac24cf3991bc6021a7f138850bb757429360f8bfa44e4fecc85d936ae217d9bc2062162beee7cf394c24d63333fbcd13f81cef6263a5399837bbd0f51bb9277f4be5237320ae459ad596a33048eb9a8109f5b99882b7554efb41edb32e39ff3109e88e964982ed106523ba5d9166a90ec9b0d06276e1795fe299f5190d48a899d56ee8fe8f1b3694a2ece023a1318149d79a2975f9cc0ee6400fbab80b63aa1ab16b28e5cfd1d4d9518f89ccc7ceff745d004
)
"description":"**Airbnb Las Vegas Listings \uD83C\uDFE0**\n\n**Overview:**\nWelcome to our cozy corner of data, featuring a curated selection of Airbnb listings from the vibrant city of Las Vegas! Dive into the unique stays Vegas has to offer, from luxurious condos to private rooms that promise an unforgettable stay.\n\n**Data Science Applications:**\nThis dataset is your playground for various data science projects. Whether you\u0027re predicting prices, analyzing guest preferences, or exploring the impact of locations on ratings, there\u0027s something here for everyone. It\u0027s perfect for those looking to practice their data wrangling, visualization, and machine learning skills in a real-world context. Price column is null here so , one may take that as a data cleaning activity also.\n\n**Column Descriptors:**\n- **\u0060roomType\u0060**: Discover the type of accommodation.\n- **\u0060stars\u0060**: Check out the guest ratings.\n- **\u0060address\u0060**: Know where you\u0027ll be staying.\n- **\u0060numberOfGuests\u0060**: Find out the guest capacity.\n- **\u0060primaryHost/smartName\u0060**: Get to know your host.\n- **\u0060price\u0060**: Peek at the listing prices. (Note: Some data may be missing here, so creativity in handling this could be a fun challenge!)\n- **\u0060firstReviewComments\u0060**: Read what the first guests had to say.\n- **\u0060firstReviewRating\u0060**: See how the first guests rated their stay.\n\n**Ethically Mined Data:**\nWe\u0027re committed to ethical data practices. This dataset has been carefully compiled, respecting privacy and data sharing norms. It\u0027s all about fostering learning and innovation, without stepping over any lines.\n\n**A Big Thank You:**\nWe extend our heartfelt gratitude to Airbnb and the platforms that share data openly, making projects like this possible. Their commitment to community and openness enriches the data science world.\n\nDive in, explore, and let the data spark your curiosity and creativity! Happy analyzing! \uD83C\uDF1F"
are the "description" values populated with correct metadata when hf
generates the croissant files? for example, if you check the actual dataset
page for the hf example above, it has rich info such as
https://huggingface.co/datasets/CohereForAI/aya_collection#dataset-summary
:
The Aya Collection is a massive multilingual collection consisting of 513
million instances of prompts and completions covering a wide range of
tasks. This collection incorporates instruction-style templates from fluent
speakers and applies them to a curated list of datasets, as well as
translations of instruction-style datasets into 101 languages. Aya Dataset,
a human-curated multilingual instruction and response dataset, is also part
of this collection. See our paper for more details regarding the collection.
—
Reply to this email directly, view it on GitHub
<#574>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABMV3YTS7V2IYWHCAOVPC7DYV42QPAVCNFSM6AAAAABD6CZYI6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWGI4DOMZQGY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Yes, the issue is with the specific endpoint (https://datasets-server.huggingface.co/croissant?dataset=CohereForAI/aya_collection&full=true). The embedded Croissant in https://huggingface.co/datasets/CohereForAI/aya_collection gives a better description:
"description": "\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nThe Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completions covering a wide range of tasks.\nThis collection incorporates instruction-style templates from fluent speakers and applies them to a curated list of datasets, as well as translations of instruction-style datasets into 101 languages. Aya Dataset, a human-curated multilingual instruction and response dataset, is also part of this collection. See… See the full description on the dataset page: https://huggingface.co/datasets/CohereForAI/aya_collection.",
For now, it’s better to rely on the Croissant schema embedded in the datasets pages such as https://huggingface.co/datasets/CohereForAI/aya_collection.
… Le 28 févr. 2024 à 15:29, Omar Benjelloun ***@***.***> a écrit :
Hi Luis!
We identified this issue already, and discussed a bit with the HF folks.
If my memory is correct, it's because there are two mechanisms to provide metadata for HF datasets: Data cards, and an older legacy mechanism. We're getting the descriptions from data cards in Croissant, while the longer, nicer descriptions come from the legacy mechanism.
+Quentin Lhoest ***@***.***> +Sylvain Lesage ***@***.***> in case you want to chime in with possible approaches to address this issue.
Cheers,
Omar
On Wed, Feb 28, 2024 at 3:04 PM Luis Oala ***@***.*** ***@***.***>> wrote:
>
> hi gang,
>
> i have been playing a bit with the nice crawler that @marcenacp <https://github.com/marcenacp> built (https://github.com/mlcommons/croissant/blob/main/health/crawler/spiders/huggingface.py).
>
> when analyzing the resulting dataframes i noticed that the "description" tag in hf croissant files is, lets say, very light, especially compared to kaggle. anecdotally, this type of diet description seems to be widespread in the hf croissant files from the crawl.
>
> here is a side by side example for two datasets
>
> hf (dataset gui page: https://huggingface.co/datasets/CohereForAI/aya_collection, croissant: https://datasets-server.huggingface.co/croissant?dataset=CohereForAI/aya_collection&full=true)
>
> "description":"CohereForAI/aya_collection dataset hosted on Hugging Face and contributed by the HF Datasets community"
> kaggle (dataset gui page: https://www.kaggle.com/datasets/kanchana1990/airbnb-las-vegas-listings, croissant: https://storage.googleapis.com/kaggle-data-sets/4482993/7683348/croissant/metadata.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240228%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240228T140305Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=78fbb6537a3cf1a6f8ac9f85dbb46aaa244a66b9dfce03d37405c3dd2bb9f6c6d40a7897e516841d5f5bb516e806252852752905b6f12bb9f9a67a4becd2664df514e33dde9af3931576acc78bfac24cf3991bc6021a7f138850bb757429360f8bfa44e4fecc85d936ae217d9bc2062162beee7cf394c24d63333fbcd13f81cef6263a5399837bbd0f51bb9277f4be5237320ae459ad596a33048eb9a8109f5b99882b7554efb41edb32e39ff3109e88e964982ed106523ba5d9166a90ec9b0d06276e1795fe299f5190d48a899d56ee8fe8f1b3694a2ece023a1318149d79a2975f9cc0ee6400fbab80b63aa1ab16b28e5cfd1d4d9518f89ccc7ceff745d004)
>
> "description":"**Airbnb Las Vegas Listings \uD83C\uDFE0**\n\n**Overview:**\nWelcome to our cozy corner of data, featuring a curated selection of Airbnb listings from the vibrant city of Las Vegas! Dive into the unique stays Vegas has to offer, from luxurious condos to private rooms that promise an unforgettable stay.\n\n**Data Science Applications:**\nThis dataset is your playground for various data science projects. Whether you\u0027re predicting prices, analyzing guest preferences, or exploring the impact of locations on ratings, there\u0027s something here for everyone. It\u0027s perfect for those looking to practice their data wrangling, visualization, and machine learning skills in a real-world context. Price column is null here so , one may take that as a data cleaning activity also.\n\n**Column Descriptors:**\n- **\u0060roomType\u0060**: Discover the type of accommodation.\n- **\u0060stars\u0060**: Check out the guest ratings.\n- **\u0060address\u0060**: Know where you\u0027ll be staying.\n- **\u0060numberOfGuests\u0060**: Find out the guest capacity.\n- **\u0060primaryHost/smartName\u0060**: Get to know your host.\n- **\u0060price\u0060**: Peek at the listing prices. (Note: Some data may be missing here, so creativity in handling this could be a fun challenge!)\n- **\u0060firstReviewComments\u0060**: Read what the first guests had to say.\n- **\u0060firstReviewRating\u0060**: See how the first guests rated their stay.\n\n**Ethically Mined Data:**\nWe\u0027re committed to ethical data practices. This dataset has been carefully compiled, respecting privacy and data sharing norms. It\u0027s all about fostering learning and innovation, without stepping over any lines.\n\n**A Big Thank You:**\nWe extend our heartfelt gratitude to Airbnb and the platforms that share data openly, making projects like this possible. Their commitment to community and openness enriches the data science world.\n\nDive in, explore, and let the data spark your curiosity and creativity! Happy analyzing! \uD83C\uDF1F"
> are the "description" values populated with correct metadata when hf generates the croissant files? for example, if you check the actual dataset page for the hf example above, it has rich info such as https://huggingface.co/datasets/CohereForAI/aya_collection#dataset-summary:
>
> The Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completions covering a wide range of tasks. This collection incorporates instruction-style templates from fluent speakers and applies them to a curated list of datasets, as well as translations of instruction-style datasets into 101 languages. Aya Dataset, a human-curated multilingual instruction and response dataset, is also part of this collection. See our paper for more details regarding the collection.
>
> —
> Reply to this email directly, view it on GitHub <#574>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMV3YTS7V2IYWHCAOVPC7DYV42QPAVCNFSM6AAAAABD6CZYI6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWGI4DOMZQGY>.
> You are receiving this because you are subscribed to this thread.
>
|
Beta Was this translation helpful? Give feedback.
0 replies
-
thx omar for context! where exactly is the embedded croissant located? when i traverse the dataset file directory for the aya example i find .parquet .png and .md files. on the dataset page itself i can only find the croissant info that was also in the dataframe that came back from the crawl Screencast.from.02-28-2024.04.29.09.PM.webm |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hey Luis, the Croissant is included in the source of the HTML page, inside
a <script type="application/ld+json">.
The dataset metadata -- including the description -- is at the end of the
(very long) Croissant block.
Best,
Omar
…On Wed, Feb 28, 2024 at 4:37 PM Luis Oala ***@***.***> wrote:
thx omar for context! where exactly is the embedded croissant located?
when i traverse the dataset file directory for the aya example i find
.parquet .png and .md files. on the dataset page itself i can only find the
croissant info that was also in the dataframe that came back from the crawl
Screencast from 02-28-2024 04:29:09 PM.webm
<https://github.com/mlcommons/croissant/assets/26168435/3060a491-6e76-46fc-8532-d0cebedebe88>
—
Reply to this email directly, view it on GitHub
<#574 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABMV3YRP6RIZVYMSG2T4RCLYV5FLFAVCNFSM6AAAAABD6CZYI6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DMMRQGE2DC>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
hi gang,
i have been playing a bit with the nice crawler that @marcenacp built (https://github.com/mlcommons/croissant/blob/main/health/crawler/spiders/huggingface.py).
when analyzing the resulting dataframes i noticed that the "description" tag in hf croissant files is, lets say, very light, especially compared to kaggle. anecdotally, this type of diet description seems to be widespread in the hf croissant files from the crawl.
here is a side by side example for two datasets
hf (dataset gui page: https://huggingface.co/datasets/CohereForAI/aya_collection, croissant: https://datasets-server.huggingface.co/croissant?dataset=CohereForAI/aya_collection&full=true)
kaggle (dataset gui page: https://www.kaggle.com/datasets/kanchana1990/airbnb-las-vegas-listings, croissant: https://storage.googleapis.com/kaggle-data-sets/4482993/7683348/croissant/metadata.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240228%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240228T140305Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=78fbb6537a3cf1a6f8ac9f85dbb46aaa244a66b9dfce03d37405c3dd2bb9f6c6d40a7897e516841d5f5bb516e806252852752905b6f12bb9f9a67a4becd2664df514e33dde9af3931576acc78bfac24cf3991bc6021a7f138850bb757429360f8bfa44e4fecc85d936ae217d9bc2062162beee7cf394c24d63333fbcd13f81cef6263a5399837bbd0f51bb9277f4be5237320ae459ad596a33048eb9a8109f5b99882b7554efb41edb32e39ff3109e88e964982ed106523ba5d9166a90ec9b0d06276e1795fe299f5190d48a899d56ee8fe8f1b3694a2ece023a1318149d79a2975f9cc0ee6400fbab80b63aa1ab16b28e5cfd1d4d9518f89ccc7ceff745d004)
are the "description" values populated with correct metadata when hf generates the croissant files? for example, if you check the actual dataset page for the hf example above, it has rich info such as https://huggingface.co/datasets/CohereForAI/aya_collection#dataset-summary:
Beta Was this translation helpful? Give feedback.
All reactions