light croissant metdadata content in "description" key: hf vs kaggle #574

luisoala · 2024-02-28T14:04:42Z

luisoala
Feb 28, 2024
Collaborator

hi gang,

i have been playing a bit with the nice crawler that @marcenacp built (https://github.com/mlcommons/croissant/blob/main/health/crawler/spiders/huggingface.py).

when analyzing the resulting dataframes i noticed that the "description" tag in hf croissant files is, lets say, very light, especially compared to kaggle. anecdotally, this type of diet description seems to be widespread in the hf croissant files from the crawl.

here is a side by side example for two datasets

hf (dataset gui page: https://huggingface.co/datasets/CohereForAI/aya_collection, croissant: https://datasets-server.huggingface.co/croissant?dataset=CohereForAI/aya_collection&full=true)

"description":"CohereForAI/aya_collection dataset hosted on Hugging Face and contributed by the HF Datasets community"

kaggle (dataset gui page: https://www.kaggle.com/datasets/kanchana1990/airbnb-las-vegas-listings, croissant: https://storage.googleapis.com/kaggle-data-sets/4482993/7683348/croissant/metadata.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240228%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240228T140305Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=78fbb6537a3cf1a6f8ac9f85dbb46aaa244a66b9dfce03d37405c3dd2bb9f6c6d40a7897e516841d5f5bb516e806252852752905b6f12bb9f9a67a4becd2664df514e33dde9af3931576acc78bfac24cf3991bc6021a7f138850bb757429360f8bfa44e4fecc85d936ae217d9bc2062162beee7cf394c24d63333fbcd13f81cef6263a5399837bbd0f51bb9277f4be5237320ae459ad596a33048eb9a8109f5b99882b7554efb41edb32e39ff3109e88e964982ed106523ba5d9166a90ec9b0d06276e1795fe299f5190d48a899d56ee8fe8f1b3694a2ece023a1318149d79a2975f9cc0ee6400fbab80b63aa1ab16b28e5cfd1d4d9518f89ccc7ceff745d004)

"description":"**Airbnb Las Vegas Listings \uD83C\uDFE0**\n\n**Overview:**\nWelcome to our cozy corner of data, featuring a curated selection of Airbnb listings from the vibrant city of Las Vegas! Dive into the unique stays Vegas has to offer, from luxurious condos to private rooms that promise an unforgettable stay.\n\n**Data Science Applications:**\nThis dataset is your playground for various data science projects. Whether you\u0027re predicting prices, analyzing guest preferences, or exploring the impact of locations on ratings, there\u0027s something here for everyone. It\u0027s perfect for those looking to practice their data wrangling, visualization, and machine learning skills in a real-world context. Price column is null here so , one may take that as a data cleaning activity also.\n\n**Column Descriptors:**\n- **\u0060roomType\u0060**: Discover the type of accommodation.\n- **\u0060stars\u0060**: Check out the guest ratings.\n- **\u0060address\u0060**: Know where you\u0027ll be staying.\n- **\u0060numberOfGuests\u0060**: Find out the guest capacity.\n- **\u0060primaryHost/smartName\u0060**: Get to know your host.\n- **\u0060price\u0060**: Peek at the listing prices. (Note: Some data may be missing here, so creativity in handling this could be a fun challenge!)\n- **\u0060firstReviewComments\u0060**: Read what the first guests had to say.\n- **\u0060firstReviewRating\u0060**: See how the first guests rated their stay.\n\n**Ethically Mined Data:**\nWe\u0027re committed to ethical data practices. This dataset has been carefully compiled, respecting privacy and data sharing norms. It\u0027s all about fostering learning and innovation, without stepping over any lines.\n\n**A Big Thank You:**\nWe extend our heartfelt gratitude to Airbnb and the platforms that share data openly, making projects like this possible. Their commitment to community and openness enriches the data science world.\n\nDive in, explore, and let the data spark your curiosity and creativity! Happy analyzing! \uD83C\uDF1F"

are the "description" values populated with correct metadata when hf generates the croissant files? for example, if you check the actual dataset page for the hf example above, it has rich info such as https://huggingface.co/datasets/CohereForAI/aya_collection#dataset-summary:

The Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completions covering a wide range of tasks. This collection incorporates instruction-style templates from fluent speakers and applies them to a curated list of datasets, as well as translations of instruction-style datasets into 101 languages. Aya Dataset, a human-curated multilingual instruction and response dataset, is also part of this collection. See our paper for more details regarding the collection.

benjelloun · 2024-02-28T14:29:58Z

benjelloun
Feb 28, 2024
Maintainer

Hi Luis! We identified this issue already, and discussed a bit with the HF folks. If my memory is correct, it's because there are two mechanisms to provide metadata for HF datasets: Data cards, and an older legacy mechanism. We're getting the descriptions from data cards in Croissant, while the longer, nicer descriptions come from the legacy mechanism. +Quentin Lhoest ***@***.***> +Sylvain Lesage ***@***.***> in case you want to chime in with possible approaches to address this issue. Cheers, Omar

…

On Wed, Feb 28, 2024 at 3:04 PM Luis Oala ***@***.***> wrote: hi gang, i have been playing a bit with the nice crawler that @marcenacp <https://github.com/marcenacp> built ( https://github.com/mlcommons/croissant/blob/main/health/crawler/spiders/huggingface.py ). when analyzing the resulting dataframes i noticed that the "description" tag in hf croissant files is, lets say, very light, especially compared to kaggle. anecdotally, this type of diet description seems to be widespread in the hf croissant files from the crawl. here is a side by side example for two datasets hf (dataset gui page: https://huggingface.co/datasets/CohereForAI/aya_collection, croissant: https://datasets-server.huggingface.co/croissant?dataset=CohereForAI/aya_collection&full=true ) "description":"CohereForAI/aya_collection dataset hosted on Hugging Face and contributed by the HF Datasets community" kaggle (dataset gui page: https://www.kaggle.com/datasets/kanchana1990/airbnb-las-vegas-listings, croissant: https://storage.googleapis.com/kaggle-data-sets/4482993/7683348/croissant/metadata.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240228%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240228T140305Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=78fbb6537a3cf1a6f8ac9f85dbb46aaa244a66b9dfce03d37405c3dd2bb9f6c6d40a7897e516841d5f5bb516e806252852752905b6f12bb9f9a67a4becd2664df514e33dde9af3931576acc78bfac24cf3991bc6021a7f138850bb757429360f8bfa44e4fecc85d936ae217d9bc2062162beee7cf394c24d63333fbcd13f81cef6263a5399837bbd0f51bb9277f4be5237320ae459ad596a33048eb9a8109f5b99882b7554efb41edb32e39ff3109e88e964982ed106523ba5d9166a90ec9b0d06276e1795fe299f5190d48a899d56ee8fe8f1b3694a2ece023a1318149d79a2975f9cc0ee6400fbab80b63aa1ab16b28e5cfd1d4d9518f89ccc7ceff745d004 ) "description":"**Airbnb Las Vegas Listings \uD83C\uDFE0**\n\n**Overview:**\nWelcome to our cozy corner of data, featuring a curated selection of Airbnb listings from the vibrant city of Las Vegas! Dive into the unique stays Vegas has to offer, from luxurious condos to private rooms that promise an unforgettable stay.\n\n**Data Science Applications:**\nThis dataset is your playground for various data science projects. Whether you\u0027re predicting prices, analyzing guest preferences, or exploring the impact of locations on ratings, there\u0027s something here for everyone. It\u0027s perfect for those looking to practice their data wrangling, visualization, and machine learning skills in a real-world context. Price column is null here so , one may take that as a data cleaning activity also.\n\n**Column Descriptors:**\n- **\u0060roomType\u0060**: Discover the type of accommodation.\n- **\u0060stars\u0060**: Check out the guest ratings.\n- **\u0060address\u0060**: Know where you\u0027ll be staying.\n- **\u0060numberOfGuests\u0060**: Find out the guest capacity.\n- **\u0060primaryHost/smartName\u0060**: Get to know your host.\n- **\u0060price\u0060**: Peek at the listing prices. (Note: Some data may be missing here, so creativity in handling this could be a fun challenge!)\n- **\u0060firstReviewComments\u0060**: Read what the first guests had to say.\n- **\u0060firstReviewRating\u0060**: See how the first guests rated their stay.\n\n**Ethically Mined Data:**\nWe\u0027re committed to ethical data practices. This dataset has been carefully compiled, respecting privacy and data sharing norms. It\u0027s all about fostering learning and innovation, without stepping over any lines.\n\n**A Big Thank You:**\nWe extend our heartfelt gratitude to Airbnb and the platforms that share data openly, making projects like this possible. Their commitment to community and openness enriches the data science world.\n\nDive in, explore, and let the data spark your curiosity and creativity! Happy analyzing! \uD83C\uDF1F" are the "description" values populated with correct metadata when hf generates the croissant files? for example, if you check the actual dataset page for the hf example above, it has rich info such as https://huggingface.co/datasets/CohereForAI/aya_collection#dataset-summary : The Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completions covering a wide range of tasks. This collection incorporates instruction-style templates from fluent speakers and applies them to a curated list of datasets, as well as translations of instruction-style datasets into 101 languages. Aya Dataset, a human-curated multilingual instruction and response dataset, is also part of this collection. See our paper for more details regarding the collection. — Reply to this email directly, view it on GitHub <#574>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMV3YTS7V2IYWHCAOVPC7DYV42QPAVCNFSM6AAAAABD6CZYI6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWGI4DOMZQGY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

benjelloun · 2024-02-28T15:25:38Z

benjelloun
Feb 28, 2024
Maintainer

Yes, the issue is with the specific endpoint (https://datasets-server.huggingface.co/croissant?dataset=CohereForAI/aya_collection&full=true). The embedded Croissant in https://huggingface.co/datasets/CohereForAI/aya_collection gives a better description: "description": "\n\n\t\n\t\t\n\t\n\t\n\t\tDataset Summary\n\t\n\nThe Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completions covering a wide range of tasks.\nThis collection incorporates instruction-style templates from fluent speakers and applies them to a curated list of datasets, as well as translations of instruction-style datasets into 101 languages. Aya Dataset, a human-curated multilingual instruction and response dataset, is also part of this collection. See… See the full description on the dataset page: https://huggingface.co/datasets/CohereForAI/aya_collection.", For now, it’s better to rely on the Croissant schema embedded in the datasets pages such as https://huggingface.co/datasets/CohereForAI/aya_collection.

…

Le 28 févr. 2024 à 15:29, Omar Benjelloun ***@***.***> a écrit : Hi Luis! We identified this issue already, and discussed a bit with the HF folks. If my memory is correct, it's because there are two mechanisms to provide metadata for HF datasets: Data cards, and an older legacy mechanism. We're getting the descriptions from data cards in Croissant, while the longer, nicer descriptions come from the legacy mechanism. +Quentin Lhoest ***@***.***> +Sylvain Lesage ***@***.***> in case you want to chime in with possible approaches to address this issue. Cheers, Omar On Wed, Feb 28, 2024 at 3:04 PM Luis Oala ***@***.*** ***@***.***>> wrote: > > hi gang, > > i have been playing a bit with the nice crawler that @marcenacp <https://github.com/marcenacp> built (https://github.com/mlcommons/croissant/blob/main/health/crawler/spiders/huggingface.py). > > when analyzing the resulting dataframes i noticed that the "description" tag in hf croissant files is, lets say, very light, especially compared to kaggle. anecdotally, this type of diet description seems to be widespread in the hf croissant files from the crawl. > > here is a side by side example for two datasets > > hf (dataset gui page: https://huggingface.co/datasets/CohereForAI/aya_collection, croissant: https://datasets-server.huggingface.co/croissant?dataset=CohereForAI/aya_collection&full=true) > > "description":"CohereForAI/aya_collection dataset hosted on Hugging Face and contributed by the HF Datasets community" > kaggle (dataset gui page: https://www.kaggle.com/datasets/kanchana1990/airbnb-las-vegas-listings, croissant: https://storage.googleapis.com/kaggle-data-sets/4482993/7683348/croissant/metadata.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240228%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240228T140305Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=78fbb6537a3cf1a6f8ac9f85dbb46aaa244a66b9dfce03d37405c3dd2bb9f6c6d40a7897e516841d5f5bb516e806252852752905b6f12bb9f9a67a4becd2664df514e33dde9af3931576acc78bfac24cf3991bc6021a7f138850bb757429360f8bfa44e4fecc85d936ae217d9bc2062162beee7cf394c24d63333fbcd13f81cef6263a5399837bbd0f51bb9277f4be5237320ae459ad596a33048eb9a8109f5b99882b7554efb41edb32e39ff3109e88e964982ed106523ba5d9166a90ec9b0d06276e1795fe299f5190d48a899d56ee8fe8f1b3694a2ece023a1318149d79a2975f9cc0ee6400fbab80b63aa1ab16b28e5cfd1d4d9518f89ccc7ceff745d004) > > "description":"**Airbnb Las Vegas Listings \uD83C\uDFE0**\n\n**Overview:**\nWelcome to our cozy corner of data, featuring a curated selection of Airbnb listings from the vibrant city of Las Vegas! Dive into the unique stays Vegas has to offer, from luxurious condos to private rooms that promise an unforgettable stay.\n\n**Data Science Applications:**\nThis dataset is your playground for various data science projects. Whether you\u0027re predicting prices, analyzing guest preferences, or exploring the impact of locations on ratings, there\u0027s something here for everyone. It\u0027s perfect for those looking to practice their data wrangling, visualization, and machine learning skills in a real-world context. Price column is null here so , one may take that as a data cleaning activity also.\n\n**Column Descriptors:**\n- **\u0060roomType\u0060**: Discover the type of accommodation.\n- **\u0060stars\u0060**: Check out the guest ratings.\n- **\u0060address\u0060**: Know where you\u0027ll be staying.\n- **\u0060numberOfGuests\u0060**: Find out the guest capacity.\n- **\u0060primaryHost/smartName\u0060**: Get to know your host.\n- **\u0060price\u0060**: Peek at the listing prices. (Note: Some data may be missing here, so creativity in handling this could be a fun challenge!)\n- **\u0060firstReviewComments\u0060**: Read what the first guests had to say.\n- **\u0060firstReviewRating\u0060**: See how the first guests rated their stay.\n\n**Ethically Mined Data:**\nWe\u0027re committed to ethical data practices. This dataset has been carefully compiled, respecting privacy and data sharing norms. It\u0027s all about fostering learning and innovation, without stepping over any lines.\n\n**A Big Thank You:**\nWe extend our heartfelt gratitude to Airbnb and the platforms that share data openly, making projects like this possible. Their commitment to community and openness enriches the data science world.\n\nDive in, explore, and let the data spark your curiosity and creativity! Happy analyzing! \uD83C\uDF1F" > are the "description" values populated with correct metadata when hf generates the croissant files? for example, if you check the actual dataset page for the hf example above, it has rich info such as https://huggingface.co/datasets/CohereForAI/aya_collection#dataset-summary: > > The Aya Collection is a massive multilingual collection consisting of 513 million instances of prompts and completions covering a wide range of tasks. This collection incorporates instruction-style templates from fluent speakers and applies them to a curated list of datasets, as well as translations of instruction-style datasets into 101 languages. Aya Dataset, a human-curated multilingual instruction and response dataset, is also part of this collection. See our paper for more details regarding the collection. > > — > Reply to this email directly, view it on GitHub <#574>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMV3YTS7V2IYWHCAOVPC7DYV42QPAVCNFSM6AAAAABD6CZYI6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWGI4DOMZQGY>. > You are receiving this because you are subscribed to this thread. >

0 replies

luisoala · 2024-02-28T15:37:11Z

luisoala
Feb 28, 2024
Collaborator Author

thx omar for context! where exactly is the embedded croissant located? when i traverse the dataset file directory for the aya example i find .parquet .png and .md files. on the dataset page itself i can only find the croissant info that was also in the dataframe that came back from the crawl

Screencast.from.02-28-2024.04.29.09.PM.webm

0 replies

benjelloun · 2024-03-04T09:13:37Z

benjelloun
Mar 4, 2024
Maintainer

Hey Luis, the Croissant is included in the source of the HTML page, inside a <script type="application/ld+json">. The dataset metadata -- including the description -- is at the end of the (very long) Croissant block. Best, Omar

…

On Wed, Feb 28, 2024 at 4:37 PM Luis Oala ***@***.***> wrote: thx omar for context! where exactly is the embedded croissant located? when i traverse the dataset file directory for the aya example i find .parquet .png and .md files. on the dataset page itself i can only find the croissant info that was also in the dataframe that came back from the crawl Screencast from 02-28-2024 04:29:09 PM.webm <https://github.com/mlcommons/croissant/assets/26168435/3060a491-6e76-46fc-8532-d0cebedebe88> — Reply to this email directly, view it on GitHub <#574 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMV3YRP6RIZVYMSG2T4RCLYV5FLFAVCNFSM6AAAAABD6CZYI6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DMMRQGE2DC> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

luisoala Mar 5, 2024
Collaborator Author

i c thx omar!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

light croissant metdadata content in "description" key: hf vs kaggle #574

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

light croissant metdadata content in "description" key: hf vs kaggle #574

luisoala Feb 28, 2024 Collaborator

Replies: 4 comments · 1 reply

benjelloun Feb 28, 2024 Maintainer

benjelloun Feb 28, 2024 Maintainer

luisoala Feb 28, 2024 Collaborator Author

benjelloun Mar 4, 2024 Maintainer

luisoala Mar 5, 2024 Collaborator Author

luisoala
Feb 28, 2024
Collaborator

Replies: 4 comments 1 reply

benjelloun
Feb 28, 2024
Maintainer

benjelloun
Feb 28, 2024
Maintainer

luisoala
Feb 28, 2024
Collaborator Author

benjelloun
Mar 4, 2024
Maintainer

luisoala Mar 5, 2024
Collaborator Author