Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix URLs in blog_authorship_corpus dataset #3106

Merged
merged 3 commits into from
Oct 19, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 25 additions & 12 deletions datasets/blog_authorship_corpus/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,27 @@
---
annotations_creators:
- no-annotation
language_creators:
- found
languages:
- en
licenses:
- unknown
multilinguality:
- monolingual
paperswithcode_id: blog-authorship-corpus
pretty_name: Blog Authorship Corpus
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- multi-class-classification
---

# Dataset Card for "blog_authorship_corpus"
# Dataset Card for Blog Authorship Corpus

## Table of Contents
- [Dataset Description](#dataset-description)
Expand Down Expand Up @@ -47,26 +64,23 @@ The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers ga
Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:

· 8240 "10s" blogs (ages 13-17),

· 8086 "20s" blogs(ages 23-27)

· 2994 "30s" blogs (ages 33-47).
- 8240 "10s" blogs (ages 13-17),
- 8086 "20s" blogs (ages 23-27),
- 2994 "30s" blogs (ages 33-47).

For each age group there are an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

The corpus may be freely used for non-commercial research purposes
The corpus may be freely used for non-commercial research purposes.

### Supported Tasks and Leaderboards

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

### Languages

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The language of the dataset is English (`en`).

## Dataset Structure

Expand Down Expand Up @@ -162,7 +176,7 @@ The data fields are the same among all splits.

### Licensing Information

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The corpus may be freely used for non-commercial research purposes.

### Citation Information

Expand All @@ -178,7 +192,6 @@ The data fields are the same among all splits.

```


### Contributions

Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.
Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.
15 changes: 6 additions & 9 deletions datasets/blog_authorship_corpus/blog_authorship_corpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,21 +24,18 @@
Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:

· 8240 "10s" blogs (ages 13-17),

· 8086 "20s" blogs(ages 23-27)

· 2994 "30s" blogs (ages 33-47).
- 8240 "10s" blogs (ages 13-17),
- 8086 "20s" blogs (ages 23-27),
- 2994 "30s" blogs (ages 33-47).

For each age group there are an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

The corpus may be freely used for non-commercial research purposes
The corpus may be freely used for non-commercial research purposes.
"""
_URL = "https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm"
_DATA_URL = "http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip"
_URL = "https://lingcog.blogspot.com/p/datasets.html"
_DATA_URL = "https://drive.google.com/u/0/uc?id=1cGy4RNDV87ZHEXbiozABr9gsSrZpPaPz&export=download"


class BlogAuthorshipCorpus(datasets.GeneratorBasedBuilder):
Expand Down
2 changes: 1 addition & 1 deletion datasets/blog_authorship_corpus/dataset_infos.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"blog_authorship_corpus": {"description": "The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.\n\nEach blog is presented as a separate file, the name of which indicates a blogger id# and the blogger\u2019s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)\n\nAll bloggers included in the corpus fall into one of three age groups:\n\n\u00b7 8240 \"10s\" blogs (ages 13-17),\n\n\u00b7 8086 \"20s\" blogs(ages 23-27)\n\n\u00b7 2994 \"30s\" blogs (ages 33-47).\n\nFor each age group there are an equal number of male and female bloggers.\n\nEach blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.\n\nThe corpus may be freely used for non-commercial research purposes\n", "citation": "@inproceedings{schler2006effects,\n title={Effects of age and gender on blogging.},\n author={Schler, Jonathan and Koppel, Moshe and Argamon, Shlomo and Pennebaker, James W},\n booktitle={AAAI spring symposium: Computational approaches to analyzing weblogs},\n volume={6},\n pages={199--205},\n year={2006}\n}\n", "homepage": "https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm", "license": "", "features": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "gender": {"dtype": "string", "id": null, "_type": "Value"}, "age": {"dtype": "int32", "id": null, "_type": "Value"}, "horoscope": {"dtype": "string", "id": null, "_type": "Value"}, "job": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "blog_authorship_corpus", "config_name": "blog_authorship_corpus", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 753833081, "num_examples": 689793, "dataset_name": "blog_authorship_corpus"}, "validation": {"name": "validation", "num_bytes": 41236028, "num_examples": 37919, "dataset_name": "blog_authorship_corpus"}}, "download_checksums": {"http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip": {"num_bytes": 312949121, "checksum": "1dfa6996663515a4baf8c1b71713ce8fe9a314b13778701447e4663bbc64c983"}}, "download_size": 312949121, "post_processing_size": null, "dataset_size": 795069109, "size_in_bytes": 1108018230}}
{"blog_authorship_corpus": {"description": "The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.\n\nEach blog is presented as a separate file, the name of which indicates a blogger id# and the blogger\u2019s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)\n\nAll bloggers included in the corpus fall into one of three age groups:\n- 8240 \"10s\" blogs (ages 13-17),\n- 8086 \"20s\" blogs (ages 23-27)\n- 2994 \"30s\" blogs (ages 33-47).\n\nFor each age group there are an equal number of male and female bloggers.\n\nEach blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.\n\nThe corpus may be freely used for non-commercial research purposes.\n", "citation": "@inproceedings{schler2006effects,\n title={Effects of age and gender on blogging.},\n author={Schler, Jonathan and Koppel, Moshe and Argamon, Shlomo and Pennebaker, James W},\n booktitle={AAAI spring symposium: Computational approaches to analyzing weblogs},\n volume={6},\n pages={199--205},\n year={2006}\n}\n", "homepage": "https://lingcog.blogspot.com/p/datasets.html", "license": "", "features": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "date": {"dtype": "string", "id": null, "_type": "Value"}, "gender": {"dtype": "string", "id": null, "_type": "Value"}, "age": {"dtype": "int32", "id": null, "_type": "Value"}, "horoscope": {"dtype": "string", "id": null, "_type": "Value"}, "job": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "blog_authorship_corpus", "config_name": "blog_authorship_corpus", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 753833081, "num_examples": 689793, "dataset_name": "blog_authorship_corpus"}, "validation": {"name": "validation", "num_bytes": 41236028, "num_examples": 37919, "dataset_name": "blog_authorship_corpus"}}, "download_checksums": {"https://drive.google.com/u/0/uc?id=1cGy4RNDV87ZHEXbiozABr9gsSrZpPaPz&export=download": {"num_bytes": 632898892, "checksum": "e89941d841b1652f4405583cfc98c86767bf739cc8876c96037f2a177aef8ebe"}}, "download_size": 632898892, "post_processing_size": null, "dataset_size": 795069109, "size_in_bytes": 1427968001}}