Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add The Pile subsets #3378

Merged
merged 5 commits into from
Dec 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 70 additions & 4 deletions datasets/the_pile/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,15 @@ This dataset is in English (`EN`).
}
```

#### europarl
```
{
'text': 'Uvádění biocidních přípravků na trh - Nový návrh revize týkající se biocidních přípravků (rozprava) \nPředsedající\nDalším bodem je společná rozprava o následujících tématech:\nzpráva paní Sârbuové za Výbor pro životní prostředí, veřejné zdraví a bezpečnost potravin o návrhu...'
'meta': "{'language': 'cs'}",

}
```

#### free_law
```
{
Expand All @@ -89,6 +98,30 @@ This dataset is in English (`EN`).
}
```

#### hacker_news
```
{
'text': "\nChina Deserves Donald Trump - rm2889\nhttps://www.nytimes.com/2019/05/21/opinion/china-trump-trade.html\n======\nNotPaidToPost\n> so he’d be wise to curb his nationalistic “no-one-tells-China-what-to-do”\n> bluster\n\nThis comment highlights both ignorance of Chinese history and continuing\nAmerican arrogance.\n\nChina has been painfully dictated what to do during the last 200 years. This\nhas had a profound effect on the country and has led to the collapse of\nimperial rule and the drive to 'rejuvenate'...",
'meta': "{'id': '19979654'}",
}
```

#### nih_exporter
```
{
'text': "The National Domestic Violence Hotline (NDVH) and the National Dating Abuse Helpline (NDAH), which are supported by the Division of Family Violence Prevention and Services within the Family and Youth Services Bureau, serve as critical partners in the intervention, prevention, and resource assistance efforts of the network of family violence, domestic violence, and dating violence service providers. They provide crisis intervention and support services; information about resources on domestic...",
'meta': " {'APPLICATION_ID': 100065}",
}
```

#### pubmed
```
{
'meta': {'pmid': 11409574, 'language': 'eng'},
'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that included 4,021 children under 5 with acute respiratory infections (ARI) and reported the prevalence of hypoxaemia. Out-patient children and those with a clinical diagnosis of upper ARI had a low risk of hypoxaemia (pooled estimate of 6% to 9%). The prevalence increased to 31% and to 43% in patients in emergency departments and in cases with clinical pneumonia, respectively, and it was even higher among hospitalised children (47%) and in those with radiographically confirmed pneumonia (72%). The cumulated data also suggest that hypoxaemia is more frequent in children living at high altitude. Three papers reported an association between hypoxaemia and death, with relative risks varying between 1.4 and 4.6. Papers describing predictors of hypoxaemia have focused on clinical signs for detecting hypoxaemia rather than on identifying risk factors for developing this complication. Hypoxaemia is a common and potentially lethal complication of ALRI in children under 5, particularly among those with severe disease and those living at high altitude. Given the observed high prevalence of hypoxaemia and its likely association with increased mortality, efforts should be made to improve the detection of hypoxaemia and to provide oxygen earlier to more children with severe ALRI.'
}
```

#### pubmed_central
```
{
Expand All @@ -97,6 +130,14 @@ This dataset is in English (`EN`).
}
```

#### ubuntu_irc
```
{
'text': "#ubuntu 2004-07-05\n* Window 3\n* \tServer: [0] <None>\n* \tScreen: 0x817e90c\n* \tGeometry Info: [0 11 0 11 11 11] \n* \tCO, LI are [94 49] \n* \tCurrent channel: #ubuntu\n* \tQuery User: <None> \n*\tPrompt: <None>\n* \tSecond status line is OFF\n* \tSplit line is ON triple is OFF\n* \tLogging is ON\n* \tLogfile is irclogs/ubuntu.log\n* \tNotification is OFF\n* \tHold mode is OFF\n* \tWindow level is NONE\n* \tLastlog level is ALL\n* \tNotify level is ALL\n<mdz> lifeless: using tla effectively for all packages in Warty requ...",
'meta': "{'channel': 'ubuntu', 'month': 7}"
}
```

#### uspto
```
{
Expand All @@ -110,23 +151,48 @@ This dataset is in English (`EN`).
#### all

- `text` (str): Text.
- `meta` (dict): Metadata of the data instance, with keys:
- `meta` (dict): Metadata of the data instance with keys:
- pile_set_name: Name of the subset.

#### europarl

- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: language.

#### free_law

- `text` (str): Text.
- `meta` (str): Metadata of the data instance, with: case_ID, case_jurisdiction, date_created.
- `meta` (str): Metadata of the data instance with: case_ID, case_jurisdiction, date_created.

#### hacker_news

- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: id.

#### nih_exporter

- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: APPLICATION_ID.

#### pubmed

- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: pmid, language.

#### pubmed_central

- `text` (str): Text.
- `meta` (str): Metadata of the data instance, with: ID of the data instance.
- `meta` (str): Metadata of the data instance with: ID of the data instance.

#### ubuntu_irc

- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: channel, month.

#### uspto

- `text` (str): Text.
- `meta` (str): Metadata of the data instance, with: bibliographic_information, source_file, abstract, classifications,
- `meta` (str): Metadata of the data instance with: bibliographic_information, source_file, abstract, classifications,
inventors.

### Data Splits
Expand Down
48 changes: 44 additions & 4 deletions datasets/the_pile/the_pile.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,14 @@
_HOMEPAGE = "https://pile.eleuther.ai/"

_LICENSES = {
"all": "MIT License",
"all": "Multiple: see each subset license",
"europarl": "Unknown",
"free_law": "Unknown",
"pubmed_central": "MIT License",
"hacker_news": "Unknown",
"nih_exporter": "Unknown",
"pubmed": "Unknown",
"pubmed_central": "Unknown",
"ubuntu_irc": "Unknown",
"uspto": "Unknown",
}

Expand All @@ -50,8 +55,13 @@
"validation": ["https://the-eye.eu/public/AI/pile/val.jsonl.zst"],
"test": ["https://the-eye.eu/public/AI/pile/test.jsonl.zst"],
},
"europarl": "https://the-eye.eu/public/AI/pile_preliminary_components/EuroParliamentProceedings_1996_2011.jsonl.zst",
"free_law": "https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
"hacker_news": "https://the-eye.eu/public/AI/pile_preliminary_components/hn.tar.gz",
"nih_exporter": "https://the-eye.eu/public/AI/pile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst",
"pubmed": "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst",
"pubmed_central": "https://the-eye.eu/public/AI/pile_preliminary_components/PMC_extracts.tar.gz",
"ubuntu_irc": "https://the-eye.eu/public/AI/pile_preliminary_components/ubuntu_irc_until_2020_9_1.jsonl.zst",
"uspto": "https://the-eye.eu/public/AI/pile_preliminary_components/pile_uspto.tar",
}

Expand All @@ -62,18 +72,48 @@
"meta": {"pile_set_name": datasets.Value("string")},
}
),
"europarl": datasets.Features(
{
"text": datasets.Value("string"),
"meta": datasets.Value("string"),
}
),
"free_law": datasets.Features(
{
"text": datasets.Value("string"),
"meta": datasets.Value("string"),
}
),
"hacker_news": datasets.Features(
{
"text": datasets.Value("string"),
"meta": datasets.Value("string"),
}
),
"nih_exporter": datasets.Features(
{
"text": datasets.Value("string"),
"meta": datasets.Value("string"),
}
),
"pubmed": datasets.Features(
{
"text": datasets.Value("string"),
"meta": datasets.Value("string"),
}
),
"pubmed_central": datasets.Features(
{
"text": datasets.Value("string"),
"meta": datasets.Value("string"),
}
),
"ubuntu_irc": datasets.Features(
{
"text": datasets.Value("string"),
"meta": datasets.Value("string"),
}
),
"uspto": datasets.Features(
{
"text": datasets.Value("string"),
Expand Down Expand Up @@ -173,15 +213,15 @@ def _generate_examples(self, files):
key += 1
else:
for subset in files:
if subset == "free_law":
if subset in {"europarl", "free_law", "nih_exporter", "pubmed", "ubuntu_irc"}:
import zstandard as zstd

with zstd.open(open(files[subset], "rb"), "rt", encoding="utf-8") as f:
for row in f:
data = json.loads(row)
yield key, data
key += 1
elif subset == "pubmed_central":
elif subset in {"hacker_news", "pubmed_central"}:
for path, file in files[subset]:
id_ = path.split("/")[-1].split(".")[0]
meta = {"id": id_}
Expand Down