Skip to content

Commit

Permalink
namuwikitext translation completed
Browse files Browse the repository at this point in the history
  • Loading branch information
hank110 committed Nov 6, 2020
1 parent 095c3fa commit fc72b78
Showing 1 changed file with 112 additions and 1 deletion.
113 changes: 112 additions & 1 deletion en-docs/corpuslist/namuwikitext.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,115 @@ sort: 8

# NamuWikiText

TBD
NamuWikiText is a dataset released by lovit@github. It provides Namu Wikipedia in a text format.
Data specification is as follows.

- author: lovit@github
- repository: [https://github.com/lovit/namuwikitext](https://github.com/lovit/namuwikitext)
- size:
- train: 31,235,096 lines (500,104 docs, 4.6G)
- dev: 153,605 lines (2,525 docs, 23M)
- test: 160,233 lines (2,527 docs, 24M)

Data structure is as follows:

|Attributes|Property|
|---|---|
|text|a body of a section|
|pair|a title of a section|


## 1. Using in Python

You can download and load the corpus after executing your Python console.

### Downloading the corpus

You can download NamuWikiText corpus into your local directory with the following Python codes.

```python
from Korpora import Korpora
Korpora.fetch("namuwikitext")
```

```note
By default, the corpus is downloaded to a Korpora directory within the user's root directory (`~/Korpora`). If you wish to download the corpus to another directory,
add `root_dir=custom_path` argument to the fetch method.
```

```tip
When the fetch method is executed with `force_download=True` argument, it ignores the existing corpus in the local directory and re-downloads the corpus. The default value of `force_download` is `False`.
```


### Loading the corpus

You can load NamuWikiText corpus from your Python console with the following codes.
If the corpus does not exist in the local directory, it is also downloaded as well.

```python
from Korpora import Korpora
corpus = Korpora.load("namuwikitext")
```

You can also load the corpus as follows.
The output of these codes is identical to that of previous codes.

```python
from Korpora import NamuwikiTextKorpus
corpus = NamuwikiTextKorpus()
```

If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.

```
>>> corpus.train[0]
SentencePair(text='상위 문서: 아스날 FC\n2009-10 시즌 2011-12 시즌\n2010 -11 시즌...', pair=' = 아스날 FC/2010-11 시즌 =')
>>> corpus.train[0].text
상위 문서: 아스날 FC\n2009-10 시즌 2011-12 시즌\n2010 -11 시즌...
>>> corpus.train[0].pair
= 아스날 FC/2010-11 시즌 =
```

`dev` and `test` refer to the validation and test datasets of the corpus, respectively. Each of their first instance can be accessed as follows.

```
>>> corpus.dev[0]
SentencePair(text='상위 항목: 축구 관련 인물, 외국인 선수/역대 프로축구\n...', pair=' = 소말리아(축구선수) =')
>>> corpus.test[0]
SentencePair(text='', pair=' = 덴덴타운 =')
```

By executing the `get_all_texts` method, you can access all texts (bodies of sections) within the corpus.

```
>>> corpus.get_all_texts()
['상위 문서: 아스날 FC\n2009-10 시즌 2011-12 시즌\n2010 -11 시즌...', ... ]
```

By executing the `get_all_pairs` method, you can access all pairs (titles of sections) within the corpus.

```
>>> corpus.get_all_pairs()
['= 아스날 FC/2010-11 시즌 =', ... ]
```


## 2. Using in a terminal

You can directly download the corpus without executing Python console.
To do so, use the following command.

```bash
korpora fetch --corpus namuwikitext
```

```note
By default, the corpus is downloaded to a Korpora directory within the user's root directory (`~/Korpora`). If you wish to download the corpus to another directory,
add `--root_dir custom_path` argument to the fetch command.
```

```tip
If you add `--force_download` argument when executing the fetch command in the terminal, it ignores the existing corpus in the local directory and re-downloads the corpus.
```

0 comments on commit fc72b78

Please sign in to comment.