Skip to content

Commit

Permalink
Copy korean_chatbot_data.md (#139)
Browse files Browse the repository at this point in the history
  • Loading branch information
warnikchow committed Nov 16, 2020
1 parent afecdea commit 4ca90b6
Showing 1 changed file with 111 additions and 1 deletion.
112 changes: 111 additions & 1 deletion en-docs/corpuslist/korean_chatbot_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,114 @@ sort: 1

# Korean Chatbot Data

TBD
์ฑ—๋ด‡ ๋ฌธ๋‹ต ํŽ˜์–ด๋Š” songys@github ๋‹˜์ด ๋งŒ๋“œ์‹  ์ฑ—๋ด‡ ๋ฌธ๋‹ต ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.
๋ฐ์ดํ„ฐ ์ •๋ณด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

- author: songys@github
- repository: https://github.com/songys/Chatbot_data
- size:
- train: 11,876 examples

๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

|์†์„ฑ๋ช…|๋‚ด์šฉ|
|---|---|
|text|์งˆ๋ฌธ|
|pair|๋‹ต๋ณ€|
|label|์ผ์ƒ๋‹ค๋ฐ˜์‚ฌ 0, ์ด๋ณ„(๋ถ€์ •) 1, ์‚ฌ๋ž‘(๊ธ์ •) 2|


## 1. ํŒŒ์ด์ฌ์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ

ํŒŒ์ด์ฌ ์ฝ˜์†”์„ ์‹คํ–‰ํ•œ ๋’ค ๋ง๋ญ‰์น˜๋ฅผ ๋‚ด๋ ค๋ฐ›๊ณ  ์ฝ์–ด๋“ค์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

### ๋ง๋ญ‰์น˜ ๋‹ค์šด๋กœ๋“œ

์ฑ—๋ด‡ ๋ฌธ๋‹ต ํŽ˜์–ด๋ฅผ ๋กœ์ปฌ์— ๋‚ด๋ ค ๋ฐ›๋Š” ํŒŒ์ด์ฌ ์˜ˆ์ œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

```python
from Korpora import Korpora
Korpora.fetch("korean_chatbot_data")
```

```note
๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์šฉ์ž์˜ ๋กœ์ปฌ ์ปดํ“จํ„ฐ ๋ฃจํŠธ ํ•˜์œ„์˜ Korpora๋ผ๋Š” ๋””๋ ‰ํ† ๋ฆฌ์— ๋ง๋ญ‰์น˜๋ฅผ ๋‚ด๋ ค ๋ฐ›์Šต๋‹ˆ๋‹ค(`~/Korpora`). ๋‹ค๋ฅธ ๊ฒฝ๋กœ์— ๋ง๋ญ‰์น˜๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›๊ณ  ์‹ถ๋‹ค๋ฉด
fetch ํ•จ์ˆ˜ ์‹คํ–‰์‹œ `root_dir=custom_path`๋ผ๋Š” ์ธ์ž๋ฅผ ์ถ”๊ฐ€ํ•˜์„ธ์š”.
```

```tip
fetch ํ•จ์ˆ˜ ์‹คํ–‰์‹œ `force_download=True`๋ผ๋Š” ์ธ์ž๋ฅผ ์ค„ ๊ฒฝ์šฐ ํ•ด๋‹น ๋ง๋ญ‰์น˜๊ฐ€ ์ด๋ฏธ ๋กœ์ปฌ์— ์žˆ๋”๋ผ๋„ ์ด๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ๋‹ค์‹œ ๋‚ด๋ ค ๋ฐ›์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ `False`์ž…๋‹ˆ๋‹ค.
```


### ๋ง๋ญ‰์น˜ ์ฝ์–ด๋“ค์ด๊ธฐ

์ฑ—๋ด‡ ๋ฌธ๋‹ต ํŽ˜์–ด๋ฅผ ํŒŒ์ด์ฌ ์ฝ˜์†”์—์„œ ์ฝ์–ด๋“ค์ด๋Š” ์˜ˆ์ œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
๋ง๋ญ‰์น˜๊ฐ€ ๋กœ์ปฌ์— ์—†๋‹ค๋ฉด ๋‹ค์šด๋กœ๋“œ๋„ ํ•จ๊ป˜ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

```python
from Korpora import Korpora
corpus = Korpora.load("korean_chatbot_data")
```

๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‹คํ–‰ํ•ด๋„ ์ฑ—๋ด‡ ๋ฌธ๋‹ต ํŽ˜์–ด๋ฅผ ์ฝ์–ด๋“ค์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ˆ˜ํ–‰ ๊ฒฐ๊ณผ๋Š” ์œ„์˜ ์ฝ”๋“œ์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

```python
from Korpora import KoreanChatbotKorpus
corpus = KoreanChatbotKorpus()
```

์œ„ ์ฝ”๋“œ ๋‘˜ ์ค‘ ํ•˜๋‚˜๋ฅผ ํƒํ•ด ์‹คํ–‰ํ•˜๋ฉด `corpus`๋ผ๋Š” ๋ณ€์ˆ˜์— ๋ง๋ญ‰์น˜๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.
`train`์€ ์ฑ—๋ด‡ ๋ฌธ๋‹ต ํŽ˜์–ด์˜ train ๋ฐ์ดํ„ฐ๋กœ ์ฒซ๋ฒˆ์งธ ์ธ์Šคํ„ด์Šค๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

```
>>> corpus.train[0]
LabeledSentencePair(text='12์‹œ ๋•ก!', pair='ํ•˜๋ฃจ๊ฐ€ ๋˜ ๊ฐ€๋„ค์š”.', label=0)
>>> corpus.train[0].text
12์‹œ ๋•ก!
>>> corpus.train[0].pair
ํ•˜๋ฃจ๊ฐ€ ๋˜ ๊ฐ€๋„ค์š”.
>>> corpus.train[0].label
0
```

`get_all_texts`๋ผ๋Š” ๋ฉ”์†Œ๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ์ฑ—๋ด‡ ๋ฌธ๋‹ต ํŽ˜์–ด์˜ ๋ชจ๋“  text(์งˆ๋ฌธ)๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

```
>>> corpus.get_all_texts()
['12์‹œ ๋•ก!', '1์ง€๋ง ํ•™๊ต ๋–จ์–ด์กŒ์–ด', ... ]
```

`get_all_pairs`๋ผ๋Š” ๋ฉ”์†Œ๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ์ฑ—๋ด‡ ๋ฌธ๋‹ต ํŽ˜์–ด์˜ ๋ชจ๋“  pair(๋‹ต๋ณ€)๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

```
>>> corpus.get_all_pairs()
['ํ•˜๋ฃจ๊ฐ€ ๋˜ ๊ฐ€๋„ค์š”.', '์œ„๋กœํ•ด ๋“œ๋ฆฝ๋‹ˆ๋‹ค.', ... ]
```

`get_all_labels`๋ผ๋Š” ๋ฉ”์†Œ๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ์ฑ—๋ด‡ ๋ฌธ๋‹ต ํŽ˜์–ด์˜ ๋ชจ๋“  label(๋ ˆ์ด๋ธ”)์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

```
>>> corpus.get_all_labels()
[0, 0, ... ]
```

## 2. ํ„ฐ๋ฏธ๋„์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ

ํŒŒ์ด์ฌ ์ฝ˜์†” ์‹คํ–‰ ์—†์ด ๋ฐ”๋กœ ๋ง๋ญ‰์น˜๋ฅผ ๋‹ค์šด๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‹คํ–‰ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

```bash
korpora fetch --corpus korean_chatbot_data
```

```note
๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์šฉ์ž์˜ ๋กœ์ปฌ ์ปดํ“จํ„ฐ ๋ฃจํŠธ ํ•˜์œ„์˜ Korpora๋ผ๋Š” ๋””๋ ‰ํ† ๋ฆฌ์— ๋ง๋ญ‰์น˜๋ฅผ ๋‚ด๋ ค ๋ฐ›์Šต๋‹ˆ๋‹ค(`~/Korpora`). ๋‹ค๋ฅธ ๊ฒฝ๋กœ์— ๋ง๋ญ‰์น˜๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›๊ณ  ์‹ถ๋‹ค๋ฉด
ํ„ฐ๋ฏธ๋„์—์„œ fetch ํ•จ์ˆ˜ ์‹คํ–‰์‹œ `--root_dir custom_path`๋ผ๋Š” ์ธ์ž๋ฅผ ์ถ”๊ฐ€ํ•˜์„ธ์š”.
```

```tip
ํ„ฐ๋ฏธ๋„์—์„œ fetch ํ•จ์ˆ˜ ์‹คํ–‰์‹œ `--force_download`๋ผ๋Š” ์ธ์ž๋ฅผ ์ค„ ๊ฒฝ์šฐ ํ•ด๋‹น ๋ง๋ญ‰์น˜๊ฐ€ ์ด๋ฏธ ๋กœ์ปฌ์— ์žˆ๋”๋ผ๋„ ์ด๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ๋‹ค์‹œ ๋‚ด๋ ค ๋ฐ›์Šต๋‹ˆ๋‹ค.
```

0 comments on commit 4ca90b6

Please sign in to comment.