Skip to content

Commit

Permalink
Fix notebooks
Browse files Browse the repository at this point in the history
  • Loading branch information
polm committed Nov 10, 2021
1 parent 286126c commit d515fc5
Show file tree
Hide file tree
Showing 4 changed files with 21 additions and 94 deletions.
37 changes: 6 additions & 31 deletions en/2.1-fugashi-fuseji.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -395,7 +395,7 @@
"Ambiguous words are more difficult. Some examples of ambiguous words: \n",
"\n",
"\n",
"\n",

"\n",
"- 東: *higashi* or *azuma* (or *tou*)\n",
"- 中田: *nakada* or *nakata*\n",
Expand All @@ -405,17 +405,8 @@
"- 私: *watashi* or *watakushi*\n",
"- 日本: *nihon* or *nippon*\n",
"\n",
"d\n",
"\n",
"\n",
"- 東: ひがし、あずま、とう\n",
"- 中田: なかだ、なかた\n",
"- 仮名: かな、かめい\n",
"- 牧場: ぼくじょう、まきば\n",
"- 網代: あみしろ、あじろ\n",
"- 日本: にほん、にっぽん\n",
"\n",
"d\n",

"\n",
"Usually a reading will be clear from context, but many ambiguous words are proper nouns like the names of people and places, and without knowing which specific entity it's referring to there's no way to be sure of the correct reading. Even worse, there's no way to be sure if the word you're looking at is ambiguous or not just using the tokenizer output. \n",
"\n",
Expand Down Expand Up @@ -482,24 +473,16 @@
"And that makes our automatic fuseji program complete. It's not a lot of code, but in building this you learned how to: \n",
"\n",
"\n",
"\n",

"\n",
"1. iterate over the tokens in a text\n",
"2. identify parts of speech of interest with example sentences\n",
"3. use multiple levels of part of speech tags\n",
"4. check if a token is in the dictionary or an unk\n",
"5. convert words to their phonetic representation\n",
"\n",
"d\n",
"\n",
"\n",
"1. 文章の単語を一つずつ処理する方法\n",
"2. 例文を使って目的の品詞を特定する方法\n",
"3. 品詞の構造の扱い\n",
"4. 未知語の判別\n",
"5. 読み仮名変換\n",
"\n",
"d\n",

"\n",
"These are all basic building blocks you can use to build a wide variety of applications. \n",
"\n",
Expand All @@ -510,21 +493,13 @@
"To learn more about the tokenizer API, consider some ways you might want to extend this application and how you'd make the necessary changes. \n",
"\n",
"\n",
"\n",

"\n",
"- what if you wanted to remove all numbers from a contract, to hide dates or prices?\n",
"- what if you wanted to hide a specific list of words, perhaps obscenities, rather than certain parts of speech?\n",
"- how would you change the program to replace hard-to-read words with their phonetic versions?\n",
"\n",
"d\n",
"\n",
"\n",
"- 契約書から日付や金額などの数字を消す\n",
"- 品詞によってではなく、禁止語など特定の単語を伏せる\n",
"- 難読語を読み仮名に変換する\n",
"\n",
"d"
]
"\n"]
}
],
"metadata": {
Expand Down
28 changes: 5 additions & 23 deletions en/5.1-5.2-language-generation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -383,20 +383,14 @@
"Here we are going to use [the JAQKET dataset](https://www.nlp.ecei.tohoku.ac.jp/projects/jaqket/), which is an open-domain question answering dataset developed and distribted by Tohoku University. The dataset includes common sense questions and their answers, where answers and candidates are always drawn from Wikipedia article titles, such as: \n",
"\n",
"\n",
"\n",

"\n",
"* Question: Which city is called \"the navel of Hokkaido\" due to its location, and is also famous for its lavender fields?\n",
"* Answer: Furano\n",
"* Candidates: Furano, Nayoro, Mikasa, Makubetsu, Kitami, ...\n",
"\n",
"d\n",
"\n",
"\n",
"* 質問: 北海道の中心に位置することから「北海道のへそ」を名乗る、ラベンダーで有名な都市はどこ?\n",
"* 答え: 富良野市\n",
"* 候補: 富良野市, 名寄市, 三笠市, 幕別町, 北見市, ...\n",
"\n",
"d\n",

"\n",
"First, let's download, format, and read the datasets (both the train and dev1 portions) so that we can evaluate the language model's quiz answering peformance on them. \n",
"\n"]
Expand Down Expand Up @@ -837,29 +831,17 @@
"id": "976cdee1",
"metadata": {},
"source": [
"\n",

"\n",
"You can solve a much wider range of NLP tasks with language models, and it's fun to think how you'd make them solve certain tasks by designing prompts or even fine-tuning if necessary. How would do go about solving the following tasks, for example?\n",
"\n",
"* Translation. Can Rinna translate between, say, Japanese and English?\n",
"* Arithmetic. Can Rinna answer simple math questions such as 6+7=?\n",
"* Word analogy. Can Rinna answer analogy questions such as Japan is to Yen as USA is to...?\n",
"\n",
"If you need some inspration, [the GPT-3 paper](https://arxiv.org/abs/2005.14165) has many examples.\n",
"\n",
"d\n",
"\n",
"If you need some inspiration, [the GPT-3 paper](https://arxiv.org/abs/2005.14165) has many examples.\n",
"\n",
"言語モデルを使って、もっと様々な NLP タスクを解くことができます。プロンプトを設計したり、必要に応じて微調整したりして、どうやったらタスクを解くようにできるかを考えるのも面白いでしょう。例えば、以下のタスクを解くにはどうしたら良いでしょうか?\n",
"\n",
"* 翻訳。りんなを使って、例えば、日本語と英語の翻訳をすることはできるでしょうか?\n",
"* 演算。りんなは、6+7=? のような簡単な算数の問題に答えることができるでしょうか?\n",
"* 単語の類推。日本→円、アメリカ→? のような類推問題に答えることができるでしょうか?\n",
"\n",
"もしヒント等が必要であれば、[GPT-3 の論文](https://arxiv.org/abs/2005.14165) にこのような例がたくさん載っています。\n",
"\n",
"d"
]
"\n"]
}
],
"metadata": {
Expand Down
30 changes: 6 additions & 24 deletions ja/2.1-fugashi-fuseji.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -406,17 +406,9 @@
"\n",
"同形異音語(形は同じでも読み方が曖昧な単語)は未知語よりも対応が難しいです。同形異音語の例には以下のようなものがあります。 \n",
"\n",

"\n",
"- 東: *higashi* or *azuma* (or *tou*)\n",
"- 中田: *nakada* or *nakata*\n",
"- 仮名: *kana* or *kamei*\n",
"- 網代: *amishiro* or *ajiro*\n",
"- 最中: *saichuu* or *monaka*\n",
"- 私: *watashi* or *watakushi*\n",
"- 日本: *nihon* or *nippon*\n",
"\n",
"\n",
"\n",

"\n",
"- 東: ひがし、あずま、とう\n",
"- 中田: なかだ、なかた\n",
Expand Down Expand Up @@ -491,15 +483,9 @@
"\n",
"これで今回の伏せ字プログラムは完成となります。行数は決して多くはありませんが、これを書く過程で、下記の機能の使い方を紹介しました。\n",
"\n",

"\n",
"1. iterate over the tokens in a text\n",
"2. identify parts of speech of interest with example sentences\n",
"3. use multiple levels of part of speech tags\n",
"4. check if a token is in the dictionary or an unk\n",
"5. convert words to their phonetic representation\n",
"\n",
"\n",
"\n",

"\n",
"1. 文章の単語を一つずつ処理する方法\n",
"2. 例文を使って目的の品詞を特定する方法\n",
Expand All @@ -517,13 +503,9 @@
"\n",
"MeCab の API を更に深く理解するために、下記の場合、どうやってこの伏せ字プログラムを変更するか考えてみましょう。 \n",
"\n",

"\n",
"- what if you wanted to remove all numbers from a contract, to hide dates or prices?\n",
"- what if you wanted to hide a specific list of words, perhaps obscenities, rather than certain parts of speech?\n",
"- how would you change the program to replace hard-to-read words with their phonetic versions?\n",
"\n",
"\n",
"\n",

"\n",
"- 契約書から日付や金額などの数字を消す\n",
"- 品詞によってではなく、禁止語など特定の単語を伏せる\n",
Expand Down
20 changes: 4 additions & 16 deletions ja/5.1-5.2-language-generation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -402,13 +402,9 @@
"\n",
"ここでは、東北大学によって開発・配布されているオープンドメインの質問応答データセットである [JAQKET データセット](https://www.nlp.ecei.tohoku.ac.jp/projects/jaqket/) を使います。このデータセットには、以下のように常識問題とその答えが含まれており、答えと候補は必ず Wikipedia 記事のタイトルに対応するようになっています: \n",
"\n",

"\n",
"* Question: Which city is called \"the navel of Hokkaido\" due to its location, and is also famous for its lavender fields?\n",
"* Answer: Furano\n",
"* Candidates: Furano, Nayoro, Mikasa, Makubetsu, Kitami, ...\n",
"\n",
"\n",
"\n",

"\n",
"* 質問: 北海道の中心に位置することから「北海道のへそ」を名乗る、ラベンダーで有名な都市はどこ?\n",
"* 答え: 富良野市\n",
Expand Down Expand Up @@ -866,17 +862,9 @@
"id": "976cdee1",
"metadata": {},
"source": [

"\n",
"You can solve a much wider range of NLP tasks with language models, and it's fun to think how you'd make them solve certain tasks by designing prompts or even fine-tuning if necessary. How would do go about solving the following tasks, for example?\n",
"\n",
"* Translation. Can Rinna translate between, say, Japanese and English?\n",
"* Arithmetic. Can Rinna answer simple math questions such as 6+7=?\n",
"* Word analogy. Can Rinna answer analogy questions such as Japan is to Yen as USA is to...?\n",
"\n",
"If you need some inspration, [the GPT-3 paper](https://arxiv.org/abs/2005.14165) has many examples.\n",
"\n",
"\n",
"\n",

"\n",
"言語モデルを使って、もっと様々な NLP タスクを解くことができます。プロンプトを設計したり、必要に応じて微調整したりして、どうやったらタスクを解くようにできるかを考えるのも面白いでしょう。例えば、以下のタスクを解くにはどうしたら良いでしょうか?\n",
"\n",
Expand Down

0 comments on commit d515fc5

Please sign in to comment.