diff --git a/notebooks/quick_start/quick_start_train.ipynb b/notebooks/quick_start/quick_start_train.ipynb new file mode 100644 index 0000000..f66ef9e --- /dev/null +++ b/notebooks/quick_start/quick_start_train.ipynb @@ -0,0 +1,3762 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "quick_start_train.ipynb", + "provenance": [], + "collapsed_sections": [ + "0-UErUopqefu", + "fBIzbRQTjNvp", + "PnEOIvI_tlX9", + "L7LtlfL6toUI", + "tXZbHG7UtqGj", + "PHvQCZlPtsP4", + "vTQ9mvEvtvZ7", + "Z8WZfNOetw3H" + ], + "authorship_tag": "ABX9TyNkAIwICbGRZOB6TBLhKw6U", + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G5aBD7iyu-NW" + }, + "source": [ + "# Part 1: Quick Start\r\n", + "\r\n", + "Part 1 gives you a quick walk-through of main AllenNLP concepts and features. We’ll build a complete, working NLP model (text classifier) along the way." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0-UErUopqefu" + }, + "source": [ + "# Introduction" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l_AJqSjwjEnc" + }, + "source": [ + "## 1. What is text classification?\r\n", + "\r\n", + "Text classification is one of the simplest NLP tasks, where the model, given some input text, predicts a label for the text. See the figure below for an illustration.\r\n", + "\r\n", + "![text-classification.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/introduction/text-classification.svg)\r\n", + "\r\n", + "There are a variety of applications of text classification, such as spam filtering, sentiment analysis, and topic detection. Some examples are shown in the table below.\r\n", + "\r\n", + "|Application| Description | Input | Output |\r\n", + "|---|---|---| ---|\r\n", + "| Spam filtering | Detect and filter spam emails | Email | Spam / Not spam |\r\n", + "| Sentiment analysis | Detect the polarity of text | Tweet, review | Positive / Negative |\r\n", + "|Topic detection | Detect the topic of text | News article, blog post | Business / Tech / Sports |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bzN7oAsml4nP" + }, + "source": [ + "## 2. Defining input and output\r\n", + "\r\n", + "The first step for building an NLP model is to define its input and output. In AllenNLP, each training example is represented by an `Instance` object. An `Instance` consists of one or more `Fields`, where each `Field` represents one piece of data used by your model, either as an input or an output. `Fields` well get converted to tensors and fed to your model. The [Reading Data Chapter](https://guide.allennlp.org/reading-data) provides more details on using `Instances` and `Fields` to represent textual data.\r\n", + "\r\n", + "For text classification, the input and the output are very simple. The model takes a `TextField` that represents the input text and predicts its label, which is represented by a `LabelField`:\r\n", + "\r\n", + "```\r\n", + "# Input\r\n", + "text: TextField\r\n", + "\r\n", + "# Output\r\n", + "label: LabelField\r\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4IaBHsvjoZqB" + }, + "source": [ + "## 3. Reading data\r\n", + "\r\n", + "![dataset-reader.png](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/dataset-reader.svg)\r\n", + "\r\n", + "The first step for building an NLP application is to read the dataset and represent it with some internal data structure.\r\n", + "\r\n", + "AllenNLP uses `DatasetReaders` to read the data, whose job is to transform raw data files into `Instances` that match the input / ouput spec. Our spec for text classification is:\r\n", + "\r\n", + "```\r\n", + "# Inputs\r\n", + "text: TextField\r\n", + "\r\n", + "# Outputs\r\n", + "label: LabelField\r\n", + "```\r\n", + "\r\n", + "We'll want one `Field` for the input and another for the output, and our model will use the inputs to predict the outputs.\r\n", + "\r\n", + "We assume the dataset has a simple data file format: \r\n", + "```\r\n", + "[text] [TAB] [label]\r\n", + "```\r\n", + "\r\n", + "for example:\r\n", + "\r\n", + "```\r\n", + "I like this movie a lot! [TAB] positive\r\n", + "This was a monstrous waste of time [TAB] negative\r\n", + "AllenNLP is amazing [TAB] positive\r\n", + "Why does this have to be so complicated? [TAB] negative\r\n", + "This sentence expresses no sentiment [TAB] neutral\r\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PwEIQbY4qlgj" + }, + "source": [ + "# Let's begin to code" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fBIzbRQTjNvp" + }, + "source": [ + "# Imports\r\n", + "\r\n", + "At first, we will import the required libraries." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7qI8gsjACM0j" + }, + "source": [ + "import tempfile\r\n", + "from typing import Dict, Iterable, List, Tuple" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-eB_o0cysgoe", + "outputId": "1b3e5d23-0439-4c94-d600-bfde011d70fd" + }, + "source": [ + "!pip install allennlp" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Collecting allennlp\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/e7/bd/c75fa01e3deb9322b637fe0be45164b40d43747661aca9195b5fb334947c/allennlp-2.1.0-py3-none-any.whl (585kB)\n", + "\u001b[K |████████████████████████████████| 593kB 19.0MB/s \n", + "\u001b[?25hRequirement already satisfied: spacy<3.1,>=2.1.0 in /usr/local/lib/python3.7/dist-packages (from allennlp) (2.2.4)\n", + "Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from allennlp) (1.4.1)\n", + "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from allennlp) (0.22.2.post1)\n", + "Requirement already satisfied: pytest in /usr/local/lib/python3.7/dist-packages (from allennlp) (3.6.4)\n", + "Requirement already satisfied: more-itertools in /usr/local/lib/python3.7/dist-packages (from allennlp) (8.7.0)\n", + "Collecting jsonnet>=0.10.0; sys_platform != \"win32\"\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/42/40/6f16e5ac994b16fa71c24310f97174ce07d3a97b433275589265c6b94d2b/jsonnet-0.17.0.tar.gz (259kB)\n", + "\u001b[K |████████████████████████████████| 266kB 54.1MB/s \n", + "\u001b[?25hRequirement already satisfied: tqdm>=4.19 in /usr/local/lib/python3.7/dist-packages (from allennlp) (4.41.1)\n", + "Collecting boto3<2.0,>=1.14\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/bd/c8/b5aac643697038ef6eb8c11c73b9ee9c2dc8cb2bc95cda2d4ee656167644/boto3-1.17.17-py2.py3-none-any.whl (130kB)\n", + "\u001b[K |████████████████████████████████| 133kB 24.8MB/s \n", + "\u001b[?25hRequirement already satisfied: requests>=2.18 in /usr/local/lib/python3.7/dist-packages (from allennlp) (2.23.0)\n", + "Requirement already satisfied: filelock<3.1,>=3.0 in /usr/local/lib/python3.7/dist-packages (from allennlp) (3.0.12)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from allennlp) (1.19.5)\n", + "Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from allennlp) (3.2.5)\n", + "Collecting transformers<4.4,>=4.1\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)\n", + "\u001b[K |████████████████████████████████| 1.9MB 56.2MB/s \n", + "\u001b[?25hRequirement already satisfied: torchvision<0.9.0,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from allennlp) (0.8.2+cu101)\n", + "Collecting overrides==3.1.0\n", + " Downloading https://files.pythonhosted.org/packages/ff/b1/10f69c00947518e6676bbd43e739733048de64b8dd998e9c2d5a71f44c5d/overrides-3.1.0.tar.gz\n", + "Requirement already satisfied: lmdb in /usr/local/lib/python3.7/dist-packages (from allennlp) (0.99)\n", + "Requirement already satisfied: torch<1.8.0,>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from allennlp) (1.7.1+cu101)\n", + "Collecting tensorboardX>=1.2\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/af/0c/4f41bcd45db376e6fe5c619c01100e9b7531c55791b7244815bac6eac32c/tensorboardX-2.1-py2.py3-none-any.whl (308kB)\n", + "\u001b[K |████████████████████████████████| 317kB 52.2MB/s \n", + "\u001b[?25hRequirement already satisfied: h5py in /usr/local/lib/python3.7/dist-packages (from allennlp) (2.10.0)\n", + "Collecting jsonpickle\n", + " Downloading https://files.pythonhosted.org/packages/bb/1a/f2db026d4d682303793559f1c2bb425ba3ec0d6fd7ac63397790443f2461/jsonpickle-2.0.0-py2.py3-none-any.whl\n", + "Collecting sentencepiece\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)\n", + "\u001b[K |████████████████████████████████| 1.2MB 42.3MB/s \n", + "\u001b[?25hRequirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (0.8.2)\n", + "Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (7.4.0)\n", + "Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (53.0.0)\n", + "Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (1.0.0)\n", + "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (2.0.5)\n", + "Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (1.1.3)\n", + "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (3.0.5)\n", + "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (1.0.5)\n", + "Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (1.0.5)\n", + "Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (0.4.1)\n", + "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->allennlp) (1.0.1)\n", + "Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/dist-packages (from pytest->allennlp) (20.3.0)\n", + "Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.7/dist-packages (from pytest->allennlp) (1.4.0)\n", + "Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from pytest->allennlp) (1.15.0)\n", + "Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.7/dist-packages (from pytest->allennlp) (0.7.1)\n", + "Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from pytest->allennlp) (1.10.0)\n", + "Collecting s3transfer<0.4.0,>=0.3.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/ea/43/4b4a1b26eb03a429a4c37ca7fdf369d938bd60018fc194e94b8379b0c77c/s3transfer-0.3.4-py2.py3-none-any.whl (69kB)\n", + "\u001b[K |████████████████████████████████| 71kB 10.6MB/s \n", + "\u001b[?25hCollecting jmespath<1.0.0,>=0.7.1\n", + " Downloading https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl\n", + "Collecting botocore<1.21.0,>=1.20.17\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/e6/fb/7ea265e28306dde068c74e6792affd4df43e51784384829c69142042ad56/botocore-1.20.17-py2.py3-none-any.whl (7.3MB)\n", + "\u001b[K |████████████████████████████████| 7.3MB 36.7MB/s \n", + "\u001b[?25hRequirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18->allennlp) (2020.12.5)\n", + "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18->allennlp) (2.10)\n", + "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18->allennlp) (3.0.4)\n", + "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18->allennlp) (1.24.3)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers<4.4,>=4.1->allennlp) (2019.12.20)\n", + "Collecting tokenizers<0.11,>=0.10.1\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)\n", + "\u001b[K |████████████████████████████████| 3.2MB 45.0MB/s \n", + "\u001b[?25hCollecting sacremoses\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)\n", + "\u001b[K |████████████████████████████████| 890kB 48.0MB/s \n", + "\u001b[?25hRequirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers<4.4,>=4.1->allennlp) (20.9)\n", + "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /usr/local/lib/python3.7/dist-packages (from transformers<4.4,>=4.1->allennlp) (3.7.0)\n", + "Requirement already satisfied: pillow>=4.1.1 in /usr/local/lib/python3.7/dist-packages (from torchvision<0.9.0,>=0.8.1->allennlp) (7.0.0)\n", + "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch<1.8.0,>=1.6.0->allennlp) (3.7.4.3)\n", + "Requirement already satisfied: protobuf>=3.8.0 in /usr/local/lib/python3.7/dist-packages (from tensorboardX>=1.2->allennlp) (3.12.4)\n", + "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.7/dist-packages (from botocore<1.21.0,>=1.20.17->boto3<2.0,>=1.14->allennlp) (2.8.1)\n", + "Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers<4.4,>=4.1->allennlp) (7.1.2)\n", + "Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers<4.4,>=4.1->allennlp) (2.4.7)\n", + "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < \"3.8\"->transformers<4.4,>=4.1->allennlp) (3.4.0)\n", + "Building wheels for collected packages: jsonnet, overrides, sacremoses\n", + " Building wheel for jsonnet (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for jsonnet: filename=jsonnet-0.17.0-cp37-cp37m-linux_x86_64.whl size=3388751 sha256=6073d1c844e65d56543bd252922dd1baab05914c73ada0274b1cd8c3cef3c1e3\n", + " Stored in directory: /root/.cache/pip/wheels/26/7a/37/7dbcc30a6b4efd17b91ad1f0128b7bbf84813bd4e1cfb8c1e3\n", + " Building wheel for overrides (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for overrides: filename=overrides-3.1.0-cp37-none-any.whl size=10174 sha256=6ef02ec8f7b45262da13d506c59a9242680e17855b876b6d04db1084dddeee65\n", + " Stored in directory: /root/.cache/pip/wheels/5c/24/13/6ef8600e6f147c95e595f1289a86a3cc82ed65df57582c65a9\n", + " Building wheel for sacremoses (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=7028d980b3196f17b1963dde3dcdd9bdcc8d3f21445ece73beab86eeeb572564\n", + " Stored in directory: /root/.cache/pip/wheels/29/3c/fd/7ce5c3f0666dab31a50123635e6fb5e19ceb42ce38d4e58f45\n", + "Successfully built jsonnet overrides sacremoses\n", + "\u001b[31mERROR: botocore 1.20.17 has requirement urllib3<1.27,>=1.25.4, but you'll have urllib3 1.24.3 which is incompatible.\u001b[0m\n", + "Installing collected packages: jsonnet, jmespath, botocore, s3transfer, boto3, tokenizers, sacremoses, transformers, overrides, tensorboardX, jsonpickle, sentencepiece, allennlp\n", + "Successfully installed allennlp-2.1.0 boto3-1.17.17 botocore-1.20.17 jmespath-0.10.0 jsonnet-0.17.0 jsonpickle-2.0.0 overrides-3.1.0 s3transfer-0.3.4 sacremoses-0.0.43 sentencepiece-0.1.95 tensorboardX-2.1 tokenizers-0.10.1 transformers-4.3.3\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "m9bkrj3asXE3" + }, + "source": [ + "import allennlp\r\n", + "import torch\r\n", + "from allennlp.data import (\r\n", + " DataLoader,\r\n", + " DatasetReader,\r\n", + " Instance,\r\n", + " Vocabulary,\r\n", + " TextFieldTensors,\r\n", + ")\r\n", + "from allennlp.data.data_loaders import SimpleDataLoader\r\n", + "from allennlp.data.fields import LabelField, TextField\r\n", + "from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer\r\n", + "from allennlp.data.tokenizers import Token, Tokenizer, WhitespaceTokenizer\r\n", + "from allennlp.models import Model\r\n", + "from allennlp.modules import TextFieldEmbedder, Seq2VecEncoder\r\n", + "from allennlp.modules.seq2vec_encoders import BagOfEmbeddingsEncoder\r\n", + "from allennlp.modules.token_embedders import Embedding\r\n", + "from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder\r\n", + "from allennlp.nn import util\r\n", + "from allennlp.training.trainer import GradientDescentTrainer, Trainer\r\n", + "from allennlp.training.optimizers import AdamOptimizer\r\n", + "from allennlp.training.metrics import CategoricalAccuracy" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5o2gOAXZnW_O" + }, + "source": [ + "# Making a DatasetReader" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5QkolYE0rAOs" + }, + "source": [ + "You can implement your own `DatasetReader` by inheriting from the `DatasetReader` class. At minimum, you need to override the `_read()` method, which reads the input and yields `Instances`" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sJeqKLlEseA7" + }, + "source": [ + "class ClassificationTsvReader(DatasetReader):\r\n", + " def __init__(\r\n", + " self,\r\n", + " tokenizer: Tokenizer = None,\r\n", + " token_indexers: Dict[str, TokenIndexer] = None,\r\n", + " max_tokens: int = None,\r\n", + " **kwargs\r\n", + " ):\r\n", + " super().__init__(**kwargs)\r\n", + " self.tokenizer = tokenizer or WhitespaceTokenizer()\r\n", + " self.token_indexers = token_indexers or {\"tokens\": SingleIdTokenIndexer()}\r\n", + " self.max_tokens = max_tokens\r\n", + "\r\n", + " def _read(self, file_path: str) -> Iterable[Instance]:\r\n", + " with open(file_path, \"r\") as lines:\r\n", + " for line in lines:\r\n", + " text, sentiment = line.strip().split(\"\\t\")\r\n", + " tokens = self.tokenizer.tokenize(text)\r\n", + " if self.max_tokens:\r\n", + " tokens = tokens[: self.max_tokens]\r\n", + " text_field = TextField(tokens, self.token_indexers)\r\n", + " label_field = LabelField(sentiment)\r\n", + " yield Instance({\"text\": text_field, \"label\": label_field})\r\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eUhKJOfMr1KG" + }, + "source": [ + "This is a minimal `DatasetReader` that will return a list of classification `Instances` when you call `reader.read(file)`. This reader will take each line in the input file, split the `text` into words using a tokenizer (the `SpaCyTokenizer` shown here relies on [spaCy](https://spacy.io/)), and represent those words as tensors using a word id in a vocabulary we construct for you.\r\n", + "\r\n", + "Pay special attention to the `text` and `label` keys that are used in the fields dictionary passed to the `Instance` - these keys will be used as parameter names when passing tensors into your `Model` later.\r\n", + "\r\n", + "Ideally, the output label would be optional when we create the `Instances`, so that we can use the same code to make predictions on unlabeled data (say, in a demo), but for the rest of this chapter we’ll keep things simple and ignore that.\r\n", + "\r\n", + "There are lots of places where this could be made better for a more flexible and fully-featured reader; see the section on [DatasetReaders](https://guide.allennlp.org/reading-data#2) for a deeper dive." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PnEOIvI_tlX9" + }, + "source": [ + "# Model \r\n", + "\r\n", + "![allennlp-model](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/allennlp-model.svg)\r\n", + "\r\n", + "Now that we know what our model is going to do, we need to implement it. First, we’ll say a few words about how `Models` work in AllenNLP:\r\n", + "\r\n", + "- An AllenNLP `Model` is just a PyTorch `Module`\r\n", + "- It implements a `forward()` method, and requires the output to be a dictionary\r\n", + "- Its output contains a `loss` key during training, which is used to optimize the model.\r\n", + "\r\n", + "Our training loop takes a batch of `Instances`, passes it through `Model.forward()`, grabs the `loss` key from the resulting dictionary, and uses backprop to compute gradients and update the model’s parameters. You don’t have to implement the training loop—all this will be taken care of by AllenNLP (though you can if you want to)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6LPb2OGx-_oI" + }, + "source": [ + "## Constructing the Model\r\n", + "\r\n", + "In the `Model` constructor, we need to instantiate all of the parameters that we will want to train. In AllenNLP, [we recommend](https://guide.allennlp.org/using-config-files#1) taking most of these parameters as constructor arguments, so that we can configure the behavior of our model without changing the model code itself, and so that we can think at a higher level about what our model is doing. The constructor for our text classification model looks like this:\r\n", + "\r\n", + "```python\r\n", + "@Model.register('simple_classifier')\r\n", + "class SimpleClassifier(Model):\r\n", + " def __init__(self,\r\n", + " vocab: Vocabulary,\r\n", + " embedder: TextFieldEmbedder,\r\n", + " encoder: Seq2VecEncoder):\r\n", + " super().__init__(vocab)\r\n", + " self.embedder = embedder\r\n", + " self.encoder = encoder\r\n", + " num_labels = vocab.get_vocab_size(\"labels\")\r\n", + " self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)\r\n", + "```\r\n", + "\r\n", + "You’ll notice that we use type annotations a lot in AllenNLP code - this is both for code readability (it’s way easier to understand what a method does if you know the types of its arguments, instead of just their names), and because we use these annotations to do some magic for you in some cases.\r\n", + "\r\n", + "One of those cases is constructor parameters, where we can automatically construct the embedder and encoder from a configuration file using these type annotations. See the chapter on [configuration files](https://guide.allennlp.org/using-config-files) for more information. That chapter will also tell you about the call to `@Model.register().`\r\n", + "\r\n", + "The upshot is that if you’re using the `allennlp train` command with a configuration file (which we show how to do below), you won’t ever have to call this constructor, it all gets taken care of for you." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WSZTinog_7TY" + }, + "source": [ + "### Passing the vocabulary\r\n", + "\r\n", + "
\r\n",
+        "@Model.register('simple_classifier')\r\n",
+        "class SimpleClassifier(Model):\r\n",
+        "    def __init__(self,\r\n",
+        "                 vocab: Vocabulary,\r\n",
+        "                 embedder: TextFieldEmbedder,\r\n",
+        "                 encoder: Seq2VecEncoder):\r\n",
+        "        super().__init__(vocab)\r\n",
+        "        self.embedder = embedder\r\n",
+        "        self.encoder = encoder\r\n",
+        "        num_labels = vocab.get_vocab_size(\"labels\")\r\n",
+        "        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)\r\n",
+        "
\r\n", + "\r\n", + "`Vocabulary` manages mappings between vocabulary items (such as words and labels) and their integer IDs. In our prebuilt training loop, the vocabulary gets created by AllenNLP after reading your training data, then passed to the `Model` when it gets constructed. We’ll find all tokens and labels that you use and assign them all integer IDs in separate namespaces. The way that this happens is fully configurable; see the [Vocabulary section of this guide](https://guide.allennlp.org/reading-data#3) for more information.\r\n", + "\r\n", + "What we did in the `DatasetReader` will put the labels in the default “labels” namespace, and we grab the number of labels from the vocabulary on line 10.\r\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aRz02XVes8ss" + }, + "source": [ + "class SimpleClassifier(Model):\r\n", + " def __init__(\r\n", + " self, vocab: Vocabulary, embedder: TextFieldEmbedder, encoder: Seq2VecEncoder\r\n", + " ):\r\n", + " super().__init__(vocab)\r\n", + " self.embedder = embedder\r\n", + " self.encoder = encoder\r\n", + " num_labels = vocab.get_vocab_size(\"labels\")\r\n", + " self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)\r\n", + "\r\n", + " def forward(\r\n", + " self, text: TextFieldTensors, label: torch.Tensor\r\n", + " ) -> Dict[str, torch.Tensor]:\r\n", + " print(\"In model.forward(); printing here just because binder is so slow\")\r\n", + " # Shape: (batch_size, num_tokens, embedding_dim)\r\n", + " embedded_text = self.embedder(text)\r\n", + " # Shape: (batch_size, num_tokens)\r\n", + " mask = util.get_text_field_mask(text)\r\n", + " # Shape: (batch_size, encoding_dim)\r\n", + " encoded_text = self.encoder(embedded_text, mask)\r\n", + " # Shape: (batch_size, num_labels)\r\n", + " logits = self.classifier(encoded_text)\r\n", + " # Shape: (batch_size, num_labels)\r\n", + " probs = torch.nn.functional.softmax(logits, dim=-1)\r\n", + " # Shape: (1,)\r\n", + " loss = torch.nn.functional.cross_entropy(logits, label)\r\n", + " return {\"loss\": loss, \"probs\": probs}\r\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L7LtlfL6toUI" + }, + "source": [ + "# TODO" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "66DJqZoItFYL" + }, + "source": [ + "def build_dataset_reader() -> DatasetReader:\r\n", + " return ClassificationTsvReader()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tXZbHG7UtqGj" + }, + "source": [ + "# TODO" + ] + }, + { + "cell_type": "code", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "yZxwglXGtIVj", + "outputId": "3f05b597-9a23-46da-b7ce-c94b8f3ad5a4" + }, + "source": [ + "!wget \"https://raw.githubusercontent.com/allenai/allennlp-guide/master/quick_start/data/movie_review/train.tsv\"\r\n", + "!wget \"https://raw.githubusercontent.com/allenai/allennlp-guide/master/quick_start/data/movie_review/dev.tsv\"\r\n", + "!wget \"https://raw.githubusercontent.com/allenai/allennlp-guide/master/quick_start/data/movie_review/test.tsv\"" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "--2021-02-27 13:40:37-- https://raw.githubusercontent.com/allenai/allennlp-guide/master/quick_start/data/movie_review/train.tsv\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 6175540 (5.9M) [text/plain]\n", + "Saving to: ‘train.tsv’\n", + "\n", + "train.tsv 100%[===================>] 5.89M 24.6MB/s in 0.2s \n", + "\n", + "2021-02-27 13:40:38 (24.6 MB/s) - ‘train.tsv’ saved [6175540/6175540]\n", + "\n", + "--2021-02-27 13:40:38-- https://raw.githubusercontent.com/allenai/allennlp-guide/master/quick_start/data/movie_review/dev.tsv\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 744425 (727K) [text/plain]\n", + "Saving to: ‘dev.tsv’\n", + "\n", + "dev.tsv 100%[===================>] 726.98K --.-KB/s in 0.06s \n", + "\n", + "2021-02-27 13:40:38 (11.4 MB/s) - ‘dev.tsv’ saved [744425/744425]\n", + "\n", + "--2021-02-27 13:40:38-- https://raw.githubusercontent.com/allenai/allennlp-guide/master/quick_start/data/movie_review/test.tsv\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 809416 (790K) [text/plain]\n", + "Saving to: ‘test.tsv.1’\n", + "\n", + "test.tsv.1 100%[===================>] 790.45K 4.53MB/s in 0.2s \n", + "\n", + "2021-02-27 13:40:39 (4.53 MB/s) - ‘test.tsv.1’ saved [809416/809416]\n", + "\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PHvQCZlPtsP4" + }, + "source": [ + "# TODO" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FxpoFs5uuOLc" + }, + "source": [ + "def read_data(reader: DatasetReader) -> Tuple[List[Instance], List[Instance]]:\r\n", + " print(\"Reading data\")\r\n", + " training_data = list(reader.read(\"train.tsv\"))\r\n", + " validation_data = list(reader.read(\"dev.tsv\"))\r\n", + " return training_data, validation_data" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "X_qBH-KYtzHy" + }, + "source": [ + "def build_vocab(instances: Iterable[Instance]) -> Vocabulary:\r\n", + " print(\"Building the vocabulary\")\r\n", + " return Vocabulary.from_instances(instances)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "UkcW51vbt167" + }, + "source": [ + "def build_model(vocab: Vocabulary) -> Model:\r\n", + " print(\"Building the model\")\r\n", + " vocab_size = vocab.get_vocab_size(\"tokens\")\r\n", + " embedder = BasicTextFieldEmbedder(\r\n", + " {\"tokens\": Embedding(embedding_dim=10, num_embeddings=vocab_size)}\r\n", + " )\r\n", + " encoder = BagOfEmbeddingsEncoder(embedding_dim=10)\r\n", + " return SimpleClassifier(vocab, embedder, encoder)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vTQ9mvEvtvZ7" + }, + "source": [ + "# TODO" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "USxbc3zAuYV9" + }, + "source": [ + "def run_training_loop():\r\n", + " dataset_reader = build_dataset_reader()\r\n", + "\r\n", + " train_data, dev_data = read_data(dataset_reader)\r\n", + "\r\n", + " vocab = build_vocab(train_data + dev_data)\r\n", + " model = build_model(vocab)\r\n", + "\r\n", + " train_loader, dev_loader = build_data_loaders(train_data, dev_data)\r\n", + " train_loader.index_with(vocab)\r\n", + " dev_loader.index_with(vocab)\r\n", + "\r\n", + " # You obviously won't want to create a temporary file for your training\r\n", + " # results, but for execution in binder for this guide, we need to do this.\r\n", + " with tempfile.TemporaryDirectory() as serialization_dir:\r\n", + " trainer = build_trainer(model, serialization_dir, train_loader, dev_loader)\r\n", + " print(\"Starting training\")\r\n", + " trainer.train()\r\n", + " print(\"Finished training\")\r\n", + "\r\n", + " return model, dataset_reader" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "BoI0JGPiudVD" + }, + "source": [ + "def build_data_loaders(\r\n", + " train_data: List[Instance],\r\n", + " dev_data: List[Instance],\r\n", + ") -> Tuple[DataLoader, DataLoader]:\r\n", + " train_loader = SimpleDataLoader(train_data, 8, shuffle=True)\r\n", + " dev_loader = SimpleDataLoader(dev_data, 8, shuffle=False)\r\n", + " return train_loader, dev_loader" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AZZZXFVJufkb" + }, + "source": [ + "def build_trainer(\r\n", + " model: Model,\r\n", + " serialization_dir: str,\r\n", + " train_loader: DataLoader,\r\n", + " dev_loader: DataLoader,\r\n", + ") -> Trainer:\r\n", + " parameters = [(n, p) for n, p in model.named_parameters() if p.requires_grad]\r\n", + " optimizer = AdamOptimizer(parameters) # type: ignore\r\n", + " trainer = GradientDescentTrainer(\r\n", + " model=model,\r\n", + " serialization_dir=serialization_dir,\r\n", + " data_loader=train_loader,\r\n", + " validation_data_loader=dev_loader,\r\n", + " num_epochs=5,\r\n", + " optimizer=optimizer,\r\n", + " )\r\n", + " return trainer" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z8WZfNOetw3H" + }, + "source": [ + "# TODO" + ] + }, + { + "cell_type": "code", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Uufus4Rkui67", + "outputId": "d3301a14-0198-4e44-f139-18ee2c235bb0" + }, + "source": [ + "run_training_loop()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Reading data\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "building vocab: 12%|#1 | 208/1800 [00:00<00:00, 2073.55it/s]" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "Building the vocabulary\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "building vocab: 100%|##########| 1800/1800 [00:01<00:00, 1702.41it/s]\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "Building the model\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "You provided a validation dataset but patience was set to None, meaning that early stopping is disabled\n", + " 0%| | 0/200 [00:00)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "eMel_Vclukh6" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/notebooks/quick_start/your_first_model.ipynb b/notebooks/quick_start/your_first_model.ipynb new file mode 100644 index 0000000..65da293 --- /dev/null +++ b/notebooks/quick_start/your_first_model.ipynb @@ -0,0 +1,907 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "your_first_model.ipynb", + "provenance": [], + "collapsed_sections": [ + "0-UErUopqefu", + "fBIzbRQTjNvp", + "PnEOIvI_tlX9", + "L7LtlfL6toUI", + "tXZbHG7UtqGj", + "PHvQCZlPtsP4", + "vTQ9mvEvtvZ7", + "Z8WZfNOetw3H" + ], + "authorship_tag": "ABX9TyOMvfA+Za1JEpBY0RxRqqIu", + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G5aBD7iyu-NW" + }, + "source": [ + "# Part 1: Quick Start\r\n", + "\r\n", + "Part 1 gives you a quick walk-through of main AllenNLP concepts and features. We’ll build a complete, working NLP model (text classifier) along the way." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0-UErUopqefu" + }, + "source": [ + "# Introduction" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l_AJqSjwjEnc" + }, + "source": [ + "## 1. What is text classification?\r\n", + "\r\n", + "Text classification is one of the simplest NLP tasks, where the model, given some input text, predicts a label for the text. See the figure below for an illustration.\r\n", + "\r\n", + "![text-classification.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/introduction/text-classification.svg)\r\n", + "\r\n", + "There are a variety of applications of text classification, such as spam filtering, sentiment analysis, and topic detection. Some examples are shown in the table below.\r\n", + "\r\n", + "|Application| Description | Input | Output |\r\n", + "|---|---|---| ---|\r\n", + "| Spam filtering | Detect and filter spam emails | Email | Spam / Not spam |\r\n", + "| Sentiment analysis | Detect the polarity of text | Tweet, review | Positive / Negative |\r\n", + "|Topic detection | Detect the topic of text | News article, blog post | Business / Tech / Sports |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bzN7oAsml4nP" + }, + "source": [ + "## 2. Defining input and output\r\n", + "\r\n", + "The first step for building an NLP model is to define its input and output. In AllenNLP, each training example is represented by an `Instance` object. An `Instance` consists of one or more `Fields`, where each `Field` represents one piece of data used by your model, either as an input or an output. `Fields` well get converted to tensors and fed to your model. The [Reading Data Chapter](https://guide.allennlp.org/reading-data) provides more details on using `Instances` and `Fields` to represent textual data.\r\n", + "\r\n", + "For text classification, the input and the output are very simple. The model takes a `TextField` that represents the input text and predicts its label, which is represented by a `LabelField`:\r\n", + "\r\n", + "```\r\n", + "# Input\r\n", + "text: TextField\r\n", + "\r\n", + "# Output\r\n", + "label: LabelField\r\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4IaBHsvjoZqB" + }, + "source": [ + "## 3. Reading data\r\n", + "\r\n", + "![dataset-reader.png](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/dataset-reader.svg)\r\n", + "\r\n", + "The first step for building an NLP application is to read the dataset and represent it with some internal data structure.\r\n", + "\r\n", + "AllenNLP uses `DatasetReaders` to read the data, whose job is to transform raw data files into `Instances` that match the input / ouput spec. Our spec for text classification is:\r\n", + "\r\n", + "```\r\n", + "# Inputs\r\n", + "text: TextField\r\n", + "\r\n", + "# Outputs\r\n", + "label: LabelField\r\n", + "```\r\n", + "\r\n", + "We'll want one `Field` for the input and another for the output, and our model will use the inputs to predict the outputs.\r\n", + "\r\n", + "We assume the dataset has a simple data file format: \r\n", + "```\r\n", + "[text] [TAB] [label]\r\n", + "```\r\n", + "\r\n", + "for example:\r\n", + "\r\n", + "```\r\n", + "I like this movie a lot! [TAB] positive\r\n", + "This was a monstrous waste of time [TAB] negative\r\n", + "AllenNLP is amazing [TAB] positive\r\n", + "Why does this have to be so complicated? [TAB] negative\r\n", + "This sentence expresses no sentiment [TAB] neutral\r\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PwEIQbY4qlgj" + }, + "source": [ + "# Let's begin to code" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fBIzbRQTjNvp" + }, + "source": [ + "# Imports\r\n", + "\r\n", + "At first, we will import the required libraries." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7qI8gsjACM0j" + }, + "source": [ + "import tempfile\r\n", + "from typing import Dict, Iterable, List, Tuple" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-eB_o0cysgoe", + "outputId": "6abc68e0-f9b6-45b9-89a9-1f46033ce460" + }, + "source": [ + "!pip install allennlp" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Collecting allennlp\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/e7/bd/c75fa01e3deb9322b637fe0be45164b40d43747661aca9195b5fb334947c/allennlp-2.1.0-py3-none-any.whl (585kB)\n", + "\u001b[K |████████████████████████████████| 593kB 8.5MB/s \n", + "\u001b[?25hCollecting boto3<2.0,>=1.14\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/48/84/7403268cd52f7d420fd0e2b3bdf524a440d8b2eda6097daeb0a5c55b3e49/boto3-1.17.22-py2.py3-none-any.whl (130kB)\n", + "\u001b[K |████████████████████████████████| 133kB 14.3MB/s \n", + "\u001b[?25hRequirement already satisfied: spacy<3.1,>=2.1.0 in /usr/local/lib/python3.7/dist-packages (from allennlp) (2.2.4)\n", + "Collecting sentencepiece\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)\n", + "\u001b[K |████████████████████████████████| 1.2MB 15.0MB/s \n", + "\u001b[?25hRequirement already satisfied: lmdb in /usr/local/lib/python3.7/dist-packages (from allennlp) (0.99)\n", + "Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from allennlp) (3.2.5)\n", + "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from allennlp) (0.22.2.post1)\n", + "Collecting jsonnet>=0.10.0; sys_platform != \"win32\"\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/42/40/6f16e5ac994b16fa71c24310f97174ce07d3a97b433275589265c6b94d2b/jsonnet-0.17.0.tar.gz (259kB)\n", + "\u001b[K |████████████████████████████████| 266kB 30.5MB/s \n", + "\u001b[?25hRequirement already satisfied: filelock<3.1,>=3.0 in /usr/local/lib/python3.7/dist-packages (from allennlp) (3.0.12)\n", + "Collecting overrides==3.1.0\n", + " Downloading https://files.pythonhosted.org/packages/ff/b1/10f69c00947518e6676bbd43e739733048de64b8dd998e9c2d5a71f44c5d/overrides-3.1.0.tar.gz\n", + "Collecting jsonpickle\n", + " Downloading https://files.pythonhosted.org/packages/bb/1a/f2db026d4d682303793559f1c2bb425ba3ec0d6fd7ac63397790443f2461/jsonpickle-2.0.0-py2.py3-none-any.whl\n", + "Requirement already satisfied: pytest in /usr/local/lib/python3.7/dist-packages (from allennlp) (3.6.4)\n", + "Requirement already satisfied: torchvision<0.9.0,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from allennlp) (0.8.2+cu101)\n", + "Collecting transformers<4.4,>=4.1\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)\n", + "\u001b[K |████████████████████████████████| 1.9MB 25.9MB/s \n", + "\u001b[?25hCollecting tensorboardX>=1.2\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/af/0c/4f41bcd45db376e6fe5c619c01100e9b7531c55791b7244815bac6eac32c/tensorboardX-2.1-py2.py3-none-any.whl (308kB)\n", + "\u001b[K |████████████████████████████████| 317kB 52.3MB/s \n", + "\u001b[?25hRequirement already satisfied: torch<1.8.0,>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from allennlp) (1.7.1+cu101)\n", + "Requirement already satisfied: tqdm>=4.19 in /usr/local/lib/python3.7/dist-packages (from allennlp) (4.41.1)\n", + "Requirement already satisfied: more-itertools in /usr/local/lib/python3.7/dist-packages (from allennlp) (8.7.0)\n", + "Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from allennlp) (1.4.1)\n", + "Requirement already satisfied: requests>=2.18 in /usr/local/lib/python3.7/dist-packages (from allennlp) (2.23.0)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from allennlp) (1.19.5)\n", + "Requirement already satisfied: h5py in /usr/local/lib/python3.7/dist-packages (from allennlp) (2.10.0)\n", + "Collecting botocore<1.21.0,>=1.20.22\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/8c/d8/0069415ca12180b94d368c60dcce9c0680dc5cfc1aed36882ac452fcf2bf/botocore-1.20.22-py2.py3-none-any.whl (7.3MB)\n", + "\u001b[K |████████████████████████████████| 7.3MB 46.8MB/s \n", + "\u001b[?25hCollecting jmespath<1.0.0,>=0.7.1\n", + " Downloading https://files.pythonhosted.org/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl\n", + "Collecting s3transfer<0.4.0,>=0.3.0\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/ea/43/4b4a1b26eb03a429a4c37ca7fdf369d938bd60018fc194e94b8379b0c77c/s3transfer-0.3.4-py2.py3-none-any.whl (69kB)\n", + "\u001b[K |████████████████████████████████| 71kB 10.6MB/s \n", + "\u001b[?25hRequirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (1.0.5)\n", + "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (3.0.5)\n", + "Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (0.8.2)\n", + "Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (1.1.3)\n", + "Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (7.4.0)\n", + "Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (54.0.0)\n", + "Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (0.4.1)\n", + "Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (1.0.0)\n", + "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (1.0.5)\n", + "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.1,>=2.1.0->allennlp) (2.0.5)\n", + "Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from nltk->allennlp) (1.15.0)\n", + "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->allennlp) (1.0.1)\n", + "Requirement already satisfied: importlib-metadata; python_version < \"3.8\" in /usr/local/lib/python3.7/dist-packages (from jsonpickle->allennlp) (3.7.0)\n", + "Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.7/dist-packages (from pytest->allennlp) (0.7.1)\n", + "Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.7/dist-packages (from pytest->allennlp) (1.4.0)\n", + "Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from pytest->allennlp) (1.10.0)\n", + "Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/dist-packages (from pytest->allennlp) (20.3.0)\n", + "Requirement already satisfied: pillow>=4.1.1 in /usr/local/lib/python3.7/dist-packages (from torchvision<0.9.0,>=0.8.1->allennlp) (7.0.0)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers<4.4,>=4.1->allennlp) (2019.12.20)\n", + "Collecting sacremoses\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)\n", + "\u001b[K |████████████████████████████████| 890kB 45.9MB/s \n", + "\u001b[?25hRequirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers<4.4,>=4.1->allennlp) (20.9)\n", + "Collecting tokenizers<0.11,>=0.10.1\n", + "\u001b[?25l Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)\n", + "\u001b[K |████████████████████████████████| 3.2MB 45.8MB/s \n", + "\u001b[?25hRequirement already satisfied: protobuf>=3.8.0 in /usr/local/lib/python3.7/dist-packages (from tensorboardX>=1.2->allennlp) (3.12.4)\n", + "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch<1.8.0,>=1.6.0->allennlp) (3.7.4.3)\n", + "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18->allennlp) (2.10)\n", + "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18->allennlp) (1.24.3)\n", + "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18->allennlp) (3.0.4)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18->allennlp) (2020.12.5)\n", + "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.7/dist-packages (from botocore<1.21.0,>=1.20.22->boto3<2.0,>=1.14->allennlp) (2.8.1)\n", + "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < \"3.8\"->jsonpickle->allennlp) (3.4.0)\n", + "Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers<4.4,>=4.1->allennlp) (7.1.2)\n", + "Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers<4.4,>=4.1->allennlp) (2.4.7)\n", + "Building wheels for collected packages: jsonnet, overrides, sacremoses\n", + " Building wheel for jsonnet (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for jsonnet: filename=jsonnet-0.17.0-cp37-cp37m-linux_x86_64.whl size=3388770 sha256=98e2c41cb2629a99b88c08483f6d0f6f98f51c0c1f0840533ba3699a3bee446e\n", + " Stored in directory: /root/.cache/pip/wheels/26/7a/37/7dbcc30a6b4efd17b91ad1f0128b7bbf84813bd4e1cfb8c1e3\n", + " Building wheel for overrides (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for overrides: filename=overrides-3.1.0-cp37-none-any.whl size=10174 sha256=69c092dbeab473cdfaa7cd4ef35f7d32a40b242e54b37603bcaf1e13823064b8\n", + " Stored in directory: /root/.cache/pip/wheels/5c/24/13/6ef8600e6f147c95e595f1289a86a3cc82ed65df57582c65a9\n", + " Building wheel for sacremoses (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=3f9a55e78014d726af4b1f7571e6c26e974ddc1a5c6882e3ff2f6ebab4b6ace9\n", + " Stored in directory: /root/.cache/pip/wheels/29/3c/fd/7ce5c3f0666dab31a50123635e6fb5e19ceb42ce38d4e58f45\n", + "Successfully built jsonnet overrides sacremoses\n", + "\u001b[31mERROR: botocore 1.20.22 has requirement urllib3<1.27,>=1.25.4, but you'll have urllib3 1.24.3 which is incompatible.\u001b[0m\n", + "Installing collected packages: jmespath, botocore, s3transfer, boto3, sentencepiece, jsonnet, overrides, jsonpickle, sacremoses, tokenizers, transformers, tensorboardX, allennlp\n", + "Successfully installed allennlp-2.1.0 boto3-1.17.22 botocore-1.20.22 jmespath-0.10.0 jsonnet-0.17.0 jsonpickle-2.0.0 overrides-3.1.0 s3transfer-0.3.4 sacremoses-0.0.43 sentencepiece-0.1.95 tensorboardX-2.1 tokenizers-0.10.1 transformers-4.3.3\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "m9bkrj3asXE3" + }, + "source": [ + "import allennlp\r\n", + "import torch\r\n", + "from allennlp.data import (\r\n", + " DataLoader,\r\n", + " DatasetReader,\r\n", + " Instance,\r\n", + " Vocabulary,\r\n", + " TextFieldTensors,\r\n", + ")\r\n", + "from allennlp.data.data_loaders import SimpleDataLoader\r\n", + "from allennlp.data.fields import LabelField, TextField\r\n", + "from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer\r\n", + "from allennlp.data.tokenizers import Token, Tokenizer, WhitespaceTokenizer\r\n", + "from allennlp.models import Model\r\n", + "from allennlp.modules import TextFieldEmbedder, Seq2VecEncoder\r\n", + "from allennlp.modules.seq2vec_encoders import BagOfEmbeddingsEncoder\r\n", + "from allennlp.modules.token_embedders import Embedding\r\n", + "from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder\r\n", + "from allennlp.nn import util\r\n", + "from allennlp.training.trainer import GradientDescentTrainer, Trainer\r\n", + "from allennlp.training.optimizers import AdamOptimizer\r\n", + "from allennlp.training.metrics import CategoricalAccuracy" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5o2gOAXZnW_O" + }, + "source": [ + "# Making a DatasetReader" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5QkolYE0rAOs" + }, + "source": [ + "You can implement your own `DatasetReader` by inheriting from the `DatasetReader` class. At minimum, you need to override the `_read()` method, which reads the input and yields `Instances`" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sJeqKLlEseA7" + }, + "source": [ + "class ClassificationTsvReader(DatasetReader):\r\n", + " def __init__(\r\n", + " self,\r\n", + " tokenizer: Tokenizer = None,\r\n", + " token_indexers: Dict[str, TokenIndexer] = None,\r\n", + " max_tokens: int = None,\r\n", + " **kwargs\r\n", + " ):\r\n", + " super().__init__(**kwargs)\r\n", + " self.tokenizer = tokenizer or WhitespaceTokenizer()\r\n", + " self.token_indexers = token_indexers or {\"tokens\": SingleIdTokenIndexer()}\r\n", + " self.max_tokens = max_tokens\r\n", + "\r\n", + " def _read(self, file_path: str) -> Iterable[Instance]:\r\n", + " with open(file_path, \"r\") as lines:\r\n", + " for line in lines:\r\n", + " text, sentiment = line.strip().split(\"\\t\")\r\n", + " tokens = self.tokenizer.tokenize(text)\r\n", + " if self.max_tokens:\r\n", + " tokens = tokens[: self.max_tokens]\r\n", + " text_field = TextField(tokens, self.token_indexers)\r\n", + " label_field = LabelField(sentiment)\r\n", + " yield Instance({\"text\": text_field, \"label\": label_field})\r\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eUhKJOfMr1KG" + }, + "source": [ + "This is a minimal `DatasetReader` that will return a list of classification `Instances` when you call `reader.read(file)`. This reader will take each line in the input file, split the `text` into words using a tokenizer (the `SpaCyTokenizer` shown here relies on [spaCy](https://spacy.io/)), and represent those words as tensors using a word id in a vocabulary we construct for you.\r\n", + "\r\n", + "Pay special attention to the `text` and `label` keys that are used in the fields dictionary passed to the `Instance` - these keys will be used as parameter names when passing tensors into your `Model` later.\r\n", + "\r\n", + "Ideally, the output label would be optional when we create the `Instances`, so that we can use the same code to make predictions on unlabeled data (say, in a demo), but for the rest of this chapter we’ll keep things simple and ignore that.\r\n", + "\r\n", + "There are lots of places where this could be made better for a more flexible and fully-featured reader; see the section on [DatasetReaders](https://guide.allennlp.org/reading-data#2) for a deeper dive." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "asXkbZtRD9wY" + }, + "source": [ + "# Building your model\r\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cbm6V-UWEBw2" + }, + "source": [ + "![designing-a-model.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model.svg)\r\n", + "\r\n", + "The next thing we need is a `Model` that will take a batch of `Instances`, predict the outputs from the inputs, and compute a loss.\r\n", + "\r\n", + "Remember that our `Instances` have this input/output spec:\r\n", + "\r\n", + "```\r\n", + "# Inputs\r\n", + "text: TextField\r\n", + "\r\n", + "# Outputs\r\n", + "label: LabelField\r\n", + "```\r\n", + "Also, remember that we used these names (`text` and `label`) for the fields in the `DatasetReader`. AllenNLP passes those fields by name to the model code, so we need to use the same names in our model.\r\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8AvSUIwTEZZl" + }, + "source": [ + "## What should our model do?\r\n", + "\r\n", + "![designing-a-model-1.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model-1.svg)\r\n", + "\r\n", + "Conceptually, a generic model for classifying text does the following:\r\n", + "\r\n", + "- Get some features corresponding to each word in your input\r\n", + "- Combine those word-level features into a document-level feature vector\r\n", + "- Classify that document-level feature vector into one of your labels.\r\n", + "\r\n", + "In AllenNLP, we make each of these conceptual steps into a generic abstraction that you can use in your code, so that you can have a very flexible model that can use different concrete components for each step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1M2gofMGEkPp" + }, + "source": [ + "## Representing text with token IDs\r\n", + "\r\n", + "![designing-a-model-2.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model-2.svg)\r\n", + "\r\n", + "The first step is changing the strings in the input text into token ids. This is handled by the `SingleIdTokenIndexer` that we used previously, during part of our data processing pipeline that you don’t have to write code for." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QOvsbjFOGDLp" + }, + "source": [ + "## Embedding tokens\r\n", + "\r\n", + "![designing-a-model-3.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model-3.svg)\r\n", + "\r\n", + "The first thing our `Model` does is apply an `Embedding` function that converts each token ID that we got as input into a vector. This gives us a vector for each input token, so we have a large tensor here." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mAjvfaD2EuB6" + }, + "source": [ + "## Apply Seq2Vec encoder\r\n", + "\r\n", + "![designing-a-model-4.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model-4.svg)\r\n", + "\r\n", + "Next we apply some function that takes the sequence of vectors for each input token and squashes it into a single vector. Before the days of pretrained language models like BERT, this was typically an LSTM or convolutional encoder. With BERT we might just take the embedding of the `[CLS]` token (more on how to do that [later](https://guide.allennlp.org/next-steps)).\r\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SxmZy7E_E5PY" + }, + "source": [ + "## Computing distribution over labels\r\n", + "\r\n", + "![designing-a-model-5.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model-5.svg)\r\n", + "\r\n", + "Finally, we take that single feature vector (for each `Instance` in the batch), and classify it as a label, which will give us a categorical probability distribution over our label space." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dgodwsVeCjF_" + }, + "source": [ + "# Implementing the model - the constructor" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PnEOIvI_tlX9" + }, + "source": [ + "![allennlp-model](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/allennlp-model.svg)\r\n", + "\r\n", + "Now that we know what our model is going to do, we need to implement it. First, we’ll say a few words about how `Models` work in AllenNLP:\r\n", + "\r\n", + "- An AllenNLP `Model` is just a PyTorch `Module`\r\n", + "- It implements a `forward()` method, and requires the output to be a dictionary\r\n", + "- Its output contains a `loss` key during training, which is used to optimize the model.\r\n", + "\r\n", + "Our training loop takes a batch of `Instances`, passes it through `Model.forward()`, grabs the `loss` key from the resulting dictionary, and uses backprop to compute gradients and update the model’s parameters. You don’t have to implement the training loop—all this will be taken care of by AllenNLP (though you can if you want to)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6LPb2OGx-_oI" + }, + "source": [ + "## Constructing the Model\r\n", + "\r\n", + "In the `Model` constructor, we need to instantiate all of the parameters that we will want to train. In AllenNLP, [we recommend](https://guide.allennlp.org/using-config-files#1) taking most of these parameters as constructor arguments, so that we can configure the behavior of our model without changing the model code itself, and so that we can think at a higher level about what our model is doing. The constructor for our text classification model looks like this:\r\n", + "\r\n", + "```python\r\n", + "@Model.register('simple_classifier')\r\n", + "class SimpleClassifier(Model):\r\n", + " def __init__(self,\r\n", + " vocab: Vocabulary,\r\n", + " embedder: TextFieldEmbedder,\r\n", + " encoder: Seq2VecEncoder):\r\n", + " super().__init__(vocab)\r\n", + " self.embedder = embedder\r\n", + " self.encoder = encoder\r\n", + " num_labels = vocab.get_vocab_size(\"labels\")\r\n", + " self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)\r\n", + "```\r\n", + "\r\n", + "You’ll notice that we use type annotations a lot in AllenNLP code - this is both for code readability (it’s way easier to understand what a method does if you know the types of its arguments, instead of just their names), and because we use these annotations to do some magic for you in some cases.\r\n", + "\r\n", + "One of those cases is constructor parameters, where we can automatically construct the embedder and encoder from a configuration file using these type annotations. See the chapter on [configuration files](https://guide.allennlp.org/using-config-files) for more information. That chapter will also tell you about the call to `@Model.register().`\r\n", + "\r\n", + "The upshot is that if you’re using the `allennlp train` command with a configuration file (which we show how to do below), you won’t ever have to call this constructor, it all gets taken care of for you." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WSZTinog_7TY" + }, + "source": [ + "### Passing the vocabulary\r\n", + "\r\n", + "
\r\n",
+        "@Model.register('simple_classifier')\r\n",
+        "class SimpleClassifier(Model):\r\n",
+        "    def __init__(self,\r\n",
+        "                 vocab: Vocabulary,\r\n",
+        "                 embedder: TextFieldEmbedder,\r\n",
+        "                 encoder: Seq2VecEncoder):\r\n",
+        "        super().__init__(vocab)\r\n",
+        "        self.embedder = embedder\r\n",
+        "        self.encoder = encoder\r\n",
+        "        num_labels = vocab.get_vocab_size(\"labels\")\r\n",
+        "        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)\r\n",
+        "
\r\n", + "\r\n", + "`Vocabulary` manages mappings between vocabulary items (such as words and labels) and their integer IDs. In our prebuilt training loop, the vocabulary gets created by AllenNLP after reading your training data, then passed to the `Model` when it gets constructed. We’ll find all tokens and labels that you use and assign them all integer IDs in separate namespaces. The way that this happens is fully configurable; see the [Vocabulary section of this guide](https://guide.allennlp.org/reading-data#3) for more information.\r\n", + "\r\n", + "What we did in the `DatasetReader` will put the labels in the default “labels” namespace, and we grab the number of labels from the vocabulary on line 10.\r\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rjZ7lT_74YmA" + }, + "source": [ + "### Embedding words\r\n", + "\r\n", + "
\r\n",
+        "@Model.register('simple_classifier')\r\n",
+        "class SimpleClassifier(Model):\r\n",
+        "    def __init__(self,\r\n",
+        "                 vocab: Vocabulary,\r\n",
+        "                 embedder: TextFieldEmbedder,\r\n",
+        "                 encoder: Seq2VecEncoder):\r\n",
+        "        super().__init__(vocab)\r\n",
+        "        self.embedder = embedder\r\n",
+        "        self.encoder = encoder\r\n",
+        "        >num_labels = vocab.get_vocab_size(\"labels\")\r\n",
+        "        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)\r\n",
+        "
\r\n", + "\r\n", + "To get an initial word embedding, we’ll use AllenNLP’s `TextFieldEmbedder`. This abstraction takes the tensors created by a `TextField` and embeds each one. This is our most complex abstraction, because there are a lot of ways to do this particular operation in NLP, and we want to be able to switch between these without changing our code. We won’t go into the details here; we have a whole [chapter of this guide](https://guide.allennlp.org/representing-text-as-features) dedicated to diving deep into how this abstraction works and how to use it. All you need to know for now is that you apply this to the `text` parameter you get in `forward()`, and you get out a tensor that has a single embedding vector for each input token, with shape `(batch_size, num_tokens, embedding_dim)`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OgfbmLCm5LTk" + }, + "source": [ + "### Applying a Seq2VecEncoder\r\n", + "\r\n", + "
\r\n",
+        "@Model.register('simple_classifier')\r\n",
+        "class SimpleClassifier(Model):\r\n",
+        "    def __init__(self,\r\n",
+        "                 vocab: Vocabulary,\r\n",
+        "                 embedder: TextFieldEmbedder,\r\n",
+        "                 encoder: Seq2VecEncoder):\r\n",
+        "        super().__init__(vocab)\r\n",
+        "        self.embedder = embedder\r\n",
+        "        self.encoder = encoder\r\n",
+        "        >num_labels = vocab.get_vocab_size(\"labels\")\r\n",
+        "        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)\r\n",
+        "
\r\n", + "\r\n", + "To squash our sequence of token vectors into a single vector, we use AllenNLP’s `Seq2VecEncoder` abstraction. As the name implies, this encapsulates an operation that takes a sequence of vectors and returns a single vector. Because all of our modules operate on batched input, this will take a tensor shaped like `(batch_size, num_tokens, embedding_dim)` and return a tensor shaped like `(batch_size, encoding_dim)`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "aRz02XVes8ss" + }, + "source": [ + "class SimpleClassifier(Model):\r\n", + " def __init__(\r\n", + " self, \r\n", + " vocab: Vocabulary, \r\n", + " embedder: TextFieldEmbedder, \r\n", + " encoder: Seq2VecEncoder\r\n", + " ):\r\n", + " super().__init__(vocab)\r\n", + " self.embedder = embedder\r\n", + " self.encoder = encoder\r\n", + " num_labels = vocab.get_vocab_size(\"labels\")\r\n", + " self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L7LtlfL6toUI" + }, + "source": [ + "# Implementing the model — the forward method" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y0hXPxfSCdQ4" + }, + "source": [ + "Next, we need to implement the `forward()` method of your model, which takes the input, produces the prediction, and computes the loss. Remember, our constructor and input/output spec look like:\r\n", + "\r\n", + "```python\r\n", + "@Model.register('simple_classifier')\r\n", + "class SimpleClassifier(Model):\r\n", + " def __init__(self,\r\n", + " vocab: Vocabulary,\r\n", + " embedder: TextFieldEmbedder,\r\n", + " encoder: Seq2VecEncoder):\r\n", + " super().__init__(vocab)\r\n", + " self.embedder = embedder\r\n", + " self.encoder = encoder\r\n", + " num_labels = vocab.get_vocab_size(\"labels\")\r\n", + " self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)\r\n", + "```\r\n", + "\r\n", + "```\r\n", + "# Inputs:\r\n", + "text: TextField\r\n", + "\r\n", + "# Outputs:\r\n", + "label: LabelField\r\n", + "```\r\n", + "\r\n", + "Here we’ll show how to use these parameters inside of `Model.forward()`, which will get arguments that match our input/output spec (because that’s how we coded the [DatasetReader](https://colab.research.google.com/drive/1Fxl4PEW-U-x7MjIrLfPyqw2Sgs1Z2Fcw?authuser=1#scrollTo=5o2gOAXZnW_O&line=1&uniqifier=1))." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XvDoAv938kpg" + }, + "source": [ + "## Model.forward()\r\n", + "\r\n", + "In `forward`, we use the parameters that we created in our constructor to transform the inputs into outputs. After we’ve predicted the outputs, we compute some loss function based on how close we got to the true outputs, and then return that loss (along with whatever else we want) so that we can use it to train the parameters.\r\n", + "\r\n", + "```python\r\n", + "class SimpleClassifier(Model):\r\n", + " def forward(self,\r\n", + " text: TextFieldTensors,\r\n", + " label: torch.Tensor) -> Dict[str, torch.Tensor]:\r\n", + " # Shape: (batch_size, num_tokens, embedding_dim)\r\n", + " embedded_text = self.embedder(text)\r\n", + " # Shape: (batch_size, num_tokens)\r\n", + " mask = util.get_text_field_mask(text)\r\n", + " # Shape: (batch_size, encoding_dim)\r\n", + " encoded_text = self.encoder(embedded_text, mask)\r\n", + " # Shape: (batch_size, num_labels)\r\n", + " logits = self.classifier(encoded_text)\r\n", + " # Shape: (batch_size, num_labels)\r\n", + " probs = torch.nn.functional.softmax(logits)\r\n", + " # Shape: (1,)\r\n", + " loss = torch.nn.functional.cross_entropy(logits, label)\r\n", + " return {'loss': loss, 'probs': probs}\r\n", + "```\r\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VTdv9S5i9JYd" + }, + "source": [ + "### Inputs to forward()\r\n", + "\r\n", + "
\r\n",
+        "class SimpleClassifier(Model):\r\n",
+        "    def forward(self,\r\n",
+        "                text: TextFieldTensors,\r\n",
+        "                label: torch.Tensor) -> Dict[str, torch.Tensor]:\r\n",
+        "        # Shape: (batch_size, num_tokens, embedding_dim)\r\n",
+        "        embedded_text = self.embedder(text)\r\n",
+        "        # Shape: (batch_size, num_tokens)\r\n",
+        "        mask = util.get_text_field_mask(text)\r\n",
+        "        # Shape: (batch_size, encoding_dim)\r\n",
+        "        encoded_text = self.encoder(embedded_text, mask)\r\n",
+        "        # Shape: (batch_size, num_labels)\r\n",
+        "        logits = self.classifier(encoded_text)\r\n",
+        "        # Shape: (batch_size, num_labels)\r\n",
+        "        probs = torch.nn.functional.softmax(logits)\r\n",
+        "        # Shape: (1,)\r\n",
+        "        loss = torch.nn.functional.cross_entropy(logits, label)\r\n",
+        "        return {'loss': loss, 'probs': probs}\r\n",
+        "
\r\n", + "\r\n", + "The first thing to notice is the inputs to this function. The way the AllenNLP training loop works is that we will take the field names that you used in your `DatasetReader` and give you a batch of instances _with those same field names_ in `forward`. So, because we used `text` and `label` as our field names, we need to name our arguments to forward the same way.\r\n", + "\r\n", + "Second, notice the types of these arguments. Each type of `Field` knows how to convert itself into a `torch.Tensor`, then create a batched `torch.Tensor` from all of the `Fields` with the same name from a batch of `Instances`. The types you see for `text` and `label` are the tensors produced by `TextField` and `LabelField` (again, see our [chapter on using TextFields](https://guide.allennlp.org/representing-text-as-features) for more information about `TextFieldTensors`). The important part to know is that our `TextFieldEmbedder`, which we created in the constructor, expects this type of object as input and will return an embedded tensor as output." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F5fL1kON-qMM" + }, + "source": [ + "### Embedding the text\r\n", + "\r\n", + "
\r\n",
+        "class SimpleClassifier(Model):\r\n",
+        "    def forward(self,\r\n",
+        "                text: TextFieldTensors,\r\n",
+        "                label: torch.Tensor) -> Dict[str, torch.Tensor]:\r\n",
+        "        # Shape: (batch_size, num_tokens, embedding_dim)\r\n",
+        "        embedded_text = self.embedder(text)\r\n",
+        "        # Shape: (batch_size, num_tokens)\r\n",
+        "        mask = util.get_text_field_mask(text)\r\n",
+        "        # Shape: (batch_size, encoding_dim)\r\n",
+        "        encoded_text = self.encoder(embedded_text, mask)\r\n",
+        "        # Shape: (batch_size, num_labels)\r\n",
+        "        logits = self.classifier(encoded_text)\r\n",
+        "        # Shape: (batch_size, num_labels)\r\n",
+        "        probs = torch.nn.functional.softmax(logits)\r\n",
+        "        # Shape: (1,)\r\n",
+        "        loss = torch.nn.functional.cross_entropy(logits, label)\r\n",
+        "        return {'loss': loss, 'probs': probs}\r\n",
+        "
\r\n", + "\r\n", + "The first actual modeling operation that we do is embed the text, getting a vector for each input token. Notice here that we’re not specifying anything about how that operation is done, just that a `TextFieldEmbedder` that we got in our constructor is going to do it. This lets us be very flexible later, changing between various kinds of embedding methods or pretrained representations (including ELMo and BERT) without changing our model code." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eTqnTEbo8-Xq" + }, + "source": [ + "### Applying a Seq2VecEncoder\r\n", + "\r\n", + "
\r\n",
+        "class SimpleClassifier(Model):\r\n",
+        "    def forward(self,\r\n",
+        "                text: TextFieldTensors,\r\n",
+        "                label: torch.Tensor) -> Dict[str, torch.Tensor]:\r\n",
+        "        # Shape: (batch_size, num_tokens, embedding_dim)\r\n",
+        "        embedded_text = self.embedder(text)\r\n",
+        "        # Shape: (batch_size, num_tokens)\r\n",
+        "        mask = util.get_text_field_mask(text)\r\n",
+        "        # Shape: (batch_size, encoding_dim)\r\n",
+        "        encoded_text = self.encoder(embedded_text, mask)\r\n",
+        "        # Shape: (batch_size, num_labels)\r\n",
+        "        logits = self.classifier(encoded_text)\r\n",
+        "        # Shape: (batch_size, num_labels)\r\n",
+        "        probs = torch.nn.functional.softmax(logits)\r\n",
+        "        # Shape: (1,)\r\n",
+        "        loss = torch.nn.functional.cross_entropy(logits, label)\r\n",
+        "        return {'loss': loss, 'probs': probs}\r\n",
+        "
\r\n", + "\r\n", + "After we have embedded our text, we next have to squash the sequence of vectors (one per token) into a single vector for the whole text. We do that using the `Seq2VecEncoder` that we got as a constructor argument. In order to behave properly when we’re batching pieces of text together that could have different lengths, we need to mask elements in the `embedded_text` tensor that are only there due to padding. We use a utility function to get a mask from the `TextField` output, then pass that mask into the encoder.\r\n", + "\r\n", + "At the end of these lines, we have a single vector for each instance in the batch." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uUxYty0Y_7A-" + }, + "source": [ + "### Making predictions\r\n", + "\r\n", + "
\r\n",
+        "class SimpleClassifier(Model):\r\n",
+        "    def forward(self,\r\n",
+        "                text: TextFieldTensors,\r\n",
+        "                label: torch.Tensor) -> Dict[str, torch.Tensor]:\r\n",
+        "        # Shape: (batch_size, num_tokens, embedding_dim)\r\n",
+        "        embedded_text = self.embedder(text)\r\n",
+        "        # Shape: (batch_size, num_tokens)\r\n",
+        "        mask = util.get_text_field_mask(text)\r\n",
+        "        # Shape: (batch_size, encoding_dim)\r\n",
+        "        encoded_text = self.encoder(embedded_text, mask)\r\n",
+        "        # Shape: (batch_size, num_labels)\r\n",
+        "        logits = self.classifier(encoded_text)\r\n",
+        "        # Shape: (batch_size, num_labels)\r\n",
+        "        probs = torch.nn.functional.softmax(logits)\r\n",
+        "        # Shape: (1,)\r\n",
+        "        loss = torch.nn.functional.cross_entropy(logits, label)\r\n",
+        "        return {'loss': loss, 'probs': probs}\r\n",
+        "
\r\n", + "\r\n", + "The last step of our model is to take the vector for each instance in the batch and predict a label for it. Our `classifier` is a `torch.nn.Linear` layer that gives a score (commonly called a `logit`) for each possible label. We normalize those scores using a `softmax` operation to get a probability distribution over labels that we can return to a consumer of this model. For computing the loss, PyTorch has a built in function that computes the cross entropy between the logits that we predict and the true label distribution, and we use that as our loss function.\r\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DQVGPAu0A_iA" + }, + "source": [ + "# class SimpleClassifier(Model)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "BMuMrIDkAcQZ" + }, + "source": [ + "class SimpleClassifier(Model):\r\n", + " def __init__(\r\n", + " self, \r\n", + " vocab: Vocabulary, \r\n", + " embedder: TextFieldEmbedder, \r\n", + " encoder: Seq2VecEncoder\r\n", + " ):\r\n", + " super().__init__(vocab)\r\n", + " self.embedder = embedder\r\n", + " self.encoder = encoder\r\n", + " num_labels = vocab.get_vocab_size(\"labels\")\r\n", + " self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)\r\n", + "\r\n", + " def forward(\r\n", + " self, text: TextFieldTensors, label: torch.Tensor\r\n", + " ) -> Dict[str, torch.Tensor]:\r\n", + " print(\"In model.forward(); printing here just because binder is so slow\")\r\n", + " # Shape: (batch_size, num_tokens, embedding_dim)\r\n", + " embedded_text = self.embedder(text)\r\n", + " # Shape: (batch_size, num_tokens)\r\n", + " mask = util.get_text_field_mask(text)\r\n", + " # Shape: (batch_size, encoding_dim)\r\n", + " encoded_text = self.encoder(embedded_text, mask)\r\n", + " # Shape: (batch_size, num_labels)\r\n", + " logits = self.classifier(encoded_text)\r\n", + " # Shape: (batch_size, num_labels)\r\n", + " probs = torch.nn.functional.softmax(logits, dim=-1)\r\n", + " # Shape: (1,)\r\n", + " loss = torch.nn.functional.cross_entropy(logits, label)\r\n", + " return {\"loss\": loss, \"probs\": probs}\r\n", + " " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y4OrXtARBEeH" + }, + "source": [ + "# Conclusion\r\n", + "\r\n", + "And that’s it! This is all you need for a simple classifier. After you’ve written a `DatasetReader` and `Model`, AllenNLP takes care of the rest: connecting your input files to the dataset reader, intelligently batching together your instances and feeding them to the model, and optimizing the model’s parameters by using backprop on the loss. We go over this part in the next chapter." + ] + } + ] +} \ No newline at end of file