Added Seq2Seq tasks (#37)

* Added Seq2Seq tasks * Use rank 0 for model specific params * Add licences * Fix summarization scripts * Fix comments, update from files API * Add tests * Add docs * Fix doc header * Apply suggestions from code review Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Add typing * Add imports, fix docs * Add rouge score for metric * fix imports * fix imports and style * Install sentencepiece for slow tokenizer conversion * yapf * Fixed underlines * Fixed doc references * Added min versions address formatting * Update requirement * Fix formatting issues * add seq to seq finetuning callback * docs: link blog * resolve tests * update * Delete lock file * remove download_model * Revert some changes, update requirements.txt * Move to mbart for now, even if it's a large model file * Clean up finetuning module, fix tests plus add todo * Cleanup * Update flash/text/seq2seq/core/model.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Remove lock file, add typing * Change to test code * Swap to module available * Revert testcode due to test error Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: tchaton <thomas@grid.ai> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Lightning-Universe · Feb 2, 2021 · 0446361 · 0446361
1 parent d5409bd
commit 0446361
Show file tree

Hide file tree

Showing 42 changed files with 2,130 additions and 481 deletions.
diff --git a/docs/source/_templates/theme_variables.jinja b/docs/source/_templates/theme_variables.jinja
@@ -11,7 +11,7 @@
   'home': 'https://pytorchlightning.github.io/lightning-flash/',
   'get_started': 'https://pytorchlightning.github.io/lightning-flash/quickstart.html',
   'features': 'https://pytorchlightning.github.io/lightning-flash/',
-  'blog': 'https://pytorchlightning.github.io/lightning-flash/',
+  'blog': 'https://www.pytorchlightning.ai/blog',
   'resources': 'https://pytorchlightning.github.io/lightning-flash/',
   'support': 'https://pytorchlightning.github.io/lightning-flash/',
 }

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -22,8 +22,10 @@ Lightning Flash
    reference/task
    reference/image_classification
    reference/image_embedder
+   reference/summarization
    reference/text_classification
    reference/tabular_classification
+   reference/translation
 
 .. toctree::
    :maxdepth: 1

diff --git a/docs/source/reference/summarization.rst b/docs/source/reference/summarization.rst
@@ -0,0 +1,185 @@
+.. _summarization:
+
+#############
+Summarization
+#############
+
+********
+The task
+********
+
+Summarization is the task of summarizing text from a larger document/article into a short sentence/description. For example, taking a web article and describing the topic in a short sentence.
+This task is a subset of Sequence to Sequence tasks, which requires the model to generate a variable length sequence given an input sequence. In our case the article would be our input sequence, and the short description/sentence would be the output sequence from the model.
+
+-----
+
+*********
+Inference
+*********
+
+The :class:`~flash.text.SummarizationTask` is already pre-trained on [XSUM](https://arxiv.org/abs/1808.08745), a dataset of online British Broadcasting Corporation articles.
+
+Use the :class:`~flash.text.SummarizationTask` pretrained model for inference on any string sequence using :func:`~flash.text.SummarizationTask.predict`:
+
+.. code-block:: python
+
+    # import our libraries
+    from flash.text import SummarizationTask
+
+
+    # 2. Load the model from a checkpoint
+    model = SummarizationTask.load_from_checkpoint("https://flash-weights.s3.amazonaws.com/summarization_model_xsum.pt")
+
+    # 2. Perform inference from a sequence
+    predictions = model.predict([
+        """
+        Camilla bought a box of mangoes with a Brixton Â£10 note, introduced last year to try to keep the money of local
+        people within the community.The couple were surrounded by shoppers as they walked along Electric Avenue.
+        They came to Brixton to see work which has started to revitalise the borough.
+        It was Charles' first visit to the area since 1996, when he was accompanied by the former
+        South African president Nelson Mandela.Greengrocer Derek Chong, who has run a stall on Electric Avenue
+        for 20 years, said Camilla had been ""nice and pleasant"" when she purchased the fruit.
+        ""She asked me what was nice, what would I recommend, and I said we've got some nice mangoes.
+        She asked me were they ripe and I said yes - they're from the Dominican Republic.""
+        Mr Chong is one of 170 local retailers who accept the Brixton Pound.
+        Customers exchange traditional pound coins for Brixton Pounds and then spend them at the market
+        or in participating shops.
+        During the visit, Prince Charles spent time talking to youth worker Marcus West, who works with children
+        nearby on an estate off Coldharbour Lane. Mr West said:
+        ""He's on the level, really down-to-earth. They were very cheery. The prince is a lovely man.""
+        He added: ""I told him I was working with young kids and he said, 'Keep up all the good work.'""
+        Prince Charles also visited the Railway Hotel, at the invitation of his charity The Prince's Regeneration Trust.
+        The trust hopes to restore and refurbish the building,
+        where once Jimi Hendrix and The Clash played, as a new community and business centre."
+        """
+    ])
+    print(predictions)
+
+Or on a given dataset:
+
+.. code-block:: python
+
+    # import our libraries
+    from pytorch_lightning import Trainer
+    from flash import download_data
+    from flash.text import SummarizationData, SummarizationTask
+
+    # 2. Load the model from a checkpoint
+    model = SummarizationTask.load_from_checkpoint("https://flash-weights.s3.amazonaws.com/summarization_model_xsum.pt")
+
+    # 3. Create dataset from file
+    datamodule = SummarizationData.from_file(
+        predict_file="data/xsum/predict.csv",
+        input="input",
+    )
+
+    # 4. generate summaries
+    predictions = Trainer().predict(model, datamodule=datamodule)
+    print(predictions)
+
+For more advanced inference options, see :ref:`predictions`.
+
+-----
+
+**********
+Finetuning
+**********
+
+Say you want to finetune to your own summarization data. We use the XSUM dataset as an example which contains a ``train.csv`` and ``valid.csv``, structured like so:
+
+.. code-block::
+
+    input,target
+    "The researchers have sequenced the genome of a strain of bacterium that causes the virulent infection...","A team of UK scientists hopes to shed light on the mysteries of bleeding canker, a disease that is threatening the nation's horse chestnut trees."
+    "Knight was shot in the leg by an unknown gunman at Miami's Shore Club where West was holding a pre-MTV Awards...",Hip hop star Kanye West is being sued by Death Row Records founder Suge Knight over a shooting at a beach party in August 2005.
+    ...
+
+In the above the input column represents the long articles/documents, and the target is the short description used as the target.
+
+All we need is three lines of code to train our model!
+
+.. code-block:: python
+
+    # import our libraries
+    import flash
+    from flash import download_data
+    from flash.text import SummarizationData, SummarizationTask
+
+    # 1. Download data
+    download_data("https://pl-flash-data.s3.amazonaws.com/xsum.zip", 'data/')
+
+    # Organize the data
+    datamodule = SummarizationData.from_files(
+        train_file="data/xsum/train.csv",
+        valid_file="data/xsum/valid.csv",
+        test_file="data/xsum/test.csv",
+        input="input",
+        target="target"
+    )
+
+    # 2. Build the task
+    model = SummarizationTask()
+
+    # 4. Create trainer
+    trainer = flash.Trainer(max_epochs=1, gpus=1)
+
+    # 5. Finetune the task
+    trainer.finetune(model, datamodule=datamodule)
+
+    # 6. Save trainer task
+    trainer.save_checkpoint("summarization_model_xsum.pt")
+
+----
+
+To run the example:
+
+.. code-block:: bash
+
+    python flash_examples/finetuning/summarization.py
+
+
+------
+
+*********************
+Changing the backbone
+*********************
+By default, we use the `t5 <https://arxiv.org/abs/1910.10683>`_ model for summarization. You can change the model run by passing in the backbone parameter.
+
+.. note:: When changing the backbone, make sure you pass in the same backbone to the Task and the Data object! Since this is a Seq2Seq task, make sure you use a Seq2Seq model.
+
+.. code-block:: python
+
+    datamodule = SummarizationData.from_files(
+        train_file="data/wmt_en_ro/train.csv",
+        valid_file="data/wmt_en_ro/valid.csv",
+        test_file="data/wmt_en_ro/test.csv",
+        input="input",
+        target="target",
+        backbone="google/mt5-small",
+    )
+
+    model = SummarizationTask(backbone="google/mt5-small")
+
+------
+
+*************
+API reference
+*************
+
+.. _summarization_task:
+
+SummarizationTask
+-----------------
+
+.. autoclass:: flash.text.seq2seq.summarization.model.SummarizationTask
+    :members:
+    :exclude-members: forward
+
+.. _summarization_data:
+
+SummarizationData
+-----------------
+
+.. autoclass:: flash.text.seq2seq.summarization.data.SummarizationData
+
+.. automethod:: flash.text.seq2seq.summarization.data.SummarizationData.from_files
diff --git a/docs/source/reference/text_classification.rst b/docs/source/reference/text_classification.rst
@@ -16,9 +16,9 @@ Text classification is the task of assigning a piece of text (word, sentence or
 Inference
 *********
 
-The :class:`~flash.text.TextClassificatier` is already pre-trained on [IMDB](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews), a dataset of highly polarized movie reviews, trained for binary classification- to predict if a given review has a positive or negative sentiment.
+The :class:`~flash.text.TextClassifier` is already pre-trained on [IMDB](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews), a dataset of highly polarized movie reviews, trained for binary classification- to predict if a given review has a positive or negative sentiment.
 
-Use the :class:`~flash.text.TextClassificatier` pretrained model for inference on any string sequence using :func:`~flash.text.TextClassifier.predict`:
+Use the :class:`~flash.text.TextClassifier` pretrained model for inference on any string sequence using :func:`~flash.text.TextClassifier.predict`:
 
 .. code-block:: python
 
@@ -83,10 +83,10 @@ All we need is three lines of code to train our model!
 
 .. code-block:: python
 
-	# import our libraries
-	import flash
-	from flash import download_data
-	from flash.text import TextClassificationData, TextClassifier
+    # import our libraries
+    import flash
+    from flash import download_data
+    from flash.text import TextClassificationData, TextClassifier
 
     # 1. Download data
     download_data("https://pl-flash-data.s3.amazonaws.com/imdb.zip", 'data/')