Merge branch 'master' of https://github.com/RasaHQ/rasa into replace-…

…os.path-pathlib
RasaHQ · Nov 10, 2020 · 4fdd821 · 4fdd821
2 parents 5370197 + df7a5b9
commit 4fdd821
Show file tree

Hide file tree

Showing 31 changed files with 2,577 additions and 1,037 deletions.
diff --git a/.github/workflows/security-scans.yml b/.github/workflows/security-scans.yml
@@ -1,6 +1,8 @@
 name: Security Scans
 
-on: [push, pull_request]
+on:
+  pull_request:
+    types: [opened, synchronize, labeled]
 
 jobs:
   cleanup_runs:

diff --git a/changelog/6285.improvement.md b/changelog/6285.improvement.md
@@ -0,0 +1,5 @@
+Predictions of the [`FallbackClassifier`](components.mdx#fallbackclassifier) are
+ignored when 
+[evaluating the NLU model](testing-your-assistant.mdx#evaluating-an-nlu-model)
+Note that the `FallbackClassifier` predictions still apply to 
+[test stories](testing-your-assistant.mdx#writing-test-stories).
diff --git a/changelog/6973.bugfix.md b/changelog/6973.bugfix.md
@@ -0,0 +1 @@
+Ignore rules when validating stories
diff --git a/changelog/6973.doc.md b/changelog/6973.doc.md
@@ -0,0 +1 @@
+Correct data validation docs
diff --git a/changelog/7027.improvement.md b/changelog/7027.improvement.md
@@ -0,0 +1,6 @@
+Remove dependency between `ConveRTTokenizer` and `ConveRTFeaturizer`. The `ConveRTTokenizer` is now deprecated, and the 
+`ConveRTFeaturizer` can be used with any other `Tokenizer`.
+
+Remove dependency between `HFTransformersNLP`, `LanguageModelTokenizer`, and `LanguageModelFeaturizer`. Both 
+`HFTransformersNLP` and `LanguageModelTokenizer` are now deprecated. `LanguageModelFeaturizer` implements the behavior 
+of the stack and can be used with any other `Tokenizer`.
diff --git a/changelog/README.md b/changelog/README.md
@@ -18,7 +18,7 @@ Each file should be named like `<ISSUE>.<TYPE>.md`, where
 * `feature`: new user facing features, like new command-line options and new behavior.
 * `improvement`: improvement of existing functionality, usually without requiring user intervention.
 * `bugfix`: fixes a reported bug.
-* `doc`: documentation improvement, like rewording an entire session or adding missing docs.
+* `doc`: documentation improvement, like rewording an entire section or adding missing docs.
 * `removal`: feature deprecation or feature removal.
 * `misc`: fixing a small typo or internal change, will not be included in the changelog.
 

diff --git a/data/test_stories/rules_without_stories_and_wrong_names.md b/data/test_stories/rules_without_stories_and_wrong_names.md
@@ -0,0 +1,23 @@
+>> rule 1
+    - form{"name": "loop_q_form"}  <!-- condition that form is active-->
+    - slot{"requested_slot": "some_slot"}  <!-- some condition -->
+    - ...
+* some_intent_that_doesnt_exist{"some_slot":"bla"} <!-- can be ANY -->
+    - loop_q_form <!-- can be internal core action, can be anything -->
+
+>> rule 2
+    - form{"name": "loop_q_form"} <!-- condition that form is active-->
+    - slot{"requested_slot": "some_slot"}  <!-- some condition -->
+    - ...
+* explain                          <!-- can be anything -->
+    - utter_some_action_that_doesnt_exist
+    - loop_q_form
+    - form{"name": "loop_q_form"} <!-- condition that form is active-->
+
+>> rule 3
+    - form{"name": "loop_q_form"} <!-- condition that form is active-->
+    - ...
+    - loop_q_form <!-- condition that form is active -->
+    - form{"name": null}
+    - slot{"requested_slot": null}
+    - action_stop_q_form
diff --git a/data/test_stories/stories_with_rules_conflicting.md b/data/test_stories/stories_with_rules_conflicting.md
@@ -0,0 +1,9 @@
+>> rule 1
+* greet
+    - utter_noworries
+
+## ML story 1
+* greet
+    - utter_greet
+* thankyou
+    - utter_noworries
diff --git a/docs/docs/command-line-interface.mdx b/docs/docs/command-line-interface.mdx
@@ -313,31 +313,34 @@ rasa data convert nlg --help
 
 ## rasa data validate
 
-You can check your domain, NLU data, or conversation data for mistakes and inconsistencies. 
+You can check your domain, NLU data, or story data for mistakes and inconsistencies. 
 To validate your data, run this command:
 
 ```bash
 rasa data validate
 ```
 
-By default, the validator searches only for errors in the data, e.g. the same training
-example being listed as an example for two intents.
-To catch minor issues that don't prevent training a model but might indicate messy data
-(e.g. unused intents), use the `--fail-on-warnings` flag.
+The validator searches for errors in the data, e.g. two intents that have some
+identical training examples.
+The validator also checks if you have any stories where different assistant actions follow from the same 
+dialogue history. Conflicts between stories will prevent a model from learning the correct
+pattern for a dialogue. 
 
-You can also validate the story structure by running this command:
+If you pass a `max_history` value to one or more policies in your `config.yml` file, provide the 
+smallest of those values in the validator command using the `--max-history <max_history>` flag. 
+
+You can also validate only the story structure by running this command:
 
 ```bash
 rasa data validate stories
 ```
 
-This validator checks if you have any stories where different assistant actions follow from the same 
-dialogue history. Conflicts between stories will prevent a model from learning the correct
-pattern for a dialogue. 
+:::note
+Running `rasa data validate` does **not** test if your [rules](./rules.mdx) are consistent with your stories. 
+However, during training, the `RulePolicy` checks for conflicts between rules and stories. Any such conflict will abort training.
+:::
 
-If you have a [Memoization Policy](./policies.mdx#memoization-policy) in your 
-`config.yml` file, run the validator with the `--max-history` argument and provide the `max_history` 
-value set in `config.yml`. If you didn't set `max_history` in the config file, provide the default value of `5`.
+To interrupt validation even for minor issues such as unused intents or responses, use the `--fail-on-warnings` flag.
 
 :::caution check your story names
 The `rasa data validate stories` command assumes that all your story names are unique!

diff --git a/docs/docs/components.mdx b/docs/docs/components.mdx
@@ -139,6 +139,10 @@ word vectors in your pipeline.
 
 ### HFTransformersNLP
 
+:::caution Deprecated
+The `HFTransformersNLP` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer)
+now implements its behavior.
+:::
 
 * **Short**
 
@@ -406,6 +410,10 @@ word vectors in your pipeline.
 
   ### ConveRTTokenizer
 
+:::caution Deprecated
+The `ConveRTTokenizer` is deprecated and will be removed in a future release. The [ConveRTFeaturizer](./components.mdx#convertfeaturizer)
+now implements its behavior. Any [tokenizer](./components.mdx#tokenizers) can be used in its place.
+:::
 
   * **Short**
 
@@ -466,42 +474,46 @@ word vectors in your pipeline.
 
   ### LanguageModelTokenizer
 
+:::caution Deprecated
+The `LanguageModelTokenizer` is deprecated and will be removed in a future release. The [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer)
+now implements its behavior. Any [tokenizer](./components.mdx#tokenizers) can be used in its place.
+:::
 
-  * **Short**
+* **Short**
 
-    Tokenizer from pre-trained language models
+Tokenizer from pre-trained language models
 
 
 
-  * **Outputs**
+* **Outputs**
 
-    `tokens` for user messages, responses (if present), and intents (if specified)
+`tokens` for user messages, responses (if present), and intents (if specified)
 
 
 
-  * **Requires**
+* **Requires**
 
-    [HFTransformersNLP](./components.mdx#hftransformersnlp)
+[HFTransformersNLP](./components.mdx#hftransformersnlp)
 
 
 
-  * **Description**
+* **Description**
 
-    Creates tokens using the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component.
-    Must be used whenever the [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) is used.
+Creates tokens using the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component.
+Must be used whenever the [LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) is used.
 
 
 
-  * **Configuration**
+* **Configuration**
 
-    ```yaml-rasa
-    pipeline:
-    - name: "LanguageModelTokenizer"
-      # Flag to check whether to split intents
-      "intent_tokenization_flag": False
-      # Symbol on which intent should be split
-      "intent_split_symbol": "_"
-    ```
+```yaml-rasa
+pipeline:
+- name: "LanguageModelTokenizer"
+  # Flag to check whether to split intents
+  "intent_tokenization_flag": False
+  # Symbol on which intent should be split
+  "intent_split_symbol": "_"
+```
 
 
 ## Featurizers
@@ -644,7 +656,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
 
 * **Requires**
 
-  [ConveRTTokenizer](./components.mdx#converttokenizer)
+  `tokens`
 
 
 
@@ -667,7 +679,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
   :::
 
   :::note
-  To use `ConveRTTokenizer`, install Rasa Open Source with `pip3 install rasa[convert]`.
+  To use `ConveRTFeaturizer`, install Rasa Open Source with `pip3 install rasa[convert]`.
 
   :::
 
@@ -698,7 +710,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
 
 * **Requires**
 
-  [HFTransformersNLP](./components.mdx#hftransformersnlp) and [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer)
+  `tokens`.
 
 
 
@@ -711,8 +723,7 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
 * **Description**
 
   Creates features for entity extraction, intent classification, and response selection.
-  Uses the pre-trained language model specified in upstream [HFTransformersNLP](./components.mdx#hftransformersnlp) component to compute vector
-  representations of input text.
+  Uses a pre-trained language model to compute vector representations of input text.
 
   :::note
   Please make sure that you use a language model which is pre-trained on the same language corpus as that of your
@@ -724,14 +735,49 @@ Note: The `feature-dimension` for sequence and sentence features does not have t
 
 * **Configuration**
 
-  Include [HFTransformersNLP](./components.mdx#hftransformersnlp) and [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) components before this component. Use
-  [LanguageModelTokenizer](./components.mdx#languagemodeltokenizer) to ensure tokens are correctly set for all components throughout the pipeline.
+  Include a [Tokenizer](./components.mdx#tokenizers) component before this component.
+
+  You should specify what language model to load via the parameter `model_name`. See the below table for the
+  available language models.
+  Additionally, you can also specify the architecture variation of the chosen language model by specifying the
+  parameter `model_weights`.
+  The full list of supported architectures can be found in the
+  [HuggingFace documentation](https://huggingface.co/transformers/pretrained_models.html).
+  If left empty, it uses the default model architecture that original Transformers library loads (see table below).
+
+  ```
+  +----------------+--------------+-------------------------+
+  | Language Model | Parameter    | Default value for       |
+  |                | "model_name" | "model_weights"         |
+  +----------------+--------------+-------------------------+
+  | BERT           | bert         | rasa/LaBSE              |
+  +----------------+--------------+-------------------------+
+  | GPT            | gpt          | openai-gpt              |
+  +----------------+--------------+-------------------------+
+  | GPT-2          | gpt2         | gpt2                    |
+  +----------------+--------------+-------------------------+
+  | XLNet          | xlnet        | xlnet-base-cased        |
+  +----------------+--------------+-------------------------+
+  | DistilBERT     | distilbert   | distilbert-base-uncased |
+  +----------------+--------------+-------------------------+
+  | RoBERTa        | roberta      | roberta-base            |
+  +----------------+--------------+-------------------------+
+  ```
+
+  The following configuration loads the language model BERT:
 
   ```yaml-rasa
   pipeline:
-  - name: "LanguageModelFeaturizer"
-  ```
+    - name: LanguageModelFeaturizer
+      # Name of the language model to use
+      model_name: "bert"
+      # Pre-Trained weights to be loaded
+      model_weights: "rasa/LaBSE"
 
+      # An optional path to a specific directory to download and cache the pre-trained model weights.
+      # The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory .
+      cache_dir: null
+  ```
 
 ### RegexFeaturizer
 

diff --git a/docs/docs/migration-guide.mdx b/docs/docs/migration-guide.mdx
@@ -10,6 +10,34 @@ description: |
 This page contains information about changes between major versions and
 how you can migrate from one version to another.
 
+## Rasa 2.0 to Rasa 2.1
+
+### Deprecations
+
+`ConveRTTokenizer` is now deprecated. [ConveRTFeaturizer](./components.mdx#convertfeaturizer) now implements
+its behaviour. To migrate, replace `ConveRTTokenizer` with any other tokenizer, for e.g.:
+
+```yaml
+pipeline:
+    - name: WhitespaceTokenizer
+    - name: ConveRTFeaturizer
+      model_url: <Remote/Local path to model files>
+    ...
+```
+
+`HFTransformersNLP` and `LanguageModelTokenizer` components are now deprecated.
+[LanguageModelFeaturizer](./components.mdx#languagemodelfeaturizer) now implements their behaviour.
+To migrate, replace both the above components with any tokenizer and specify the model architecture and model weights
+as part of `LanguageModelFeaturizer`, for e.g.:
+
+```yaml
+pipeline:
+    - name: WhitespaceTokenizer
+    - name: LanguageModelFeaturizer
+      model_name: "bert"
+      model_weights: "rasa/LaBSE"
+    ...
+```
 
 ## Rasa 1.10 to Rasa 2.0
 

diff --git a/docs/docs/setting-up-ci-cd.mdx b/docs/docs/setting-up-ci-cd.mdx
@@ -38,20 +38,29 @@ you can make a test run only if the pull request has a certain label (e.g. “NL
 
 ### Validating Data and Stories
 
-Data validation verifies that there are no mistakes or major inconsistencies in your domain, NLU 
-data, or conversation data. To validate your data, have your CI run this command:
+Data validation verifies that no mistakes or major inconsistencies appear in your domain, NLU 
+data, or story data. To validate your data, have your CI run this command:
 
 ```bash
-rasa data validate --fail-on-warnings --max-history <max_history>
+rasa data validate
 ```
 
-If you pass a `max_history` value to a Memoization policy in your `config.yml` file, provide the 
-same value in the above validator command. Otherwise, provide the default value of `5`.
+If you pass a `max_history` value to one or more policies in your `config.yml` file, provide the 
+smallest of those values as
 
-If data validation results in errors, training a model will also fail, so it's
+```bash
+rasa data validate --max-history <max_history>
+```
+
+If data validation results in errors, training a model can also fail or yield bad performance, so it's
 always good to run this check before training a model. By including the
 `--fail-on-warnings` flag, this step will fail on warnings indicating more minor issues.
 
+:::note
+Running `rasa data validate` does **not** test if your [rules](./rules.mdx) are consistent with your stories. 
+However, during training, the `RulePolicy` checks for conflicts between rules and stories. Any such conflict will abort training.
+:::
+
 To read more about the validator and all of the available options, see [the documentation for 
 `rasa data validate`](./command-line-interface.mdx#rasa-data-validate).
 
@@ -95,8 +104,10 @@ as you make improvements to your assistant. A good rule of thumb to follow is th
 to be representative of the true distribution of real conversations.
 Rasa X makes it easy to [add test conversations based on real conversations](https://rasa.com/docs/rasa-x/user-guide/test-assistant/#how-to-create-tests).
 
-Note: Running test stories does **not** execute your action code. You will need to
+:::note
+Running test stories does **not** execute your action code. You will need to
 [test your action code](./setting-up-ci-cd.mdx#testing-action-code) in a separate step.
+:::
 
 ### Comparing NLU Performance