Support for writing back NLU intent/example metadata to YAML #7731

chdorner · 2021-01-15T09:14:21Z

Description of Problem:
Rasa 2.0 introduced support for metadata on NLU intents and examples (reference), but so far only the RasaYAMLReader supports parsing this, the RasaYAMLWriter is not able to write it back to YAML files.

This came out of https://github.com/RasaHQ/rasa-x/issues/4180.

Overview of the Solution:
Support for intent and example metadata needs to be added to RasaYAMLWriter.process_training_examples_by_key (src).

Considering this YAML structure:

version: "2.0"
nlu:
- intent: greet
  metadata:
    sentiment: neutral
  examples:
    - text: |
        hi
      metadata:
        capitalization: lazy
    - text: |
        Hi
      metadata:
        capitalization: correct

The parser returns:

# ...
[{'text': 'hi',
  'intent': 'greet',
  'metadata': {'intent': {'sentiment': 'neutral'},
               'example': {'capitalization': 'lazy'}}},
 {'text': 'Hi',
  'intent': 'greet',
  'metadata': {'intent': {'sentiment': 'neutral'},
               'example': {'capitalization': 'correct'}}}]
# ...

Rendering the example metadata (the dict with "capitalization") should probably be fairly straight forward without too many questions to figure out upfront.
The intent metadata (the dict with "sentiment") however is duplicated on each example which does raise a few questions.

Should the writer just take the first/last (?) example and take its intent metadata?
Or should the writer take the intent metadata of all examples, deep/shallow (?) merge before writing it?

Examples (if relevant):

>>> from rasa.shared.nlu.training_data.message import Message
>>> from rasa.shared.nlu.training_data.training_data import TrainingData
>>> a = Message.build(text="hello", intent="greet", example_metadata={"paraphrases": ["hey", "hi"]})
>>> a.data
{'text': 'hello', 'intent': 'greet', 'metadata': {'example': {'paraphrases': ['hey', 'hi']}}}
>>> a.as_dict()
{'text': 'hello', 'intent': 'greet', 'metadata': {'example': {'paraphrases': ['hey', 'hi']}}}
>>> td = TrainingData([a])
>>> td.nlu_as_yaml()
'version: "2.0"\nnlu:\n- intent: greet\n  examples: |\n    - hello\n'

(hat tip to @dakshvar22 for the code)

Blockers (if relevant):

Definition of Done:

Discuss and agree on how to handle intent metadata
Tests are added
Feature mentioned in the changelog

The text was updated successfully, but these errors were encountered:

chdorner · 2021-01-18T16:11:04Z

Open Questions

Rendering intent metadata

We don't have a Python-object representation for the intent itself when parsing an NLU file, thus we put the metadata on the intent level in each example of that intent:

from rasa.shared.nlu.training_data.formats.rasa_yaml import RasaYAMLReader

yaml_string = f"""version: "2.0"
nlu:
- intent: greet
  metadata:
    sentiment: neutral
  examples: |
    - hi
    - hello
"""

training_data = RasaYAMLReader().reads(yaml_string)

training_data.training_examples[0].as_dict()
# {'text': 'hi',
#  'intent': 'greet',
#  'metadata': {'intent': {'sentiment': 'neutral'}}}

training_data.training_examples[0].as_dict()
# {'text': 'hello',
#  'intent': 'greet',
#  'metadata': {'intent': {'sentiment': 'neutral'}}}

This opens up a question on how the RasaYAMLWriter should collect all the intent metadata from each example and render it in YAML.

We can:
a) trust that the training example representation in Python objects follows the rules of the NLU file format and that for each example of a given intent the intent metadata is exactly the same, thus allowing us to just grab the intent metadata from one (first? last?) example.

b) be a bit more defensive and try to collect all intent metadata from the examples of a given intent and try to merge them together (shallow / deep merge?).

Update: The RasaYAMLWriter can assume that all intent metadata from the examples belonging to the same intent are identical, thus it's fine just to take the first one.

Data type of metadata

The docs currently say that:

the metadata key can contain arbitrary key-value data.

There is however one test case in the code which has a list of strings as the value of the "metadata" key.

Which one of the two is the truth? Only allowing key-value objects (i.e. Python dicts) would simplify the implementation in the RasaYAMLWriter significantly.

Update: The metadata can be any data type that is supported by YAML including maps, lists, strings, numbers, etc.

Preserving the YAML structure for examples without metadata

The YAML structure looks different depending if we have example metadata or not. Given that we have metadata on individual examples (or if at least one of the examples has metadata) the YAML structure looks like this:

With metadata on examples it would be (example 1):

version: "2.0"
nlu:
- intent: greet
  examples:
    - text: |
        hi
      metadata:
        sentiment: neutral
    - text: |
        hello
# ...

If we don't have any metadata on the examples, then we can use a less verbose YAML structure (example 2):

version: "2.0"
nlu:
- intent: greet
  examples: |
    - hi
    - hello

So far the RasaYAMLWriter only supports the less verbose YAML structure. Do we need to preserve this functionality, or can the writer from now on always write the verbose version?
In other words, given example 2 as the input, is it okay if the RasaYAMLWriter will always write this as:

version: "2.0"
nlu:
- intent: greet
  examples:
    - text: |
        hi
    - text: |
        hello

Update: The YAML output should be identical to the input.

m-vdb · 2021-01-19T08:59:25Z

As I followed the initial implementation by @degiz , let me share a few thoughts:

Rendering intent metadata: since Rasa Open Source is responsible for loading + manipulating intent metadata + dumping, I think that a) is more sensible. I'd be a bit more defensive and include a warning in case the intent metadata is different on one or more examples (which maybe would be a bug in our code?). Implementing b) sounds a bit overkill (what's the use case here?)
Data type of metadata: I'd follow the public API and the doc. I think that the test was written at an early stage of implementation. While allowing any kind of metadata sounds appealing, I think it would reduce our ability to manipulate it / combine it, etc... hence reducing value for users.
Preserving the YAML structure for examples without metadata: I think we need to focus on the user experience here, and simplify the "look" of the training data as long as it's manageable on our end. Not all examples for all users will have metadata. So I'd respect what's in the documentation and go for both implementations.

(also cc'ing you @tmbo in case you miss reasoning about training data format 😅 )

chdorner · 2021-01-19T17:00:00Z

Summary from a call w/ @degiz today:

Rendering intent metadata: The RasaYAMLWriter can assume that all intent metadata from the examples belonging to the same intent are identical, thus it's fine just to take the first one.
Metadata can be any data type that is supported by YAML including maps, lists, strings, numbers, etc.
Preserving the YAML structure for examples without metadata: The YAML output should be identical to the input.

chdorner added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Jan 15, 2021

chdorner self-assigned this Jan 15, 2021

chdorner mentioned this issue Jan 20, 2021

Support for writing NLU intent/example metadata to YAML #7761

Merged

3 tasks

chdorner closed this as completed Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for writing back NLU intent/example metadata to YAML #7731

Support for writing back NLU intent/example metadata to YAML #7731

chdorner commented Jan 15, 2021 •

edited

Loading

chdorner commented Jan 18, 2021 •

edited

Loading

m-vdb commented Jan 19, 2021

chdorner commented Jan 19, 2021

Support for writing back NLU intent/example metadata to YAML #7731

Support for writing back NLU intent/example metadata to YAML #7731

Comments

chdorner commented Jan 15, 2021 • edited Loading

Blockers (if relevant):

chdorner commented Jan 18, 2021 • edited Loading

Open Questions

Rendering intent metadata

Data type of metadata

Preserving the YAML structure for examples without metadata

m-vdb commented Jan 19, 2021

chdorner commented Jan 19, 2021

chdorner commented Jan 15, 2021 •

edited

Loading

chdorner commented Jan 18, 2021 •

edited

Loading