Model hashes are not equal when training from multiple nlu.md files #5405

tommykoctur · 2020-03-11T08:18:38Z

Rasa version:
1.8.1
Python version:
Python 3.7.0
Operating system (windows, osx, ...):
Ubuntu 19.10
Issue:
NLU models trained from multiple .md files have different md5sum even though the random seed was set. When training with the same data from one .md5 then models have the same hashes.
Probably fix ad1dd2c#diff-b69ac23b9d0ad7d535a81c9f705d10e5 did not solved the issue.

Error (including full traceback):

Model 1 md5sums:
7898e02645b8bd479567355b7bd9dfcc  1/nlu/component_5_DIETClassifier.tf_model.data-00000-of-00001
d4c46fac513c3a4c63b693e17addfc2f  1/nlu/component_5_DIETClassifier.tf_model.index

Model 2 md5sums:
cd84319dddf5e3c4bcdcd4e89c49b2f3  2/nlu/component_5_DIETClassifier.tf_model.data-00000-of-00001
5297ccdca94857c71dc9a88a7b4b9744  2/nlu/component_5_DIETClassifier.tf_model.index

Command or request that led to error:

tar -xzf your_nlu_model.tar.gz
md5sum nlu/*

Content of configuration file (config.yml) (if relevant):

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en
pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
    random_seed: 999
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy

The text was updated successfully, but these errors were encountered:

sara-tagger · 2020-03-12T07:00:08Z

Thanks for the issue, @JustinaPetr will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

akelad · 2020-03-17T09:43:31Z

Hey @tommykoctur just to clarify, what command do you use to train the model? Are you passing a directory or multiple files?

tommykoctur · 2020-03-17T13:49:07Z

Hey @tommykoctur just to clarify, what command do you use to train the model? Are you passing a directory or multiple files?

HI @akelad I am using command pyton -m rasa train nlu --config config.yml --nlu data/nlu
Yes, I am using directory with multiple .md files.
But reading directory with multiple files should be fixed since ad1dd2c#diff-b69ac23b9d0ad7d535a81c9f705d10e5

akelad · 2020-03-19T13:36:46Z

ok yeah i can definitely reproduce this. i'll look into what's going on

jbjulien · 2020-03-20T14:38:13Z

Thank you !! I was diving into the rasa code since a few hours to understand why rasa train was always retraining my nlu model. Your modifications works for me !

tommykoctur · 2020-04-17T08:19:36Z

Hi, I have tried it in rasa 1.9.5 and 1.9.6 but it is still producing different model hashes .

Please see:

To clean rasa install:
Folder one: rasa init. And added random seed 999 , then rasa train
Folder two: rasa init, And added random seed 999 And split training data into 2 files (nlu and stories each to 2 files) then rasa train.

Model sha1 hashes are different:
Folder1:

528b366df341bf2c1571853329e2a0c5c2004a4e  core/policy_0_MemoizationPolicy/memorized_turns.json
5fd8404898ca23aba9f8a08015a5fb34c23b0eda  core/policy_1_TEDPolicy/ted_policy.data_example.pkl
fd0b2948d216a8ef8abd89aff8df9ff94cd9808c  core/policy_1_TEDPolicy/ted_policy.tf_model.data-00000-of-00001
2267b9e6c9f587285950c6e2479a1359c4abf7af  core/policy_1_TEDPolicy/ted_policy.tf_model.index
e4c9c3063b03458a23c62f69170e2b692dbfd4aa  nlu/component_5_DIETClassifier.data_example.pkl
9543d7305cca736097d47b8d24344fe2a838100f  nlu/component_5_DIETClassifier.tf_model.data-00000-of-00001
b36f12f11b71ca05702ba71076396d0525113ac0  nlu/component_5_DIETClassifier.tf_model.index

Folder2:

1b7643aa885af0bfa65b99c631aff29d486e2514  core/policy_0_MemoizationPolicy/memorized_turns.json
cde62795c81c1d11009d673c5c2a86935a4f7d44  core/policy_1_TEDPolicy/ted_policy.data_example.pkl
f3b68623d97b6bef3c5183b21d76705e7cf9fe75  core/policy_1_TEDPolicy/ted_policy.tf_model.data-00000-of-00001
a541750d117e3373ac5c3984d1b4664cf49f731d  core/policy_1_TEDPolicy/ted_policy.tf_model.index
169ce93056f993ae4377068688764d415639dbe1  nlu/component_5_DIETClassifier.data_example.pkl
fc3266a47bc4d6e61be525b0e5fbf64cb676b2b5  nlu/component_5_DIETClassifier.tf_model.data-00000-of-00001
9f533e5b1445368de6003a0b7d49c1429680ec93  nlu/component_5_DIETClassifier.tf_model.index

cat config.yml 
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en
pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
    random_seed: 999
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy

akelad · 2020-04-17T08:27:31Z

@tommykoctur are you getting consistent results though? Could you share the structure of your data directory?

tommykoctur · 2020-04-17T10:58:17Z

Hi,
No not consistent results.
Here is my data folder1:
ls -l data/
total 8
-rw-r--r-- 1 tommy tommy 671 Apr 17 10:01 nlu.md
-rw-r--r-- 1 tommy tommy 407 Apr 17 10:01 stories.md

Here is my data folder2:
ls -l data/
total 16
-rw-r--r-- 1 tommy tommy 220 Apr 17 10:03 nlu1.md
-rw-r--r-- 1 tommy tommy 451 Apr 17 10:03 nlu.md
-rw-r--r-- 1 tommy tommy 217 Apr 17 10:03 stories.md
-rw-r--r-- 1 tommy tommy 189 Apr 17 10:03 story1.md

akelad · 2020-04-17T12:26:13Z

oh wait - are you saying that the results on the two separate folders aren't the same? That's to be expected, the data has a different structure and therefore may get read in a different order. Results should only be the same on two separate runs on the same folder.

tommykoctur added the type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. label Mar 11, 2020

akelad mentioned this issue Mar 20, 2020

sort nlu/story files before training #5455

Merged

4 tasks

akelad closed this as completed Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model hashes are not equal when training from multiple nlu.md files #5405

Model hashes are not equal when training from multiple nlu.md files #5405

tommykoctur commented Mar 11, 2020 •

edited

Loading

sara-tagger commented Mar 12, 2020

akelad commented Mar 17, 2020

tommykoctur commented Mar 17, 2020

akelad commented Mar 19, 2020

jbjulien commented Mar 20, 2020

tommykoctur commented Apr 17, 2020 •

edited

Loading

akelad commented Apr 17, 2020

tommykoctur commented Apr 17, 2020

akelad commented Apr 17, 2020

Model hashes are not equal when training from multiple nlu.md files #5405

Model hashes are not equal when training from multiple nlu.md files #5405

Comments

tommykoctur commented Mar 11, 2020 • edited Loading

sara-tagger commented Mar 12, 2020

You may find help in the docs and the forum, too 🤗

akelad commented Mar 17, 2020

tommykoctur commented Mar 17, 2020

akelad commented Mar 19, 2020

jbjulien commented Mar 20, 2020

tommykoctur commented Apr 17, 2020 • edited Loading

akelad commented Apr 17, 2020

tommykoctur commented Apr 17, 2020

akelad commented Apr 17, 2020

tommykoctur commented Mar 11, 2020 •

edited

Loading

tommykoctur commented Apr 17, 2020 •

edited

Loading