Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model hashes are not equal when training from multiple nlu.md files #5405

Closed
tommykoctur opened this issue Mar 11, 2020 · 9 comments
Closed
Labels
type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@tommykoctur
Copy link

tommykoctur commented Mar 11, 2020

Rasa version:
1.8.1
Python version:
Python 3.7.0
Operating system (windows, osx, ...):
Ubuntu 19.10
Issue:
NLU models trained from multiple .md files have different md5sum even though the random seed was set. When training with the same data from one .md5 then models have the same hashes.
Probably fix ad1dd2c#diff-b69ac23b9d0ad7d535a81c9f705d10e5 did not solved the issue.

Error (including full traceback):

Model 1 md5sums:
7898e02645b8bd479567355b7bd9dfcc  1/nlu/component_5_DIETClassifier.tf_model.data-00000-of-00001
d4c46fac513c3a4c63b693e17addfc2f  1/nlu/component_5_DIETClassifier.tf_model.index

Model 2 md5sums:
cd84319dddf5e3c4bcdcd4e89c49b2f3  2/nlu/component_5_DIETClassifier.tf_model.data-00000-of-00001
5297ccdca94857c71dc9a88a7b4b9744  2/nlu/component_5_DIETClassifier.tf_model.index

Command or request that led to error:

tar -xzf your_nlu_model.tar.gz
md5sum nlu/*

Content of configuration file (config.yml) (if relevant):

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en
pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
    random_seed: 999
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy
@tommykoctur tommykoctur added the type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. label Mar 11, 2020
@sara-tagger
Copy link
Collaborator

Thanks for the issue, @JustinaPetr will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

@akelad
Copy link
Contributor

akelad commented Mar 17, 2020

Hey @tommykoctur just to clarify, what command do you use to train the model? Are you passing a directory or multiple files?

@tommykoctur
Copy link
Author

Hey @tommykoctur just to clarify, what command do you use to train the model? Are you passing a directory or multiple files?

HI @akelad I am using command pyton -m rasa train nlu --config config.yml --nlu data/nlu
Yes, I am using directory with multiple .md files.
But reading directory with multiple files should be fixed since ad1dd2c#diff-b69ac23b9d0ad7d535a81c9f705d10e5

@akelad
Copy link
Contributor

akelad commented Mar 19, 2020

ok yeah i can definitely reproduce this. i'll look into what's going on

@jbjulien
Copy link

Thank you !! I was diving into the rasa code since a few hours to understand why rasa train was always retraining my nlu model. Your modifications works for me !

@akelad akelad closed this as completed Mar 31, 2020
@tommykoctur
Copy link
Author

tommykoctur commented Apr 17, 2020

Hi, I have tried it in rasa 1.9.5 and 1.9.6 but it is still producing different model hashes .

Please see:

To clean rasa install:
Folder one: rasa init. And added random seed 999 , then rasa train
Folder two: rasa init, And added random seed 999 And split training data into 2 files (nlu and stories each to 2 files) then rasa train.

Model sha1 hashes are different:
Folder1:

528b366df341bf2c1571853329e2a0c5c2004a4e  core/policy_0_MemoizationPolicy/memorized_turns.json
5fd8404898ca23aba9f8a08015a5fb34c23b0eda  core/policy_1_TEDPolicy/ted_policy.data_example.pkl
fd0b2948d216a8ef8abd89aff8df9ff94cd9808c  core/policy_1_TEDPolicy/ted_policy.tf_model.data-00000-of-00001
2267b9e6c9f587285950c6e2479a1359c4abf7af  core/policy_1_TEDPolicy/ted_policy.tf_model.index
e4c9c3063b03458a23c62f69170e2b692dbfd4aa  nlu/component_5_DIETClassifier.data_example.pkl
9543d7305cca736097d47b8d24344fe2a838100f  nlu/component_5_DIETClassifier.tf_model.data-00000-of-00001
b36f12f11b71ca05702ba71076396d0525113ac0  nlu/component_5_DIETClassifier.tf_model.index

Folder2:

1b7643aa885af0bfa65b99c631aff29d486e2514  core/policy_0_MemoizationPolicy/memorized_turns.json
cde62795c81c1d11009d673c5c2a86935a4f7d44  core/policy_1_TEDPolicy/ted_policy.data_example.pkl
f3b68623d97b6bef3c5183b21d76705e7cf9fe75  core/policy_1_TEDPolicy/ted_policy.tf_model.data-00000-of-00001
a541750d117e3373ac5c3984d1b4664cf49f731d  core/policy_1_TEDPolicy/ted_policy.tf_model.index
169ce93056f993ae4377068688764d415639dbe1  nlu/component_5_DIETClassifier.data_example.pkl
fc3266a47bc4d6e61be525b0e5fbf64cb676b2b5  nlu/component_5_DIETClassifier.tf_model.data-00000-of-00001
9f533e5b1445368de6003a0b7d49c1429680ec93  nlu/component_5_DIETClassifier.tf_model.index
cat config.yml 
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en
pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
    random_seed: 999
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy

@akelad
Copy link
Contributor

akelad commented Apr 17, 2020

@tommykoctur are you getting consistent results though? Could you share the structure of your data directory?

@tommykoctur
Copy link
Author

Hi,
No not consistent results.
Here is my data folder1:
ls -l data/
total 8
-rw-r--r-- 1 tommy tommy 671 Apr 17 10:01 nlu.md
-rw-r--r-- 1 tommy tommy 407 Apr 17 10:01 stories.md

Here is my data folder2:
ls -l data/
total 16
-rw-r--r-- 1 tommy tommy 220 Apr 17 10:03 nlu1.md
-rw-r--r-- 1 tommy tommy 451 Apr 17 10:03 nlu.md
-rw-r--r-- 1 tommy tommy 217 Apr 17 10:03 stories.md
-rw-r--r-- 1 tommy tommy 189 Apr 17 10:03 story1.md

@akelad
Copy link
Contributor

akelad commented Apr 17, 2020

oh wait - are you saying that the results on the two separate folders aren't the same? That's to be expected, the data has a different structure and therefore may get read in a different order. Results should only be the same on two separate runs on the same folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

No branches or pull requests

4 participants