How to train and evaluate policy models with unified dataset format? #191

JamesCao2048 · 2024-03-06T04:35:25Z

Hi there, I noticed that there are APIs to load NLU, DST, Policy and NLG data in unified data format. Besides, I found the training and evaluation guide for NLU/DST/NLG with unified data in $model/README.md or NLU/DST/NLG/evaluate_unified_datasets.py. However, I did not find a guide for how to train and evaluate policy models with unified data format. Specifically, I have the following questions:

Training: I did not find support for training with unified data format in $policy_model/train.py, such as ppo/train.py and mle/train.py, it seems that they will use MultiWozEvaluator by default.
Evaluation: I did not find support for evaluation with unified data format in policy/evaluate.py, it seems that it will also use MultiWozEvaluator by default.
My Training Experiment: I have tried to train a PPO policy with this config file base_pipeline_rule_user.json (which has been initialized with a MLE policy weight trained with default config), and get the result: Best Complete Rate: 0.95, Best Success Rate: 0.5, Best Average Return: 4.5. It is a good start for me, but still worser than
BERTNLU | RuleDST | PPOPolicy | TemplateNLG evaluation in ConvLab2 ReadME (75.5 completion rate and 71.7 success rate). How does this gap come from?
My Evaluation Experiment: I evaluated my previously trained PPO model policy/evaluate.py, but get a much worser result: "Complete 500 0.372 Success 500 0.228 Success strict 500 0.174". During the evaluation, there are two warnings: "Value not found in standard value set: [dontcare] (slot: name domain: restaurant)", "Value [none] invalid! (Lexicalisation Error) (slot: name domain: hotel)". They seem to be the dataset format mismatch between training and evaluation process, because I am not sure whether I have used original Multiwoz format or unified data format to train and evaluate my policy model.
For user simulator: I have found that tus, emoUS and genTUS could be trained and evaluated with unified data format. However, I did not found unified data format support in rule-based user simulator. Does that mean if I trained my models(NLU/NLG or Policy) with unified data format, I could not evaluate them with rule-based user simulator?

Looking forward to your reply,
James Cao

JamesCao2048 · 2024-03-06T04:46:20Z

Another question is I found that my trained PPO policy will give tens of system act output in every turn, is it expected?

zqwerty · 2024-03-11T02:38:46Z

@ChrisGeishauser could you give some guidance?

ChrisGeishauser · 2024-03-22T08:13:18Z

Hi @JamesCao2048, thanks a lot for all your questions! I hope I can answer them sufficiently for you:

For MLE training, this is explained in the README: https://github.com/ConvLab/ConvLab-3/tree/master/convlab/policy/mle
So when you execute train.py, you just pass --dataset_name=sgd and it should work. For the DDPT model (in folder vtrace_DPT), it is also explained how to specify the dataset, namely in the pipeline configuration under "vectorizer_sys" you set "dataset_name" = "sgd". For PPO, it should be the very same as for DDPT (even though I have not checked it yet). But as you found out, there is at the moment only an evaluator for MultiWOZ unfortunately, so currently RL training is only possible for MultiWOZ. We are working on an SGD evaluator and hope to finish that soon.
You are right, there is only a multiwoz evaluator at the moment unfortunately, but we are working on an SGD evaluator.
If the policy is loaded correctly, there should be an output in the terminal at the beginning like "dialogue policy loaded from checkpoint ...". If you do not see that, it is not loaded correctly. You have to be a bit careful here because you should not set the "load_path" as "save/best_ppo.pol.mdl" but "save/best_ppo" because the policy tries to load both policy and critic. Sorry for that confusion! Please check whether the model is loaded correctly and otherwise contact me again! This hopefully closes the gap then.
This is definitely the performance of a randomly initialised policy. Please check if the policy is loaded correctly (see point 3 above)
This is correct, unfortunately the rule-based simulator only supports Multiwoz at the moment.

Another question is I found that my trained PPO policy will give tens of system act output in every turn, is it expected?

This is an indicator that you used a random policy, in this case the output is expected: the architecture of the policy has an output dimension that is equal to the number of "atomic actions" (e.g. hotel-inform-phone or restaurant-request-price). For every "atomic actions" there is a binary decision wether to use it or not. In case of a random policy, there is roughly a chance of 50% to take an atomic actions, which will lead to a lot of actions.

I hope I could help you with the answers! Let me know if something is unclear.

Ahmed-Mahmod-Salem · 2024-03-28T12:19:09Z

@ChrisGeishauser sorry for bothering, do you have any estimates on when will the evaluator class be ready?

another thing, the vectorizers seem to work only on the mutliwoz dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train and evaluate policy models with unified dataset format? #191

How to train and evaluate policy models with unified dataset format? #191

JamesCao2048 commented Mar 6, 2024 •

edited

Loading

JamesCao2048 commented Mar 6, 2024

zqwerty commented Mar 11, 2024

ChrisGeishauser commented Mar 22, 2024

Ahmed-Mahmod-Salem commented Mar 28, 2024 •

edited

Loading

How to train and evaluate policy models with unified dataset format? #191

How to train and evaluate policy models with unified dataset format? #191

Comments

JamesCao2048 commented Mar 6, 2024 • edited Loading

JamesCao2048 commented Mar 6, 2024

zqwerty commented Mar 11, 2024

ChrisGeishauser commented Mar 22, 2024

Ahmed-Mahmod-Salem commented Mar 28, 2024 • edited Loading

JamesCao2048 commented Mar 6, 2024 •

edited

Loading

Ahmed-Mahmod-Salem commented Mar 28, 2024 •

edited

Loading