Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train and evaluate policy models with unified dataset format? #191

Open
JamesCao2048 opened this issue Mar 6, 2024 · 4 comments
Open

Comments

@JamesCao2048
Copy link

JamesCao2048 commented Mar 6, 2024

Hi there, I noticed that there are APIs to load NLU, DST, Policy and NLG data in unified data format. Besides, I found the training and evaluation guide for NLU/DST/NLG with unified data in $model/README.md or NLU/DST/NLG/evaluate_unified_datasets.py. However, I did not find a guide for how to train and evaluate policy models with unified data format. Specifically, I have the following questions:

  1. Training: I did not find support for training with unified data format in $policy_model/train.py, such as ppo/train.py and mle/train.py, it seems that they will use MultiWozEvaluator by default.
  2. Evaluation: I did not find support for evaluation with unified data format in policy/evaluate.py, it seems that it will also use MultiWozEvaluator by default.
  3. My Training Experiment: I have tried to train a PPO policy with this config file base_pipeline_rule_user.json (which has been initialized with a MLE policy weight trained with default config), and get the result: Best Complete Rate: 0.95, Best Success Rate: 0.5, Best Average Return: 4.5. It is a good start for me, but still worser than
    BERTNLU | RuleDST | PPOPolicy | TemplateNLG evaluation in ConvLab2 ReadME (75.5 completion rate and 71.7 success rate). How does this gap come from?
  4. My Evaluation Experiment: I evaluated my previously trained PPO model policy/evaluate.py, but get a much worser result: "Complete 500 0.372 Success 500 0.228 Success strict 500 0.174". During the evaluation, there are two warnings: "Value not found in standard value set: [dontcare] (slot: name domain: restaurant)", "Value [none] invalid! (Lexicalisation Error) (slot: name domain: hotel)". They seem to be the dataset format mismatch between training and evaluation process, because I am not sure whether I have used original Multiwoz format or unified data format to train and evaluate my policy model.
  5. For user simulator: I have found that tus, emoUS and genTUS could be trained and evaluated with unified data format. However, I did not found unified data format support in rule-based user simulator. Does that mean if I trained my models(NLU/NLG or Policy) with unified data format, I could not evaluate them with rule-based user simulator?

Looking forward to your reply,
James Cao

@JamesCao2048
Copy link
Author

Another question is I found that my trained PPO policy will give tens of system act output in every turn, is it expected?

@zqwerty
Copy link
Member

zqwerty commented Mar 11, 2024

@ChrisGeishauser could you give some guidance?

@ChrisGeishauser
Copy link
Contributor

Hi @JamesCao2048, thanks a lot for all your questions! I hope I can answer them sufficiently for you:

  1. For MLE training, this is explained in the README: https://github.com/ConvLab/ConvLab-3/tree/master/convlab/policy/mle
    So when you execute train.py, you just pass --dataset_name=sgd and it should work. For the DDPT model (in folder vtrace_DPT), it is also explained how to specify the dataset, namely in the pipeline configuration under "vectorizer_sys" you set "dataset_name" = "sgd". For PPO, it should be the very same as for DDPT (even though I have not checked it yet). But as you found out, there is at the moment only an evaluator for MultiWOZ unfortunately, so currently RL training is only possible for MultiWOZ. We are working on an SGD evaluator and hope to finish that soon.
  2. You are right, there is only a multiwoz evaluator at the moment unfortunately, but we are working on an SGD evaluator.
  3. If the policy is loaded correctly, there should be an output in the terminal at the beginning like "dialogue policy loaded from checkpoint ...". If you do not see that, it is not loaded correctly. You have to be a bit careful here because you should not set the "load_path" as "save/best_ppo.pol.mdl" but "save/best_ppo" because the policy tries to load both policy and critic. Sorry for that confusion! Please check whether the model is loaded correctly and otherwise contact me again! This hopefully closes the gap then.
  4. This is definitely the performance of a randomly initialised policy. Please check if the policy is loaded correctly (see point 3 above)
  5. This is correct, unfortunately the rule-based simulator only supports Multiwoz at the moment.

Another question is I found that my trained PPO policy will give tens of system act output in every turn, is it expected?

This is an indicator that you used a random policy, in this case the output is expected: the architecture of the policy has an output dimension that is equal to the number of "atomic actions" (e.g. hotel-inform-phone or restaurant-request-price). For every "atomic actions" there is a binary decision wether to use it or not. In case of a random policy, there is roughly a chance of 50% to take an atomic actions, which will lead to a lot of actions.

I hope I could help you with the answers! Let me know if something is unclear.

@Ahmed-Mahmod-Salem
Copy link

Ahmed-Mahmod-Salem commented Mar 28, 2024

@ChrisGeishauser sorry for bothering, do you have any estimates on when will the evaluator class be ready?

another thing, the vectorizers seem to work only on the mutliwoz dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants