Meta Dialog Platform: a toolkit platform for NLP Few-Shot Learning tasks of:
- Text Classification
- Sequence Labeling
It also provides the baselines for:
- Track-1 of SMP2020: Few-shot dialog language understanding.
- Benchmark Paper: "FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding"
- Updates 2021.3.8: Fix wrong default setting for few-shot data generator scripts.
- Updates 2020.9.17: FewJoint benchmark (Dataset for SMP) is available: paper, data, reformatted data (for MetaDialog)
State-of-the-art solutions for Few-shot NLP:
- Support Few-shot Learning for sequence-labeling task with state-of-the-art methods: CDT (Hou et al., 2020).
- Support to use semantic within label name or label description.
- Support various deep pre-trained embedding compatible with huggingface/transformers, such as BERT and Electra.
- Support pair-wise embedding mechanism (Hou et al., 2020, Gao et al., 2019).
Easy-to-start & flexible framework:
- Provide tools for easy training & testing.
- Support various few-shot models with unified and extendable interfaces, such as ProtoNet and TapNet.
- Support easy-to-switch similarity-metrics and logits-scaling methods.
- Provide tools of generating episode-style data for meta-learning.
Please cite code and data:
@article{hou2020fewjoint,
title={FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding},
author={Yutai Hou, Jiafeng Mao, Yongkui Lai, Cheng Chen, Wanxiang Che, Zhigang Chen, Ting Liu},
journal={arXiv preprint},
year={2020}
}
python>=3.6
torch>=1.2.0
transformers>=2.9.0
numpy>=1.17.0
tqdm>=4.31.1
allennlp>=0.8.4
pytorch-nlp
Here, we take the few-shot slot tagging and NER task from (Hou et al., 2020) as quick start examples.
- Download the pytorch bert model, or convert tensorflow param by yourself with scripts.
- Set BERT path in the
./scripts/run_1_shot_slot_tagging.sh
to your setting:
bert_base_uncased=/your_dir/uncased_L-12_H-768_A-12/
bert_base_uncased_vocab=/your_dir/uncased_L-12_H-768_A-12/vocab.txt
-
Download the compatible few-shot data at here: download
-
Set test, train, dev data file path in
./scripts/run_1_shot_slot_tagging.sh
to your setting.
For simplicity, your only need to set the root path for data as follow:
base_data_dir=/your_dir/ACL2020data/
- Build a folder to collect running log
mkdir result
- Execute cross-evaluation script with two params: -[gpu id] -[dataset name]
source ./scripts/run_1_shot_slot_tagging.sh 0 snips
source ./scripts/run_1_shot_slot_tagging.sh 0 ner
To run 5-shots experiments, use
./scripts/run_5_shot_slot_tagging.sh
You can experiment freely by passing parameters to main.py
to choose different model architectures, hyperparameters, etc.
To view detailed options and corresponding descriptions, run commandline:
python main.py --h
We provide scripts for general few-shot classification and sequence labeling task respectively:
- classification
run_electra_sc.sh
run_bert_sc.sh
- sequence labeling
run_electra_sl.sh
run_bert_sl.sh
The usage of these scripts are similar to process in Get Started.
- Get reformatted FewJoint data at here or construct episode-style data by yourself with our tool.
- Use script
./scripts/run_smp_bert_sc.sh
and./scripts/run_smp_bert_sl.sh
to perform few-shot intent detection or few-shot slot filling respectively. - Notice that:
- Change train/dev/test path in the scripts before running.
- Find predicted results at
trained_model_path
within running scripts.
We also provide a generation tool for converting normal data into few-shot/meta-episode style.
The tool is included at path: scripts/other_tool/meta_dataset_generator.py
.
Run following commandline to view detailed interface:
python generate_meta_dataset.py --h
For simplicity, we provide an example script to help generate few-shot data: ./scripts/gen_meta_data.sh
.
The following are some key params for you to control the generation process:
input_dir
: raw data pathoutput_dir
: output data pathepisode_num
: the number of episode which you want to generatesupport_shots_lst
: to specified the support shot size in each episode, we can specified multiple number to generate at the same time.query_shot
: to specified the query shot size in each episodeseed_lst
: random seed list to control random generationuse_fix_support
: set the fix support in dev datasetdataset_lst
: specified the dataset type which our tool can handle, there are some choices:stanford
&SLU
&TourSG
&SMP
.
If you want to handle other type of dataset, you can add your code for load raw dataset in
meta_dataset_generator/raw_data_loader.py
.
{
"domain_name": [
{ // episode
"support": { // support set
"seq_ins": [["we", "are", "friends", "."], ["how", "are", "you", "?"]], // input sequence
"seq_outs": [["O", "O", "O", "O"], ["O", "O", "O", "O"]], // output sequence in sequence labeling task
"labels": [["statement"], ["query"]] // output labels in classification task
},
"query": { // query set
"seq_ins": [["we", "are", "friends", "."], ["how", "are", "you", "?"]],
"seq_outs": [["O", "O", "O", "O"], ["O", "O", "O", "O"]],
"labels": [["statement"], ["query"]]
}
},
...
],
...
}
The platform is developed by HIT-SCIR. If you have any question and advice for it, please contact us(Yutai Hou - ythou@ir.hit.edu.cn or Yongkui Lai - yklai@ir.hit.edu.cn).
Apache License 2.0