SEC

This is the implementation of the IEEE AIKE'2022 paper Few-shot Text Classification with Saliency-equivalent Concatenation (SEC).

Introduction to SEC

SEC is an unsupervised data augmentation approach for creating additional key information for a given sentence.

In our experiments (paper link), SEC improves few-shot text classification with several meta-learning models and can be applied to any text classification task.

Run our code of SEC

Install required packages

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Prepare folders

mkdir experiments

Prepare raw data

Using T5

File with one sentence per line.
- An example file huffpost_test.txt is located in the SEC/data folder.

Choose the pre-trained T5 model

English: t5-large
Japanese (currently not stable):

Run the code

Set gen_model_name with the desired pre-trained T5 model name
- E.g., gen_model_name=google/mt5-large
An example shell script run_sec.sh is located in the SEC/scripts folder.

Output tokens per line. (specify `output_tokens`)

gpu_id=0
gen_model_name=google/mt5-large
cls_model_name=roberta-large-mnli

python src/t5_summarize.py \
--gpu_id $gpu_id \
--data_path data/huffpost_test.txt \
--batch_size 4 \
--gen_model_name $gen_model_name \
--cls_model_name $cls_model_name \
--decoding_strategy 'top-k' \
--topk_value 40 \
--num_generate_per_sentence 10 \
--filter_mode $mode \
--output_length 128 \
--output_tokens

Or you can output text string per line. (without `--output_tokens`)

gpu_id=0
gen_model_name=google/mt5-large
cls_model_name=roberta-large-mnli

python src/t5_summarize.py \
--gpu_id $gpu_id \
--data_path data/huffpost_test.txt \
--batch_size 4 \
--gen_model_name $gen_model_name \
--cls_model_name $cls_model_name \
--decoding_strategy 'top-k' \
--topk_value 40 \
--num_generate_per_sentence 10 \
--filter_mode $mode \
--output_length 128

Output files

Example input filename: data/huffpost_test.txt

Using T5

Generated sentences (.txt)
- Default: experiments/t5-large_huffpost_test_10N_topk_40_l128.txt
Concatenated sentences (.json)
- Default: data/t5-large_huffpost_test_10N_topk_40_l128_roberta-large-mnli_E_only.json

Parameters

gpu_id: ID for GPU usage
data_path: file path of data (one sentence per line)
batch_size: test batch size for model inference
gen_model_name: generation model
cls_model_name: model for sentence filtering
decoding_strategy: decoding technique during sentence generation
topk_value: value for top-k sampling (specify one of topk_value, topp_value, and beam_size)
topp_value: value for top-p sampling (specify one of topk_value, topp_value, and beam_size)
beam_size: value for beam search (specify one of topk_value, topp_value, and beam_size)
num_generate_per_sentence: number of synthetic sentences outputed by the generation model
filter_mode: mode for NLI (natural language inference) during sentence filtering
output_length: the maximum length of T5-output sequences
output_tokens: If specified, tokens will be saved in the output file (.json ).

Cite our paper

Please cite our paper if you use our code. Thank you!

@INPROCEEDINGS{9939269,
  author={Lin, Ying-Jia and Chang, Yu-Fang and Kao, Hung-Yu and Wang, Hsin-Yang and Liu, Mu},
  booktitle={2022 IEEE Fifth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)}, 
  title={Few-shot Text Classification with Saliency-equivalent Concatenation}, 
  year={2022},
  volume={},
  number={},
  pages={74-81},
  doi={10.1109/AIKE55402.2022.00019}}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
figures		figures
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEC

Introduction to SEC

Run our code of SEC

Install required packages

Prepare folders

Prepare raw data

Using T5

Choose the pre-trained T5 model

Run the code

Output tokens per line. (specify `output_tokens`)

Or you can output text string per line. (without `--output_tokens`)

Output files

Using T5

Parameters

Cite our paper

About

Releases

Packages

Languages

License

IKMLab/SEC

Folders and files

Latest commit

History

Repository files navigation

SEC

Introduction to SEC

Run our code of SEC

Install required packages

Prepare folders

Prepare raw data

Using T5

Choose the pre-trained T5 model

Run the code

Output tokens per line. (specify output_tokens)

Or you can output text string per line. (without --output_tokens)

Output files

Using T5

Parameters

Cite our paper

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Output tokens per line. (specify `output_tokens`)

Or you can output text string per line. (without `--output_tokens`)

Packages