Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in-domain supervised evaluation #1

Closed
ShDdu opened this issue Nov 12, 2024 · 1 comment
Closed

in-domain supervised evaluation #1

ShDdu opened this issue Nov 12, 2024 · 1 comment

Comments

@ShDdu
Copy link

ShDdu commented Nov 12, 2024

您好,非常感谢您开源这么优秀的项目。我在复现您论文的实验时Out-of-domain evaluation验证的结果和您论文里的数据接近,但是在验证in-domain supervised时,数据结果差异比较大,英文的平均f1只有70几(论文的结果是83.85),我使用的数据是B2NER_all,语言模型使用的是InternLM2-7b(internlm/internlm2-7b),训练脚本使用的是train_lora_internlm2_bilingual_full.sh,数据配置是将configs/bilingual_full里面的test_tasks.json的数据集换成了train_tasks.json的数据集,把train_tasks.json的rand换成了full。想请问您如何能够复现您论文在in-domain supervised的结果。非常感谢!!!

@UmeanNever
Copy link
Owner

UmeanNever commented Nov 14, 2024

感谢您的关注!为了方便其他有同样问题的人,下面我就用英文回答了。
Thanks for your interest and for replicating the out-of-domain evaluation! The following points may help in customizing the running script to replicate the in-domain supervised evaluation:

  1. We have uploaded an example task(data) configuration for in-domain supervised evaluation under configs/ml_configs/bilingual_idsupervised, which specifies the datasets used for training and testing under this setting. You can refer to it to update the TASK_CONFIG_DIR in your running script. Note that, following previous work, it differs slightly from simply copying the OOD evaluation configuration and updating the testing datasets.
  2. Use the B2NERD_all dataset instead of B2NERD for training. Set DATA_DIR to the path of your local B2NER_all dataset (which you may have already done).
  3. Set the training argument max_num_instances_per_task to 10000 as specified in the paper. Consistent with previous work, we sample 10,000 examples from each training dataset under the supervised setting.
  4. Consider using full parameter fine-tuning with a learning rate like 2e-5 instead of lora. This should be beneficial for supervised tasks. You can achieve this by setting use_lora to False or removing it.
  5. Consider disabling training regularization methods like random label dropout. These regularization methods are mainly for better OOD generalization instead of ID fitting. These can be achieved by removing arguments like dynamic_range, droplabel_rate, and label_shuffle.

Other configurations should remain the same as those in train_lora_internlm2_bilingual_full.sh. Combining all the points, you can refer to the settings below to configure the running script for in-domain supervised evaluation. Please note that hyperparameters such as batch size and learning rate may require adjustment, and the results could vary slightly depending on the computing environment.

   ...
   TASK_CONFIG_DIR="configs/ml_configs/bilingual_idsupervised"
   DATA_DIR=".../B2NERD_all"
   ...
      deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_port $port src/run.py \
         --do_train \
         --do_predict \
         --predict_with_generate \
         --predict_each_epoch \
         --lang auto \
         --model_name_or_path $MODEL_NAME_OR_PATH \
         --data_dir $DATA_DIR \
         --task_config_dir $TASK_CONFIG_DIR \
         --instruction_file $INSTRUCTION_CONFIG \
         --instruction_strategy single \
         --output_dir $OUTPUT_DIR \
         --input_record_file $INPUT_RECORD \
         --bf16 True \
         --seed $SEED \
         --per_device_train_batch_size 1 \
         --per_device_eval_batch_size 2 \
         --gradient_accumulation_steps $GRAD_ACC \
         --gradient_checkpointing True \
         --learning_rate 2e-05 \
         --adam_beta1 0.9 \
         --adam_beta2 0.98 \
         --weight_decay 1e-4 \
         --warmup_ratio 0.02 \
         --lr_scheduler_type "cosine" \
         --adam_epsilon 1e-8 \
         --num_train_epochs 4 \
         --deepspeed $DS_CONFIG \
         --run_name $RUN_NAME \
         --max_source_length 2048 \
         --max_target_length 1024 \
         --generation_max_length 1024 \
         --max_num_instances_per_task 10000 \
         --max_num_instances_per_eval_task 500 \
         --add_task_name False \
         --add_dataset_name False \
         --num_examples 0 \
         --num_examples_test 0 \
         --train_0shot_prop 1 \
         --train_fewshot_prop 0 \
         --overwrite_output_dir \
         --overwrite_cache \
         --logging_strategy steps \
         --logging_steps 50 \
         --evaluation_strategy epoch \
         --eval_steps 2000 \
         --save_strategy epoch \
         --save_steps 2000 \
         --report_to "none" \
         --log_level info

Hope this information is helpful to you. We also plan to include the relevant details in the appendix of future versions of the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants