dataset_process_pipeline convert the source code repo to post-pretrain dataset for CodeQwen start sh pipeline.sh