Skip to content

Latest commit

 

History

History
49 lines (39 loc) · 2.83 KB

data.md

File metadata and controls

49 lines (39 loc) · 2.83 KB

Dataset

The data collection process is illustrated below:
We fed GPT-3.5 with captions from 3K images and descriptions of 22 visual tasks. This produced 66K instructions, each corresponding to a specific visual task and a visual foundation model (tool). Subsequently, we eliminated duplicate instructions and retained 41K sound instructions. To teach the model to utilize tools in a predefined manner, we followed the prompt format used in Visual ChatGPT and converted these instructions into a conversational format. Concurrently, we generated negative data without tool usage by randomly sampling 3K instructions from alpaca_gpt4_data and converting them to the defined format. Using the generated 71K instructions, we finetuned the Vicuna using LoRA and got our GPT4Tools, which can automatically decide, control, and utilize distinct tools in a conversation.

Each sample follows the below format:

{
    'instruction': xxx,
    'input': xxx,
    'output': xxx,
}

Download

Data file name Size OneDrive Google Driver
gpt4tools_71k.json 229 MB link link
gpt4tools_val_seen.json -- link link
gpt4tools_test_unseen.json -- link link
  • gpt4tools_71k.json contains 71K instruction-following data we used for fine-tuning the GPT4Tools model.

  • gpt4tools_val_seen.json is the manually cleaned instruction data used for validation, which includes instructions related to tools of gpt4tools_71k.json.

  • gpt4tools_test_unseen.json cleaned instruction data used for testing, including instructions related to some tools that are absented in gpt4tools_71k.json.

Generation

During generation using GPT-3.5, the openai api_key should be set in the env (OPENAI_API_KEY).

  • Raw Data Generation
python3 gpt4tools/data/get_instruction.py \
        --caption-path <your_caption_data_path> \
	    --instruction-path <instruction_data_path> 
  • Cleaning, and Instructional Data Consutruction
python3 gpt4tools/data/generate_annoations.py \
        --input-path <instruction_data_path> \
        --output-path <annotations_path> \
	    --caption-path <your_caption_data_path> \
	    --alpaca-path <your_alpaca_instruction_path> \
	    --filter \
	    --complement \
	    --insert-alpaca