Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files #348

Merged
merged 7 commits into from
Aug 20, 2023

Conversation

winglian
Copy link
Collaborator

@winglian winglian commented Aug 7, 2023

for user defined prompters:

datasets:
  - path: path/to/custom.jsonl
    type:
      system_prompt: "Below is a conversation between a user and a helpful assistant"
      field_instruction: question
      field_output: answer
      format: |-
        User: {instruction}
        Assistant:

the above would define a prompt strategy for a data file with features question and answer mapped to the instruction and output fields according to the format. The {output} is not necessary in the format as it is assumed to be appended at the end.

support for pretokenized datasets in the config is automatic as it checks for the input_ids, labels, and attention mask features on the dataset which would only exist on pretokenized dataset

support for parquet and arrow files is supported by either setting the ds_type option for a dataset, or automatically when a file under data_files has the suffix of .arrow or .parquet

@winglian winglian marked this pull request as draft August 7, 2023 03:34
@enn-nafnlaus
Copy link

enn-nafnlaus commented Aug 7, 2023

Can't wait for this one! :) To be clear: in your above example, it would prompt in the form:

f"Below is a conversation between a user and a helpful assistant{question}{answer}"

... right? Doesn't insert any other text in the mix?

@winglian
Copy link
Collaborator Author

winglian commented Aug 7, 2023

Can't wait for this one! :) To be clear: in your above example, it would prompt in the form:

f"Below is a conversation between a user and a helpful assistant{question}{answer}"

... right? Doesn't insert any other text in the mix?

I updated my example to include the format key. You would want to simply have {instruction} as your format in that case

@enn-nafnlaus
Copy link

datasets:
  - path: path/to/custom.jsonl
    type:
      system_prompt: "Below is a conversation between a user and a helpful assistant"
      field_instruction: question
      field_output: answer
      format: |-
        User: {instruction}
        Assistant:

So with:

format:{instruction}

... then {instruction} would be the input, and {answer} would be used for training the output for that input? Because that's what I need :)

In your example, how does "question" fit into the picture?

@winglian
Copy link
Collaborator Author

winglian commented Aug 7, 2023

@enn-nafnlaus question is the field name in your dataset for your instruction. e.g. {"question: "....", "answer": "..."}
Or if your dataset used instruction for the field name, it would simply be field_instruction: instruction

@enn-nafnlaus
Copy link

@enn-nafnlaus question is the field name in your dataset for your instruction. e.g. {"question: "....", "answer": "..."} Or if your dataset used instruction for the field name, it would simply be field_instruction: instruction

Okay, it's just that your example used all three. "question" was mentioned as the instruction, but then later you use the instruction as "instruction". Was just wondering if that was an accident or what.

@enn-nafnlaus
Copy link

enn-nafnlaus commented Aug 8, 2023

datasets:

  • path: /scratch/LLM_Training/summarize_training.json
    type:
    system_prompt: ""
    field_instruction: instruction
    field_output: output
    format:{instruction}

Yields:

Traceback (most recent call last):
File "/path/to/axolotl/scripts/finetune.py", line 444, in
fire.Fire(train)
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/path/to/axolotl/scripts/finetune.py", line 305, in train
train_dataset, eval_dataset = load_prepare_datasets(
File "/path/to/axolotl/src/axolotl/utils/data.py", line 402, in load_prepare_datasets
dataset = load_tokenized_prepared_datasets(
File "/path/to/axolotl/src/axolotl/utils/data.py", line 159, in load_tokenized_prepared_datasets
d_type_split = d_type.split(":")
TypeError: 'NoneType' object is not callable

Same thing happens with your example.

@winglian
Copy link
Collaborator Author

winglian commented Aug 9, 2023

@enn-nafnlaus are you on the branch for this PR? from your stack trace, it seems you might not be. line 159 https://github.com/OpenAccess-AI-Collective/axolotl/blob/2d10911853fe6ecb21d3998a5ad3f940f2d9608a/src/axolotl/utils/data.py#L159 is different than your stacktrace.

@enn-nafnlaus
Copy link

enn-nafnlaus commented Aug 9, 2023

@enn-nafnlaus are you on the branch for this PR? from your stack trace, it seems you might not be. line 159

https://github.com/OpenAccess-AI-Collective/axolotl/blob/2d10911853fe6ecb21d3998a5ad3f940f2d9608a/src/axolotl/utils/data.py#L159

is different than your stacktrace.

ED: Whoops, wasn't. But I still get errors on yours.

[PID:871977] tokenizing, merging, and shuffling master dataset
^MMap (num_proc=12): 0%| | 0/124835 [00:00<?, ? examples/s]
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/path/to/.local/lib/python3.10/site-packages/multiprocess/process.py", line 314, in _bootstrap
self.run()
File "/path/to/.local/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/path/to/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
File "/path/to/.local/lib/python3.10/site-packages/multiprocess/queues.py", line 370, in get
return _ForkingPickler.loads(res)
File "/path/to/.local/lib/python3.10/site-packages/dill/_dill.py", line 286, in loads
return load(file, ignore, **kwds)
File "/path/to/.local/lib/python3.10/site-packages/dill/_dill.py", line 272, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/path/to/.local/lib/python3.10/site-packages/dill/_dill.py", line 419, in load
obj = StockUnpickler.load(self)
File "/path/to/.local/lib/python3.10/site-packages/addict/addict.py", line 34, in setitem
object.getattribute(self, '__frozen'))
AttributeError: 'DictDefault' object has no attribute '__frozen'

... same for every ForkPoolWorker, through 12.

]$ git branch
  main
* pr-348
  pr-353

@enn-nafnlaus
Copy link

enn-nafnlaus commented Aug 9, 2023

Also, old-style prompter definitions don't seem to work any more. This:

datasets:

  • path: /path/to/summarize_training_oasst.json
    type: oasst

... works fine on main. But when I switch to your branch:

Traceback (most recent call last):
File "/path/to/axolotl/scripts/finetune.py", line 444, in
fire.Fire(train)
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/path/to/axolotl/scripts/finetune.py", line 305, in train
train_dataset, eval_dataset = load_prepare_datasets(
File "/path/to/axolotl/src/axolotl/utils/data.py", line 423, in load_prepare_datasets
dataset = load_tokenized_prepared_datasets(
File "/path/to/axolotl/src/axolotl/utils/data.py", line 291, in load_tokenized_prepared_datasets
d_iter = iter(d)
File "/path/to/axolotl/src/axolotl/datasets.py", line 42, in iter
self.prompt_tokenizer.tokenize_prompt,
AttributeError: 'NoneType' object has no attribute 'tokenize_prompt'
Traceback (most recent call last):
File "/path/to/.local/bin/accelerate", line 8, in
sys.exit(main())
File "/path/to/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/path/to/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
simple_launcher(args)
File "/path/to/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'scripts/finetune.py', 'summarize.yaml']' returned non-zero exit status 1.

@enn-nafnlaus
Copy link

@winglian Any followup on this? Really looking forward to using this :)

@winglian winglian marked this pull request as ready for review August 16, 2023 02:14
@winglian
Copy link
Collaborator Author

@enn-nafnlaus should be working now,

  - path: teknium/GPT4-LLM-Cleaned
    type:
      system_prompt: "Below is a conversation between a user and a helpful assistant"
      field_instruction: instruction
      field_output: output
      format: |-
        User: {instruction}
        Assistant: 

src/axolotl/utils/data.py Show resolved Hide resolved
src/axolotl/utils/data.py Outdated Show resolved Hide resolved
@enn-nafnlaus
Copy link

enn-nafnlaus commented Aug 16, 2023

@enn-nafnlaus should be working now,

  - path: teknium/GPT4-LLM-Cleaned
    type:
      system_prompt: "Below is a conversation between a user and a helpful assistant"
      field_instruction: instruction
      field_output: output
      format: |-
        User: {instruction}
        Assistant: 

I confirm that it can run now. :) Question, though: given the following, to shut off all added prompt elements:

datasets:
  - path: summarize_training.json
    type:
      system_prompt: ""
      field_instruction: instruction
      field_output: output
      format: |-
        {instruction}

When I put the following debug statements into user_defined_strategies.py:

    print("============================")
    print([system_prompt])
    print([turn_format])
    print([turn_no_input_format])
    print([system_format])
    print("============================")

I get:

============================
['']
['{instruction}']
['{instruction} ']
['{system}']
============================

Should there be that extra space at the end of turn_no_input_format? There are no extra spaces in my yaml.

@winglian winglian merged commit d2e7f27 into main Aug 20, 2023
6 checks passed
@winglian winglian deleted the yml-prompter branch August 20, 2023 13:17
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
…l parquet, local arrow files (axolotl-ai-cloud#348)

* support user defined prompters, pretokenized datasets in config, local parquet, local arrow files

* fix user defined dataset types

* fix for system prompts

* fix tests

* fix checks for parquet and arrow

* aha moment that d.data_files isn't used

* add documentation for ds_type to add support for parquet and arrow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants