support user defined prompters, pretokenized datasets in config, local parquet, local arrow files #348

winglian · 2023-08-07T03:34:09Z

for user defined prompters:

datasets:
  - path: path/to/custom.jsonl
    type:
      system_prompt: "Below is a conversation between a user and a helpful assistant"
      field_instruction: question
      field_output: answer
      format: |-
        User: {instruction}
        Assistant:

the above would define a prompt strategy for a data file with features question and answer mapped to the instruction and output fields according to the format. The {output} is not necessary in the format as it is assumed to be appended at the end.

support for pretokenized datasets in the config is automatic as it checks for the input_ids, labels, and attention mask features on the dataset which would only exist on pretokenized dataset

support for parquet and arrow files is supported by either setting the ds_type option for a dataset, or automatically when a file under data_files has the suffix of .arrow or .parquet

enn-nafnlaus · 2023-08-07T11:35:56Z

Can't wait for this one! :) To be clear: in your above example, it would prompt in the form:

f"Below is a conversation between a user and a helpful assistant{question}{answer}"

... right? Doesn't insert any other text in the mix?

winglian · 2023-08-07T12:22:09Z

Can't wait for this one! :) To be clear: in your above example, it would prompt in the form:

f"Below is a conversation between a user and a helpful assistant{question}{answer}"

... right? Doesn't insert any other text in the mix?

I updated my example to include the format key. You would want to simply have {instruction} as your format in that case

enn-nafnlaus · 2023-08-07T13:41:25Z

datasets:
  - path: path/to/custom.jsonl
    type:
      system_prompt: "Below is a conversation between a user and a helpful assistant"
      field_instruction: question
      field_output: answer
      format: |-
        User: {instruction}
        Assistant:

So with:

format:{instruction}

... then {instruction} would be the input, and {answer} would be used for training the output for that input? Because that's what I need :)

In your example, how does "question" fit into the picture?

winglian · 2023-08-07T23:19:24Z

@enn-nafnlaus question is the field name in your dataset for your instruction. e.g. {"question: "....", "answer": "..."}
Or if your dataset used instruction for the field name, it would simply be field_instruction: instruction

enn-nafnlaus · 2023-08-07T23:24:33Z

@enn-nafnlaus question is the field name in your dataset for your instruction. e.g. {"question: "....", "answer": "..."} Or if your dataset used instruction for the field name, it would simply be field_instruction: instruction

Okay, it's just that your example used all three. "question" was mentioned as the instruction, but then later you use the instruction as "instruction". Was just wondering if that was an accident or what.

enn-nafnlaus · 2023-08-08T21:41:20Z

datasets:

path: /scratch/LLM_Training/summarize_training.json
type:
system_prompt: ""
field_instruction: instruction
field_output: output
format:{instruction}

Yields:

Traceback (most recent call last):
File "/path/to/axolotl/scripts/finetune.py", line 444, in
fire.Fire(train)
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/path/to/axolotl/scripts/finetune.py", line 305, in train
train_dataset, eval_dataset = load_prepare_datasets(
File "/path/to/axolotl/src/axolotl/utils/data.py", line 402, in load_prepare_datasets
dataset = load_tokenized_prepared_datasets(
File "/path/to/axolotl/src/axolotl/utils/data.py", line 159, in load_tokenized_prepared_datasets
d_type_split = d_type.split(":")
TypeError: 'NoneType' object is not callable

Same thing happens with your example.

winglian · 2023-08-09T20:55:37Z

@enn-nafnlaus are you on the branch for this PR? from your stack trace, it seems you might not be. line 159 https://github.com/OpenAccess-AI-Collective/axolotl/blob/2d10911853fe6ecb21d3998a5ad3f940f2d9608a/src/axolotl/utils/data.py#L159 is different than your stacktrace.

enn-nafnlaus · 2023-08-09T23:39:18Z

@enn-nafnlaus are you on the branch for this PR? from your stack trace, it seems you might not be. line 159

https://github.com/OpenAccess-AI-Collective/axolotl/blob/2d10911853fe6ecb21d3998a5ad3f940f2d9608a/src/axolotl/utils/data.py#L159

is different than your stacktrace.

ED: Whoops, wasn't. But I still get errors on yours.

[PID:871977] tokenizing, merging, and shuffling master dataset
^MMap (num_proc=12): 0%| | 0/124835 [00:00<?, ? examples/s]
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/path/to/.local/lib/python3.10/site-packages/multiprocess/process.py", line 314, in _bootstrap
self.run()
File "/path/to/.local/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/path/to/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
File "/path/to/.local/lib/python3.10/site-packages/multiprocess/queues.py", line 370, in get
return _ForkingPickler.loads(res)
File "/path/to/.local/lib/python3.10/site-packages/dill/_dill.py", line 286, in loads
return load(file, ignore, **kwds)
File "/path/to/.local/lib/python3.10/site-packages/dill/_dill.py", line 272, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/path/to/.local/lib/python3.10/site-packages/dill/_dill.py", line 419, in load
obj = StockUnpickler.load(self)
File "/path/to/.local/lib/python3.10/site-packages/addict/addict.py", line 34, in setitem
object.getattribute(self, '__frozen'))
AttributeError: 'DictDefault' object has no attribute '__frozen'

... same for every ForkPoolWorker, through 12.

]$ git branch
  main
* pr-348
  pr-353

enn-nafnlaus · 2023-08-09T23:53:36Z

Also, old-style prompter definitions don't seem to work any more. This:

datasets:

path: /path/to/summarize_training_oasst.json
type: oasst

... works fine on main. But when I switch to your branch:

Traceback (most recent call last):
File "/path/to/axolotl/scripts/finetune.py", line 444, in
fire.Fire(train)
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/path/to/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/path/to/axolotl/scripts/finetune.py", line 305, in train
train_dataset, eval_dataset = load_prepare_datasets(
File "/path/to/axolotl/src/axolotl/utils/data.py", line 423, in load_prepare_datasets
dataset = load_tokenized_prepared_datasets(
File "/path/to/axolotl/src/axolotl/utils/data.py", line 291, in load_tokenized_prepared_datasets
d_iter = iter(d)
File "/path/to/axolotl/src/axolotl/datasets.py", line 42, in iter
self.prompt_tokenizer.tokenize_prompt,
AttributeError: 'NoneType' object has no attribute 'tokenize_prompt'
Traceback (most recent call last):
File "/path/to/.local/bin/accelerate", line 8, in
sys.exit(main())
File "/path/to/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/path/to/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
simple_launcher(args)
File "/path/to/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'scripts/finetune.py', 'summarize.yaml']' returned non-zero exit status 1.

enn-nafnlaus · 2023-08-12T14:11:53Z

@winglian Any followup on this? Really looking forward to using this :)

winglian · 2023-08-16T02:29:54Z

@enn-nafnlaus should be working now,

  - path: teknium/GPT4-LLM-Cleaned
    type:
      system_prompt: "Below is a conversation between a user and a helpful assistant"
      field_instruction: instruction
      field_output: output
      format: |-
        User: {instruction}
        Assistant:

src/axolotl/utils/data.py

enn-nafnlaus · 2023-08-16T10:58:41Z

@enn-nafnlaus should be working now,

  - path: teknium/GPT4-LLM-Cleaned
    type:
      system_prompt: "Below is a conversation between a user and a helpful assistant"
      field_instruction: instruction
      field_output: output
      format: |-
        User: {instruction}
        Assistant:

I confirm that it can run now. :) Question, though: given the following, to shut off all added prompt elements:

datasets:
  - path: summarize_training.json
    type:
      system_prompt: ""
      field_instruction: instruction
      field_output: output
      format: |-
        {instruction}

When I put the following debug statements into user_defined_strategies.py:

    print("============================")
    print([system_prompt])
    print([turn_format])
    print([turn_no_input_format])
    print([system_format])
    print("============================")

I get:

============================
['']
['{instruction}']
['{instruction} ']
['{system}']
============================

Should there be that extra space at the end of turn_no_input_format? There are no extra spaces in my yaml.

src/axolotl/utils/data.py

…l parquet, local arrow files

…l parquet, local arrow files (axolotl-ai-cloud#348) * support user defined prompters, pretokenized datasets in config, local parquet, local arrow files * fix user defined dataset types * fix for system prompts * fix tests * fix checks for parquet and arrow * aha moment that d.data_files isn't used * add documentation for ds_type to add support for parquet and arrow

winglian marked this pull request as draft August 7, 2023 03:34

winglian requested a review from NanoCode012 August 16, 2023 02:14

winglian marked this pull request as ready for review August 16, 2023 02:14

winglian force-pushed the yml-prompter branch from 740437c to 9cd1ec7 Compare August 16, 2023 02:28

NanoCode012 reviewed Aug 16, 2023

View reviewed changes

src/axolotl/utils/data.py Show resolved Hide resolved

src/axolotl/utils/data.py Outdated Show resolved Hide resolved

winglian requested review from NanoCode012 and tmm1 August 17, 2023 02:19

NanoCode012 reviewed Aug 17, 2023

View reviewed changes

src/axolotl/utils/data.py Outdated Show resolved Hide resolved

winglian added 7 commits August 18, 2023 11:33

support user defined prompters, pretokenized datasets in config, loca…

0e0b6f1

…l parquet, local arrow files

fix user defined dataset types

b5c64c8

fix for system prompts

6f5c3bd

fix tests

b61eafe

fix checks for parquet and arrow

cf62106

aha moment that d.data_files isn't used

4a3eb7d

add documentation for ds_type to add support for parquet and arrow

d84d39b

winglian force-pushed the yml-prompter branch from 5b8d518 to d84d39b Compare August 18, 2023 15:33

winglian merged commit d2e7f27 into main Aug 20, 2023
6 checks passed

winglian deleted the yml-prompter branch August 20, 2023 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files #348

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files #348

winglian commented Aug 7, 2023 •

edited

Loading

enn-nafnlaus commented Aug 7, 2023 •

edited

Loading

winglian commented Aug 7, 2023

enn-nafnlaus commented Aug 7, 2023

winglian commented Aug 7, 2023

enn-nafnlaus commented Aug 7, 2023

enn-nafnlaus commented Aug 8, 2023 •

edited

Loading

winglian commented Aug 9, 2023

enn-nafnlaus commented Aug 9, 2023 •

edited

Loading

enn-nafnlaus commented Aug 9, 2023 •

edited

Loading

enn-nafnlaus commented Aug 12, 2023

winglian commented Aug 16, 2023

enn-nafnlaus commented Aug 16, 2023 •

edited

Loading

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files #348

support user defined prompters, pretokenized datasets in config, local parquet, local arrow files #348

Conversation

winglian commented Aug 7, 2023 • edited Loading

enn-nafnlaus commented Aug 7, 2023 • edited Loading

winglian commented Aug 7, 2023

enn-nafnlaus commented Aug 7, 2023

winglian commented Aug 7, 2023

enn-nafnlaus commented Aug 7, 2023

enn-nafnlaus commented Aug 8, 2023 • edited Loading

winglian commented Aug 9, 2023

enn-nafnlaus commented Aug 9, 2023 • edited Loading

enn-nafnlaus commented Aug 9, 2023 • edited Loading

enn-nafnlaus commented Aug 12, 2023

winglian commented Aug 16, 2023

enn-nafnlaus commented Aug 16, 2023 • edited Loading

winglian commented Aug 7, 2023 •

edited

Loading

enn-nafnlaus commented Aug 7, 2023 •

edited

Loading

enn-nafnlaus commented Aug 8, 2023 •

edited

Loading

enn-nafnlaus commented Aug 9, 2023 •

edited

Loading

enn-nafnlaus commented Aug 9, 2023 •

edited

Loading

enn-nafnlaus commented Aug 16, 2023 •

edited

Loading