-
-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support user defined prompters, pretokenized datasets in config, local parquet, local arrow files #348
Conversation
Can't wait for this one! :) To be clear: in your above example, it would prompt in the form: f"Below is a conversation between a user and a helpful assistant{question}{answer}" ... right? Doesn't insert any other text in the mix? |
I updated my example to include the |
So with: format:{instruction} ... then {instruction} would be the input, and {answer} would be used for training the output for that input? Because that's what I need :) In your example, how does "question" fit into the picture? |
@enn-nafnlaus question is the field name in your dataset for your instruction. e.g. |
Okay, it's just that your example used all three. "question" was mentioned as the instruction, but then later you use the instruction as "instruction". Was just wondering if that was an accident or what. |
datasets:
Yields: Traceback (most recent call last): Same thing happens with your example. |
@enn-nafnlaus are you on the branch for this PR? from your stack trace, it seems you might not be. line 159 https://github.com/OpenAccess-AI-Collective/axolotl/blob/2d10911853fe6ecb21d3998a5ad3f940f2d9608a/src/axolotl/utils/data.py#L159 is different than your stacktrace. |
ED: Whoops, wasn't. But I still get errors on yours. [PID:871977] tokenizing, merging, and shuffling master dataset ... same for every ForkPoolWorker, through 12.
|
Also, old-style prompter definitions don't seem to work any more. This: datasets:
... works fine on main. But when I switch to your branch: Traceback (most recent call last): |
@winglian Any followup on this? Really looking forward to using this :) |
@enn-nafnlaus should be working now,
|
I confirm that it can run now. :) Question, though: given the following, to shut off all added prompt elements:
When I put the following debug statements into user_defined_strategies.py:
I get:
Should there be that extra space at the end of turn_no_input_format? There are no extra spaces in my yaml. |
…l parquet, local arrow files
…l parquet, local arrow files (axolotl-ai-cloud#348) * support user defined prompters, pretokenized datasets in config, local parquet, local arrow files * fix user defined dataset types * fix for system prompts * fix tests * fix checks for parquet and arrow * aha moment that d.data_files isn't used * add documentation for ds_type to add support for parquet and arrow
for user defined prompters:
the above would define a prompt strategy for a data file with features question and answer mapped to the instruction and output fields according to the format. The
{output}
is not necessary in the format as it is assumed to be appended at the end.support for pretokenized datasets in the config is automatic as it checks for the input_ids, labels, and attention mask features on the dataset which would only exist on pretokenized dataset
support for parquet and arrow files is supported by either setting the
ds_type
option for a dataset, or automatically when a file underdata_files
has the suffix of.arrow
or.parquet