Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration option to allow users to specify a custom Dataset class #781

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mbsabath
Copy link

@mbsabath mbsabath commented Jan 11, 2025

solution to #780

Adds an optional field to the data configuration to allow users to specify an arbitrary class to use
as their underlying dataset to be wrapped in the OLMo IterableDataset. Allows users to also specify their own collation function (assuming that it will output fields that OLMo proper can work with) or to specify mappings from fields in their dataset to fields expected by OLMo's collation function.

Unit tests have been added, and integration tests have been run confirming that no currently existing OLMo configurations will need to be changed and that both standard OLMo dataset based training runs and custom dataset runs are able to run to completion and be restarted successfully from checkpoints. For example use, please see the included example config in this PR.

Edit: Updated PR to wrap custom dataset in an IterableDataset, updated the PR description to reflect this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant