Add configuration option to allow users to specify a custom Dataset class #781

mbsabath · 2025-01-11T14:27:55Z

solution to #780

Adds an optional field to the data configuration to allow users to specify an arbitrary class to use
as their underlying dataset to be wrapped in the OLMo IterableDataset. Allows users to also specify their own collation function (assuming that it will output fields that OLMo proper can work with) or to specify mappings from fields in their dataset to fields expected by OLMo's collation function.

Unit tests have been added, and integration tests have been run confirming that no currently existing OLMo configurations will need to be changed and that both standard OLMo dataset based training runs and custom dataset runs are able to run to completion and be restarted successfully from checkpoints. For example use, please see the included example config in this PR.

Edit: Updated PR to wrap custom dataset in an IterableDataset, updated the PR description to reflect this

mbsabath added 4 commits January 10, 2025 15:46

custom dataset implementation

e3ae7f9

lint and add unit tests

4306270

use IterableDataset as wrapper on custom dataset

cdd9f70

return assert statements to train.py

8258c03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configuration option to allow users to specify a custom Dataset class #781

Add configuration option to allow users to specify a custom Dataset class #781

mbsabath commented Jan 11, 2025 •

edited

Loading

Add configuration option to allow users to specify a custom Dataset class #781

Are you sure you want to change the base?

Add configuration option to allow users to specify a custom Dataset class #781

Conversation

mbsabath commented Jan 11, 2025 • edited Loading

mbsabath commented Jan 11, 2025 •

edited

Loading