Allow arbitrary implementations of pytorch datasets to be used/specified in the configuration file #780

mbsabath · 2025-01-11T14:15:38Z

🚀 The feature, motivation and pitch

I'm currently working within the Kempner Institute at Harvard at preparing an internal dataset management system (and associated library) that will allow researchers to find and use data with multiple LLM frameworks on an on prem HPC system with a single API. As part of this package, we provide a way to create a Pytorch Dataset object for our data that can be used for training. We would like to handle our own implementation of the dataset to avoid external dependencies on multiple frameworks/to allow researchers space to develop their own frameworks/to avoid having to make slightly different copies of the same data for all frameworks researchers want to use.

Olmo as currently written only supports IterableDatasets as the input format, which only currently works with the MemmapDataset objects that can be specified as lists of file paths. I propose adding a configuration option to allow users to optionally specify a custom dataset class and its arguments. I also propose including additional configuration options allowing users to specify how to map from the outputs of that dataset to the expected fields for Olmo's collation function (or to provide their own collation function).

I have a currently written PR implementing this functionality. I've tested it running a 7b model using FSDP and confirmed that current OLMo configurations will run successfully without changes, while we successfully training proceed as expected when using the custom dataset. I've added appropriate logging statements to make clear that custom datasets may not support the deterministic training that OLMo's IterableDataset offers.

Alternatives

An alternative approach could instead be to add functionality to support wrapping arbitrary dataset objects in an IterableDataset object but I believe that supporting that would add unneeded complexity to the IterableDataset class rather than letting external users/developers handle the complexity of adapting their data to OLMo.

Additional context

No response

The text was updated successfully, but these errors were encountered:

mbsabath added the type/feature An issue or pull request that introduces a new feature label Jan 11, 2025

mbsabath mentioned this issue Jan 11, 2025

Add configuration option to allow users to specify a custom Dataset class #781

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow arbitrary implementations of pytorch datasets to be used/specified in the configuration file #780

Allow arbitrary implementations of pytorch datasets to be used/specified in the configuration file #780

mbsabath commented Jan 11, 2025

Allow arbitrary implementations of pytorch datasets to be used/specified in the configuration file #780

Allow arbitrary implementations of pytorch datasets to be used/specified in the configuration file #780

Comments

mbsabath commented Jan 11, 2025

🚀 The feature, motivation and pitch

Alternatives

Additional context