You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently working within the Kempner Institute at Harvard at preparing an internal dataset management system (and associated library) that will allow researchers to find and use data with multiple LLM frameworks on an on prem HPC system with a single API. As part of this package, we provide a way to create a Pytorch Dataset object for our data that can be used for training. We would like to handle our own implementation of the dataset to avoid external dependencies on multiple frameworks/to allow researchers space to develop their own frameworks/to avoid having to make slightly different copies of the same data for all frameworks researchers want to use.
Olmo as currently written only supports IterableDatasets as the input format, which only currently works with the MemmapDataset objects that can be specified as lists of file paths. I propose adding a configuration option to allow users to optionally specify a custom dataset class and its arguments. I also propose including additional configuration options allowing users to specify how to map from the outputs of that dataset to the expected fields for Olmo's collation function (or to provide their own collation function).
I have a currently written PR implementing this functionality. I've tested it running a 7b model using FSDP and confirmed that current OLMo configurations will run successfully without changes, while we successfully training proceed as expected when using the custom dataset. I've added appropriate logging statements to make clear that custom datasets may not support the deterministic training that OLMo's IterableDataset offers.
Alternatives
An alternative approach could instead be to add functionality to support wrapping arbitrary dataset objects in an IterableDataset object but I believe that supporting that would add unneeded complexity to the IterableDataset class rather than letting external users/developers handle the complexity of adapting their data to OLMo.
Additional context
No response
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
I'm currently working within the Kempner Institute at Harvard at preparing an internal dataset management system (and associated library) that will allow researchers to find and use data with multiple LLM frameworks on an on prem HPC system with a single API. As part of this package, we provide a way to create a Pytorch Dataset object for our data that can be used for training. We would like to handle our own implementation of the dataset to avoid external dependencies on multiple frameworks/to allow researchers space to develop their own frameworks/to avoid having to make slightly different copies of the same data for all frameworks researchers want to use.
Olmo as currently written only supports
IterableDatasets
as the input format, which only currently works with theMemmapDataset
objects that can be specified as lists of file paths. I propose adding a configuration option to allow users to optionally specify a custom dataset class and its arguments. I also propose including additional configuration options allowing users to specify how to map from the outputs of that dataset to the expected fields for Olmo's collation function (or to provide their own collation function).I have a currently written PR implementing this functionality. I've tested it running a 7b model using FSDP and confirmed that current OLMo configurations will run successfully without changes, while we successfully training proceed as expected when using the custom dataset. I've added appropriate logging statements to make clear that custom datasets may not support the deterministic training that OLMo's
IterableDataset
offers.Alternatives
An alternative approach could instead be to add functionality to support wrapping arbitrary dataset objects in an
IterableDataset
object but I believe that supporting that would add unneeded complexity to theIterableDataset
class rather than letting external users/developers handle the complexity of adapting their data to OLMo.Additional context
No response
The text was updated successfully, but these errors were encountered: