Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow arbitrary implementations of pytorch datasets to be used/specified in the configuration file #780

Open
mbsabath opened this issue Jan 11, 2025 · 0 comments
Labels
type/feature An issue or pull request that introduces a new feature

Comments

@mbsabath
Copy link

🚀 The feature, motivation and pitch

I'm currently working within the Kempner Institute at Harvard at preparing an internal dataset management system (and associated library) that will allow researchers to find and use data with multiple LLM frameworks on an on prem HPC system with a single API. As part of this package, we provide a way to create a Pytorch Dataset object for our data that can be used for training. We would like to handle our own implementation of the dataset to avoid external dependencies on multiple frameworks/to allow researchers space to develop their own frameworks/to avoid having to make slightly different copies of the same data for all frameworks researchers want to use.

Olmo as currently written only supports IterableDatasets as the input format, which only currently works with the MemmapDataset objects that can be specified as lists of file paths. I propose adding a configuration option to allow users to optionally specify a custom dataset class and its arguments. I also propose including additional configuration options allowing users to specify how to map from the outputs of that dataset to the expected fields for Olmo's collation function (or to provide their own collation function).

I have a currently written PR implementing this functionality. I've tested it running a 7b model using FSDP and confirmed that current OLMo configurations will run successfully without changes, while we successfully training proceed as expected when using the custom dataset. I've added appropriate logging statements to make clear that custom datasets may not support the deterministic training that OLMo's IterableDataset offers.

Alternatives

An alternative approach could instead be to add functionality to support wrapping arbitrary dataset objects in an IterableDataset object but I believe that supporting that would add unneeded complexity to the IterableDataset class rather than letting external users/developers handle the complexity of adapting their data to OLMo.

Additional context

No response

@mbsabath mbsabath added the type/feature An issue or pull request that introduces a new feature label Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature An issue or pull request that introduces a new feature
Projects
None yet
Development

No branches or pull requests

1 participant