Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DeepSpeed-Chat] Fix OOM issue in dataloader #841

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

youkaichao
Copy link

@youkaichao youkaichao commented Jan 1, 2024

Currently, DeepSpeed-Chat directly saves tokenized tensors on disk, which consumes hundreds GB of memory. For each string, it will be converted to max_seq_len of attention_mask and input_ids, stored in int32 or int64.

If we count about 2~3 char per token, then tokenized tensors can take on average hundreds of byte in storage. This is very problematic, and when the prompt dataset becomes larger (say 1GB), the on-disk dataset can be hundreds of GB.

What's worse, DeepSpeed-Chat will load these data in memory, which can require hundreds of GB of memory.

Per my personal experience, my 1.1GB prompt dataset incurs OOM in a 512GB machine, even if I'm just using 512 as max_seq_len. If I want to use 2048 as max_seq_len, that would be four times more memory, i.e. 2TB :(

This PR only saves the string, and tokenizes the string on-the-fly. The saved data are about the same size of the input dataset.

@youkaichao
Copy link
Author

@microsoft-github-policy-service agree

@youkaichao
Copy link
Author

Hi, team, any feedback on this 👀

@loadams
Copy link
Contributor

loadams commented Jan 24, 2025

Hi, team, any feedback on this 👀

Hi @youkaichao - sorry we didn't get to this until now. Would you want to fix the merge conflicts?

@loadams loadams self-assigned this Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants