Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a config not to shuffle merged dataset #1394

Merged
merged 7 commits into from
Mar 19, 2024

Conversation

seungduk-yanolja
Copy link
Contributor

@seungduk-yanolja seungduk-yanolja commented Mar 11, 2024

Add a config not to shuffle merged dataset

Description

Added a config named not_shuffle_merged_datasets, which I have been using in my fork for a long time :)

Motivation and Context

When training a model to expand its vocab with non-English tokens, I usually start with parallel corpora and then train it on web-crawled or something suitable for pre-training.
It is better giving the user to have an option not to shuffle the merged datasets anyway.

How has this been tested?

This config has been used in my fork for a long time and I verified that it works by seeing the loss graph.

Screenshots (if appropriate)

N/A

Types of changes

New feature (non-breaking change which adds functionality)

@seungduk-yanolja
Copy link
Contributor Author

seungduk-yanolja commented Mar 11, 2024

Resolves #1393

if cfg.not_shuffle_merged_datasets:
LOG.info("NOT shuffling merged pretraining datasets")
else:
dataset = dataset.shuffle(seed=seed, buffer_size=buffer_size)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure if this is intended to shuffle the pre-training dataset (it is a single dataset) within the buffer size?

README.md Outdated Show resolved Hide resolved
src/axolotl/utils/config/models/input/v0_4_1/__init__.py Outdated Show resolved Hide resolved
src/axolotl/utils/data.py Outdated Show resolved Hide resolved
src/axolotl/utils/data.py Outdated Show resolved Hide resolved
src/axolotl/utils/data.py Outdated Show resolved Hide resolved
@NanoCode012
Copy link
Collaborator

I am a bit curious. This only disables shuffling when merging datasets. It is still intended to shuffle within dataset?

@seungduk-yanolja
Copy link
Contributor Author

Does this PR require some more changes to be merged? PTAL

@NanoCode012 NanoCode012 merged commit 43bdc5d into axolotl-ai-cloud:main Mar 19, 2024
7 checks passed
seungduk-yanolja added a commit to Y-IAB/axolotl that referenced this pull request Mar 19, 2024
…kip ci]

* Add a config not to shuffle merged dataset

* Update README.md

* Update src/axolotl/utils/config/models/input/v0_4_1/__init__.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

* invert the condition name

* update README

* info -> debug

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants