Add a config not to shuffle merged dataset #1394

seungduk-yanolja · 2024-03-11T17:09:23Z

Add a config not to shuffle merged dataset

Description

Added a config named not_shuffle_merged_datasets, which I have been using in my fork for a long time :)

Motivation and Context

When training a model to expand its vocab with non-English tokens, I usually start with parallel corpora and then train it on web-crawled or something suitable for pre-training.
It is better giving the user to have an option not to shuffle the merged datasets anyway.

How has this been tested?

This config has been used in my fork for a long time and I verified that it works by seeing the loss graph.

Screenshots (if appropriate)

N/A

Types of changes

New feature (non-breaking change which adds functionality)

seungduk-yanolja · 2024-03-11T17:09:40Z

Resolves #1393

seungduk-yanolja · 2024-03-11T17:28:52Z

src/axolotl/utils/data.py

+    if cfg.not_shuffle_merged_datasets:
+        LOG.info("NOT shuffling merged pretraining datasets")
+    else:
+        dataset = dataset.shuffle(seed=seed, buffer_size=buffer_size)


I am unsure if this is intended to shuffle the pre-training dataset (it is a single dataset) within the buffer size?

README.md

src/axolotl/utils/config/models/input/v0_4_1/__init__.py

src/axolotl/utils/data.py

Co-authored-by: Wing Lian <wing.lian@gmail.com>

NanoCode012 · 2024-03-13T09:39:00Z

I am a bit curious. This only disables shuffling when merging datasets. It is still intended to shuffle within dataset?

seungduk-yanolja · 2024-03-18T17:49:04Z

Does this PR require some more changes to be merged? PTAL

Approved via chat

…kip ci] * Add a config not to shuffle merged dataset * Update README.md * Update src/axolotl/utils/config/models/input/v0_4_1/__init__.py Co-authored-by: Wing Lian <wing.lian@gmail.com> * invert the condition name * update README * info -> debug --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>

Add a config not to shuffle merged dataset

53b0bf7

Update README.md

10d19d2

seungduk-yanolja commented Mar 11, 2024

View reviewed changes

winglian previously requested changes Mar 12, 2024

View reviewed changes

seungduk-yanolja and others added 4 commits March 13, 2024 11:15

Update src/axolotl/utils/config/models/input/v0_4_1/__init__.py

4e05beb

Co-authored-by: Wing Lian <wing.lian@gmail.com>

invert the condition name

59ccd21

update README

75579fb

info -> debug

abfc7dd

Merge branch 'OpenAccess-AI-Collective:main' into not_shuffle

203a358

NanoCode012 approved these changes Mar 19, 2024

View reviewed changes

NanoCode012 merged commit 43bdc5d into axolotl-ai-cloud:main Mar 19, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a config not to shuffle merged dataset #1394

Add a config not to shuffle merged dataset #1394

seungduk-yanolja commented Mar 11, 2024 •

edited

Loading

seungduk-yanolja commented Mar 11, 2024 •

edited

Loading

seungduk-yanolja Mar 11, 2024

NanoCode012 commented Mar 13, 2024

seungduk-yanolja commented Mar 18, 2024

Add a config not to shuffle merged dataset #1394

Add a config not to shuffle merged dataset #1394

Conversation

seungduk-yanolja commented Mar 11, 2024 • edited Loading

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

seungduk-yanolja commented Mar 11, 2024 • edited Loading

seungduk-yanolja Mar 11, 2024

Choose a reason for hiding this comment

NanoCode012 commented Mar 13, 2024

seungduk-yanolja commented Mar 18, 2024

seungduk-yanolja commented Mar 11, 2024 •

edited

Loading

seungduk-yanolja commented Mar 11, 2024 •

edited

Loading