Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update on "[BE][6/n] replace large c4_mini datasets by c4_test with t…
…he first 2K entries" `c4_mini` is 100MB large and makes repo clone slow. Since we already have the original dataset `c4`, let's remove redundancy and only keey a minimal dataset for testing (even offline). For loss convergence testing, we can use the full `c4`. `c4_test` (2K entries, <5MB size) is now put under `test/assets`, together with the test tokenizer. It can cover the first 10 iterations of debug model without repetition. After this PR lands, we should do a history rewriting to remove `c4_mini` entirely from history, to avoid repo clone overhead. [ghstack-poisoned]
- Loading branch information