Refactor dataset #348

pan-x-c · 2024-07-11T05:24:57Z

Add DJDataset as the base interface of all datasets in data-juicer.
Add RayDataset as a sub-class of DJDataset
Simplify RayExecutor
update version of RAY to 2.31.0
fix the ray-mode dataset loading method according to Unable to process data with ray executor on Kubernetes Ray Cluster: #345

data_juicer/core/ray_data.py

* modelscope-sora news (#323) * News/modelscope sora (#327) * modelscope-sora news * remove empower * debug for gpu rank for analyser (#329) * debug for gpu rank for analyser * spec_numprocs -> num_proc * Add more unittest (#304) * add unittest env with gpu * fix unittest yml * add environment for unittest * update workflow trigger * update install step * fix install command * update working dir * update container * update working dir * change working directory * change working directory * change working directory * change working directory * change unittest * use test tag * finish tag support * support run op with different executro * fix pre-commit * add hf mirror * add hf mirror * run all test in standalone mode by default * ignore image face ratio * update tags * add ray testcase * add ray test in workflow * update ray unittest workflow * delete old unittest --------- Co-authored-by: root <panxuchen> * Add source tag (#317) * add source tag for some mapper op * fix no attribute 'current_tag' when executing local tests * move op process logic from executor to base op * fix typo * move export outside op * init refactor * update analyser * fix format * clean up * bring back batch mapper * Improve fault tolerance & Fix Ray executor * fix wrapper * fix batched filter * Remove use_actor as it is not compatible with the refactored OP clas, unless the dataset class is refactored * make wrappers work with unittests * Compatible with unit tests and works with ray * fix unittest * fix wrappers with ray, map, filter * unify unittests * wrap deduplicators * Compatible with non-batched calls * Class-level wrappers - compatible with dataset.filter - bring back nested wrappers * Instance-level wrappers * Refined instance-level wrappers - Remove incomplete dataset.filter wrappers - Simplify code - Stack wrappers * fix use_cuda * Refactor dataset (#348) * refactor dataset * update unittest with DJDataset * fix unittest * update ray data load * add test * ray read json * update docker image version * actor is no longer supported * Regress filter's stats export logic --------- Co-authored-by: BeachWang <1400012807@pku.edu.cn> Co-authored-by: Xuchen Pan <32844285+pan-x-c@users.noreply.github.com> Co-authored-by: chenhesen <hesen.chs@alibaba-inc.com> Co-authored-by: garyzhang99 <garyzhang99@163.com>

* Refactor OP & Dataset (#336) * modelscope-sora news (#323) * News/modelscope sora (#327) * modelscope-sora news * remove empower * debug for gpu rank for analyser (#329) * debug for gpu rank for analyser * spec_numprocs -> num_proc * Add more unittest (#304) * add unittest env with gpu * fix unittest yml * add environment for unittest * update workflow trigger * update install step * fix install command * update working dir * update container * update working dir * change working directory * change working directory * change working directory * change working directory * change unittest * use test tag * finish tag support * support run op with different executro * fix pre-commit * add hf mirror * add hf mirror * run all test in standalone mode by default * ignore image face ratio * update tags * add ray testcase * add ray test in workflow * update ray unittest workflow * delete old unittest --------- Co-authored-by: root <panxuchen> * Add source tag (#317) * add source tag for some mapper op * fix no attribute 'current_tag' when executing local tests * move op process logic from executor to base op * fix typo * move export outside op * init refactor * update analyser * fix format * clean up * bring back batch mapper * Improve fault tolerance & Fix Ray executor * fix wrapper * fix batched filter * Remove use_actor as it is not compatible with the refactored OP clas, unless the dataset class is refactored * make wrappers work with unittests * Compatible with unit tests and works with ray * fix unittest * fix wrappers with ray, map, filter * unify unittests * wrap deduplicators * Compatible with non-batched calls * Class-level wrappers - compatible with dataset.filter - bring back nested wrappers * Instance-level wrappers * Refined instance-level wrappers - Remove incomplete dataset.filter wrappers - Simplify code - Stack wrappers * fix use_cuda * Refactor dataset (#348) * refactor dataset * update unittest with DJDataset * fix unittest * update ray data load * add test * ray read json * update docker image version * actor is no longer supported * Regress filter's stats export logic --------- Co-authored-by: BeachWang <1400012807@pku.edu.cn> Co-authored-by: Xuchen Pan <32844285+pan-x-c@users.noreply.github.com> Co-authored-by: chenhesen <hesen.chs@alibaba-inc.com> Co-authored-by: garyzhang99 <garyzhang99@163.com> * minor fix * fix num_proc default None --------- Co-authored-by: Ce Ge (戈策) <gece@foxmail.com> Co-authored-by: BeachWang <1400012807@pku.edu.cn> Co-authored-by: Xuchen Pan <32844285+pan-x-c@users.noreply.github.com> Co-authored-by: chenhesen <hesen.chs@alibaba-inc.com> Co-authored-by: garyzhang99 <garyzhang99@163.com> Co-authored-by: null <3213204+drcege@users.noreply.github.com>

refactor dataset

75ad1af

pan-x-c temporarily deployed to Testing July 11, 2024 05:25 — with GitHub Actions Inactive

update unittest with DJDataset

8971934

drcege reviewed Jul 11, 2024

View reviewed changes

data_juicer/core/ray_data.py Outdated Show resolved Hide resolved

fix unittest

04ff789

pan-x-c temporarily deployed to Testing July 11, 2024 07:06 — with GitHub Actions Inactive

fix conflict

a2080e7

pan-x-c had a problem deploying to Testing July 11, 2024 07:21 — with GitHub Actions Failure

drcege requested review from HYLcool, zhijianma and yxdyc July 11, 2024 07:27

update ray data load

239c2c5

pan-x-c temporarily deployed to Testing July 11, 2024 08:14 — with GitHub Actions Inactive

pan-x-c mentioned this pull request Jul 11, 2024

Unable to process data with ray executor on Kubernetes Ray Cluster: #345

Closed

3 tasks

add test

d486244

pan-x-c had a problem deploying to Testing July 12, 2024 03:02 — with GitHub Actions Failure

ray read json

7ed8c49

pan-x-c had a problem deploying to Testing July 12, 2024 08:23 — with GitHub Actions Failure

update docker image version

eeca6cf

pan-x-c temporarily deployed to Testing July 12, 2024 08:29 — with GitHub Actions Inactive

drcege merged commit bc33d7c into refactor/OP Jul 13, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dataset #348

Refactor dataset #348

pan-x-c commented Jul 11, 2024 •

edited

Loading

Refactor dataset #348

Refactor dataset #348

Conversation

pan-x-c commented Jul 11, 2024 • edited Loading

pan-x-c commented Jul 11, 2024 •

edited

Loading