Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev/rebuild sandbox #332

Merged
merged 19 commits into from
Jul 17, 2024
Merged

Dev/rebuild sandbox #332

merged 19 commits into from
Jul 17, 2024

Conversation

BeachWang
Copy link
Collaborator

@BeachWang BeachWang commented Jun 20, 2024

  1. Separation of sanbox and data processing parameters;
  2. pipeline is automatically executed according to the configuration file list sequence.
  3. The parsing of job parameters is adjusted to the execution before job execution, and the refine of recipes is supported for use in the current round;
  4. Since Ksigma used to store all init parameters, it cannot be used as the DJ parameter file. Instead, only the parameters in the configuration file with a low value are stored.

Minor:

  1. The subconfiguration file parsing function prepare_side_configs in data_juicer/config/config.py, this part of the function may be repeated with gece's parameter reconfiguration;
  2. The HPO part of the sandbox is not modified. It needs to be checked for correctness.

@BeachWang BeachWang added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 20, 2024
@BeachWang BeachWang self-assigned this Jun 20, 2024
@drcege drcege marked this pull request as draft June 27, 2024 03:33
@drcege
Copy link
Collaborator

drcege commented Jun 27, 2024

TODO:

  • More discussions on major refactoring
  • Align with the concepts in the paper
  • Correct minor errors (e.g., hooker -> hook, analyzer/analyser)

@drcege drcege requested a review from zhijianma June 27, 2024 03:44
data_juicer/core/sandbox/hooks.py Outdated Show resolved Hide resolved
data_juicer/core/sandbox/hooks.py Outdated Show resolved Hide resolved
configs/demo/sandbox/sandbox.yaml Show resolved Hide resolved
demos/process_sci_data/data/arxiv.jsonl Show resolved Hide resolved
Copy link
Collaborator

@yxdyc yxdyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yxdyc yxdyc merged commit 0fdb97a into main Jul 17, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants