Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] add auto mode for analyzer #512

Merged
merged 7 commits into from
Dec 20, 2024
Merged

[Feature] add auto mode for analyzer #512

merged 7 commits into from
Dec 20, 2024

Conversation

HYLcool
Copy link
Collaborator

@HYLcool HYLcool commented Dec 13, 2024

dj-analyze --auto --dataset_path xx.jsonl [--auto_num 1000]
  • add auto mode for Analyzer, so users can:
    • analyze a small part of input dataset quickly and avoid writing a specific recipe
    • with all Filters that produce stats
    • change the number of samples to be analyzed in auto mode using --auto_num arg.
  • --auto and --config are mutually exclusive, meaning users can provide a specific recipe or analyze in auto mode.

Others:

  • support drawing word cloud images for string stats in addition to histograms.
  • set default mem_required for all model-based OPs.
  • limit wandb version to "<=0.19.0" because the latest 0.19.1 could cause query exception

@HYLcool HYLcool added enhancement New feature or request dj:core issues/PRs about the core functions of Data-Juicer labels Dec 13, 2024
@HYLcool HYLcool requested review from BeachWang and yxdyc December 13, 2024 09:06
@HYLcool HYLcool self-assigned this Dec 13, 2024
Copy link
Collaborator

@yxdyc yxdyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

README.md Show resolved Hide resolved
@HYLcool HYLcool merged commit 2fdf484 into main Dec 20, 2024
3 checks passed
@HYLcool HYLcool deleted the feat/auto_analyze branch December 20, 2024 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:core issues/PRs about the core functions of Data-Juicer enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants