-
Notifications
You must be signed in to change notification settings - Fork 77
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
- Loading branch information
1 parent
ad12e15
commit 71aedab
Showing
6 changed files
with
154 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,21 @@ | ||
# TODO | ||
# NeMo Curator Python API examples | ||
|
||
This directory contains multiple Python scripts with examples of how to use various NeMo Curator classes and functions. | ||
The goal of these examples is to give the user an overview of many of the ways your text data can be curated. | ||
These include: | ||
|
||
- `blend_and_shuffle.py`: Combine multiple datasets into one with different amounts of each dataset, then randomly permute the dataset. | ||
- `classifier_filtering.py`: Train a fastText classifier, then use it to filter high and low quality data. | ||
- `download_arxiv.py`: Download Arxiv tar files and extract them. | ||
- `download_common_crawl.py`: Download Common Crawl WARC snapshots and extract them. | ||
- `download_wikipedia.py`: Download the latest Wikipedia dumps and extract them. | ||
- `exact_deduplication.py`: Use the `ExactDuplicates` class to perform exact deduplication on text data. | ||
- `find_pii_and_deidentify.py`: Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data. | ||
- `fuzzy_deduplication.py`: Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data. | ||
- `identify_languages_and_fix_unicode.py`: Use `FastTextLangId` to filter data by language, then fix the unicode in it. | ||
- `raw_download_common_crawl.py`: Download the raw compressed WARC files from Common Crawl without extracting them. | ||
- `semdedup_example.py`: Use the `SemDedup` class to perform semantic deduplication on text data. | ||
- `task_decontamination.py`: Remove segments of downstream evaluation tasks from a dataset. | ||
- `translation_example.py`: Create and use an `IndicTranslation` model for language translation. | ||
|
||
The `classifiers`, `k8s`, `nemo_run`, and `slurm` subdirectories contain even more examples of NeMo Curator's capabilities. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,9 @@ | ||
# TODO | ||
# Dask with Slurm | ||
|
||
This directory provides an example Slurm script pipeline. | ||
This pipeline has a script `start-slurm.sh` that provides configuration options similar to what `get_client` provides. | ||
Every Slurm cluster is different, so make sure you understand how your Slurm cluster works so the scripts can be easily adapted. | ||
`start-slurm.sh` calls `containter-entrypoint.sh`, which sets up a Dask scheduler and workers across the cluster. | ||
|
||
Our Python examples are designed to work such that they can be run locally on their own, or easily substituted into the `start-slurm.sh` script to run on multiple nodes. | ||
You can adapt your scripts easily too by simply following the pattern of adding `get_client` with `add_distributed_args`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,29 @@ | ||
# TODO | ||
# NeMo Curator CLI Scripts | ||
|
||
The following Python scripts are designed to be executed from the command line (terminal) only. | ||
|
||
Here, we list all of the Python scripts and their terminal commands: | ||
|
||
| Python Command | CLI Command | | ||
|------------------------------------------|--------------------------------| | ||
| python add_id.py | add_id | | ||
| python blend_datasets.py | blend_datasets | | ||
| python download_and_extract.py | download_and_extract | | ||
| python filter_documents.py | filter_documents | | ||
| python find_exact_duplicates.py | gpu_exact_dups | | ||
| python find_matching_ngrams.py | find_matching_ngrams | | ||
| python find_pii_and_deidentify.py | deidentify | | ||
| python get_common_crawl_urls.py | get_common_crawl_urls | | ||
| python get_wikipedia_urls.py | get_wikipedia_urls | | ||
| python make_data_shards.py | make_data_shards | | ||
| python prepare_fasttext_training_data.py | prepare_fasttext_training_data | | ||
| python prepare_task_data.py | prepare_task_data | | ||
| python remove_matching_ngrams.py | remove_matching_ngrams | | ||
| python separate_by_metadata.py | separate_by_metadata | | ||
| python text_cleaning.py | text_cleaning | | ||
| python train_fasttext.py | train_fasttext | | ||
| python verify_classification_results.py | verify_classification_results | | ||
|
||
For more information about the arguments needed for each script, you can use `add_id --help`, etc. | ||
|
||
More scripts can be found in the `classifiers`, `fuzzy_deduplication`, and `semdedup` subdirectories. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,92 @@ | ||
# TODO | ||
## Text Classification | ||
|
||
The Python scripts in this directory demonstrate how to run classification on your text data with each of these 4 classifiers: | ||
|
||
- Domain Classifier | ||
- Quality Classifier | ||
- AEGIS Safety Models | ||
- FineWeb Educational Content Classifier | ||
|
||
For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html). | ||
|
||
### Usage | ||
|
||
#### Domain classifier inference | ||
|
||
```bash | ||
# same as `python domain_classifier_inference.py` | ||
domain_classifier_inference \ | ||
--input-data-dir /path/to/data/directory \ | ||
--output-data-dir /path/to/output/directory \ | ||
--input-file-type "jsonl" \ | ||
--input-file-extension "jsonl" \ | ||
--output-file-type "jsonl" \ | ||
--input-text-field "text" \ | ||
--batch-size 64 \ | ||
--autocast \ | ||
--max-chars 2000 \ | ||
--device "gpu" | ||
``` | ||
|
||
Additional arguments may be added for customizing a Dask cluster and client. Run `domain_classifier_inference --help` for more information. | ||
|
||
#### Quality classifier inference | ||
|
||
```bash | ||
# same as `python quality_classifier_inference.py` | ||
quality_classifier_inference \ | ||
--input-data-dir /path/to/data/directory \ | ||
--output-data-dir /path/to/output/directory \ | ||
--input-file-type "jsonl" \ | ||
--input-file-extension "jsonl" \ | ||
--output-file-type "jsonl" \ | ||
--input-text-field "text" \ | ||
--batch-size 64 \ | ||
--autocast \ | ||
--max-chars 2000 \ | ||
--device "gpu" | ||
``` | ||
|
||
Additional arguments may be added for customizing a Dask cluster and client. Run `quality_classifier_inference --help` for more information. | ||
|
||
#### AEGIS classifier inference | ||
|
||
```bash | ||
# same as `python aegis_classifier_inference.py` | ||
aegis_classifier_inference \ | ||
--input-data-dir /path/to/data/directory \ | ||
--output-data-dir /path/to/output/directory \ | ||
--input-file-type "jsonl" \ | ||
--input-file-extension "jsonl" \ | ||
--output-file-type "jsonl" \ | ||
--input-text-field "text" \ | ||
--batch-size 64 \ | ||
--max-chars 6000 \ | ||
--device "gpu" \ | ||
--aegis-variant "nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0" \ | ||
--token "hf_1234" | ||
``` | ||
|
||
- `--aegis-variant` can be `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0`, `nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0`, or a path to your own PEFT of LlamaGuard 2. | ||
- `--token` is your HuggingFace token, which is used when downloading the base Llama Guard model. | ||
|
||
Additional arguments may be added for customizing a Dask cluster and client. Run `aegis_classifier_inference --help` for more information. | ||
|
||
#### FineWeb-Edu classifier inference | ||
|
||
```bash | ||
# same as `python fineweb_edu_classifier_inference.py` | ||
fineweb_edu_classifier_inference \ | ||
--input-data-dir /path/to/data/directory \ | ||
--output-data-dir /path/to/output/directory \ | ||
--input-file-type "jsonl" \ | ||
--input-file-extension "jsonl" \ | ||
--output-file-type "jsonl" \ | ||
--input-text-field "text" \ | ||
--batch-size 64 \ | ||
--autocast \ | ||
--max-chars 2000 \ | ||
--device "gpu" | ||
``` | ||
|
||
Additional arguments may be added for customizing a Dask cluster and client. Run `fineweb_edu_classifier_inference --help` for more information. |