Skip to content

Commit

Permalink
update tutorials/pretraining-Vietnamese-data-curation according to Ry…
Browse files Browse the repository at this point in the history
…an Wolf's comments

Signed-off-by: hoangphu7122002 <hoangphu7122002ai@gmail.com>
  • Loading branch information
hoangphu7122002 committed Oct 30, 2024
1 parent 9eb92eb commit f4c6a6f
Showing 1 changed file with 15 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,21 @@
"In this tutorial, we will use NeMo Curator to process high-quality [Vietnamese data](https://huggingface.co/datasets/VTSNLP/vietnamese_curated_dataset). We will guide you through the data curation pipeline used and share sample code for each stage."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"- **1. [Prerequisites and Environment setups](#prerequisites-and-environment-setups)**\n",
"- **2. [Data Collecting](#data-collecting)**\n",
"- **3. [Data Curation flow](#data-curation-flow)**\n",
" - a. [Unicode reformatting](#unicode-reformatting)\n",
" - b. [Adding Custom IDs to Documents](#adding-custom-ids-to-documents)\n",
" - c. [Exact deduplication](#exact-deduplication)\n",
" - d. [Heuristic Quality Filtering](#heuristic-quality-filtering)\n",
" - e. [Classifier-based Quality Filtering](#classifier-based-quality-filtering)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down

0 comments on commit f4c6a6f

Please sign in to comment.