diff --git a/chapters/de/chapter1/3.mdx b/chapters/de/chapter1/3.mdx
index c58ba5d64..7db9cb48a 100644
--- a/chapters/de/chapter1/3.mdx
+++ b/chapters/de/chapter1/3.mdx
@@ -68,7 +68,7 @@ Wenn du einen Text an eine Pipeline übergibst, gibt es drei wichtige Schritte:
3. Die Vorhersagen des Modells werden so nachverarbeitet, sodass du sie nutzen kannst.
-Einige der derzeit [verfügbaren Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html) sind:
+Einige der derzeit [verfügbaren Pipelines](https://huggingface.co/transformers/main_classes/pipelines) sind:
- `feature-extraction` (Vektordarstellung eines Textes erhalten)
- `fill-mask`
diff --git a/chapters/de/chapter1/5.mdx b/chapters/de/chapter1/5.mdx
index 502ab5d89..5f1b68f53 100644
--- a/chapters/de/chapter1/5.mdx
+++ b/chapters/de/chapter1/5.mdx
@@ -15,8 +15,8 @@ Rein Encoder-basierte Modelle eignen sich am besten für Aufgaben, die ein Verst
Zu dieser Modellfamilie gehören unter anderem:
-- [ALBERT](https://huggingface.co/transformers/model_doc/albert.html)
-- [BERT](https://huggingface.co/transformers/model_doc/bert.html)
-- [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html)
-- [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html)
-- [RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)
+- [ALBERT](https://huggingface.co/transformers/model_doc/albert)
+- [BERT](https://huggingface.co/transformers/model_doc/bert)
+- [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert)
+- [ELECTRA](https://huggingface.co/transformers/model_doc/electra)
+- [RoBERTa](https://huggingface.co/transformers/model_doc/roberta)
diff --git a/chapters/de/chapter1/6.mdx b/chapters/de/chapter1/6.mdx
index 3e0e5768a..948c010a2 100644
--- a/chapters/de/chapter1/6.mdx
+++ b/chapters/de/chapter1/6.mdx
@@ -15,7 +15,7 @@ Diese Modelle sind am besten für Aufgaben geeignet, bei denen es um die Generie
Zu dieser Modellfamilie gehören unter anderem:
-- [CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)
+- [CTRL](https://huggingface.co/transformers/model_doc/ctrl)
- [GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)
-- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html)
-- [Transformer XL](https://huggingface.co/transformers/model_doc/transformerxl.html)
+- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2)
+- [Transformer XL](https://huggingface.co/transformers/model_doc/transformerxl)
diff --git a/chapters/de/chapter1/7.mdx b/chapters/de/chapter1/7.mdx
index 0ce8561cf..4bf04585f 100644
--- a/chapters/de/chapter1/7.mdx
+++ b/chapters/de/chapter1/7.mdx
@@ -15,7 +15,7 @@ Sequence-to-Sequence-Modelle eignen sich am besten für Aufgaben, bei denen es d
Vertreter dieser Modellfamilie sind u. a.:
-- [BART](https://huggingface.co/transformers/model_doc/bart.html)
-- [mBART](https://huggingface.co/transformers/model_doc/mbart.html)
-- [Marian](https://huggingface.co/transformers/model_doc/marian.html)
-- [T5](https://huggingface.co/transformers/model_doc/t5.html)
+- [BART](https://huggingface.co/transformers/model_doc/bart)
+- [mBART](https://huggingface.co/transformers/model_doc/mbart)
+- [Marian](https://huggingface.co/transformers/model_doc/marian)
+- [T5](https://huggingface.co/transformers/model_doc/t5)
diff --git a/chapters/de/chapter3/2.mdx b/chapters/de/chapter3/2.mdx
index 50a183761..12c7f699a 100644
--- a/chapters/de/chapter3/2.mdx
+++ b/chapters/de/chapter3/2.mdx
@@ -235,7 +235,7 @@ tokenized_dataset = tokenizer(
Das funktioniert gut, hat aber den Nachteil, dass ein Dictionary zurückgegeben wird (mit unseren Schlüsselwörtern `input_ids`, `attention_mask` und `token_type_ids` und Werten aus Listen von Listen). Es funktioniert auch nur, wenn du genügend RAM hast, um den gesamten Datensatz während der Tokenisierung zu im RAM zwischen zu speichern (während die Datensätze aus der Bibliothek 🤗 Datasets [Apache Arrow](https://arrow.apache.org/) Dateien sind, die auf der Festplatte gespeichert sind, sodass nur die gewünschten Samples im RAM geladen sind).
-Um die Daten als Datensatz zu speichern, verwenden wir die Methode [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map). Dies gewährt uns zusätzliche Flexibilität, wenn wir zusätzliche Vorverarbeitung als nur die Tokenisierung benötigen. Die `map()`-Methode funktioniert, indem sie eine Funktion auf jedes Element des Datensatzes anwendet, also definieren wir eine Funktion, die unsere Inputs tokenisiert:
+Um die Daten als Datensatz zu speichern, verwenden wir die Methode [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map). Dies gewährt uns zusätzliche Flexibilität, wenn wir zusätzliche Vorverarbeitung als nur die Tokenisierung benötigen. Die `map()`-Methode funktioniert, indem sie eine Funktion auf jedes Element des Datensatzes anwendet, also definieren wir eine Funktion, die unsere Inputs tokenisiert:
```py
def tokenize_function(example):
diff --git a/chapters/de/chapter4/2.mdx b/chapters/de/chapter4/2.mdx
index 0b20fe1d9..a445d6f60 100644
--- a/chapters/de/chapter4/2.mdx
+++ b/chapters/de/chapter4/2.mdx
@@ -65,7 +65,7 @@ tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = CamembertForMaskedLM.from_pretrained("camembert-base")
```
-Dennoch empfehlen wir, dass man die [`Auto*` classes](https://huggingface.co/transformers/model_doc/auto.html?highlight=auto#auto-classes) stattdessen benutzt, da diese architekturunabhängig sind. Das vorherige Code-Beispiel gilt nur für Checkpoints, die in die CamemBERT Architektur zu laden sind, aber mit den `Auto*` Klassen kann man Checkpoints ziemlich einfach tauschen:
+Dennoch empfehlen wir, dass man die [`Auto*` classes](https://huggingface.co/transformers/model_doc/auto?highlight=auto#auto-classes) stattdessen benutzt, da diese architekturunabhängig sind. Das vorherige Code-Beispiel gilt nur für Checkpoints, die in die CamemBERT Architektur zu laden sind, aber mit den `Auto*` Klassen kann man Checkpoints ziemlich einfach tauschen:
```py
from transformers import AutoTokenizer, AutoModelForMaskedLM
@@ -81,7 +81,7 @@ tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = TFCamembertForMaskedLM.from_pretrained("camembert-base")
```
-Hier empfehlen wir auch, dass man stattdessen die [`TFAuto*` classes](https://huggingface.co/transformers/model_doc/auto.html?highlight=auto#auto-classes) benutzt, da diese architekturunabhängig sind. Das vorherige Code-Beispiel gilt nur für Checkpoints, die in die CamemBERT Architektur zu laden sind, aber mit den `TFAuto*` Klassen kann man Checkpoints einfach tauschen:
+Hier empfehlen wir auch, dass man stattdessen die [`TFAuto*` classes](https://huggingface.co/transformers/model_doc/auto?highlight=auto#auto-classes) benutzt, da diese architekturunabhängig sind. Das vorherige Code-Beispiel gilt nur für Checkpoints, die in die CamemBERT Architektur zu laden sind, aber mit den `TFAuto*` Klassen kann man Checkpoints einfach tauschen:
```py
from transformers import AutoTokenizer, TFAutoModelForMaskedLM
diff --git a/chapters/en/chapter1/3.mdx b/chapters/en/chapter1/3.mdx
index 4ff8b0d00..a31638e9e 100644
--- a/chapters/en/chapter1/3.mdx
+++ b/chapters/en/chapter1/3.mdx
@@ -68,7 +68,7 @@ There are three main steps involved when you pass some text to a pipeline:
3. The predictions of the model are post-processed, so you can make sense of them.
-Some of the currently [available pipelines](https://huggingface.co/transformers/main_classes/pipelines.html) are:
+Some of the currently [available pipelines](https://huggingface.co/transformers/main_classes/pipelines) are:
- `feature-extraction` (get the vector representation of a text)
- `fill-mask`
diff --git a/chapters/en/chapter1/6.mdx b/chapters/en/chapter1/6.mdx
index 396d79bd3..b0f4ba09c 100644
--- a/chapters/en/chapter1/6.mdx
+++ b/chapters/en/chapter1/6.mdx
@@ -15,7 +15,7 @@ These models are best suited for tasks involving text generation.
Representatives of this family of models include:
-- [CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)
+- [CTRL](https://huggingface.co/transformers/model_doc/ctrl)
- [GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)
-- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html)
-- [Transformer XL](https://huggingface.co/transformers/model_doc/transfo-xl.html)
+- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2)
+- [Transformer XL](https://huggingface.co/transformers/model_doc/transfo-xl)
diff --git a/chapters/en/chapter1/7.mdx b/chapters/en/chapter1/7.mdx
index 23599e26f..e39c5ca8e 100644
--- a/chapters/en/chapter1/7.mdx
+++ b/chapters/en/chapter1/7.mdx
@@ -15,7 +15,7 @@ Sequence-to-sequence models are best suited for tasks revolving around generatin
Representatives of this family of models include:
-- [BART](https://huggingface.co/transformers/model_doc/bart.html)
-- [mBART](https://huggingface.co/transformers/model_doc/mbart.html)
-- [Marian](https://huggingface.co/transformers/model_doc/marian.html)
-- [T5](https://huggingface.co/transformers/model_doc/t5.html)
+- [BART](https://huggingface.co/transformers/model_doc/bart)
+- [mBART](https://huggingface.co/transformers/model_doc/mbart)
+- [Marian](https://huggingface.co/transformers/model_doc/marian)
+- [T5](https://huggingface.co/transformers/model_doc/t5)
diff --git a/chapters/en/chapter3/2.mdx b/chapters/en/chapter3/2.mdx
index f64747dac..ebf464fc3 100644
--- a/chapters/en/chapter3/2.mdx
+++ b/chapters/en/chapter3/2.mdx
@@ -235,7 +235,7 @@ tokenized_dataset = tokenizer(
This works well, but it has the disadvantage of returning a dictionary (with our keys, `input_ids`, `attention_mask`, and `token_type_ids`, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization (whereas the datasets from the 🤗 Datasets library are [Apache Arrow](https://arrow.apache.org/) files stored on the disk, so you only keep the samples you ask for loaded in memory).
-To keep the data as a dataset, we will use the [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The `map()` method works by applying a function on each element of the dataset, so let's define a function that tokenizes our inputs:
+To keep the data as a dataset, we will use the [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map) method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The `map()` method works by applying a function on each element of the dataset, so let's define a function that tokenizes our inputs:
```py
def tokenize_function(example):
diff --git a/chapters/en/chapter4/2.mdx b/chapters/en/chapter4/2.mdx
index 8b110ec7b..0bd50669b 100644
--- a/chapters/en/chapter4/2.mdx
+++ b/chapters/en/chapter4/2.mdx
@@ -65,7 +65,7 @@ tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = CamembertForMaskedLM.from_pretrained("camembert-base")
```
-However, we recommend using the [`Auto*` classes](https://huggingface.co/transformers/model_doc/auto.html?highlight=auto#auto-classes) instead, as these are by design architecture-agnostic. While the previous code sample limits users to checkpoints loadable in the CamemBERT architecture, using the `Auto*` classes makes switching checkpoints simple:
+However, we recommend using the [`Auto*` classes](https://huggingface.co/transformers/model_doc/auto?highlight=auto#auto-classes) instead, as these are by design architecture-agnostic. While the previous code sample limits users to checkpoints loadable in the CamemBERT architecture, using the `Auto*` classes makes switching checkpoints simple:
```py
from transformers import AutoTokenizer, AutoModelForMaskedLM
@@ -81,7 +81,7 @@ tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = TFCamembertForMaskedLM.from_pretrained("camembert-base")
```
-However, we recommend using the [`TFAuto*` classes](https://huggingface.co/transformers/model_doc/auto.html?highlight=auto#auto-classes) instead, as these are by design architecture-agnostic. While the previous code sample limits users to checkpoints loadable in the CamemBERT architecture, using the `TFAuto*` classes makes switching checkpoints simple:
+However, we recommend using the [`TFAuto*` classes](https://huggingface.co/transformers/model_doc/auto?highlight=auto#auto-classes) instead, as these are by design architecture-agnostic. While the previous code sample limits users to checkpoints loadable in the CamemBERT architecture, using the `TFAuto*` classes makes switching checkpoints simple:
```py
from transformers import AutoTokenizer, TFAutoModelForMaskedLM
diff --git a/chapters/en/chapter4/3.mdx b/chapters/en/chapter4/3.mdx
index 6b27777f4..9de3fb1d8 100644
--- a/chapters/en/chapter4/3.mdx
+++ b/chapters/en/chapter4/3.mdx
@@ -178,7 +178,7 @@ Click on the "Files and versions" tab, and you should see the files visible in t
-As you've seen, the `push_to_hub()` method accepts several arguments, making it possible to upload to a specific repository or organization namespace, or to use a different API token. We recommend you take a look at the method specification available directly in the [🤗 Transformers documentation](https://huggingface.co/transformers/model_sharing.html) to get an idea of what is possible.
+As you've seen, the `push_to_hub()` method accepts several arguments, making it possible to upload to a specific repository or organization namespace, or to use a different API token. We recommend you take a look at the method specification available directly in the [🤗 Transformers documentation](https://huggingface.co/transformers/model_sharing) to get an idea of what is possible.
The `push_to_hub()` method is backed by the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) Python package, which offers a direct API to the Hugging Face Hub. It's integrated within 🤗 Transformers and several other machine learning libraries, like [`allenlp`](https://github.com/allenai/allennlp). Although we focus on the 🤗 Transformers integration in this chapter, integrating it into your own code or library is simple.
diff --git a/chapters/en/chapter5/2.mdx b/chapters/en/chapter5/2.mdx
index d3aea7b87..acf417bba 100644
--- a/chapters/en/chapter5/2.mdx
+++ b/chapters/en/chapter5/2.mdx
@@ -128,7 +128,7 @@ This is exactly what we wanted. Now, we can apply various preprocessing techniqu
-The `data_files` argument of the `load_dataset()` function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting `data_files="*.json"`). See the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) for more details.
+The `data_files` argument of the `load_dataset()` function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting `data_files="*.json"`). See the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/loading#local-and-remote-files) for more details.
@@ -160,7 +160,7 @@ This returns the same `DatasetDict` object obtained above, but saves us the step
-✏️ **Try it out!** Pick another dataset hosted on GitHub or the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and try loading it both locally and remotely using the techniques introduced above. For bonus points, try loading a dataset that’s stored in a CSV or text format (see the [documentation](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) for more information on these formats).
+✏️ **Try it out!** Pick another dataset hosted on GitHub or the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and try loading it both locally and remotely using the techniques introduced above. For bonus points, try loading a dataset that’s stored in a CSV or text format (see the [documentation](https://huggingface.co/docs/datasets/loading#local-and-remote-files) for more information on these formats).
diff --git a/chapters/en/chapter5/3.mdx b/chapters/en/chapter5/3.mdx
index 4a3ddc7b5..9e6e738bc 100644
--- a/chapters/en/chapter5/3.mdx
+++ b/chapters/en/chapter5/3.mdx
@@ -238,7 +238,7 @@ As you can see, this has removed around 15% of the reviews from our original tra
-✏️ **Try it out!** Use the `Dataset.sort()` function to inspect the reviews with the largest numbers of words. See the [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.sort) to see which argument you need to use sort the reviews by length in descending order.
+✏️ **Try it out!** Use the `Dataset.sort()` function to inspect the reviews with the largest numbers of words. See the [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.sort) to see which argument you need to use sort the reviews by length in descending order.
@@ -385,7 +385,7 @@ tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
```
-Oh no! That didn't work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you've looked at the `Dataset.map()` [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map), you may recall that it's the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.
+Oh no! That didn't work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you've looked at the `Dataset.map()` [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map), you may recall that it's the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.
The problem is that we're trying to mix two different datasets of different sizes: the `drug_dataset` columns will have a certain number of examples (the 1,000 in our error), but the `tokenized_dataset` we are building will have more (the 1,463 in the error message; it is more than 1,000 because we are tokenizing long reviews into more than one example by using `return_overflowing_tokens=True`). That doesn't work for a `Dataset`, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the `remove_columns` argument:
diff --git a/chapters/en/chapter5/4.mdx b/chapters/en/chapter5/4.mdx
index 332e171c3..8f424d99f 100644
--- a/chapters/en/chapter5/4.mdx
+++ b/chapters/en/chapter5/4.mdx
@@ -46,7 +46,7 @@ We can see that there are 15,518,009 rows and 2 columns in our dataset -- that's
-✎ By default, 🤗 Datasets will decompress the files needed to load a dataset. If you want to preserve hard drive space, you can pass `DownloadConfig(delete_extracted=True)` to the `download_config` argument of `load_dataset()`. See the [documentation](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig) for more details.
+✎ By default, 🤗 Datasets will decompress the files needed to load a dataset. If you want to preserve hard drive space, you can pass `DownloadConfig(delete_extracted=True)` to the `download_config` argument of `load_dataset()`. See the [documentation](https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig) for more details.
diff --git a/chapters/en/chapter5/5.mdx b/chapters/en/chapter5/5.mdx
index 10e43fe5e..5688ea04a 100644
--- a/chapters/en/chapter5/5.mdx
+++ b/chapters/en/chapter5/5.mdx
@@ -365,7 +365,7 @@ Cool, we've pushed our dataset to the Hub and it's available for others to use!
-💡 You can also upload a dataset to the Hugging Face Hub directly from the terminal by using `huggingface-cli` and a bit of Git magic. See the [🤗 Datasets guide](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) for details on how to do this.
+💡 You can also upload a dataset to the Hugging Face Hub directly from the terminal by using `huggingface-cli` and a bit of Git magic. See the [🤗 Datasets guide](https://huggingface.co/docs/datasets/share#share-a-dataset-using-the-cli) for details on how to do this.
diff --git a/chapters/en/chapter5/6.mdx b/chapters/en/chapter5/6.mdx
index 3da60cc96..418abbbb6 100644
--- a/chapters/en/chapter5/6.mdx
+++ b/chapters/en/chapter5/6.mdx
@@ -178,7 +178,7 @@ Okay, this has given us a few thousand comments to work with!
-✏️ **Try it out!** See if you can use `Dataset.map()` to explode the `comments` column of `issues_dataset` _without_ resorting to the use of Pandas. This is a little tricky; you might find the ["Batch mapping"](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch.html?batch-mapping#batch-mapping) section of the 🤗 Datasets documentation useful for this task.
+✏️ **Try it out!** See if you can use `Dataset.map()` to explode the `comments` column of `issues_dataset` _without_ resorting to the use of Pandas. This is a little tricky; you might find the ["Batch mapping"](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping) section of the 🤗 Datasets documentation useful for this task.
diff --git a/chapters/en/chapter6/8.mdx b/chapters/en/chapter6/8.mdx
index 7caee98ed..38086edcf 100644
--- a/chapters/en/chapter6/8.mdx
+++ b/chapters/en/chapter6/8.mdx
@@ -27,14 +27,14 @@ The 🤗 Tokenizers library has been built to provide several options for each o
More precisely, the library is built around a central `Tokenizer` class with the building blocks regrouped in submodules:
-- `normalizers` contains all the possible types of `Normalizer` you can use (complete list [here](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers)).
-- `pre_tokenizers` contains all the possible types of `PreTokenizer` you can use (complete list [here](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers)).
-- `models` contains the various types of `Model` you can use, like `BPE`, `WordPiece`, and `Unigram` (complete list [here](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models)).
-- `trainers` contains all the different types of `Trainer` you can use to train your model on a corpus (one per type of model; complete list [here](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.trainers)).
-- `post_processors` contains the various types of `PostProcessor` you can use (complete list [here](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.processors)).
-- `decoders` contains the various types of `Decoder` you can use to decode the outputs of tokenization (complete list [here](https://huggingface.co/docs/tokenizers/python/latest/components.html#decoders)).
-
-You can find the whole list of building blocks [here](https://huggingface.co/docs/tokenizers/python/latest/components.html).
+- `normalizers` contains all the possible types of `Normalizer` you can use (complete list [here](https://huggingface.co/docs/tokenizers/api/normalizers)).
+- `pre_tokenizers` contains all the possible types of `PreTokenizer` you can use (complete list [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)).
+- `models` contains the various types of `Model` you can use, like `BPE`, `WordPiece`, and `Unigram` (complete list [here](https://huggingface.co/docs/tokenizers/api/models)).
+- `trainers` contains all the different types of `Trainer` you can use to train your model on a corpus (one per type of model; complete list [here](https://huggingface.co/docs/tokenizers/api/trainers)).
+- `post_processors` contains the various types of `PostProcessor` you can use (complete list [here](https://huggingface.co/docs/tokenizers/api/post-processors)).
+- `decoders` contains the various types of `Decoder` you can use to decode the outputs of tokenization (complete list [here](https://huggingface.co/docs/tokenizers/components#decoders)).
+
+You can find the whole list of building blocks [here](https://huggingface.co/docs/tokenizers/components).
## Acquiring a corpus[[acquiring-a-corpus]]
diff --git a/chapters/es/chapter3/2.mdx b/chapters/es/chapter3/2.mdx
index 7aed993ea..624749727 100644
--- a/chapters/es/chapter3/2.mdx
+++ b/chapters/es/chapter3/2.mdx
@@ -240,7 +240,7 @@ tokenized_dataset = tokenizer(
Esto funciona bien, pero tiene la desventaja de que devuelve un diccionario (con nuestras llaves, `input_ids`, `attention_mask`, and `token_type_ids`, y valores que son listas de listas). Además va a trabajar solo si tienes suficiente memoria principal para almacenar todo el conjunto de datos durante la tokenización (mientras que los conjuntos de datos de la librería 🤗 Datasets son archivos [Apache Arrow](https://arrow.apache.org/) almacenados en disco, y así solo mantienes en memoria las muestras que necesitas).
-Para mantener los datos como un conjunto de datos, usaremos el método [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map). Este también nos ofrece una flexibilidad adicional en caso de que necesitemos preprocesamiento mas allá de la tokenización. El método `map()` trabaja aplicando una función sobre cada elemento del conjunto de datos, así que definamos una función para tokenizar nuestras entradas:
+Para mantener los datos como un conjunto de datos, usaremos el método [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map). Este también nos ofrece una flexibilidad adicional en caso de que necesitemos preprocesamiento mas allá de la tokenización. El método `map()` trabaja aplicando una función sobre cada elemento del conjunto de datos, así que definamos una función para tokenizar nuestras entradas:
```py
def tokenize_function(example):
diff --git a/chapters/es/chapter5/2.mdx b/chapters/es/chapter5/2.mdx
index a0de6264a..e68820743 100644
--- a/chapters/es/chapter5/2.mdx
+++ b/chapters/es/chapter5/2.mdx
@@ -129,7 +129,7 @@ Esto es exactamente lo que queríamos. Ahora podemos aplicar varias técnicas de
-El argumento `data_files` de la función `load_dataset()` es muy flexible. Puede ser una única ruta de archivo, una lista de rutas o un diccionario que mapee los nombres de los conjuntos a las rutas de archivo. También puedes buscar archivos que cumplan con cierto patrón específico de acuerdo con las reglas usadas por el shell de Unix (e.g., puedes buscar todos los archivos JSON en una carpeta al definir `datafiles="*.json"`). Revisa la [documentación](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) para más detalles.
+El argumento `data_files` de la función `load_dataset()` es muy flexible. Puede ser una única ruta de archivo, una lista de rutas o un diccionario que mapee los nombres de los conjuntos a las rutas de archivo. También puedes buscar archivos que cumplan con cierto patrón específico de acuerdo con las reglas usadas por el shell de Unix (e.g., puedes buscar todos los archivos JSON en una carpeta al definir `datafiles="*.json"`). Revisa la [documentación](https://huggingface.co/docs/datasets/loading#local-and-remote-files) para más detalles.
@@ -161,6 +161,6 @@ Esto devuelve el mismo objeto `DatasetDict` que obtuvimos antes, pero nos ahorra
-✏️ **¡Inténtalo!** Escoge otro dataset alojado en GitHub o en el [Repositorio de Machine Learning de UCI](https://archive.ics.uci.edu/ml/index.php) e intenta cargarlo local y remotamente usando las técnicas descritas con anterioridad. Para puntos extra, intenta cargar un dataset que esté guardado en un formato CSV o de texto (revisa la [documentación](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) pata tener más información sobre estos formatos).
+✏️ **¡Inténtalo!** Escoge otro dataset alojado en GitHub o en el [Repositorio de Machine Learning de UCI](https://archive.ics.uci.edu/ml/index.php) e intenta cargarlo local y remotamente usando las técnicas descritas con anterioridad. Para puntos extra, intenta cargar un dataset que esté guardado en un formato CSV o de texto (revisa la [documentación](https://huggingface.co/docs/datasets/loading#local-and-remote-files) pata tener más información sobre estos formatos).
diff --git a/chapters/es/chapter5/3.mdx b/chapters/es/chapter5/3.mdx
index f1cc5ab69..daae902ea 100644
--- a/chapters/es/chapter5/3.mdx
+++ b/chapters/es/chapter5/3.mdx
@@ -238,7 +238,7 @@ Como puedes ver, esto ha eliminado alrededor del 15% de las reseñas de nuestros
-✏️ **¡Inténtalo!** Usa la función `Dataset.sort()` para inspeccionar las reseñas con el mayor número de palabras. Revisa la [documentación](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.sort) para ver cuál argumento necesitas para ordenar las reseñas de mayor a menor.
+✏️ **¡Inténtalo!** Usa la función `Dataset.sort()` para inspeccionar las reseñas con el mayor número de palabras. Revisa la [documentación](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.sort) para ver cuál argumento necesitas para ordenar las reseñas de mayor a menor.
@@ -385,7 +385,7 @@ tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
```
-¿Por qué no funcionó? El mensaje de error nos da una pista: hay un desajuste en las longitudes de una de las columnas, siendo una de longitud 1.463 y otra de longitud 1.000. Si has revisado la [documentación de `Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map), te habrás dado cuenta que estamos mapeando el número de muestras que le pasamos a la función: en este caso los 1.000 ejemplos nos devuelven 1.463 features, arrojando un error.
+¿Por qué no funcionó? El mensaje de error nos da una pista: hay un desajuste en las longitudes de una de las columnas, siendo una de longitud 1.463 y otra de longitud 1.000. Si has revisado la [documentación de `Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map), te habrás dado cuenta que estamos mapeando el número de muestras que le pasamos a la función: en este caso los 1.000 ejemplos nos devuelven 1.463 features, arrojando un error.
El problema es que estamos tratando de mezclar dos datasets de tamaños diferentes: las columnas de `drug_dataset` tendrán un cierto número de ejemplos (los 1.000 en el error), pero el `tokenized_dataset` que estamos construyendo tendrá más (los 1.463 en el mensaje de error). Esto no funciona para un `Dataset`, así que tenemos que eliminar las columnas del anterior dataset o volverlas del mismo tamaño del nuevo. Podemos hacer la primera operación con el argumento `remove_columns`:
diff --git a/chapters/es/chapter5/4.mdx b/chapters/es/chapter5/4.mdx
index 38919187d..326a2a609 100644
--- a/chapters/es/chapter5/4.mdx
+++ b/chapters/es/chapter5/4.mdx
@@ -45,7 +45,7 @@ Como podemos ver, hay 15.518.009 filas y dos columnas en el dataset, ¡un montó
-✎ Por defecto, 🤗 Datasets va a descomprimir los archivos necesarios para cargar un dataset. Si quieres ahorrar espacio de almacenamiento, puedes usar `DownloadConfig(delete_extracted=True)` al argumento `download_config` de `load_dataset()`. Revisa la [documentación](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig) para más detalles.
+✎ Por defecto, 🤗 Datasets va a descomprimir los archivos necesarios para cargar un dataset. Si quieres ahorrar espacio de almacenamiento, puedes usar `DownloadConfig(delete_extracted=True)` al argumento `download_config` de `load_dataset()`. Revisa la [documentación](https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig) para más detalles.
diff --git a/chapters/es/chapter5/5.mdx b/chapters/es/chapter5/5.mdx
index e901c3717..21419d4e5 100644
--- a/chapters/es/chapter5/5.mdx
+++ b/chapters/es/chapter5/5.mdx
@@ -427,7 +427,7 @@ Dataset({
-💡 También puedes subir un dataset al Hub de Hugging Face directamente desde la terminal usando `huggingface-cli` y un poco de Git. Revisa la [guía de 🤗 Datasets](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) para más detalles sobre cómo hacerlo.
+💡 También puedes subir un dataset al Hub de Hugging Face directamente desde la terminal usando `huggingface-cli` y un poco de Git. Revisa la [guía de 🤗 Datasets](https://huggingface.co/docs/datasets/share#share-a-dataset-using-the-cli) para más detalles sobre cómo hacerlo.
diff --git a/chapters/es/chapter5/6.mdx b/chapters/es/chapter5/6.mdx
index 1e635315b..d92a1333c 100644
--- a/chapters/es/chapter5/6.mdx
+++ b/chapters/es/chapter5/6.mdx
@@ -189,7 +189,7 @@ Dataset({
-✏️ **¡Inténtalo!** Prueba si puedes usar la función `Dataset.map()` para "explotar" la columna `comments` en `issues_dataset` _sin_ necesidad de usar Pandas. Esto es un poco complejo; te recomendamos revisar la sección de ["Batch mapping"](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch.html?batch-mapping#batch-mapping) de la documentación de 🤗 Datasets para completar esta tarea.
+✏️ **¡Inténtalo!** Prueba si puedes usar la función `Dataset.map()` para "explotar" la columna `comments` en `issues_dataset` _sin_ necesidad de usar Pandas. Esto es un poco complejo; te recomendamos revisar la sección de ["Batch mapping"](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping) de la documentación de 🤗 Datasets para completar esta tarea.
diff --git a/chapters/es/chapter6/8.mdx b/chapters/es/chapter6/8.mdx
index 795be36aa..0754b82d0 100644
--- a/chapters/es/chapter6/8.mdx
+++ b/chapters/es/chapter6/8.mdx
@@ -27,14 +27,14 @@ La librería 🤗 Tokenizers ha sido construida para proveer varias opciones par
De manera más precisa, la librería está construida a partir de una clase central `Tokenizer` con las unidades más básica reagrupadas en susbmódulos:
-- `normalizers` contiene todos los posibles tipos de `Normalizer` que puedes usar (la lista completa [aquí](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers)).
-- `pre_tokenizers` contiene todos los posibles tipos de `PreTokenizer` que puedes usar (la lista completa [aquí](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers)).
-- `models` contiene los distintos tipos de `Model` que puedes usar, como `BPE`, `WordPiece`, and `Unigram` (la lista completa [aquí](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models)).
-- `trainers` contiene todos los distintos tipos de `Trainer` que puedes usar para entrenar tu modelo en un corpus (uno por cada tipo de modelo; la lista completa [aquí](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.trainers)).
-- `post_processors` contiene varios tipos de `PostProcessor` que puedes usar (la lista completa [aquí](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.processors)).
-- `decoders` contiene varios tipos de `Decoder` que puedes usar para decodificar las salidas de la tokenización (la lista completa [aquí](https://huggingface.co/docs/tokenizers/python/latest/components.html#decoders)).
-
-Puedes encontrar la lista completas de las unidades más básicas [aquí](https://huggingface.co/docs/tokenizers/python/latest/components.html).
+- `normalizers` contiene todos los posibles tipos de `Normalizer` que puedes usar (la lista completa [aquí](https://huggingface.co/docs/tokenizers/api/normalizers)).
+- `pre_tokenizers` contiene todos los posibles tipos de `PreTokenizer` que puedes usar (la lista completa [aquí](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)).
+- `models` contiene los distintos tipos de `Model` que puedes usar, como `BPE`, `WordPiece`, and `Unigram` (la lista completa [aquí](https://huggingface.co/docs/tokenizers/api/models)).
+- `trainers` contiene todos los distintos tipos de `Trainer` que puedes usar para entrenar tu modelo en un corpus (uno por cada tipo de modelo; la lista completa [aquí](https://huggingface.co/docs/tokenizers/api/trainers)).
+- `post_processors` contiene varios tipos de `PostProcessor` que puedes usar (la lista completa [aquí](https://huggingface.co/docs/tokenizers/api/post-processors)).
+- `decoders` contiene varios tipos de `Decoder` que puedes usar para decodificar las salidas de la tokenización (la lista completa [aquí](https://huggingface.co/docs/tokenizers/components#decoders)).
+
+Puedes encontrar la lista completas de las unidades más básicas [aquí](https://huggingface.co/docs/tokenizers/components).
## Adquirir un corpus[[acquiring-a-corpus]]
diff --git a/chapters/fa/chapter3/2.mdx b/chapters/fa/chapter3/2.mdx
index 07d933cc6..696135a20 100644
--- a/chapters/fa/chapter3/2.mdx
+++ b/chapters/fa/chapter3/2.mdx
@@ -300,7 +300,7 @@ tokenized_dataset = tokenizer(
این روش به خوبی کار میکند، اما مشکلاش این است که دیکشنری (از کلیدهای ما شامل، `input_ids`, `attention_mask` و `token_type_ids` و مقادیر آنها که لیستهایی از لیستها هستند) برمیگرداند. همچنین این روش فقط زمانی کار میکند که حافظه موقت کافی جهت ذخیرهسازی کل دیتاسِت در حین توکِن کردن داشته باشید (در حالی که دیتاسِتهای موجود در کتابخانه `Datatasets` از هاگینگفِیس فایلهایی از نوع [Apache Arrow](https://arrow.apache.org/) هستند که روی دیسک ذخیره شدهاند، بنابراین شما فقط نمونههایی را که جهت ذخیره در حافظه درخواست کردهاید نگه میدارید).
-به منظور نگه داشتن داده به صورت یک دیتاسِت، از تابع [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) استفاده میکنیم. چنانچه به پیشپردازشهای بیشتری علاوه بر توکِن کردن نیاز داشته باشیم این روش انعطافپذیری لازم را به ما میدهد. تابع `map()` با اعمال کردن یک عملیات روی هر عنصر دیتاسِت عمل میکند، بنابراین اجازه دهید تابعی تعریف کنیم که ورودیها را توکِن کند:
+به منظور نگه داشتن داده به صورت یک دیتاسِت، از تابع [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map) استفاده میکنیم. چنانچه به پیشپردازشهای بیشتری علاوه بر توکِن کردن نیاز داشته باشیم این روش انعطافپذیری لازم را به ما میدهد. تابع `map()` با اعمال کردن یک عملیات روی هر عنصر دیتاسِت عمل میکند، بنابراین اجازه دهید تابعی تعریف کنیم که ورودیها را توکِن کند:
diff --git a/chapters/fr/chapter3/2.mdx b/chapters/fr/chapter3/2.mdx
index 70b529f7b..0aa69be72 100644
--- a/chapters/fr/chapter3/2.mdx
+++ b/chapters/fr/chapter3/2.mdx
@@ -246,7 +246,7 @@ tokenized_dataset = tokenizer(
Cela fonctionne bien, mais a l'inconvénient de retourner un dictionnaire (avec nos clés, `input_ids`, `attention_mask`, et `token_type_ids`, et des valeurs qui sont des listes de listes). Cela ne fonctionnera également que si vous avez assez de RAM pour stocker l'ensemble de votre jeu de données pendant la tokenisation (alors que les jeux de données de la bibliothèque 🤗 *Datasets* sont des fichiers [Apache Arrow](https://arrow.apache.org/) stockés sur le disque, vous ne gardez donc en mémoire que les échantillons que vous demandez).
-Pour conserver les données sous forme de jeu de données, nous utiliserons la méthode [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map). Cela nous permet également une certaine flexibilité, si nous avons besoin d'un prétraitement plus poussé que la simple tokenisation. La méthode `map()` fonctionne en appliquant une fonction sur chaque élément de l'ensemble de données, donc définissons une fonction qui tokenise nos entrées :
+Pour conserver les données sous forme de jeu de données, nous utiliserons la méthode [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map). Cela nous permet également une certaine flexibilité, si nous avons besoin d'un prétraitement plus poussé que la simple tokenisation. La méthode `map()` fonctionne en appliquant une fonction sur chaque élément de l'ensemble de données, donc définissons une fonction qui tokenise nos entrées :
```py
def tokenize_function(example):
diff --git a/chapters/fr/chapter5/2.mdx b/chapters/fr/chapter5/2.mdx
index 86a218a7c..58640da8b 100644
--- a/chapters/fr/chapter5/2.mdx
+++ b/chapters/fr/chapter5/2.mdx
@@ -132,7 +132,7 @@ C'est exactement ce que nous voulions. Désormais, nous pouvons appliquer divers
-L'argument `data_files` de la fonction `load_dataset()` est assez flexible et peut être soit un chemin de fichier unique, une liste de chemins de fichiers, ou un dictionnaire qui fait correspondre les noms des échantillons aux chemins de fichiers. Vous pouvez également regrouper les fichiers correspondant à un motif spécifié selon les règles utilisées par le shell Unix. Par exemple, vous pouvez regrouper tous les fichiers JSON d'un répertoire en une seule division en définissant `data_files="*.json"`. Voir la [documentation](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) de 🤗 *Datasets* pour plus de détails.
+L'argument `data_files` de la fonction `load_dataset()` est assez flexible et peut être soit un chemin de fichier unique, une liste de chemins de fichiers, ou un dictionnaire qui fait correspondre les noms des échantillons aux chemins de fichiers. Vous pouvez également regrouper les fichiers correspondant à un motif spécifié selon les règles utilisées par le shell Unix. Par exemple, vous pouvez regrouper tous les fichiers JSON d'un répertoire en une seule division en définissant `data_files="*.json"`. Voir la [documentation](https://huggingface.co/docs/datasets/loading#local-and-remote-files) de 🤗 *Datasets* pour plus de détails.
@@ -164,6 +164,6 @@ Cela renvoie le même objet `DatasetDict` obtenu ci-dessus mais nous évite de t
-✏️ **Essayez !** Choisissez un autre jeu de données hébergé sur GitHub ou dans le [*UCI Machine Learning Repository*](https://archive.ics.uci.edu/ml/index.php) et essayez de le charger localement et à distance en utilisant les techniques présentées ci-dessus. Pour obtenir des points bonus, essayez de charger un jeu de données stocké au format CSV ou texte (voir la [documentation](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) pour plus d'informations sur ces formats).
+✏️ **Essayez !** Choisissez un autre jeu de données hébergé sur GitHub ou dans le [*UCI Machine Learning Repository*](https://archive.ics.uci.edu/ml/index.php) et essayez de le charger localement et à distance en utilisant les techniques présentées ci-dessus. Pour obtenir des points bonus, essayez de charger un jeu de données stocké au format CSV ou texte (voir la [documentation](https://huggingface.co/docs/datasets/loading#local-and-remote-files) pour plus d'informations sur ces formats).
diff --git a/chapters/fr/chapter5/3.mdx b/chapters/fr/chapter5/3.mdx
index 271396fcb..1c38a5fff 100644
--- a/chapters/fr/chapter5/3.mdx
+++ b/chapters/fr/chapter5/3.mdx
@@ -249,7 +249,7 @@ Comme vous pouvez le constater, cela a supprimé environ 15 % des avis de nos je
-✏️ **Essayez !** Utilisez la fonction `Dataset.sort()` pour inspecter les avis avec le plus grand nombre de mots. Consultez la [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.sort) pour voir quel argument vous devez utiliser pour trier les avis par longueur dans l'ordre décroissant.
+✏️ **Essayez !** Utilisez la fonction `Dataset.sort()` pour inspecter les avis avec le plus grand nombre de mots. Consultez la [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.sort) pour voir quel argument vous devez utiliser pour trier les avis par longueur dans l'ordre décroissant.
@@ -396,7 +396,7 @@ tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
```
-Oh non ! Cela n'a pas fonctionné ! Pourquoi ? L'examen du message d'erreur nous donne un indice : il y a une incompatibilité dans les longueurs de l'une des colonnes. L'une étant de longueur 1 463 et l'autre de longueur 1 000. Si vous avez consulté la [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) de `Dataset.map()`, vous vous souvenez peut-être qu'il s'agit du nombre d'échantillons passés à la fonction que nous mappons. Ici, ces 1 000 exemples ont donné 1 463 nouvelles caractéristiques, entraînant une erreur de forme.
+Oh non ! Cela n'a pas fonctionné ! Pourquoi ? L'examen du message d'erreur nous donne un indice : il y a une incompatibilité dans les longueurs de l'une des colonnes. L'une étant de longueur 1 463 et l'autre de longueur 1 000. Si vous avez consulté la [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map) de `Dataset.map()`, vous vous souvenez peut-être qu'il s'agit du nombre d'échantillons passés à la fonction que nous mappons. Ici, ces 1 000 exemples ont donné 1 463 nouvelles caractéristiques, entraînant une erreur de forme.
Le problème est que nous essayons de mélanger deux jeux de données différents de tailles différentes : les colonnes `drug_dataset` auront un certain nombre d'exemples (les 1 000 dans notre erreur), mais le `tokenized_dataset` que nous construisons en aura plus (le 1 463 dans le message d'erreur). Cela ne fonctionne pas pour un `Dataset`, nous devons donc soit supprimer les colonnes de l'ancien jeu de données, soit leur donner la même taille que dans le nouveau jeu de données. Nous pouvons faire la première option avec l'argument `remove_columns` :
diff --git a/chapters/fr/chapter5/4.mdx b/chapters/fr/chapter5/4.mdx
index 1fdc67418..2d3d62ba6 100644
--- a/chapters/fr/chapter5/4.mdx
+++ b/chapters/fr/chapter5/4.mdx
@@ -48,7 +48,7 @@ Nous pouvons voir qu'il y a 15 518 009 lignes et 2 colonnes dans notre jeu de do
-✎ Par défaut, 🤗 *Datasets* décompresse les fichiers nécessaires pour charger un jeu de données. Si vous souhaitez conserver de l'espace sur le disque dur, vous pouvez passer `DownloadConfig(delete_extracted=True)` à l'argument `download_config` de `load_dataset()`. Voir la [documentation](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig) pour plus de détails.
+✎ Par défaut, 🤗 *Datasets* décompresse les fichiers nécessaires pour charger un jeu de données. Si vous souhaitez conserver de l'espace sur le disque dur, vous pouvez passer `DownloadConfig(delete_extracted=True)` à l'argument `download_config` de `load_dataset()`. Voir la [documentation](https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig) pour plus de détails.
diff --git a/chapters/fr/chapter5/5.mdx b/chapters/fr/chapter5/5.mdx
index af48c82b3..63a14bf36 100644
--- a/chapters/fr/chapter5/5.mdx
+++ b/chapters/fr/chapter5/5.mdx
@@ -431,7 +431,7 @@ Cool, nous avons poussé notre jeu de données vers le *Hub* et il est disponibl
-💡 Vous pouvez également télécharger un jeu de données sur le *Hub* directement depuis le terminal en utilisant `huggingface-cli` et un peu de magie Git. Consultez le [guide de 🤗 *Datasets*](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) pour savoir comment procéder.
+💡 Vous pouvez également télécharger un jeu de données sur le *Hub* directement depuis le terminal en utilisant `huggingface-cli` et un peu de magie Git. Consultez le [guide de 🤗 *Datasets*](https://huggingface.co/docs/datasets/share#share-a-dataset-using-the-cli) pour savoir comment procéder.
diff --git a/chapters/fr/chapter5/6.mdx b/chapters/fr/chapter5/6.mdx
index 610af1078..19d2e1f5d 100644
--- a/chapters/fr/chapter5/6.mdx
+++ b/chapters/fr/chapter5/6.mdx
@@ -193,7 +193,7 @@ D'accord, cela nous a donné quelques milliers de commentaires avec lesquels tra
-✏️ **Essayez !** Voyez si vous pouvez utiliser `Dataset.map()` pour exploser la colonne `comments` de `issues_dataset` _sans_ recourir à l'utilisation de Pandas. C'est un peu délicat. La section [« Batch mapping »](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch.html?batch-mapping#batch-mapping) de la documentation 🤗 *Datasets* peut être utile pour cette tâche.
+✏️ **Essayez !** Voyez si vous pouvez utiliser `Dataset.map()` pour exploser la colonne `comments` de `issues_dataset` _sans_ recourir à l'utilisation de Pandas. C'est un peu délicat. La section [« Batch mapping »](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping) de la documentation 🤗 *Datasets* peut être utile pour cette tâche.
diff --git a/chapters/fr/chapter6/8.mdx b/chapters/fr/chapter6/8.mdx
index f8a740502..5bed366b9 100644
--- a/chapters/fr/chapter6/8.mdx
+++ b/chapters/fr/chapter6/8.mdx
@@ -30,14 +30,14 @@ La bibliothèque 🤗 *Tokenizers* a été construite pour fournir plusieurs opt
Plus précisément, la bibliothèque est construite autour d'une classe centrale `Tokenizer` avec les blocs de construction regroupés en sous-modules :
-- `normalizers` contient tous les types de `Normalizer` que vous pouvez utiliser (liste complète [ici](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers)),
-- `pre_tokenizers` contient tous les types de `PreTokenizer` que vous pouvez utiliser (liste complète [ici](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers)),
-- `models` contient les différents types de `Model` que vous pouvez utiliser, comme `BPE`, `WordPiece`, et `Unigram` (liste complète [ici](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models)),
-- `trainers` contient tous les différents types de `Trainer` que vous pouvez utiliser pour entraîner votre modèle sur un corpus (un par type de modèle ; liste complète [ici](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.trainers)),
-- `post_processors` contient les différents types de `PostProcessor` que vous pouvez utiliser (liste complète [ici](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.processors)),
-- `decoders` contient les différents types de `Decoder` que vous pouvez utiliser pour décoder les sorties de tokenization (liste complète [ici](https://huggingface.co/docs/tokenizers/python/latest/components.html#decoders)).
-
-Vous pouvez trouver la liste complète des blocs de construction [ici](https://huggingface.co/docs/tokenizers/python/latest/components.html).
+- `normalizers` contient tous les types de `Normalizer` que vous pouvez utiliser (liste complète [ici](https://huggingface.co/docs/tokenizers/api/normalizers)),
+- `pre_tokenizers` contient tous les types de `PreTokenizer` que vous pouvez utiliser (liste complète [ici](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)),
+- `models` contient les différents types de `Model` que vous pouvez utiliser, comme `BPE`, `WordPiece`, et `Unigram` (liste complète [ici](https://huggingface.co/docs/tokenizers/api/models)),
+- `trainers` contient tous les différents types de `Trainer` que vous pouvez utiliser pour entraîner votre modèle sur un corpus (un par type de modèle ; liste complète [ici](https://huggingface.co/docs/tokenizers/api/trainers)),
+- `post_processors` contient les différents types de `PostProcessor` que vous pouvez utiliser (liste complète [ici](https://huggingface.co/docs/tokenizers/api/post-processors)),
+- `decoders` contient les différents types de `Decoder` que vous pouvez utiliser pour décoder les sorties de tokenization (liste complète [ici](https://huggingface.co/docs/tokenizers/components#decoders)).
+
+Vous pouvez trouver la liste complète des blocs de construction [ici](https://huggingface.co/docs/tokenizers/components).
## Acquisition d'un corpus
diff --git a/chapters/hi/chapter3/2.mdx b/chapters/hi/chapter3/2.mdx
index 481fc824f..066bfddfc 100644
--- a/chapters/hi/chapter3/2.mdx
+++ b/chapters/hi/chapter3/2.mdx
@@ -235,7 +235,7 @@ tokenized_dataset = tokenizer(
यह अच्छी तरह से काम करता है, लेकिन इसमें एक शब्दकोश (साथ में हमारी कुंजी, `input_ids`, `attention_mask`, और `token_type_ids`, और मान जो सूचियों की सूचियां हैं) के लौटने का नुकसान है। यह केवल तभी काम करेगा जब आपके पास पर्याप्त RAM हो अपने पूरे डेटासेट को टोकननाइजेशन के दौरान स्टोर करने के लिए (जबकि 🤗 डेटासेट लाइब्रेरी के डेटासेट [अपाचे एरो](https://arrow.apache.org/) फाइलें हैं जो डिस्क पर संग्रहीत है, तो आप केवल उन सैम्पल्स को रखते हैं जिन्हें आप मेमोरी मे लोड करना चाहतें है)।
-डेटा को डेटासेट के रूप में रखने के लिए, हम [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) पद्धति का उपयोग करेंगे। अगर हमें सिर्फ टोकननाइजेशन की तुलना में अधिक पूर्व प्रसंस्करण की आवश्यकता होती है, तो यह हमें कुछ अधिक लचीलेपन की भी अनुमति देता है। `map()` विधि डेटासेट के प्रत्येक तत्व पर एक फ़ंक्शन लागू करके काम करती है, तो चलिए एक फ़ंक्शन को परिभाषित करते हैं जो हमारे इनपुट को टोकननाइज़ करेगा :
+डेटा को डेटासेट के रूप में रखने के लिए, हम [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map) पद्धति का उपयोग करेंगे। अगर हमें सिर्फ टोकननाइजेशन की तुलना में अधिक पूर्व प्रसंस्करण की आवश्यकता होती है, तो यह हमें कुछ अधिक लचीलेपन की भी अनुमति देता है। `map()` विधि डेटासेट के प्रत्येक तत्व पर एक फ़ंक्शन लागू करके काम करती है, तो चलिए एक फ़ंक्शन को परिभाषित करते हैं जो हमारे इनपुट को टोकननाइज़ करेगा :
```py
def tokenize_function(example):
diff --git a/chapters/it/chapter3/2.mdx b/chapters/it/chapter3/2.mdx
index 5ebea52a6..0b428d08d 100644
--- a/chapters/it/chapter3/2.mdx
+++ b/chapters/it/chapter3/2.mdx
@@ -236,7 +236,7 @@ tokenized_dataset = tokenizer(
Questo metodo funziona, ma ha lo svantaggio di restituire un dizionario (avente `input_ids`, `attention_mask`, e `token_type_ids` come chiavi, e delle liste di liste come valori). Oltretutto, questo metodo funziona solo se si ha a disposizione RAM sufficiente per contenere l'intero dataset durante la tokenizzazione (mentre i dataset dalla libreria 🤗 Datasets sono file [Apache Arrow](https://arrow.apache.org/) archiviati su disco, perciò in memoria vengono caricati solo i campioni richiesti).
-Per tenere i dati come dataset, utilizzare il metodo [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map). Ciò permette anche della flessibilità extra, qualora fosse necessario del preprocessing aggiuntivo oltre alla tokenizzazione. Il metodo `map()` applica una funziona ad ogni elemento del dataset, perciò bisogna definire una funzione che tokenizzi gli input:
+Per tenere i dati come dataset, utilizzare il metodo [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map). Ciò permette anche della flessibilità extra, qualora fosse necessario del preprocessing aggiuntivo oltre alla tokenizzazione. Il metodo `map()` applica una funziona ad ogni elemento del dataset, perciò bisogna definire una funzione che tokenizzi gli input:
```py
def tokenize_function(example):
diff --git a/chapters/it/chapter5/2.mdx b/chapters/it/chapter5/2.mdx
index 738710236..c3ead7ad6 100644
--- a/chapters/it/chapter5/2.mdx
+++ b/chapters/it/chapter5/2.mdx
@@ -128,7 +128,7 @@ Questo è proprio ciò che volevamo. Ora possiamo applicare diverse tecniche di
-L'argomento `data_files` della funzione `load_dataset()` è molto flessibile, e può essere usato con un percorso file singolo, con una lista di percorsi file, o un dizionario che mappa i nomi delle sezioni ai percorsi file. È anche possibile usare comandi glob per recuperare tutti i file che soddisfano uno specifico pattern secondo le regole dello shell di Unix (ad esempio, è possibile recuperare tutti i file JSON presenti in una cartella usando il pattern `data_files="*.json"`). Consulta la [documentazione](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) 🤗 Datasets per maggiori informazioni.
+L'argomento `data_files` della funzione `load_dataset()` è molto flessibile, e può essere usato con un percorso file singolo, con una lista di percorsi file, o un dizionario che mappa i nomi delle sezioni ai percorsi file. È anche possibile usare comandi glob per recuperare tutti i file che soddisfano uno specifico pattern secondo le regole dello shell di Unix (ad esempio, è possibile recuperare tutti i file JSON presenti in una cartella usando il pattern `data_files="*.json"`). Consulta la [documentazione](https://huggingface.co/docs/datasets/loading#local-and-remote-files) 🤗 Datasets per maggiori informazioni.
@@ -160,7 +160,7 @@ Questo codice restituisce lo stesso oggetto `DatasetDict` visto in precedenza, m
-✏️ **Prova tu!** Scegli un altro dataset presente su GitHub o sulla [Repository di Machine Learning UCI](https://archive.ics.uci.edu/ml/index.php) e cerca di caricare sia in locale che in remoto usando le tecniche introdotte in precedenza. Per punti extra, prova a caricare un dataset archiviato in formato CSV o testuale (vedi la [documentazione](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) per ulteriori informazioni su questi formati).
+✏️ **Prova tu!** Scegli un altro dataset presente su GitHub o sulla [Repository di Machine Learning UCI](https://archive.ics.uci.edu/ml/index.php) e cerca di caricare sia in locale che in remoto usando le tecniche introdotte in precedenza. Per punti extra, prova a caricare un dataset archiviato in formato CSV o testuale (vedi la [documentazione](https://huggingface.co/docs/datasets/loading#local-and-remote-files) per ulteriori informazioni su questi formati).
diff --git a/chapters/it/chapter5/3.mdx b/chapters/it/chapter5/3.mdx
index 6b8bc7f8f..3eafa185e 100644
--- a/chapters/it/chapter5/3.mdx
+++ b/chapters/it/chapter5/3.mdx
@@ -241,7 +241,7 @@ Come puoi vedere, questo ha rimosso circa il 15% delle recensioni nelle sezioni
-✏️ **Prova tu!** Usa la funzione `Dataset.sort()` per analizzare le revisioni con il maggior numero di parole. Controlla la [documentazione](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.sort) per vedere quali argomenti bisogna usare per ordinare le recensioni in ordine decrescente di lunghezza.
+✏️ **Prova tu!** Usa la funzione `Dataset.sort()` per analizzare le revisioni con il maggior numero di parole. Controlla la [documentazione](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.sort) per vedere quali argomenti bisogna usare per ordinare le recensioni in ordine decrescente di lunghezza.
@@ -386,7 +386,7 @@ tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
```
-Oh no! Non ha funzionato! Perché? Il messaggio di errore ci dà un indizio: c'è una discordanza tra la lungheza di una delle colonne (una è lunga 1.463 e l'altra 1.000). Se hai guardato la [documentazione](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) di `Dataset.map()`, ricorderai che quello è il numero di campioni passati alla funzione map; qui quei 1.000 esempi danno 1.463 nuove feature, che risulta in un errore di shape.
+Oh no! Non ha funzionato! Perché? Il messaggio di errore ci dà un indizio: c'è una discordanza tra la lungheza di una delle colonne (una è lunga 1.463 e l'altra 1.000). Se hai guardato la [documentazione](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map) di `Dataset.map()`, ricorderai che quello è il numero di campioni passati alla funzione map; qui quei 1.000 esempi danno 1.463 nuove feature, che risulta in un errore di shape.
Il problema è che stiamo cercando di mescolare due dataset diversi di grandezze diverse: le colonne del `drug_dataset` avranno un certo numero di esempi (il 1.000 del nostro errore), ma il `tokenized_dataset` che stiamo costruendo ne avrà di più (il 1.463 nel nostro messaggio di errore). Non va bene per un `Dataset`, per cui abbiamo bisogno o di rimuovere le colonne dal dataset vecchio, o renderle della stessa dimensione del nuovo dataset. La prima opzione può essere effettuata utilizzando l'argomento `remove_columns`:
diff --git a/chapters/it/chapter5/4.mdx b/chapters/it/chapter5/4.mdx
index 57f15060c..ad2f6e191 100644
--- a/chapters/it/chapter5/4.mdx
+++ b/chapters/it/chapter5/4.mdx
@@ -47,7 +47,7 @@ Possiamo vedere che ci sono 15.518.009 righe e 2 colonne nel nostro dataset -- u
-✎ Di base, 🤗 Datasets decomprimerà i file necessari a caricare un dataset. Se vuoi risparmiare sullo spazio dell'hard disk, puoi passare `DownloadConfig(delete_extracted_True)` all'argomento `download_config` di `load_dataset()`. Per maggiori dettagli leggi la [documentazione](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig).
+✎ Di base, 🤗 Datasets decomprimerà i file necessari a caricare un dataset. Se vuoi risparmiare sullo spazio dell'hard disk, puoi passare `DownloadConfig(delete_extracted_True)` all'argomento `download_config` di `load_dataset()`. Per maggiori dettagli leggi la [documentazione](https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig).
diff --git a/chapters/it/chapter5/5.mdx b/chapters/it/chapter5/5.mdx
index 3a9c0f19f..a9246a463 100644
--- a/chapters/it/chapter5/5.mdx
+++ b/chapters/it/chapter5/5.mdx
@@ -430,7 +430,7 @@ Bene, abbiamo caricato il nostro dataset sull'Hub, e può essere utilizzato da t
-💡 Puoi caricare un dataset nell'Hub di Hugging Face anche direttamente dal terminale usando `huggingface-cli` e un po' di magia Git. La [guida a 🤗 Datasets](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) spiega come farlo.
+💡 Puoi caricare un dataset nell'Hub di Hugging Face anche direttamente dal terminale usando `huggingface-cli` e un po' di magia Git. La [guida a 🤗 Datasets](https://huggingface.co/docs/datasets/share#share-a-dataset-using-the-cli) spiega come farlo.
diff --git a/chapters/it/chapter5/6.mdx b/chapters/it/chapter5/6.mdx
index b1296144f..1bd0dca34 100644
--- a/chapters/it/chapter5/6.mdx
+++ b/chapters/it/chapter5/6.mdx
@@ -190,7 +190,7 @@ Perfetto, ora abbiamo qualche migliaio di commenti con cui lavorare!
-✏️ **Prova tu!** Prova ad utilizzare `Dataset.map()` per far esplodere la colonna `commenti` di `issues_dataset` _senza_ utilizzare Pandas. È un po' difficile: potrebbe tornarti utile la sezione ["Batch mapping"](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch.html?batch-mapping#batch-mapping) della documentazione di 🤗 Datasets.
+✏️ **Prova tu!** Prova ad utilizzare `Dataset.map()` per far esplodere la colonna `commenti` di `issues_dataset` _senza_ utilizzare Pandas. È un po' difficile: potrebbe tornarti utile la sezione ["Batch mapping"](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping) della documentazione di 🤗 Datasets.
diff --git a/chapters/ko/chapter5/2.mdx b/chapters/ko/chapter5/2.mdx
index 9f8c463f1..67ccff72a 100644
--- a/chapters/ko/chapter5/2.mdx
+++ b/chapters/ko/chapter5/2.mdx
@@ -128,7 +128,7 @@ DatasetDict({
-`load_dataset()` 함수의 `data_files` 인수는 매우 유연하여 하나의 파일 경로 (str), 파일 경로들의 리스트 (list) 또는 스플릿 이름을 각 파일 경로에 매핑하는 dictionary를 값으로 받을 수 있습니다. 또한 Unix 쉘에서 사용하는 규칙에 따라 지정된 패턴과 일치하는 파일들을 glob할 수도 있습니다. (예를 들어, `data_files="*.json"`와 같이 설정하면 한 폴더에 있는 모든 JSON 파일들을 하나의 스플릿으로 glob할 수 있습니다.) 자세한 내용은 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files)에서 확인할 수 있습니다.
+`load_dataset()` 함수의 `data_files` 인수는 매우 유연하여 하나의 파일 경로 (str), 파일 경로들의 리스트 (list) 또는 스플릿 이름을 각 파일 경로에 매핑하는 dictionary를 값으로 받을 수 있습니다. 또한 Unix 쉘에서 사용하는 규칙에 따라 지정된 패턴과 일치하는 파일들을 glob할 수도 있습니다. (예를 들어, `data_files="*.json"`와 같이 설정하면 한 폴더에 있는 모든 JSON 파일들을 하나의 스플릿으로 glob할 수 있습니다.) 자세한 내용은 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/loading#local-and-remote-files)에서 확인할 수 있습니다.
diff --git a/chapters/pt/chapter5/2.mdx b/chapters/pt/chapter5/2.mdx
index 053e35ae9..2d12ad4e0 100644
--- a/chapters/pt/chapter5/2.mdx
+++ b/chapters/pt/chapter5/2.mdx
@@ -130,7 +130,7 @@ Isto é exatamente o que queríamos. Agora, podemos aplicar várias técnicas de
-O argumento `data_files` da função `load_dataset()` é bastante flexível e pode ser um único caminho de arquivo ou uma lista de caminhos de arquivo, ou um dicionário que mapeia nomes divididos para caminhos de arquivo. Você também pode incluir arquivos que correspondam a um padrão especificado de acordo com as regras utilizadas pela Unix shell (por exemplo, você pode adicionar todos os arquivos JSON em um diretório como uma única divisão, definindo `data_files="*.json"`). Consulte a [documentação](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) do 🤗 Datasets para obter mais detalhes.
+O argumento `data_files` da função `load_dataset()` é bastante flexível e pode ser um único caminho de arquivo ou uma lista de caminhos de arquivo, ou um dicionário que mapeia nomes divididos para caminhos de arquivo. Você também pode incluir arquivos que correspondam a um padrão especificado de acordo com as regras utilizadas pela Unix shell (por exemplo, você pode adicionar todos os arquivos JSON em um diretório como uma única divisão, definindo `data_files="*.json"`). Consulte a [documentação](https://huggingface.co/docs/datasets/loading#local-and-remote-files) do 🤗 Datasets para obter mais detalhes.
@@ -161,7 +161,7 @@ squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
Isto retorna o mesmo objeto `DatasetDict` obtido anteriormente, mas nos poupa o passo de baixar e descomprimir manualmente os arquivos _SQuAD_it-*.json.gz_. Isto envolve nas várias formas de carregar conjuntos de dados que não estão hospedados no Hugging Face Hub. Agora que temos um conjunto de dados para brincar, vamos sujar as mãos com várias técnicas de manipulação de dados!
-✏️ **Tente fazer isso!** Escolha outro conjunto de dados hospedado no GitHub ou no [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) e tente carregá-lo tanto local como remotamente usando as técnicas introduzidas acima. Para pontos bônus, tente carregar um conjunto de dados que esteja armazenado em formato CSV ou texto (veja a [documentação](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) para mais informações sobre estes formatos).
+✏️ **Tente fazer isso!** Escolha outro conjunto de dados hospedado no GitHub ou no [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) e tente carregá-lo tanto local como remotamente usando as técnicas introduzidas acima. Para pontos bônus, tente carregar um conjunto de dados que esteja armazenado em formato CSV ou texto (veja a [documentação](https://huggingface.co/docs/datasets/loading#local-and-remote-files) para mais informações sobre estes formatos).
diff --git a/chapters/pt/chapter5/3.mdx b/chapters/pt/chapter5/3.mdx
index 7d48c6149..ba734d249 100644
--- a/chapters/pt/chapter5/3.mdx
+++ b/chapters/pt/chapter5/3.mdx
@@ -238,7 +238,7 @@ Como você pode ver, isso removeu cerca de 15% das avaliações de nossos conjun
-✏️ **Experimente!** Use a função `Dataset.sort()` para inspecionar as resenhas com o maior número de palavras. Consulte a [documentação](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.sort) para ver qual argumento você precisa usar para classificar as avaliações por tamanho em ordem decrescente.
+✏️ **Experimente!** Use a função `Dataset.sort()` para inspecionar as resenhas com o maior número de palavras. Consulte a [documentação](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.sort) para ver qual argumento você precisa usar para classificar as avaliações por tamanho em ordem decrescente.
@@ -384,7 +384,7 @@ tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
```
-Oh não! Isso não funcionou! Por que não? Observar a mensagem de erro nos dará uma pista: há uma incompatibilidade nos comprimentos de uma das colunas, sendo uma de comprimento 1.463 e a outra de comprimento 1.000. Se você consultou a [documentação] do `Dataset.map()` (https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map), você deve se lembrar de que é o número de amostras passadas para a função que estamos mapeando; aqui, esses 1.000 exemplos forneceram 1.463 novos recursos, resultando em um erro de forma.
+Oh não! Isso não funcionou! Por que não? Observar a mensagem de erro nos dará uma pista: há uma incompatibilidade nos comprimentos de uma das colunas, sendo uma de comprimento 1.463 e a outra de comprimento 1.000. Se você consultou a [documentação] do `Dataset.map()` (https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map), você deve se lembrar de que é o número de amostras passadas para a função que estamos mapeando; aqui, esses 1.000 exemplos forneceram 1.463 novos recursos, resultando em um erro de forma.
O problema é que estamos tentando misturar dois conjuntos de dados diferentes de tamanhos diferentes: as colunas `drug_dataset` terão um certo número de exemplos (os 1.000 em nosso erro), mas o `tokenized_dataset` que estamos construindo terá mais (o 1.463 na mensagem de erro). Isso não funciona para um `Dataset`, portanto, precisamos remover as colunas do conjunto de dados antigo ou torná-las do mesmo tamanho do novo conjunto de dados. Podemos fazer o primeiro com o argumento `remove_columns`:
diff --git a/chapters/pt/chapter5/4.mdx b/chapters/pt/chapter5/4.mdx
index 0018334e6..2a64e2471 100644
--- a/chapters/pt/chapter5/4.mdx
+++ b/chapters/pt/chapter5/4.mdx
@@ -46,7 +46,7 @@ Podemos ver que há 15.518.009 linhas e 2 colunas em nosso conjunto de dados - i
-✎ Por padrão, 🤗 Datasets descompactará os arquivos necessários para carregar um dataset. Se você quiser preservar espaço no disco rígido, você pode passar `DownloadConfig(delete_extracted=True)` para o argumento `download_config` de `load_dataset()`. Consulte a [documentação](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig) para obter mais detalhes.
+✎ Por padrão, 🤗 Datasets descompactará os arquivos necessários para carregar um dataset. Se você quiser preservar espaço no disco rígido, você pode passar `DownloadConfig(delete_extracted=True)` para o argumento `download_config` de `load_dataset()`. Consulte a [documentação](https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig) para obter mais detalhes.
diff --git a/chapters/pt/chapter5/5.mdx b/chapters/pt/chapter5/5.mdx
index 9207fa010..23e3831dc 100644
--- a/chapters/pt/chapter5/5.mdx
+++ b/chapters/pt/chapter5/5.mdx
@@ -430,7 +430,7 @@ Legal, nós enviamos nosso conjunto de dados para o Hub e está disponível para
-💡 Você também pode enviar um conjunto de dados para o Hugging Face Hub diretamente do terminal usando `huggingface-cli` e um pouco de magia Git. Consulte o [guia do 🤗 Datasets](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) para obter detalhes sobre como fazer isso.
+💡 Você também pode enviar um conjunto de dados para o Hugging Face Hub diretamente do terminal usando `huggingface-cli` e um pouco de magia Git. Consulte o [guia do 🤗 Datasets](https://huggingface.co/docs/datasets/share#share-a-dataset-using-the-cli) para obter detalhes sobre como fazer isso.
diff --git a/chapters/pt/chapter5/6.mdx b/chapters/pt/chapter5/6.mdx
index c77091eef..f0a868c57 100644
--- a/chapters/pt/chapter5/6.mdx
+++ b/chapters/pt/chapter5/6.mdx
@@ -188,7 +188,7 @@ Ok, isso nos deu alguns milhares de comentários para trabalhar!
-✏️ **Experimente!** Veja se você pode usar `Dataset.map()` para explodir a coluna `comments` de `issues_dataset` _sem_ recorrer ao uso de Pandas. Isso é um pouco complicado; você pode achar útil para esta tarefa a seção ["Mapeamento em lote"](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch.html?batch-mapping#batch-mapping) da documentação do 🤗 Dataset.
+✏️ **Experimente!** Veja se você pode usar `Dataset.map()` para explodir a coluna `comments` de `issues_dataset` _sem_ recorrer ao uso de Pandas. Isso é um pouco complicado; você pode achar útil para esta tarefa a seção ["Mapeamento em lote"](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch#batch-mapping) da documentação do 🤗 Dataset.
diff --git a/chapters/ru/chapter3/2.mdx b/chapters/ru/chapter3/2.mdx
index dc8c83f24..9a79dea25 100644
--- a/chapters/ru/chapter3/2.mdx
+++ b/chapters/ru/chapter3/2.mdx
@@ -235,7 +235,7 @@ tokenized_dataset = tokenizer(
Это хорошо работает, однако есть недостаток, который формирует токенизатор (с ключами, `input_ids`, `attention_mask`, и `token_type_ids`, и значениями в формате списка списков). Это будет работать только если у нас достаточно оперативной памяти (RAM) для хранения целого датасета во время токенизации (в то время как датасеты из библиотеки 🤗 Datasets являются [Apache Arrow](https://arrow.apache.org/) файлами, хранящимися на диске; они будут загружены только в тот момент, когда вы их будете запрашивать).
-Чтобы хранить данные в формате датасета, мы будем использовать методы [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map). Это позволит нам сохранить высокую гибкость даже если нам нужно что-то большее, чем просто токенизация. Метод `map()` работает так: применяет некоторую функцию к каждому элементу датасета, давайте определим функцию, которая токенизирует наши входные данные:
+Чтобы хранить данные в формате датасета, мы будем использовать методы [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map). Это позволит нам сохранить высокую гибкость даже если нам нужно что-то большее, чем просто токенизация. Метод `map()` работает так: применяет некоторую функцию к каждому элементу датасета, давайте определим функцию, которая токенизирует наши входные данные:
```py
def tokenize_function(example):
diff --git a/chapters/ru/chapter5/2.mdx b/chapters/ru/chapter5/2.mdx
index ee50ef207..b4a9c9ab5 100644
--- a/chapters/ru/chapter5/2.mdx
+++ b/chapters/ru/chapter5/2.mdx
@@ -123,7 +123,7 @@ DatasetDict({
-Аргумент `data_files` функции `load_dataset()` очень гибкий и может являться путем к файлу, списком путей файлов или словарем, в котором указаны названия сплитов (обучающего и тестового) и пути к соответствующим файлам. Вы также можете найти все подходящие файлы в директории с использованием маски по правилам Unix-консоли (т.е. указать путь к директории и указать `data_files="*.json"` для конкретного сплита). Более подробно это изложено в [документации](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) 🤗 Datasets.
+Аргумент `data_files` функции `load_dataset()` очень гибкий и может являться путем к файлу, списком путей файлов или словарем, в котором указаны названия сплитов (обучающего и тестового) и пути к соответствующим файлам. Вы также можете найти все подходящие файлы в директории с использованием маски по правилам Unix-консоли (т.е. указать путь к директории и указать `data_files="*.json"` для конкретного сплита). Более подробно это изложено в [документации](https://huggingface.co/docs/datasets/loading#local-and-remote-files) 🤗 Datasets.
@@ -156,7 +156,7 @@ squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
-✏️ **Попробуйте!** Выберите другой датасет, расположенный на GitHub или в архиве [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) и попробуйте загрузить его с локальной машины и с удаленного сервера. В качестве бонуса попробуйте загрузить датасет в формате CSV или обычного тектового файла (см. детали по поддерживаемым форматам в [документации](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files)).
+✏️ **Попробуйте!** Выберите другой датасет, расположенный на GitHub или в архиве [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) и попробуйте загрузить его с локальной машины и с удаленного сервера. В качестве бонуса попробуйте загрузить датасет в формате CSV или обычного тектового файла (см. детали по поддерживаемым форматам в [документации](https://huggingface.co/docs/datasets/loading#local-and-remote-files)).
diff --git a/chapters/ru/chapter5/3.mdx b/chapters/ru/chapter5/3.mdx
index 378b05e24..654f0cd81 100644
--- a/chapters/ru/chapter5/3.mdx
+++ b/chapters/ru/chapter5/3.mdx
@@ -122,7 +122,7 @@ def filter_nones(x):
lambda
:
```
-где `lambda` - одно из [ключевых](https://docs.python.org/3/reference/lexical_analysis.html#keywords) слов Python, а `` - список или множество разделенных запятой значений, которые пойдут на вход функции, и `` задает операции, которые вы хотите применить к аргументам. Например, мы можем задать простую лямбда-функцию, которая возводит в квадрат числа:
+где `lambda` - одно из [ключевых](https://docs.python.org/3/reference/lexical_analysis#keywords) слов Python, а `` - список или множество разделенных запятой значений, которые пойдут на вход функции, и `` задает операции, которые вы хотите применить к аргументам. Например, мы можем задать простую лямбда-функцию, которая возводит в квадрат числа:
```
lambda x : x * x
@@ -238,7 +238,7 @@ print(drug_dataset.num_rows)
-✏️ **Попробуйте!** Используйте функцию `Dataset.sort()` для проверки наиболее длинных отзывов. Изучите [документацию](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.sort) чтобы понять, какой аргумент нужно передать в функцию, чтобы сортировка произошла в убывающем порядке.
+✏️ **Попробуйте!** Используйте функцию `Dataset.sort()` для проверки наиболее длинных отзывов. Изучите [документацию](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.sort) чтобы понять, какой аргумент нужно передать в функцию, чтобы сортировка произошла в убывающем порядке.
@@ -385,9 +385,9 @@ tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
```
-О, нет! Не сработало! Почему? Посмотрим на ошибку: несовпадение в длинах, один из которых длиной 1463, а другой – 1000. Если вы обратитесь в [документацию](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) `Dataset.map()`, вы можете увидеть, что одно из этих чисел – число объектов, поданных на вход функции, а другое –
+О, нет! Не сработало! Почему? Посмотрим на ошибку: несовпадение в длинах, один из которых длиной 1463, а другой – 1000. Если вы обратитесь в [документацию](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map) `Dataset.map()`, вы можете увидеть, что одно из этих чисел – число объектов, поданных на вход функции, а другое –
-Oh no! That didn't work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you've looked at the [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map), you may recall that it's the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.
+Oh no! That didn't work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you've looked at the [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map), you may recall that it's the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.
Проблема заключается в том, что мы пытаемся смешать два разных датасета разной размерности: число колонок датасета `drug_dataset` равняется 1000, а нужный нам `tokenized_dataset` имеет 1463 колонки. Чтобы избежать этой ошибки, необходимо удалить несколько столбцов из старого датасета и сделать оба датасета одинакового размера. Мы можем достичь этого с помощью аргумента `remove_columns`:
diff --git a/chapters/ru/chapter5/4.mdx b/chapters/ru/chapter5/4.mdx
index d96ca1b35..324170caa 100644
--- a/chapters/ru/chapter5/4.mdx
+++ b/chapters/ru/chapter5/4.mdx
@@ -45,7 +45,7 @@ Dataset({
-✎ По умолчанию 🤗 Datasets распаковывает файлы, необходимые для загрузки набора данных. Если вы хотите сохранить место на жестком диске, вы можете передать `DownloadConfig(delete_extracted=True)` в аргумент `download_config` функции `load_dataset()`. Дополнительные сведения см. в [документации](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig).
+✎ По умолчанию 🤗 Datasets распаковывает файлы, необходимые для загрузки набора данных. Если вы хотите сохранить место на жестком диске, вы можете передать `DownloadConfig(delete_extracted=True)` в аргумент `download_config` функции `load_dataset()`. Дополнительные сведения см. в [документации](https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig).
diff --git a/chapters/ru/chapter5/6.mdx b/chapters/ru/chapter5/6.mdx
index b238ba8a9..70a52449d 100644
--- a/chapters/ru/chapter5/6.mdx
+++ b/chapters/ru/chapter5/6.mdx
@@ -190,7 +190,7 @@ Dataset({
-✏️ **Попробуйте!** Посмотрите, сможете ли вы использовать `Dataset.map()`, чтобы развернуть столбец `comments` столбца `issues_dataset` _без_ использования Pandas. Это немного сложно; вы можете найти раздел ["Batch mapping"](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch.html?batch-mapping#batch-mapping) документации 🤗 Datasets, полезным для этой задачи.
+✏️ **Попробуйте!** Посмотрите, сможете ли вы использовать `Dataset.map()`, чтобы развернуть столбец `comments` столбца `issues_dataset` _без_ использования Pandas. Это немного сложно; вы можете найти раздел ["Batch mapping"](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping) документации 🤗 Datasets, полезным для этой задачи.
diff --git a/chapters/ru/chapter6/8.mdx b/chapters/ru/chapter6/8.mdx
index 191e1d2f4..6403a9ef5 100644
--- a/chapters/ru/chapter6/8.mdx
+++ b/chapters/ru/chapter6/8.mdx
@@ -27,14 +27,14 @@
Точнее, библиотека построена вокруг центрального класса `Tokenizer`, а строительные блоки сгруппированы в подмодули:
-- `normalizers` содержит все возможные типы нормализаторов текста `Normalizer`, которые вы можете использовать (полный список [здесь](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers)).
-- `pre_tokenizers` содержит все возможные типы предварительных токенизаторов `PreTokenizer`, которые вы можете использовать (полный список [здесь](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers)).
-- `models` содержит различные типы моделей `Model`, которые вы можете использовать, такие как `BPE`, `WordPiece` и `Unigram` (полный список [здесь](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models)).
-- `trainers` содержит все различные типы `Trainer`, которые вы можете использовать для обучения модели на корпусе (по одному на каждый тип модели; полный список [здесь](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.trainers)).
-- `post_processors` содержит различные типы постпроцессоров `PostProcessor`, которые вы можете использовать (полный список [здесь](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.processors)).
-- `decoders` содержит различные типы декодеров `Decoder`, которые вы можете использовать для декодирования результатов токенизации (полный список [здесь](https://huggingface.co/docs/tokenizers/python/latest/components.html#decoders)).
-
-Весь список блоков вы можете найти [здесь](https://huggingface.co/docs/tokenizers/python/latest/components.html).
+- `normalizers` содержит все возможные типы нормализаторов текста `Normalizer`, которые вы можете использовать (полный список [здесь](https://huggingface.co/docs/tokenizers/api/normalizers)).
+- `pre_tokenizers` содержит все возможные типы предварительных токенизаторов `PreTokenizer`, которые вы можете использовать (полный список [здесь](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)).
+- `models` содержит различные типы моделей `Model`, которые вы можете использовать, такие как `BPE`, `WordPiece` и `Unigram` (полный список [здесь](https://huggingface.co/docs/tokenizers/api/models)).
+- `trainers` содержит все различные типы `Trainer`, которые вы можете использовать для обучения модели на корпусе (по одному на каждый тип модели; полный список [здесь](https://huggingface.co/docs/tokenizers/api/trainers)).
+- `post_processors` содержит различные типы постпроцессоров `PostProcessor`, которые вы можете использовать (полный список [здесь](https://huggingface.co/docs/tokenizers/api/post-processors)).
+- `decoders` содержит различные типы декодеров `Decoder`, которые вы можете использовать для декодирования результатов токенизации (полный список [здесь](https://huggingface.co/docs/tokenizers/components#decoders)).
+
+Весь список блоков вы можете найти [здесь](https://huggingface.co/docs/tokenizers/components).
## Получение корпуса текста[[acquiring-a-corpus]]
diff --git a/chapters/th/chapter3/2.mdx b/chapters/th/chapter3/2.mdx
index 73f8ec8cf..7e33dd154 100644
--- a/chapters/th/chapter3/2.mdx
+++ b/chapters/th/chapter3/2.mdx
@@ -235,7 +235,7 @@ tokenized_dataset = tokenizer(
ซึ่งการเขียนคำสั่งแบบนี้ก็ได้ผลลัพธ์ที่ถูกต้อง แต่จะมีจุดด้อยคือการทำแบบนี้จะได้ผลลัพธ์ออกมาเป็น dictionary (โดยมี keys ต่าง ๆ คือ `input_ids`, `attention_mask` และ `token_type_ids` และมี values เป็น lists ของ lists) วิธีการนี้จะใช้การได้ก็ต่อเมื่อคอมพิวเตอร์ของคุณมี RAM เพียงพอที่จะเก็บข้อมูลของทั้งชุดข้อมูลในตอนที่ทำการ tokenize ชุดข้อมูลทั้งหมด (ในขณะที่ dataset ที่ได้จากไลบรารี่ 🤗 Datasets จะเป็นไฟล์ [Apache Arrow](https://arrow.apache.org/) ซึ่งจะเก็บข้อมูลไว้ใน disk คุณจึงสามารถโหลดเฉพาะข้อมูลที่ต้องการมาเก็บไว้ใน memory ได้)
-เพื่อให้ข้อมูลของเรายังเป็นข้อมูลชนิด dataset เราจะใช้เมธอด [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) ซึ่งจะช่วยให้เราเลือกได้ว่าเราจะทำการประมวลผลอื่น ๆ นอกเหนือจากการ tokenize หรือไม่ โดยเมธอด `map()` ทำงานโดยการเรียกใช้ฟังก์ชั่นกับแต่ละ element ของชุดข้อมูล เรามาสร้างฟังก์ชั่นสำหรับ tokenize ข้อมูลของเรากันก่อน:
+เพื่อให้ข้อมูลของเรายังเป็นข้อมูลชนิด dataset เราจะใช้เมธอด [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map) ซึ่งจะช่วยให้เราเลือกได้ว่าเราจะทำการประมวลผลอื่น ๆ นอกเหนือจากการ tokenize หรือไม่ โดยเมธอด `map()` ทำงานโดยการเรียกใช้ฟังก์ชั่นกับแต่ละ element ของชุดข้อมูล เรามาสร้างฟังก์ชั่นสำหรับ tokenize ข้อมูลของเรากันก่อน:
```py
def tokenize_function(example):
diff --git a/chapters/th/chapter6/8.mdx b/chapters/th/chapter6/8.mdx
index 30369b8b3..5d7e90c33 100644
--- a/chapters/th/chapter6/8.mdx
+++ b/chapters/th/chapter6/8.mdx
@@ -29,14 +29,14 @@
library นี้ ประกอบด้วยส่วนหลักคือ `Tokenizer` class ที่มีส่วนประกอบย่อยสำคัญอื่นๆ แบ่งเป็นหลาย submodules
-- `normalizers` ประกอบไปด้วย `Normalizer` หลายประเภทที่คุณสามารถใช้ได้ (รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers)).
-- `pre_tokenizers` ประกอบไปด้วย `PreTokenizer` หลายประเภทที่คุณสามารถใช้ได้ (รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers)).
-- `models` ประกอบไปด้วย `Model` หลายประเภทที่คุณสามารถใช้ได้ เช่น `BPE`, `WordPiece`, และ `Unigram` (รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models)).
-- `trainers` ประกอบไปด้วย `Trainer` หลายประเภทที่คุณสามารถใช้เพื่อเทรนโมเดลด้วย corpus ที่มีได้ (แต่ละโมเดลจะมีเทรนเนอร์อย่างละตัว; รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.trainers)).
-- `post_processors` ประกอบไปด้วย `PostProcessor` หลายประเภทที่คุณสามารถใช้ได้ (รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.processors))
-- `decoders` ประกอบไปด้วย `Decoder` หลายประเภทที่ใช้เพื่อ decode ผลลัพธ์จากการ tokenization (รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/python/latest/components.html#decoders))
-
-คุณสามารถดูรายชื่อของส่วนประกอบสำคัญๆต่างๆได้[ที่นี่](https://huggingface.co/docs/tokenizers/python/latest/components.html)
+- `normalizers` ประกอบไปด้วย `Normalizer` หลายประเภทที่คุณสามารถใช้ได้ (รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/api/normalizers)).
+- `pre_tokenizers` ประกอบไปด้วย `PreTokenizer` หลายประเภทที่คุณสามารถใช้ได้ (รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)).
+- `models` ประกอบไปด้วย `Model` หลายประเภทที่คุณสามารถใช้ได้ เช่น `BPE`, `WordPiece`, และ `Unigram` (รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/api/models)).
+- `trainers` ประกอบไปด้วย `Trainer` หลายประเภทที่คุณสามารถใช้เพื่อเทรนโมเดลด้วย corpus ที่มีได้ (แต่ละโมเดลจะมีเทรนเนอร์อย่างละตัว; รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/api/trainers)).
+- `post_processors` ประกอบไปด้วย `PostProcessor` หลายประเภทที่คุณสามารถใช้ได้ (รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/api/post-processors))
+- `decoders` ประกอบไปด้วย `Decoder` หลายประเภทที่ใช้เพื่อ decode ผลลัพธ์จากการ tokenization (รายชื่อทั้งหมดดูได้[ที่นี่](https://huggingface.co/docs/tokenizers/components#decoders))
+
+คุณสามารถดูรายชื่อของส่วนประกอบสำคัญๆต่างๆได้[ที่นี่](https://huggingface.co/docs/tokenizers/components)
## การโหลด corpus
diff --git a/chapters/vi/chapter3/2.mdx b/chapters/vi/chapter3/2.mdx
index e04c61e6d..3d2b3af2d 100644
--- a/chapters/vi/chapter3/2.mdx
+++ b/chapters/vi/chapter3/2.mdx
@@ -235,7 +235,7 @@ tokenized_dataset = tokenizer(
Điều này hoạt động tốt, nhưng nó có nhược điểm là trả về từ điển (với các khóa của chúng tôi, `input_ids`, `attention_mask` và `token_type_ids`, và các giá trị là danh sách các danh sách). Nó cũng sẽ chỉ hoạt động nếu bạn có đủ RAM để lưu trữ toàn bộ tập dữ liệu của mình trong quá trình tokenize (trong khi các tập dữ liệu từ thư viện 🤗 Datasets là các tệp [Apache Arrow](https://arrow.apache.org/) được lưu trữ trên đĩa, vì vậy bạn chỉ giữ các mẫu bạn yêu cầu đã tải trong bộ nhớ).
-Để giữ dữ liệu dưới dạng tập dữ liệu, chúng ta sẽ sử dụng phương thức [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map). Điều này cũng cho phép chúng ta linh hoạt hơn, nếu chúng ta cần thực hiện nhiều tiền xử lý hơn là chỉ tokenize. Phương thức `map()` hoạt động bằng cách áp dụng một hàm trên mỗi phần tử của tập dữ liệu, vì vậy hãy xác định một hàm tokenize các đầu vào của chúng ta:
+Để giữ dữ liệu dưới dạng tập dữ liệu, chúng ta sẽ sử dụng phương thức [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map). Điều này cũng cho phép chúng ta linh hoạt hơn, nếu chúng ta cần thực hiện nhiều tiền xử lý hơn là chỉ tokenize. Phương thức `map()` hoạt động bằng cách áp dụng một hàm trên mỗi phần tử của tập dữ liệu, vì vậy hãy xác định một hàm tokenize các đầu vào của chúng ta:
```py
def tokenize_function(example):
diff --git a/chapters/vi/chapter5/2.mdx b/chapters/vi/chapter5/2.mdx
index 25f910110..660739c0a 100644
--- a/chapters/vi/chapter5/2.mdx
+++ b/chapters/vi/chapter5/2.mdx
@@ -128,7 +128,7 @@ DatasetDict({
-Tham số `data_files` của hàm `load_dataset()` khá linh hoạt và có thể là một đường dẫn tệp duy nhất, danh sách các đường dẫn tệp hoặc từ điển ánh xạ các tên tách thành đường dẫn tệp. Bạn cũng có thể tập hợp các tệp phù hợp với một mẫu được chỉ định theo các quy tắc được sử dụng bởi Unix shell (ví dụ: bạn có thể tổng hợp tất cả các tệp JSON trong một thư mục dưới dạng một lần tách duy nhất bằng cách đặt `data_files="*.json"`). Xem [tài liệu](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) 🤗 Datasets để biết thêm chi tiết.
+Tham số `data_files` của hàm `load_dataset()` khá linh hoạt và có thể là một đường dẫn tệp duy nhất, danh sách các đường dẫn tệp hoặc từ điển ánh xạ các tên tách thành đường dẫn tệp. Bạn cũng có thể tập hợp các tệp phù hợp với một mẫu được chỉ định theo các quy tắc được sử dụng bởi Unix shell (ví dụ: bạn có thể tổng hợp tất cả các tệp JSON trong một thư mục dưới dạng một lần tách duy nhất bằng cách đặt `data_files="*.json"`). Xem [tài liệu](https://huggingface.co/docs/datasets/loading#local-and-remote-files) 🤗 Datasets để biết thêm chi tiết.
@@ -160,6 +160,6 @@ squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
-✏️ **Thử nghiệm thôi!** Chọn một tập dữ liệu khác được lưu trữ trên GitHub hoặc [Kho lưu trữ Học Máy UCI](https://archive.ics.uci.edu/ml/index.php) và thử tải nó cả cục bộ và từ xa bằng cách sử dụng các kỹ thuật đã giới thiệu ở trên. Để có điểm thưởng, hãy thử tải tập dữ liệu được lưu trữ ở định dạng CSV hoặc dạng văn bản (xem [tài liệu](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) để biết thêm thông tin trên các định dạng này).
+✏️ **Thử nghiệm thôi!** Chọn một tập dữ liệu khác được lưu trữ trên GitHub hoặc [Kho lưu trữ Học Máy UCI](https://archive.ics.uci.edu/ml/index.php) và thử tải nó cả cục bộ và từ xa bằng cách sử dụng các kỹ thuật đã giới thiệu ở trên. Để có điểm thưởng, hãy thử tải tập dữ liệu được lưu trữ ở định dạng CSV hoặc dạng văn bản (xem [tài liệu](https://huggingface.co/docs/datasets/loading#local-and-remote-files) để biết thêm thông tin trên các định dạng này).
diff --git a/chapters/vi/chapter5/3.mdx b/chapters/vi/chapter5/3.mdx
index 70eb79d19..1c0035536 100644
--- a/chapters/vi/chapter5/3.mdx
+++ b/chapters/vi/chapter5/3.mdx
@@ -237,7 +237,7 @@ Như bạn có thể thấy, điều này đã loại bỏ khoảng 15% bài đ
-✏️ **Thử nghiệm thôi!** Sử dụng hàm `Dataset.sort()` để kiểm tra các bài đánh giá có số lượng từ lớn nhất. Tham khảo [tài liệu](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.sort) để biết bạn cần sử dụng tham số nào để sắp xếp các bài đánh giá theo thứ tự giảm dần.
+✏️ **Thử nghiệm thôi!** Sử dụng hàm `Dataset.sort()` để kiểm tra các bài đánh giá có số lượng từ lớn nhất. Tham khảo [tài liệu](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.sort) để biết bạn cần sử dụng tham số nào để sắp xếp các bài đánh giá theo thứ tự giảm dần.
@@ -384,7 +384,7 @@ tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
```
-Ôi không! Nó đã không hoạt động! Tại sao không? Nhìn vào thông báo lỗi sẽ cho chúng ta manh mối: có sự không khớp về độ dài của một trong các cột, một cột có độ dài 1,463 và cột còn lại có độ dài 1,000. Nếu bạn đã xem [tài liệu](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) về `Dataset.map()`, bạn có thể nhớ rằng đó là số các mẫu được truyền vào hàm mà chúng ta đang ánh xạ; ở đây 1,000 mẫu đó đã cung cấp 1,463 đặc trưng mới, dẫn đến lỗi hình dạng.
+Ôi không! Nó đã không hoạt động! Tại sao không? Nhìn vào thông báo lỗi sẽ cho chúng ta manh mối: có sự không khớp về độ dài của một trong các cột, một cột có độ dài 1,463 và cột còn lại có độ dài 1,000. Nếu bạn đã xem [tài liệu](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map) về `Dataset.map()`, bạn có thể nhớ rằng đó là số các mẫu được truyền vào hàm mà chúng ta đang ánh xạ; ở đây 1,000 mẫu đó đã cung cấp 1,463 đặc trưng mới, dẫn đến lỗi hình dạng.
Vấn đề là chúng ta đang cố gắng kết hợp hai tập dữ liệu khác nhau với các kích thước khác nhau: cột `drug_dataset` sẽ có một số mẫu nhất định (lỗi phía chúng ta là 1,000), nhưng `tokenized_dataset` mà chúng ta đang xây dựng sẽ có nhiều hơn (1,463 trong thông báo lỗi). Điều này không hoạt động đối với `Dataset`, vì vậy chúng ta cần xóa các cột khỏi tập dữ liệu cũ hoặc làm cho chúng có cùng kích thước với chúng trong tập dữ liệu mới. Chúng ta có thể thực hiện điều đầu thông qua tham số `remove_columns`:
diff --git a/chapters/vi/chapter5/4.mdx b/chapters/vi/chapter5/4.mdx
index 381224111..c105b0b28 100644
--- a/chapters/vi/chapter5/4.mdx
+++ b/chapters/vi/chapter5/4.mdx
@@ -54,7 +54,7 @@ Chúng ta có thể thấy rằng có 15,518,009 hàng và 2 cột trong tập d
-✎ Theo mặc định, 🤗 Datasets sẽ giải nén các tệp cần thiết để tải tập dữ liệu. Nếu bạn muốn bảo toàn dung lượng ổ cứng, bạn có thể truyền `DownloadConfig(delete_extracted=True)` vào tham số `download_config` của `load_dataset()`. Xem [tài liệu](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig) để biết thêm chi tiết.
+✎ Theo mặc định, 🤗 Datasets sẽ giải nén các tệp cần thiết để tải tập dữ liệu. Nếu bạn muốn bảo toàn dung lượng ổ cứng, bạn có thể truyền `DownloadConfig(delete_extracted=True)` vào tham số `download_config` của `load_dataset()`. Xem [tài liệu](https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig) để biết thêm chi tiết.
diff --git a/chapters/vi/chapter5/5.mdx b/chapters/vi/chapter5/5.mdx
index d8faafb4f..9b05d75d6 100644
--- a/chapters/vi/chapter5/5.mdx
+++ b/chapters/vi/chapter5/5.mdx
@@ -435,7 +435,7 @@ Tuyệt vời, chúng ta đã đưa tập dữ liệu của mình vào Hub và n
-💡 Bạn cũng có thể tải tập dữ liệu lên Hugging Face Hub trực tiếp từ thiết bị đầu cuối bằng cách sử dụng `huggingface-cli` và một chút phép thuật từ Git. Tham khảo [hướng dẫn 🤗 Datasets](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) để biết chi tiết về cách thực hiện việc này.
+💡 Bạn cũng có thể tải tập dữ liệu lên Hugging Face Hub trực tiếp từ thiết bị đầu cuối bằng cách sử dụng `huggingface-cli` và một chút phép thuật từ Git. Tham khảo [hướng dẫn 🤗 Datasets](https://huggingface.co/docs/datasets/share#share-a-dataset-using-the-cli) để biết chi tiết về cách thực hiện việc này.
diff --git a/chapters/vi/chapter5/6.mdx b/chapters/vi/chapter5/6.mdx
index cfaca863b..6c29fff33 100644
--- a/chapters/vi/chapter5/6.mdx
+++ b/chapters/vi/chapter5/6.mdx
@@ -190,7 +190,7 @@ Dataset({
-✏️ **Thử nghiệm thôi!** Cùng xem liệu bạn có thể sử dụng `Dataset.map()` để khám phá cột `comments` của `issues_dataset` _mà không cần_ sử dụng Pandas hay không. Nó sẽ hơi khó khăn một chút; bạn có thể xem phần ["Batch mapping"](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch.html?batch-mapping#batch-mapping) của tài liệu 🤗 Datasets, một tài liệu hữu ích cho tác vụ này.
+✏️ **Thử nghiệm thôi!** Cùng xem liệu bạn có thể sử dụng `Dataset.map()` để khám phá cột `comments` của `issues_dataset` _mà không cần_ sử dụng Pandas hay không. Nó sẽ hơi khó khăn một chút; bạn có thể xem phần ["Batch mapping"](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping) của tài liệu 🤗 Datasets, một tài liệu hữu ích cho tác vụ này.
diff --git a/chapters/vi/chapter6/8.mdx b/chapters/vi/chapter6/8.mdx
index 7ce6fd78e..aa93a0fed 100644
--- a/chapters/vi/chapter6/8.mdx
+++ b/chapters/vi/chapter6/8.mdx
@@ -25,14 +25,14 @@ Thư viện 🤗 Tokenizers đã được xây dựng để cung cấp nhiều s
Chính xác hơn, thư viện được xây dựng tập trung vào lớp `Tokenizer` với các khối được tập hợp lại trong các mô-đun con:
-- `normalizers` chứa tất cả các kiểu `Normalizer` bạn có thể sử dụng (hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers)).
-- `pre_tokenizers` chứa tất cả các kiểu `PreTokenizer` bạn có thể sử dụng (hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers)).
-- `models` chứa tất cả các kiểu `Model` bạn có thể sử dụng, như `BPE`, `WordPiece`, and `Unigram` (hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models)).
-- `trainers` chứa tất cả các kiểu `Trainer` khác nhau bạn có thể sử dụng để huấn luyện mô hình của bạn trên kho ngữ liệu (một cho mỗi loại mô hình; hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.trainers)).
-- `post_processors` chứa tất cả các kiểu `PostProcessor` bạn có thể sử dụng (hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.processors)).
-- `decoders` chứa tất cả các kiểu `Decoder` đa dạng bạn có thể sử dụng để giải mã đầu ra của tokenize (hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/python/latest/components.html#decoders)).
-
-Bạn có thể tìm được toàn bộ danh sách các khối tại [đây](https://huggingface.co/docs/tokenizers/python/latest/components.html).
+- `normalizers` chứa tất cả các kiểu `Normalizer` bạn có thể sử dụng (hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/api/normalizers)).
+- `pre_tokenizers` chứa tất cả các kiểu `PreTokenizer` bạn có thể sử dụng (hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)).
+- `models` chứa tất cả các kiểu `Model` bạn có thể sử dụng, như `BPE`, `WordPiece`, and `Unigram` (hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/api/models)).
+- `trainers` chứa tất cả các kiểu `Trainer` khác nhau bạn có thể sử dụng để huấn luyện mô hình của bạn trên kho ngữ liệu (một cho mỗi loại mô hình; hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/api/trainers)).
+- `post_processors` chứa tất cả các kiểu `PostProcessor` bạn có thể sử dụng (hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/api/post-processors)).
+- `decoders` chứa tất cả các kiểu `Decoder` đa dạng bạn có thể sử dụng để giải mã đầu ra của tokenize (hoàn thiện danh sách tại [đây](https://huggingface.co/docs/tokenizers/components#decoders)).
+
+Bạn có thể tìm được toàn bộ danh sách các khối tại [đây](https://huggingface.co/docs/tokenizers/components).
## Thu thập một kho ngữ liệu
diff --git a/chapters/zh-CN/chapter3/2.mdx b/chapters/zh-CN/chapter3/2.mdx
index 57e649e65..fbe6d00d3 100644
--- a/chapters/zh-CN/chapter3/2.mdx
+++ b/chapters/zh-CN/chapter3/2.mdx
@@ -235,7 +235,7 @@ tokenized_dataset = tokenizer(
这很有效,但它的缺点是返回字典(字典的键是**输入词id(input_ids)** , **注意力遮罩(attention_mask)** 和 **类型标记ID(token_type_ids)**,字典的值是键所对应值的列表)。而且只有当您在转换过程中有足够的内存来存储整个数据集时才不会出错(而🤗数据集库中的数据集是以[Apache Arrow](https://arrow.apache.org/)文件存储在磁盘上,因此您只需将接下来要用的数据加载在内存中,因此会对内存容量的需求要低一些)。
-为了将数据保存为数据集,我们将使用[Dataset.map()](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map)方法,如果我们需要做更多的预处理而不仅仅是标记化,那么这也给了我们一些额外的自定义的方法。这个方法的工作原理是在数据集的每个元素上应用一个函数,因此让我们定义一个标记输入的函数:
+为了将数据保存为数据集,我们将使用[Dataset.map()](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map)方法,如果我们需要做更多的预处理而不仅仅是标记化,那么这也给了我们一些额外的自定义的方法。这个方法的工作原理是在数据集的每个元素上应用一个函数,因此让我们定义一个标记输入的函数:
```py
def tokenize_function(example):
diff --git a/chapters/zh-CN/chapter5/2.mdx b/chapters/zh-CN/chapter5/2.mdx
index 95b8691b1..555e35f70 100644
--- a/chapters/zh-CN/chapter5/2.mdx
+++ b/chapters/zh-CN/chapter5/2.mdx
@@ -160,7 +160,7 @@ squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
-✏️ **试试看!** 选择托管在GitHub或[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)上的另一个数据集并尝试使用上述技术在本地和远程加载它。另外,可以尝试加载CSV或者文本格式存储的数据集(有关这些格式的更多信息,请参阅[文档](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files))。
+✏️ **试试看!** 选择托管在GitHub或[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)上的另一个数据集并尝试使用上述技术在本地和远程加载它。另外,可以尝试加载CSV或者文本格式存储的数据集(有关这些格式的更多信息,请参阅[文档](https://huggingface.co/docs/datasets/loading#local-and-remote-files))。
diff --git a/chapters/zh-CN/chapter5/4.mdx b/chapters/zh-CN/chapter5/4.mdx
index 70ef713ca..bf34117ff 100644
--- a/chapters/zh-CN/chapter5/4.mdx
+++ b/chapters/zh-CN/chapter5/4.mdx
@@ -46,7 +46,7 @@ Dataset({
-✎ 默认情况下, 🤗 Datasets 会自动解压加载数据集所需的文件。 如果你想保留硬盘空间, 你可以传递 `DownloadConfig(delete_extracted=True)` 到 `download_config` 的 `load_dataset()`参数. 有关更多详细信息, 请参阅文档](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig)。
+✎ 默认情况下, 🤗 Datasets 会自动解压加载数据集所需的文件。 如果你想保留硬盘空间, 你可以传递 `DownloadConfig(delete_extracted=True)` 到 `download_config` 的 `load_dataset()`参数. 有关更多详细信息, 请参阅文档](https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig)。
diff --git a/chapters/zh-CN/chapter5/5.mdx b/chapters/zh-CN/chapter5/5.mdx
index a18ee94f4..f6454f914 100644
--- a/chapters/zh-CN/chapter5/5.mdx
+++ b/chapters/zh-CN/chapter5/5.mdx
@@ -360,7 +360,7 @@ Dataset({
-💡 您还可以使用一些 Git 魔法直接从终端将数据集上传到 Hugging Face Hub。有关如何执行此操作的详细信息,请参阅 [🤗 Datasets guide](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) 指南。
+💡 您还可以使用一些 Git 魔法直接从终端将数据集上传到 Hugging Face Hub。有关如何执行此操作的详细信息,请参阅 [🤗 Datasets guide](https://huggingface.co/docs/datasets/share#share-a-dataset-using-the-cli) 指南。
diff --git a/chapters/zh-CN/chapter5/6.mdx b/chapters/zh-CN/chapter5/6.mdx
index 8de5fb335..f3a1044e9 100644
--- a/chapters/zh-CN/chapter5/6.mdx
+++ b/chapters/zh-CN/chapter5/6.mdx
@@ -177,7 +177,7 @@ Dataset({
-✏️ **Try it out!** 看看能不能不用pandas就可以完成列的扩充; 这有点棘手; 你可能会发现 🤗 Datasets 文档的 ["Batch mapping"](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch.html?batch-mapping#batch-mapping) 对这个任务很有用。
+✏️ **Try it out!** 看看能不能不用pandas就可以完成列的扩充; 这有点棘手; 你可能会发现 🤗 Datasets 文档的 ["Batch mapping"](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping) 对这个任务很有用。
diff --git a/chapters/zh-CN/chapter6/8.mdx b/chapters/zh-CN/chapter6/8.mdx
index 7e6802bdc..73e07c7fa 100644
--- a/chapters/zh-CN/chapter6/8.mdx
+++ b/chapters/zh-CN/chapter6/8.mdx
@@ -27,14 +27,14 @@
更准确地说,该库是围绕一个中央“Tokenizer”类构建的,构建这个类的每一部分可以在子模块的列表中重新组合:
-- `normalizers` 包含你可以使用的所有可能的Normalizer类型(完整列表[在这里](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers))。
-- `pre_tokenizesr` 包含您可以使用的所有可能的PreTokenizer类型(完整列表[在这里](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers))。
-- `models` 包含您可以使用的各种类型的Model,如BPE、WordPiece和Unigram(完整列表[在这里](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models))。
-- `trainers` 包含所有不同类型的 trainer,你可以使用一个语料库训练你的模型(每种模型一个;完整列表[在这里](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.trainers))。
-- `post_processors` 包含你可以使用的各种类型的PostProcessor(完整列表[在这里](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.processors))。
-- `decoders` 包含各种类型的Decoder,可以用来解码标记化的输出(完整列表[在这里](https://huggingface.co/docs/tokenizers/python/latest/components.html#decoders))。
-
-您可以[在这里](https://huggingface.co/docs/tokenizers/python/latest/components.html)找到完整的模块列表。
+- `normalizers` 包含你可以使用的所有可能的Normalizer类型(完整列表[在这里](https://huggingface.co/docs/tokenizers/api/normalizers))。
+- `pre_tokenizesr` 包含您可以使用的所有可能的PreTokenizer类型(完整列表[在这里](https://huggingface.co/docs/tokenizers/api/pre-tokenizers))。
+- `models` 包含您可以使用的各种类型的Model,如BPE、WordPiece和Unigram(完整列表[在这里](https://huggingface.co/docs/tokenizers/api/models))。
+- `trainers` 包含所有不同类型的 trainer,你可以使用一个语料库训练你的模型(每种模型一个;完整列表[在这里](https://huggingface.co/docs/tokenizers/api/trainers))。
+- `post_processors` 包含你可以使用的各种类型的PostProcessor(完整列表[在这里](https://huggingface.co/docs/tokenizers/api/post-processors))。
+- `decoders` 包含各种类型的Decoder,可以用来解码标记化的输出(完整列表[在这里](https://huggingface.co/docs/tokenizers/components#decoders))。
+
+您可以[在这里](https://huggingface.co/docs/tokenizers/components)找到完整的模块列表。
## 获取语料库 [[获取语料库]]
diff --git a/chapters/zh-TW/chapter3/2.mdx b/chapters/zh-TW/chapter3/2.mdx
index 4970fb9eb..5dd10b0a0 100644
--- a/chapters/zh-TW/chapter3/2.mdx
+++ b/chapters/zh-TW/chapter3/2.mdx
@@ -235,7 +235,7 @@ tokenized_dataset = tokenizer(
這很有效,但它的缺點是返回字典(字典的鍵是**輸入詞id(input_ids)** , **注意力遮罩(attention_mask)** 和 **類型標記ID(token_type_ids)**,字典的值是鍵所對應值的列表)。而且只有當您在轉換過程中有足夠的內存來存儲整個數據集時才不會出錯(而🤗數據集庫中的數據集是以[Apache Arrow](https://arrow.apache.org/)文件存儲在磁盤上,因此您只需將接下來要用的數據加載在內存中,因此會對內存容量的需求要低一些)。
-為了將數據保存為數據集,我們將使用[Dataset.map()](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map)方法,如果我們需要做更多的預處理而不僅僅是標記化,那麼這也給了我們一些額外的自定義的方法。這個方法的工作原理是在數據集的每個元素上應用一個函數,因此讓我們定義一個標記輸入的函數:
+為了將數據保存為數據集,我們將使用[Dataset.map()](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map)方法,如果我們需要做更多的預處理而不僅僅是標記化,那麼這也給了我們一些額外的自定義的方法。這個方法的工作原理是在數據集的每個元素上應用一個函數,因此讓我們定義一個標記輸入的函數:
```py
def tokenize_function(example):
diff --git a/chapters/zh-TW/chapter5/2.mdx b/chapters/zh-TW/chapter5/2.mdx
index f47239575..90cbfdc43 100644
--- a/chapters/zh-TW/chapter5/2.mdx
+++ b/chapters/zh-TW/chapter5/2.mdx
@@ -160,7 +160,7 @@ squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
-✏️ **試試看!** 選擇託管在GitHub或[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)上的另一個數據集並嘗試使用上述技術在本地和遠程加載它。另外,可以嘗試加載CSV或者文本格式存儲的數據集(有關這些格式的更多信息,請參閱[文檔](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files))。
+✏️ **試試看!** 選擇託管在GitHub或[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)上的另一個數據集並嘗試使用上述技術在本地和遠程加載它。另外,可以嘗試加載CSV或者文本格式存儲的數據集(有關這些格式的更多信息,請參閱[文檔](https://huggingface.co/docs/datasets/loading#local-and-remote-files))。
diff --git a/chapters/zh-TW/chapter5/4.mdx b/chapters/zh-TW/chapter5/4.mdx
index 4cd7ae730..96e11bb8c 100644
--- a/chapters/zh-TW/chapter5/4.mdx
+++ b/chapters/zh-TW/chapter5/4.mdx
@@ -46,7 +46,7 @@ Dataset({
-✎ 默認情況下, 🤗 Datasets 會自動解壓加載數據集所需的文件。 如果你想保留硬盤空間, 你可以傳遞 `DownloadConfig(delete_extracted=True)` 到 `download_config` 的 `load_dataset()`參數. 有關更多詳細信息, 請參閱文檔](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig)。
+✎ 默認情況下, 🤗 Datasets 會自動解壓加載數據集所需的文件。 如果你想保留硬盤空間, 你可以傳遞 `DownloadConfig(delete_extracted=True)` 到 `download_config` 的 `load_dataset()`參數. 有關更多詳細信息, 請參閱文檔](https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig)。
diff --git a/chapters/zh-TW/chapter5/5.mdx b/chapters/zh-TW/chapter5/5.mdx
index f65b1c0e2..6f9953910 100644
--- a/chapters/zh-TW/chapter5/5.mdx
+++ b/chapters/zh-TW/chapter5/5.mdx
@@ -423,7 +423,7 @@ Dataset({
-💡 您還可以使用一些 Git 魔法直接從終端將數據集上傳到 Hugging Face Hub。有關如何執行此操作的詳細信息,請參閱 [🤗 Datasets guide](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) 指南。
+💡 您還可以使用一些 Git 魔法直接從終端將數據集上傳到 Hugging Face Hub。有關如何執行此操作的詳細信息,請參閱 [🤗 Datasets guide](https://huggingface.co/docs/datasets/share#share-a-dataset-using-the-cli) 指南。
diff --git a/chapters/zh-TW/chapter5/6.mdx b/chapters/zh-TW/chapter5/6.mdx
index b5646f739..24bb8e9dd 100644
--- a/chapters/zh-TW/chapter5/6.mdx
+++ b/chapters/zh-TW/chapter5/6.mdx
@@ -188,7 +188,7 @@ Dataset({
-✏️ **Try it out!** 看看能不能不用pandas就可以完成列的擴充; 這有點棘手; 你可能會發現 🤗 Datasets 文檔的 ["Batch mapping"](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch.html?batch-mapping#batch-mapping) 對這個任務很有用。
+✏️ **Try it out!** 看看能不能不用pandas就可以完成列的擴充; 這有點棘手; 你可能會發現 🤗 Datasets 文檔的 ["Batch mapping"](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping) 對這個任務很有用。
diff --git a/chapters/zh-TW/chapter6/8.mdx b/chapters/zh-TW/chapter6/8.mdx
index 9a31a2bf1..c61e14d7f 100644
--- a/chapters/zh-TW/chapter6/8.mdx
+++ b/chapters/zh-TW/chapter6/8.mdx
@@ -27,14 +27,14 @@
更準確地說,該庫是圍繞一個中央「Tokenizer」類構建的,構建這個類的每一部分可以在子模塊的列表中重新組合:
-- `normalizers` 包含你可以使用的所有可能的Normalizer類型(完整列表[在這裡](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers))。
-- `pre_tokenizesr` 包含您可以使用的所有可能的PreTokenizer類型(完整列表[在這裡](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers))。
-- `models` 包含您可以使用的各種類型的Model,如BPE、WordPiece和Unigram(完整列表[在這裡](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models))。
-- `trainers` 包含所有不同類型的 trainer,你可以使用一個語料庫訓練你的模型(每種模型一個;完整列表[在這裡](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.trainers))。
-- `post_processors` 包含你可以使用的各種類型的PostProcessor(完整列表[在這裡](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.processors))。
-- `decoders` 包含各種類型的Decoder,可以用來解碼標記化的輸出(完整列表[在這裡](https://huggingface.co/docs/tokenizers/python/latest/components.html#decoders))。
-
-您可以[在這裡](https://huggingface.co/docs/tokenizers/python/latest/components.html)找到完整的模塊列表。
+- `normalizers` 包含你可以使用的所有可能的Normalizer類型(完整列表[在這裡](https://huggingface.co/docs/tokenizers/api/normalizers))。
+- `pre_tokenizesr` 包含您可以使用的所有可能的PreTokenizer類型(完整列表[在這裡](https://huggingface.co/docs/tokenizers/api/pre-tokenizers))。
+- `models` 包含您可以使用的各種類型的Model,如BPE、WordPiece和Unigram(完整列表[在這裡](https://huggingface.co/docs/tokenizers/api/models))。
+- `trainers` 包含所有不同類型的 trainer,你可以使用一個語料庫訓練你的模型(每種模型一個;完整列表[在這裡](https://huggingface.co/docs/tokenizers/api/trainers))。
+- `post_processors` 包含你可以使用的各種類型的PostProcessor(完整列表[在這裡](https://huggingface.co/docs/tokenizers/api/post-processors))。
+- `decoders` 包含各種類型的Decoder,可以用來解碼標記化的輸出(完整列表[在這裡](https://huggingface.co/docs/tokenizers/components#decoders))。
+
+您可以[在這裡](https://huggingface.co/docs/tokenizers/components)找到完整的模塊列表。
## 獲取語料庫