Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix links to HF docs #660

Merged
merged 1 commit into from
Jan 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion chapters/de/chapter1/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ Wenn du einen Text an eine Pipeline übergibst, gibt es drei wichtige Schritte:
3. Die Vorhersagen des Modells werden so nachverarbeitet, sodass du sie nutzen kannst.


Einige der derzeit [verfügbaren Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html) sind:
Einige der derzeit [verfügbaren Pipelines](https://huggingface.co/transformers/main_classes/pipelines) sind:

- `feature-extraction` (Vektordarstellung eines Textes erhalten)
- `fill-mask`
Expand Down
10 changes: 5 additions & 5 deletions chapters/de/chapter1/5.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ Rein Encoder-basierte Modelle eignen sich am besten für Aufgaben, die ein Verst

Zu dieser Modellfamilie gehören unter anderem:

- [ALBERT](https://huggingface.co/transformers/model_doc/albert.html)
- [BERT](https://huggingface.co/transformers/model_doc/bert.html)
- [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html)
- [ELECTRA](https://huggingface.co/transformers/model_doc/electra.html)
- [RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)
- [ALBERT](https://huggingface.co/transformers/model_doc/albert)
- [BERT](https://huggingface.co/transformers/model_doc/bert)
- [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert)
- [ELECTRA](https://huggingface.co/transformers/model_doc/electra)
- [RoBERTa](https://huggingface.co/transformers/model_doc/roberta)
6 changes: 3 additions & 3 deletions chapters/de/chapter1/6.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Diese Modelle sind am besten für Aufgaben geeignet, bei denen es um die Generie

Zu dieser Modellfamilie gehören unter anderem:

- [CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)
- [CTRL](https://huggingface.co/transformers/model_doc/ctrl)
- [GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)
- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html)
- [Transformer XL](https://huggingface.co/transformers/model_doc/transformerxl.html)
- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2)
- [Transformer XL](https://huggingface.co/transformers/model_doc/transformerxl)
8 changes: 4 additions & 4 deletions chapters/de/chapter1/7.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Sequence-to-Sequence-Modelle eignen sich am besten für Aufgaben, bei denen es d

Vertreter dieser Modellfamilie sind u. a.:

- [BART](https://huggingface.co/transformers/model_doc/bart.html)
- [mBART](https://huggingface.co/transformers/model_doc/mbart.html)
- [Marian](https://huggingface.co/transformers/model_doc/marian.html)
- [T5](https://huggingface.co/transformers/model_doc/t5.html)
- [BART](https://huggingface.co/transformers/model_doc/bart)
- [mBART](https://huggingface.co/transformers/model_doc/mbart)
- [Marian](https://huggingface.co/transformers/model_doc/marian)
- [T5](https://huggingface.co/transformers/model_doc/t5)
2 changes: 1 addition & 1 deletion chapters/de/chapter3/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ tokenized_dataset = tokenizer(

Das funktioniert gut, hat aber den Nachteil, dass ein Dictionary zurückgegeben wird (mit unseren Schlüsselwörtern `input_ids`, `attention_mask` und `token_type_ids` und Werten aus Listen von Listen). Es funktioniert auch nur, wenn du genügend RAM hast, um den gesamten Datensatz während der Tokenisierung zu im RAM zwischen zu speichern (während die Datensätze aus der Bibliothek 🤗 Datasets [Apache Arrow](https://arrow.apache.org/) Dateien sind, die auf der Festplatte gespeichert sind, sodass nur die gewünschten Samples im RAM geladen sind).

Um die Daten als Datensatz zu speichern, verwenden wir die Methode [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map). Dies gewährt uns zusätzliche Flexibilität, wenn wir zusätzliche Vorverarbeitung als nur die Tokenisierung benötigen. Die `map()`-Methode funktioniert, indem sie eine Funktion auf jedes Element des Datensatzes anwendet, also definieren wir eine Funktion, die unsere Inputs tokenisiert:
Um die Daten als Datensatz zu speichern, verwenden wir die Methode [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map). Dies gewährt uns zusätzliche Flexibilität, wenn wir zusätzliche Vorverarbeitung als nur die Tokenisierung benötigen. Die `map()`-Methode funktioniert, indem sie eine Funktion auf jedes Element des Datensatzes anwendet, also definieren wir eine Funktion, die unsere Inputs tokenisiert:

```py
def tokenize_function(example):
Expand Down
4 changes: 2 additions & 2 deletions chapters/de/chapter4/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = CamembertForMaskedLM.from_pretrained("camembert-base")
```

Dennoch empfehlen wir, dass man die [`Auto*` classes](https://huggingface.co/transformers/model_doc/auto.html?highlight=auto#auto-classes) stattdessen benutzt, da diese architekturunabhängig sind. Das vorherige Code-Beispiel gilt nur für Checkpoints, die in die CamemBERT Architektur zu laden sind, aber mit den `Auto*` Klassen kann man Checkpoints ziemlich einfach tauschen:
Dennoch empfehlen wir, dass man die [`Auto*` classes](https://huggingface.co/transformers/model_doc/auto?highlight=auto#auto-classes) stattdessen benutzt, da diese architekturunabhängig sind. Das vorherige Code-Beispiel gilt nur für Checkpoints, die in die CamemBERT Architektur zu laden sind, aber mit den `Auto*` Klassen kann man Checkpoints ziemlich einfach tauschen:

```py
from transformers import AutoTokenizer, AutoModelForMaskedLM
Expand All @@ -81,7 +81,7 @@ tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = TFCamembertForMaskedLM.from_pretrained("camembert-base")
```

Hier empfehlen wir auch, dass man stattdessen die [`TFAuto*` classes](https://huggingface.co/transformers/model_doc/auto.html?highlight=auto#auto-classes) benutzt, da diese architekturunabhängig sind. Das vorherige Code-Beispiel gilt nur für Checkpoints, die in die CamemBERT Architektur zu laden sind, aber mit den `TFAuto*` Klassen kann man Checkpoints einfach tauschen:
Hier empfehlen wir auch, dass man stattdessen die [`TFAuto*` classes](https://huggingface.co/transformers/model_doc/auto?highlight=auto#auto-classes) benutzt, da diese architekturunabhängig sind. Das vorherige Code-Beispiel gilt nur für Checkpoints, die in die CamemBERT Architektur zu laden sind, aber mit den `TFAuto*` Klassen kann man Checkpoints einfach tauschen:

```py
from transformers import AutoTokenizer, TFAutoModelForMaskedLM
Expand Down
2 changes: 1 addition & 1 deletion chapters/en/chapter1/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ There are three main steps involved when you pass some text to a pipeline:
3. The predictions of the model are post-processed, so you can make sense of them.


Some of the currently [available pipelines](https://huggingface.co/transformers/main_classes/pipelines.html) are:
Some of the currently [available pipelines](https://huggingface.co/transformers/main_classes/pipelines) are:

- `feature-extraction` (get the vector representation of a text)
- `fill-mask`
Expand Down
6 changes: 3 additions & 3 deletions chapters/en/chapter1/6.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ These models are best suited for tasks involving text generation.

Representatives of this family of models include:

- [CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)
- [CTRL](https://huggingface.co/transformers/model_doc/ctrl)
- [GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)
- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html)
- [Transformer XL](https://huggingface.co/transformers/model_doc/transfo-xl.html)
- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2)
- [Transformer XL](https://huggingface.co/transformers/model_doc/transfo-xl)
8 changes: 4 additions & 4 deletions chapters/en/chapter1/7.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Sequence-to-sequence models are best suited for tasks revolving around generatin

Representatives of this family of models include:

- [BART](https://huggingface.co/transformers/model_doc/bart.html)
- [mBART](https://huggingface.co/transformers/model_doc/mbart.html)
- [Marian](https://huggingface.co/transformers/model_doc/marian.html)
- [T5](https://huggingface.co/transformers/model_doc/t5.html)
- [BART](https://huggingface.co/transformers/model_doc/bart)
- [mBART](https://huggingface.co/transformers/model_doc/mbart)
- [Marian](https://huggingface.co/transformers/model_doc/marian)
- [T5](https://huggingface.co/transformers/model_doc/t5)
2 changes: 1 addition & 1 deletion chapters/en/chapter3/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ tokenized_dataset = tokenizer(

This works well, but it has the disadvantage of returning a dictionary (with our keys, `input_ids`, `attention_mask`, and `token_type_ids`, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization (whereas the datasets from the 🤗 Datasets library are [Apache Arrow](https://arrow.apache.org/) files stored on the disk, so you only keep the samples you ask for loaded in memory).

To keep the data as a dataset, we will use the [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The `map()` method works by applying a function on each element of the dataset, so let's define a function that tokenizes our inputs:
To keep the data as a dataset, we will use the [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map) method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The `map()` method works by applying a function on each element of the dataset, so let's define a function that tokenizes our inputs:

```py
def tokenize_function(example):
Expand Down
4 changes: 2 additions & 2 deletions chapters/en/chapter4/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = CamembertForMaskedLM.from_pretrained("camembert-base")
```

However, we recommend using the [`Auto*` classes](https://huggingface.co/transformers/model_doc/auto.html?highlight=auto#auto-classes) instead, as these are by design architecture-agnostic. While the previous code sample limits users to checkpoints loadable in the CamemBERT architecture, using the `Auto*` classes makes switching checkpoints simple:
However, we recommend using the [`Auto*` classes](https://huggingface.co/transformers/model_doc/auto?highlight=auto#auto-classes) instead, as these are by design architecture-agnostic. While the previous code sample limits users to checkpoints loadable in the CamemBERT architecture, using the `Auto*` classes makes switching checkpoints simple:

```py
from transformers import AutoTokenizer, AutoModelForMaskedLM
Expand All @@ -81,7 +81,7 @@ tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = TFCamembertForMaskedLM.from_pretrained("camembert-base")
```

However, we recommend using the [`TFAuto*` classes](https://huggingface.co/transformers/model_doc/auto.html?highlight=auto#auto-classes) instead, as these are by design architecture-agnostic. While the previous code sample limits users to checkpoints loadable in the CamemBERT architecture, using the `TFAuto*` classes makes switching checkpoints simple:
However, we recommend using the [`TFAuto*` classes](https://huggingface.co/transformers/model_doc/auto?highlight=auto#auto-classes) instead, as these are by design architecture-agnostic. While the previous code sample limits users to checkpoints loadable in the CamemBERT architecture, using the `TFAuto*` classes makes switching checkpoints simple:

```py
from transformers import AutoTokenizer, TFAutoModelForMaskedLM
Expand Down
2 changes: 1 addition & 1 deletion chapters/en/chapter4/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ Click on the "Files and versions" tab, and you should see the files visible in t

</Tip>

As you've seen, the `push_to_hub()` method accepts several arguments, making it possible to upload to a specific repository or organization namespace, or to use a different API token. We recommend you take a look at the method specification available directly in the [🤗 Transformers documentation](https://huggingface.co/transformers/model_sharing.html) to get an idea of what is possible.
As you've seen, the `push_to_hub()` method accepts several arguments, making it possible to upload to a specific repository or organization namespace, or to use a different API token. We recommend you take a look at the method specification available directly in the [🤗 Transformers documentation](https://huggingface.co/transformers/model_sharing) to get an idea of what is possible.

The `push_to_hub()` method is backed by the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) Python package, which offers a direct API to the Hugging Face Hub. It's integrated within 🤗 Transformers and several other machine learning libraries, like [`allenlp`](https://github.com/allenai/allennlp). Although we focus on the 🤗 Transformers integration in this chapter, integrating it into your own code or library is simple.

Expand Down
4 changes: 2 additions & 2 deletions chapters/en/chapter5/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ This is exactly what we wanted. Now, we can apply various preprocessing techniqu

<Tip>

The `data_files` argument of the `load_dataset()` function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting `data_files="*.json"`). See the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) for more details.
The `data_files` argument of the `load_dataset()` function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting `data_files="*.json"`). See the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/loading#local-and-remote-files) for more details.

</Tip>

Expand Down Expand Up @@ -160,7 +160,7 @@ This returns the same `DatasetDict` object obtained above, but saves us the step

<Tip>

✏️ **Try it out!** Pick another dataset hosted on GitHub or the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and try loading it both locally and remotely using the techniques introduced above. For bonus points, try loading a dataset that’s stored in a CSV or text format (see the [documentation](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) for more information on these formats).
✏️ **Try it out!** Pick another dataset hosted on GitHub or the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and try loading it both locally and remotely using the techniques introduced above. For bonus points, try loading a dataset that’s stored in a CSV or text format (see the [documentation](https://huggingface.co/docs/datasets/loading#local-and-remote-files) for more information on these formats).

</Tip>

Expand Down
4 changes: 2 additions & 2 deletions chapters/en/chapter5/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ As you can see, this has removed around 15% of the reviews from our original tra

<Tip>

✏️ **Try it out!** Use the `Dataset.sort()` function to inspect the reviews with the largest numbers of words. See the [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.sort) to see which argument you need to use sort the reviews by length in descending order.
✏️ **Try it out!** Use the `Dataset.sort()` function to inspect the reviews with the largest numbers of words. See the [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.sort) to see which argument you need to use sort the reviews by length in descending order.

</Tip>

Expand Down Expand Up @@ -385,7 +385,7 @@ tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
```

Oh no! That didn't work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you've looked at the `Dataset.map()` [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map), you may recall that it's the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.
Oh no! That didn't work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you've looked at the `Dataset.map()` [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map), you may recall that it's the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.

The problem is that we're trying to mix two different datasets of different sizes: the `drug_dataset` columns will have a certain number of examples (the 1,000 in our error), but the `tokenized_dataset` we are building will have more (the 1,463 in the error message; it is more than 1,000 because we are tokenizing long reviews into more than one example by using `return_overflowing_tokens=True`). That doesn't work for a `Dataset`, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the `remove_columns` argument:

Expand Down
2 changes: 1 addition & 1 deletion chapters/en/chapter5/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ We can see that there are 15,518,009 rows and 2 columns in our dataset -- that's

<Tip>

✎ By default, 🤗 Datasets will decompress the files needed to load a dataset. If you want to preserve hard drive space, you can pass `DownloadConfig(delete_extracted=True)` to the `download_config` argument of `load_dataset()`. See the [documentation](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig) for more details.
✎ By default, 🤗 Datasets will decompress the files needed to load a dataset. If you want to preserve hard drive space, you can pass `DownloadConfig(delete_extracted=True)` to the `download_config` argument of `load_dataset()`. See the [documentation](https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig) for more details.

</Tip>

Expand Down
2 changes: 1 addition & 1 deletion chapters/en/chapter5/5.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@ Cool, we've pushed our dataset to the Hub and it's available for others to use!

<Tip>

💡 You can also upload a dataset to the Hugging Face Hub directly from the terminal by using `huggingface-cli` and a bit of Git magic. See the [🤗 Datasets guide](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) for details on how to do this.
💡 You can also upload a dataset to the Hugging Face Hub directly from the terminal by using `huggingface-cli` and a bit of Git magic. See the [🤗 Datasets guide](https://huggingface.co/docs/datasets/share#share-a-dataset-using-the-cli) for details on how to do this.

</Tip>

Expand Down
2 changes: 1 addition & 1 deletion chapters/en/chapter5/6.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ Okay, this has given us a few thousand comments to work with!

<Tip>

✏️ **Try it out!** See if you can use `Dataset.map()` to explode the `comments` column of `issues_dataset` _without_ resorting to the use of Pandas. This is a little tricky; you might find the ["Batch mapping"](https://huggingface.co/docs/datasets/v1.12.1/about_map_batch.html?batch-mapping#batch-mapping) section of the 🤗 Datasets documentation useful for this task.
✏️ **Try it out!** See if you can use `Dataset.map()` to explode the `comments` column of `issues_dataset` _without_ resorting to the use of Pandas. This is a little tricky; you might find the ["Batch mapping"](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping) section of the 🤗 Datasets documentation useful for this task.

</Tip>

Expand Down
Loading
Loading