huggingface · lhoestq · Nov 3, 2021 · Oct 28, 2021 · Oct 29, 2021 · Nov 2, 2021
diff --git a/docs/source/imgs/stream.gif b/docs/source/imgs/stream.gif
diff --git a/docs/source/loading.rst b/docs/source/loading.rst
@@ -101,6 +101,14 @@ To load remote CSV files via HTTP, you can pass the URLs:
    >>> base_url = "https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/"
    >>> dataset = load_dataset('csv', data_files={'train': base_url + 'train.csv', 'test': base_url + 'test.csv'})
 
+To load zipped CSV files:
+
+.. code::
+
+   >>> url = "https://domain.org/train_data.zip"
+   >>> data_files = {"train": url}
+   >>> dataset = load_dataset("csv", data_files=data_files)
+
 JSON
 ^^^^
 

diff --git a/docs/source/process.rst b/docs/source/process.rst
@@ -525,21 +525,40 @@ You can also concatenate two datasets horizontally (axis=1) as long as they have
    >>> bookcorpus_ids = Dataset.from_dict({"ids": list(range(len(bookcorpus)))})
    >>> bookcorpus_with_ids = concatenate_datasets([bookcorpus, bookcorpus_ids], axis=1)
 
+.. _format:
+
 Format
 ------
 
+Set a dataset to a TensorFlow compatible format with :func:`datasets.Dataset.set_format`. Specify ``type=tensorflow`` and the columns that should be formatted:
+
+.. code-block::
+
+   >>> import tensorflow as tf
+   >>> dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
+
+Then you can wrap the dataset with ``tf.data.Dataset``. This method gives you more control over how to create a `TensorFlow Dataset <https://www.tensorflow.org/api_docs/python/tf/data/Dataset>`_. In the example below, the dataset is created ``from_tensor_slices``:
+
+.. code-block::
+
+   >>> tfdataset = tf.data.Dataset.from_tensor_slices((features, dataset["label"])).batch(32)
+
 :func:`datasets.Dataset.with_format` provides an alternative method to set the format. This method will return a new :class:`datasets.Dataset` object with your specified format:
 
 .. code::
 
-   >>> dataset.with_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
+   >>> dataset = dataset.with_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
+
+.. tip::
+
+   🤗 Datasets also provides support for other common data formats such as NumPy, PyTorch, Pandas, and JAX.
 
 Use :func:`datasets.Dataset.reset_format` if you need to reset the dataset to the original format:
 
 .. code-block::
 
    >>> dataset.format
-   {'type': 'torch', 'format_kwargs': {}, 'columns': ['label'], 'output_all_columns': False}
+   {'type': 'tensorflow', 'format_kwargs': {}, 'columns': ['label'], 'output_all_columns': False}
    >>> dataset.reset_format()
    >>> dataset.format
    {'type': 'python', 'format_kwargs': {}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False}

diff --git a/docs/source/stream.rst b/docs/source/stream.rst
@@ -6,6 +6,9 @@ Dataset streaming lets you get started with a dataset without waiting for the en
 * You don't want to wait for an extremely large dataset to download.
 * The dataset size exceeds the amount of disk space on your computer.
 
+.. image:: /imgs/stream.gif
+   :align: center
+
 For example, the English split of the `OSCAR <https://huggingface.co/datasets/oscar>`_ dataset is 1.2 terabytes, but you can use it instantly with streaming. Stream a dataset by setting ``streaming=True`` in :func:`datasets.load_dataset` as shown below:
 
 .. code-block::
@@ -103,3 +106,14 @@ Define sampling probabilities from each of the original datasets for more contro
    [{'text': 'Mtendere Village was inspired by the vision...}, {'text': 'Lily James cannot fight the music...}]
 
 Around 80% of the final dataset is made of the ``en_dataset``, and 20% of the ``fr_dataset``.
+
+Remove
+^^^^^^
+
+Remove columns on-the-fly with :func:`datasets.IterableDataset.remove_columns`. Specify the name of the column to remove:
+
+.. code-block::
+
+   >>> from datasets load dataset
+   >>> dataset = load_dataset('m4', 'en', streaming=True, split='train')
+   >>> dataset = dataset.remove_columns('timestamp')
diff --git a/docs/source/use_dataset.rst b/docs/source/use_dataset.rst
@@ -44,26 +44,30 @@ Now you can tokenize ``sentence1`` field of the dataset:
 
 The tokenization process creates three new columns: ``input_ids``, ``token_type_ids``, and ``attention_mask``. These are the inputs to the model.
 
-Format
-------
+Use in PyTorch or TensorFlow
+----------------------------
 
-Set the format with :func:`datasets.Dataset.set_format`, which accepts two main arguments:
+Next, format the dataset into compatible PyTorch or TensorFlow types.
 
-1. ``type`` defines the type of column to cast to. For example, ``torch`` returns PyTorch tensors and ``tensorflow`` returns TensorFlow tensors.
+PyTorch
+^^^^^^^
+
+If you are using PyTorch, set the format with :func:`datasets.Dataset.set_format`, which accepts two main arguments:
+
+1. ``type`` defines the type of column to cast to. For example, ``torch`` returns PyTorch tensors.
 
-2. ``columns`` specifies which columns should be formatted.
+2. ``columns`` specify which columns should be formatted.
 
-After you set the format, wrap the dataset in a ``torch.utils.data.DataLoader`` or a ``tf.data.Dataset``:
+After you set the format, wrap the dataset with ``torch.utils.data.DataLoader``. Your dataset is now ready for use in a training loop!
 
-.. tab:: PyTorch
+.. code-block::
 
    >>> import torch
    >>> from datasets import load_dataset
    >>> from transformers import AutoTokenizer
    >>> dataset = load_dataset('glue', 'mrpc', split='train')
    >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
    >>> dataset = dataset.map(lambda e: tokenizer(e['sentence1'], truncation=True, padding='max_length'), batched=True)
-   ...
    >>> dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
    >>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
    >>> next(iter(dataloader))
@@ -78,28 +82,53 @@ After you set the format, wrap the dataset in a ``torch.utils.data.DataLoader``
                             ...,
                             [0, 0, 0,  ..., 0, 0, 0]])}
 
-.. tab:: TensorFlow
+TensorFlow
+^^^^^^^^^^
+
+If you are using TensorFlow, set the format with ``to_tf_dataset``, which accepts several arguments:
+
+1. ``columns`` specify which columns should be formatted (includes the inputs and labels).
+
+2. ``shuffle`` determines whether the dataset should be shuffled.
+
+3. ``batch_size`` specifies the batch size.
+
+4. ``collate_fn`` specifies a data collator that will batch each processed example and apply padding. If you are using a ``DataCollator``, make sure you set ``return_tensors="tf"`` when you initialize it to return ``tf.Tensor`` outputs.
+
+.. code-block::
 
    >>> import tensorflow as tf
    >>> from datasets import load_dataset
    >>> from transformers import AutoTokenizer
    >>> dataset = load_dataset('glue', 'mrpc', split='train')
    >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
    >>> dataset = dataset.map(lambda e: tokenizer(e['sentence1'], truncation=True, padding='max_length'), batched=True)
-   ...
-   >>> dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
-   >>> features = {x: dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.model_max_length]) for x in ['input_ids', 'token_type_ids', 'attention_mask']}
-   >>> tfdataset = tf.data.Dataset.from_tensor_slices((features, dataset["label"])).batch(32)
-   >>> next(iter(tfdataset))
-   ({'input_ids': <tf.Tensor: shape=(32, 512), dtype=int32, numpy=
-   array([[  101,  7277,  2180, ...,     0,     0,     0],
-        ...,
-        [  101,   142,  1813, ...,     0,     0,     0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(32, 512), dtype=int32, numpy=
-   array([[0, 0, 0, ..., 0, 0, 0],
-        ...,
-        [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(32, 512), dtype=int32, numpy=
-   array([[1, 1, 1, ..., 0, 0, 0],
-        ...,
-        [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>}, <tf.Tensor: shape=(32,), dtype=int64, numpy=
-   array([1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
-        0, 1, 1, 1, 0, 0, 1, 1, 1, 0])>)
+   >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
+   >>> train_dataset = dataset["train"].to_tf_dataset(
+   ...   columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'],
+   ...   shuffle=True,
+   ...   batch_size=16,
+   ...   collate_fn=data_collator,
+   ... )
+   >>> next(iter(train_dataset))
+   {'attention_mask': <tf.Tensor: shape=(16, 512), dtype=int64, numpy=
+    array([[1, 1, 1, ..., 0, 0, 0],
+         ...,
+         [1, 1, 1, ..., 0, 0, 0]])>,
+    'input_ids': <tf.Tensor: shape=(16, 512), dtype=int64, numpy=
+     array([[  101, 11336, 11154, ...,     0,     0,     0],
+         ..., 
+         [  101,   156, 22705, ...,     0,     0,     0]])>,
+    'labels': <tf.Tensor: shape=(16,), dtype=int64, numpy=
+     array([1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0])>,
+    'token_type_ids': <tf.Tensor: shape=(16, 512), dtype=int64, numpy=
+     array([[0, 0, 0, ..., 0, 0, 0],
+          ...,
+         [0, 0, 0, ..., 0, 0, 0]])>
+   }
+
+.. tip::
+
+   ``to_tf_dataset`` is the easiest way to create a TensorFlow compatible dataset. If you are looking for additional options for constructing a TensorFlow dataset, take a look at the :ref:`format` section!
+
+Your dataset is now ready for use in a training loop!