[Datasets] Add glossary (ray-project#32400)

Closes ray-project#30408 Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes · Mar 22, 2023 · cee30c1 · cee30c1
1 parent e777228
commit cee30c1
Show file tree

Hide file tree

Showing 2 changed files with 139 additions and 1 deletion.
diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -141,6 +141,7 @@ parts:
               - file: data/random-access
           - file: data/faq
           - file: data/api/api
+          - file: data/glossary
           - file: data/integrations
 
       - file: train/train
@@ -380,4 +381,4 @@ parts:
               - file: ray-contribute/fake-autoscaler
               - file: ray-core/examples/testing-tips
           - file: ray-core/configure
-          - file: ray-contribute/whitepaper
+          - file: ray-contribute/whitepaper
diff --git a/doc/source/data/glossary.rst b/doc/source/data/glossary.rst
@@ -0,0 +1,137 @@
+.. _datasets_glossary:
+
+=====================
+Ray Datasets Glossary
+=====================
+
+.. glossary::
+
+    Batch format
+        The way batches of data are represented.
+
+        Set ``batch_format`` in methods like
+        :meth:`Dataset.iter_batches() <ray.data.Dataset.iter_batches>` and
+        :meth:`Dataset.map_batches() <ray.data.Dataset.map_batches>` to specify the
+        batch type.
+
+        .. doctest::
+
+            >>> import ray
+            >>> dataset = ray.data.range_table(10)
+            >>> next(iter(dataset.iter_batches(batch_format="numpy", batch_size=5)))
+            {'value': array([0, 1, 2, 3, 4])}
+            >>> next(iter(dataset.iter_batches(batch_format="pandas", batch_size=5)))
+               value
+            0      0
+            1      1
+            2      2
+            3      3
+            4      4
+
+        To learn more about batch formats, read
+        :ref:`UDF Input Batch Formats <transform_datasets_batch_formats>`.
+
+    Block
+        A processing unit of data. A :class:`~ray.data.Dataset` consists of a
+        collection of blocks.
+
+        Under the hood, :term:`Datasets <Datasets (library)>` partition :term:`records <Record>`
+        into a set of distributed data blocks. This allows Datasets to perform operations
+        in parallel.
+
+        Unlike a batch, which is a user-facing object, a block is an internal abstraction.
+
+    Block format
+        The way :term:`blocks <Block>` are represented.
+
+        Blocks are represented as
+        `Arrow tables <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`_,
+        `pandas DataFrames <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_,
+        and Python lists. To determine the block format, call
+        :meth:`Dataset.dataset_format() <ray.data.Dataset.dataset_format>`.
+
+    Datasets (library)
+        A library for distributed data processing.
+
+        Datasets isn’t intended as a replacement for more general data processing systems.
+        Its utility is as the last-mile bridge from ETL pipeline outputs to distributed
+        ML applications and libraries in Ray.
+
+        To learn more about Ray Datasets, read :ref:`Key Concepts <dataset_concept>`.
+
+    Dataset (object)
+        A class that represents a distributed collection of data.
+
+        :class:`~ray.data.Dataset` exposes methods to read, transform, and consume data at scale.
+
+        To learn more about Datasets and the operations they support, read the :ref:`Datasets API Reference <data-api>`.
+
+    Datasource
+        A :class:`~ray.data.Datasource` specifies how to read and write from
+        a variety of external storage and data formats.
+
+        Examples of Datasources include :class:`~ray.data.datasource.ParquetDatasource`,
+        :class:`~ray.data.datasource.ImageDatasource`,
+        :class:`~ray.data.datasource.TFRecordDatasource`,
+        :class:`~ray.data.datasource.CSVDatasource`, and
+        :class:`~ray.data.datasource.MongoDatasource`.
+
+        To learn more about Datasources, read :ref:`Creating a Custom Datasource <custom_datasources>`.
+
+    Record
+        A single data item.
+
+        If your dataset is :term:`tabular <Tabular Dataset>`, then records are :class:`TableRows <ray.data.row.TableRow>`.
+        If your dataset is :term:`simple <Simple Dataset>`, then records are arbitrary Python objects.
+        If your dataset is :term:`tensor <Tensor Dataset>`, then records are `NumPy ndarrays <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html>`_.
+
+    Schema
+        The data type of a dataset.
+
+        If your dataset is :term:`tabular <Tabular Dataset>`, then the schema describes
+        the column names and data types. If your dataset is :term:`simple <Simple Dataset>`,
+        then the schema describes the Python object type. If your dataset is
+        :term:`tensor <Tensor Dataset>`, then the schema describes the per-element
+        tensor shape and data type.
+
+        To determine a dataset's schema, call
+        :meth:`Dataset.schema() <ray.data.Dataset.schema>`.
+
+    Simple Dataset
+        A Dataset that represents a collection of arbitrary Python objects.
+
+        .. doctest::
+
+            >>> import ray
+            >>> ray.data.from_items(["spam", "ham", "eggs"])
+            Dataset(num_blocks=3, num_rows=3, schema=<class 'str'>)
+
+    Tensor Dataset
+        A Dataset that represents a collection of ndarrays.
+
+        :term:`Tabular datasets <Tabular Dataset>` that contain tensor columns aren’t tensor datasets.
+
+        .. doctest::
+
+            >>> import numpy as np
+            >>> import ray
+            >>> ray.data.from_numpy(np.zeros((100, 32, 32, 3)))
+            Dataset(num_blocks=1, num_rows=100, schema={__value__: ArrowTensorType(shape=(32, 32, 3), dtype=double)})
+
+    Tabular Dataset
+        A Dataset that represents columnar data.
+
+        .. doctest::
+
+            >>> import ray
+            >>> ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")
+            Dataset(num_blocks=1, num_rows=150, schema={sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64})
+
+    User-defined function (UDF)
+        A callable that transforms batches or :term:`records <Record>` of data. UDFs let you arbitrarily transform datasets.
+
+        Call :meth:`Dataset.map_batches() <ray.data.Dataset.map_batches>`,
+        :meth:`Dataset.map() <ray.data.Dataset.map>`, or
+        :meth:`Dataset.flat_map() <ray.data.Dataset.flat_map>` to apply UDFs.
+
+        To learn more about UDFs, read :ref:`Writing User-Defined Functions <transform_datasets_writing_udfs>`.