-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add glossary #32400
Merged
Merged
[Datasets] Add glossary #32400
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
.. _datasets_glossary: | ||
|
||
===================== | ||
Ray Datasets Glossary | ||
===================== | ||
|
||
.. glossary:: | ||
|
||
Batch format | ||
The way batches of data are represented. | ||
|
||
Set ``batch_format`` in methods like | ||
:meth:`Dataset.iter_batches() <ray.data.Dataset.iter_batches>` and | ||
:meth:`Dataset.map_batches() <ray.data.Dataset.map_batches>` to specify the | ||
batch type. | ||
|
||
.. doctest:: | ||
|
||
>>> import ray | ||
>>> dataset = ray.data.range_table(10) | ||
>>> next(iter(dataset.iter_batches(batch_format="numpy", batch_size=5))) | ||
{'value': array([0, 1, 2, 3, 4])} | ||
>>> next(iter(dataset.iter_batches(batch_format="pandas", batch_size=5))) | ||
value | ||
0 0 | ||
1 1 | ||
2 2 | ||
3 3 | ||
4 4 | ||
|
||
To learn more about batch formats, read | ||
:ref:`UDF Input Batch Formats <transform_datasets_batch_formats>`. | ||
|
||
Block | ||
A processing unit of data. A :class:`~ray.data.Dataset` consists of a | ||
collection of blocks. | ||
|
||
Under the hood, :term:`Datasets <Datasets (library)>` partition :term:`records <Record>` | ||
into a set of distributed data blocks. This allows Datasets to perform operations | ||
in parallel. | ||
|
||
Unlike a batch, which is a user-facing object, a block is an internal abstraction. | ||
|
||
Block format | ||
The way :term:`blocks <Block>` are represented. | ||
|
||
Blocks are represented as | ||
`Arrow tables <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`_, | ||
`pandas DataFrames <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_, | ||
and Python lists. To determine the block format, call | ||
:meth:`Dataset.dataset_format() <ray.data.Dataset.dataset_format>`. | ||
|
||
Datasets (library) | ||
A library for distributed data processing. | ||
|
||
Datasets isn’t intended as a replacement for more general data processing systems. | ||
Its utility is as the last-mile bridge from ETL pipeline outputs to distributed | ||
ML applications and libraries in Ray. | ||
|
||
To learn more about Ray Datasets, read :ref:`Key Concepts <dataset_concept>`. | ||
|
||
Dataset (object) | ||
A class that represents a distributed collection of data. | ||
|
||
:class:`~ray.data.Dataset` exposes methods to read, transform, and consume data at scale. | ||
|
||
To learn more about Datasets and the operations they support, read the :ref:`Datasets API Reference <data-api>`. | ||
|
||
Datasource | ||
A :class:`~ray.data.Datasource` specifies how to read and write from | ||
a variety of external storage and data formats. | ||
|
||
Examples of Datasources include :class:`~ray.data.datasource.ParquetDatasource`, | ||
:class:`~ray.data.datasource.ImageDatasource`, | ||
:class:`~ray.data.datasource.TFRecordDatasource`, | ||
:class:`~ray.data.datasource.CSVDatasource`, and | ||
:class:`~ray.data.datasource.MongoDatasource`. | ||
|
||
To learn more about Datasources, read :ref:`Creating a Custom Datasource <custom_datasources>`. | ||
|
||
Record | ||
A single data item. | ||
|
||
If your dataset is :term:`tabular <Tabular Dataset>`, then records are :class:`TableRows <ray.data.row.TableRow>`. | ||
If your dataset is :term:`simple <Simple Dataset>`, then records are arbitrary Python objects. | ||
If your dataset is :term:`tensor <Tensor Dataset>`, then records are `NumPy ndarrays <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html>`_. | ||
|
||
Schema | ||
The data type of a dataset. | ||
|
||
If your dataset is :term:`tabular <Tabular Dataset>`, then the schema describes | ||
the column names and data types. If your dataset is :term:`simple <Simple Dataset>`, | ||
then the schema describes the Python object type. If your dataset is | ||
:term:`tensor <Tensor Dataset>`, then the schema describes the per-element | ||
tensor shape and data type. | ||
|
||
To determine a dataset's schema, call | ||
:meth:`Dataset.schema() <ray.data.Dataset.schema>`. | ||
|
||
Simple Dataset | ||
A Dataset that represents a collection of arbitrary Python objects. | ||
|
||
.. doctest:: | ||
|
||
>>> import ray | ||
>>> ray.data.from_items(["spam", "ham", "eggs"]) | ||
Dataset(num_blocks=3, num_rows=3, schema=<class 'str'>) | ||
|
||
Tensor Dataset | ||
A Dataset that represents a collection of ndarrays. | ||
|
||
:term:`Tabular datasets <Tabular Dataset>` that contain tensor columns aren’t tensor datasets. | ||
|
||
.. doctest:: | ||
|
||
>>> import numpy as np | ||
>>> import ray | ||
>>> ray.data.from_numpy(np.zeros((100, 32, 32, 3))) | ||
Dataset(num_blocks=1, num_rows=100, schema={__value__: ArrowTensorType(shape=(32, 32, 3), dtype=double)}) | ||
|
||
Tabular Dataset | ||
A Dataset that represents columnar data. | ||
|
||
.. doctest:: | ||
|
||
>>> import ray | ||
>>> ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") | ||
Dataset(num_blocks=1, num_rows=150, schema={sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64}) | ||
|
||
User-defined function (UDF) | ||
A callable that transforms batches or :term:`records <Record>` of data. UDFs let you arbitrarily transform datasets. | ||
|
||
Call :meth:`Dataset.map_batches() <ray.data.Dataset.map_batches>`, | ||
:meth:`Dataset.map() <ray.data.Dataset.map>`, or | ||
:meth:`Dataset.flat_map() <ray.data.Dataset.flat_map>` to apply UDFs. | ||
|
||
To learn more about UDFs, read :ref:`Writing User-Defined Functions <transform_datasets_writing_udfs>`. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also add an example for
map_batches(batch_format=...)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should. Meaningful
map_batches
examples are relatively complicated, and I don't think we should include long examples in the glossary. Also, since we link toUDF Input Batch Formats
, I think we should be okay.