Skip to content

Commit

Permalink
[Data] [Docs] Improve docs around Parquet filter predicate / column s…
Browse files Browse the repository at this point in the history
…election pushdown (ray-project#48095)

## Why are these changes needed?

Improve docs around Parquet filter predicate / column selection
pushdown, so that they are easier to access from multiple parts of the
Ray Data docs, and improve the clarity of examples.

Modified pages:
- [Loading
Data](https://anyscale-ray--48095.com.readthedocs.build/en/48095/data/loading-data.html)
<img width="1298" alt="reading_files"
src="https://github.com/user-attachments/assets/c760ecd4-4cfe-4547-8b88-3026fa12a13a">

- [Performance
tips](https://anyscale-ray--48095.com.readthedocs.build/en/48095/data/performance-tips.html#parquet-column-pruning-projection-pushdown)
<img width="1311" alt="performance_tips"
src="https://github.com/user-attachments/assets/1fc894ae-dabf-4c33-bc27-ba0dcc9fddff">

-
[`Dataset.select_columns`](https://anyscale-ray--48095.com.readthedocs.build/en/48095/data/api/doc/ray.data.Dataset.select_columns.html#ray.data.Dataset.select_columns)
<img width="1338" alt="select_columns"
src="https://github.com/user-attachments/assets/483a4bf2-acc7-4ca3-90e1-3f66563b3365">

-
[`Dataset.filter`](https://anyscale-ray--48095.com.readthedocs.build/en/48095/data/api/doc/ray.data.Dataset.filter.html#ray.data.Dataset.filter)
<img width="1319" alt="filter"
src="https://github.com/user-attachments/assets/a9b1beb6-5b7c-415f-97b9-c2119513adde">

## Related issue number

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
  • Loading branch information
scottjlee authored and JP-sDEV committed Nov 14, 2024
1 parent 55d8947 commit 3c93a9d
Show file tree
Hide file tree
Showing 3 changed files with 55 additions and 17 deletions.
8 changes: 8 additions & 0 deletions doc/source/data/loading-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,14 @@ To view the full list of supported file formats, see the
petal.width double
variety string

.. tip::

When reading parquet files, you can take advantage of column and row pruning
to efficiently filter columns and rows at the file scan level. See
:ref:`Parquet column pruning <parquet_column_pruning>` and
:ref:`Parquet row pruning <parquet_row_pruning>` for more details
on the projection and filter pushdown features.

.. tab-item:: Images

To read raw images, call :func:`~ray.data.read_images`. Ray Data represents
Expand Down
52 changes: 38 additions & 14 deletions doc/source/data/performance-tips.rst
Original file line number Diff line number Diff line change
Expand Up @@ -191,28 +191,52 @@ By default, Ray requests 1 CPU per read task, which means one read task per CPU
For datasources that benefit from more IO parallelism, you can specify a lower ``num_cpus`` value for the read function with the ``ray_remote_args`` parameter.
For example, use ``ray.data.read_parquet(path, ray_remote_args={"num_cpus": 0.25})`` to allow up to four read tasks per CPU.

Parquet column pruning
~~~~~~~~~~~~~~~~~~~~~~
.. _parquet_column_pruning:

Current Dataset reads all Parquet columns into memory.
Parquet column pruning (projection pushdown)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, :func:`ray.data.read_parquet` reads all columns in the Parquet files into memory.
If you only need a subset of the columns, make sure to specify the list of columns
explicitly when calling :func:`ray.data.read_parquet` to
avoid loading unnecessary data (projection pushdown).
For example, use ``ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet", columns=["sepal.length", "variety"])`` to read
just two of the five columns of Iris dataset.
avoid loading unnecessary data (projection pushdown). Note that this is more efficient than
calling :func:`~ray.data.Dataset.select_columns`, since column selection is pushed down to the file scan.

.. testcode::

import ray
# Read just two of the five columns of the Iris dataset.
ray.data.read_parquet(
"s3://anonymous@ray-example-data/iris.parquet",
columns=["sepal.length", "variety"],
)

.. testoutput::

Dataset(num_rows=150, schema={sepal.length: double, variety: string})

.. _parquet_row_pruning:

Parquet row pruning
~~~~~~~~~~~~~~~~~~~
Parquet row pruning (filter pushdown)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Similarly, you can pass in a filter to :func:`ray.data.read_parquet` (filter pushdown)
Similar to Parquet column pruning, you can pass in a filter to :func:`ray.data.read_parquet` (filter pushdown)
which is applied at the file scan so only rows that match the filter predicate
are returned.
For example, use ``ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet", filter=pyarrow.dataset.field("sepal.length") > 5.0)``
(where ``pyarrow`` has to be imported)
to read rows with sepal.length greater than 5.0.
This can be used in conjunction with column pruning when appropriate to get the benefits of both.
are returned. This can be used in conjunction with column pruning when appropriate to get the benefits of both.

.. testcode::

import ray
# Only read rows with `sepal.length` greater than 5.0.
# The row count will be less than the total number of rows (150) in the full dataset.
ray.data.read_parquet(
"s3://anonymous@ray-example-data/iris.parquet",
filter=pyarrow.dataset.field("sepal.length") > 5.0,
).count()

.. testoutput::

118


.. _data_memory:
Expand Down
12 changes: 9 additions & 3 deletions python/ray/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -831,6 +831,11 @@ def select_columns(
Specified columns must be in the dataset schema.
.. tip::
If you're reading parquet files with :meth:`ray.data.read_parquet`,
you might be able to speed it up by using projection pushdown; see
:ref:`Parquet column pruning <parquet_column_pruning>` for details.
Examples:
>>> import ray
Expand Down Expand Up @@ -1103,9 +1108,10 @@ def filter(
dropping rows.
.. tip::
If you're using parquet and the filter is a simple predicate, you might
be able to speed it up by using filter pushdown, see
:ref:`Parquet row pruning <parquet_row_pruning>`.
If you're reading parquet files with :meth:`ray.data.read_parquet`,
and the filter is a simple predicate, you might
be able to speed it up by using filter pushdown; see
:ref:`Parquet row pruning <parquet_row_pruning>` for details.
Examples:
Expand Down

0 comments on commit 3c93a9d

Please sign in to comment.