[Data] Update `Dataset.count()` to avoid unnecessarily keeping `BlockRef`s in-memory #46369

scottjlee · 2024-07-01T23:25:22Z

Why are these changes needed?

Currently, the implementation of Dataset.count() retrieves the entire list of BlockRefs associated with the Dataset when calculating the number of rows per block. This PR is a minor performance improvement to use an iterator over the BlockRefs, so that we can drop them as soon as we get each block's row count, and we do not need to hold the entire list of BlockRefs.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sjl <sjl@anyscale.com>

scottjlee · 2024-07-02T00:19:36Z

python/ray/data/dataset.py

@@ -4577,8 +4601,6 @@ def get_internal_block_refs(self) -> List[ObjectRef[Block]]:
            >>> ds.get_internal_block_refs()
            [ObjectRef(...)]

-        Time complexity: O(1)


removing this because it's no longer accurate.

raulchen · 2024-07-02T18:09:33Z

python/ray/data/dataset.py

+            An iterator over references to this Dataset's blocks.
+        """
+        iter_block_refs_md, _, _ = self._plan.execute_to_iterator()
+        iter_block_refs = (block_ref for block_ref, _ in iter_block_refs_md)


just realized that we already have block metadata here. So no need to submit additional tasks to count rows.
We can update this method to return Iterator[RefBundle]

bveeramani · 2024-07-02T18:46:58Z

python/ray/data/dataset.py

+        This function can be used for zero-copy access to the data. It does not
+        keep the data materialized in-memory.


What does zero-copy access mean here? You might copy the data when you get the block reference, right?

i had thought that when we get the RefBundle / BlockRef, it does not copy the data. that's the main advantage of passing the references instead of blocks themselves, right?

Oh. Yeah, if you don't call ray.get there won't be any copies, although the way I read this makes it sound like I can access the actual Block without copies.

good point, let me just remove the line. i think saying "It does not keep the data materialized in-memory." is the more important main point to get across.

bveeramani · 2024-07-02T18:47:07Z

python/ray/data/dataset.py

+    def iter_internal_block_refs(self) -> Iterator[ObjectRef[Block]]:
+        """Get an iterator over references to the underlying blocks of this Dataset.
+
+        This function can be used for zero-copy access to the data. It does not


Suggested change

This function can be used for zero-copy access to the data. It does not

This function can be used for zero-copy access to the data. It doesn't

https://developers.google.com/style/contractions

bveeramani · 2024-07-02T18:47:29Z

python/ray/data/dataset.py

+            An iterator over references to this Dataset's blocks.
+        """
+        iter_block_refs_md, _, _ = self._plan.execute_to_iterator()
+        iter_block_refs = (block_ref for block_ref, _ in iter_block_refs_md)


Signed-off-by: sjl <sjl@anyscale.com>

…01-count

Signed-off-by: sjl <sjl@anyscale.com>

Signed-off-by: Scott Lee <sjl@anyscale.com>

Signed-off-by: sjl <sjl@anyscale.com>

bveeramani · 2024-07-03T21:41:00Z

python/ray/data/dataset.py

+        """Count the number of records in the dataset. For `Dataset`s
+        which only read Parquet files (created with :meth:`~ray.data.read_parquet`),
+        this method reads the file metadata to efficiently count the number of records
+        without reading in the entire data.


Nit: docstring summary should be one line. Also, replace "records" with "rows" for consistency with our terminology elsewhere

Suggested change

"""Count the number of records in the dataset. For `Dataset`s

which only read Parquet files (created with :meth:`~ray.data.read_parquet`),

this method reads the file metadata to efficiently count the number of records

without reading in the entire data.

"""Count the number of rows in the dataset.

For `Dataset`s

which only read Parquet files (created with :meth:`~ray.data.read_parquet`),

this method reads the file metadata to efficiently count the number of rows

without reading in the entire data.

We use the metadata for other APIs, too (read_images, from_pandas, etc.).

I'm wondering if we should just remove the thing about Parquet? It's an implementation detail

i think the specific case around parquet has come up in questions from OSS users sometimes, which is why i expanded on the existing O(1) for parquet to clarify what it means. let me know if you think it's too confusing, and we can remove it instead

Ah, got it. If people have been asking about, sounds good to keep.

raulchen · 2024-07-03T21:48:15Z

python/ray/data/dataset.py

+            blocks: Tuple[ObjectRef[Block], BlockMetadata],
+        ) -> RefBundle:
+            # Set `owns_blocks=True` so we can destroy the blocks eagerly
+            # after getting count from metadata.


Let's also update _plan.execute_to_iterator to return RefBundles?

Eager free won't work here, unless we explicitly call destroy_if_owned after count. I think it's fine to not eager free, just a note about the comment.

Let's also update _plan.execute_to_iterator to return RefBundles?

i am thinking i can do that in a future PR, while i also replace get_internal_block_refs usages with the new iter_internal_ref_bundles method. what do you think?

Eager free won't work here, unless we explicitly call destroy_if_owned after count. I think it's fine to not eager free, just a note about the comment.

ah thanks, i had misunderstood how that worked. i will remove the comment but keep owns_blocks=True usage

Let's also update _plan.execute_to_iterator to return RefBundles?

@scottjlee @raulchen I'll do this. It was planned clean up item from the LazyBlockList removal

raulchen · 2024-07-03T21:51:35Z

python/ray/data/dataset.py

+            num_rows = ref_bundle.num_rows()
+            # Executing the dataset always returns blocks with valid `num_rows`.
+            assert num_rows is not None
+            total_rows += num_rows


not an issue with this PR. but we should make BlockMetadata.num_rows non-nullable, to avoid repeating this check.

Yeah, let's definitely do this when we separate BlockMetadata from read tasks. Currently, BlockMetadata.num_rows must be nullable because some datasources don't know how many rows are yielded by each read task

raulchen · 2024-07-03T21:53:50Z

python/ray/data/dataset.py

+        self._synchronize_progress_bar()
+        return iter_ref_bundles
+
+    @ConsumptionAPI(pattern="")
    @DeveloperAPI
    def get_internal_block_refs(self) -> List[ObjectRef[Block]]:


(can do this later) there are only a few use cases of get_internal_block_refs, we can also update them to use iter_internal_block_refs.

Signed-off-by: Scott Lee <sjl@anyscale.com>

Signed-off-by: sjl <sjl@anyscale.com>

…01-count Signed-off-by: sjl <sjl@anyscale.com>

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2024-07-05T18:00:53Z

python/ray/data/tests/test_zip.py

+    # Execute the dataset to get full schema.
+    ds = ds.materialize()
+    assert "{col1: int64, col2: int64, col3: object, col4: object}" in str(ds)
+


Need to update tests for Zip, because in the test, we call ds.count() before attempting to check the schema from the Dataset.__str__ representation. After updating ds.count() to no longer execute and get the list of underlying Blocks, the schema is unknown for N-ary operators without executing: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/plan.py#L155-L158

scottjlee · 2024-07-05T21:52:27Z

One side effect we overlooked with this PR is, now that we avoid calling Dataset.get_internal_block_refs(), which calls ExecutionPlan.execute(), we no longer save the snapshot bundle to the ExecutionPlan after execution (which is what ExecutionPlan.execute() does). This has several effects:

Calling str(ds) on an unexecuted Dataset without row count from metadata will result in Dataset(num_rows=?), since the row count is always unknown prior to execution. If one calls ds.count() then str(ds), the proper num_rows will be displayed because the row count is obtained from ExecutionPlan._snapshot_bundle which is saved during ds.count().
Similarly, the schema for Datasets involving N-ary operators (e.g. Zip) are no longer known.

For both cases, I think it does make sense that the count / schema is unknown prior to execution, since we would need to see all of the data in order to be certain. What do you think @raulchen @bveeramani ?

Also, one optimization we could make is that whenever Dataset.count() is called, we can cache the row count to ExecutionPlan, and re-use this value in ExecutionPlan.get_plan_as_string() (equivalent to str(Dataset)), so that it's available without snapshot_bundle.

Signed-off-by: Scott Lee <sjl@anyscale.com>

bveeramani · 2024-07-08T21:19:04Z

For both cases, I think it does make sense that the count / schema is unknown prior to execution, since we would need to see all of the data in order to be certain. What do you think @raulchen @bveeramani ?

Yeah, that sounds reasonable to me. Makes sense to me to merge this PR now and somehow cache the count and metadata in a follow-up PR.

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2024-07-08T23:12:31Z

Yeah, that sounds reasonable to me. Makes sense to me to merge this PR now and somehow cache the count and metadata in a follow-up PR.

Unfortunately can't merge this PR without implementing it, otherwise a number of tests will fail. But I have added the implementation here, waiting for tests to pass then will send for review again.

Signed-off-by: Scott Lee <sjl@anyscale.com>

bveeramani

Nice

bveeramani · 2024-07-09T18:28:33Z

python/ray/data/_internal/plan.py

+        # and count. This is calculated and cached when the plan is executed as an
+        # iterator (`execute_to_iterator()`), and avoids caching
+        # all of the output blocks in memory like in `self.snapshot_bundle`.
+        self._snapshot_metadata: Optional[BlockMetadata] = None


I don't have any ideas off the top of my head, but I think it'd be good if we simplify the how we cache bundles and metadata at some point. Might be confusing how execute_to_iterator uses _snapshot_metadata but execute doesn't.

good point, added a TODO comment

python/ray/data/tests/test_consumption.py

Signed-off-by: Scott Lee <sjl@anyscale.com>

…dles` instead of `(Block, BlockMetadata)` (#46575) Followup to #46369 and #46455. Update `ExecutionPlan.execute_to_iterator()` to return `RefBundles` instead of `(Block, BlockMetadata)`, to unify the logic between `RefBundle`s and `Block`s. Also refactor the `iter_batches()` code path accordingly to handle `RefBundle`s instead of raw `Block` and `BlockMetadata`. Signed-off-by: sjl <sjl@anyscale.com> Signed-off-by: Scott Lee <sjl@anyscale.com>

sjl and others added 2 commits July 1, 2024 23:23

add iter_internal_block_refs

c313a52

Signed-off-by: sjl <sjl@anyscale.com>

Merge branch 'master' into 0701-count

09b905d

scottjlee commented Jul 2, 2024

View reviewed changes

scottjlee changed the title ~~[Data] Update Dataset.count() to avoid unnecessary keeping BlockRefs in-memory~~ [Data] Update Dataset.count() to avoid unnecessarily keeping BlockRefs in-memory Jul 2, 2024

scottjlee marked this pull request as ready for review July 2, 2024 03:19

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners July 2, 2024 03:19

scottjlee assigned raulchen and bveeramani Jul 2, 2024

raulchen requested changes Jul 2, 2024

View reviewed changes

bveeramani reviewed Jul 2, 2024

View reviewed changes

sjl added 6 commits July 3, 2024 04:32

return iterator over refbundles

5ef8041

Signed-off-by: sjl <sjl@anyscale.com>

Merge branch '0701-count' of https://github.com/scottjlee/ray into 07…

172e423

…01-count

fix docs

374898b

Signed-off-by: sjl <sjl@anyscale.com>

update consumption api usage

b0ec894

Signed-off-by: Scott Lee <sjl@anyscale.com>

fix

f0b49f1

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

aa368d4

Signed-off-by: sjl <sjl@anyscale.com>

scottjlee requested review from raulchen and bveeramani July 3, 2024 21:27

bveeramani approved these changes Jul 3, 2024

View reviewed changes

raulchen approved these changes Jul 3, 2024

View reviewed changes

Scott Lee added 3 commits July 3, 2024 14:55

comments

ab8d6ed

Signed-off-by: Scott Lee <sjl@anyscale.com>

comments

128614d

Signed-off-by: Scott Lee <sjl@anyscale.com>

lint

65c3f4e

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen enabled auto-merge (squash) July 3, 2024 22:07

github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 3, 2024

sjl added 3 commits July 3, 2024 23:01

Merge branch 'master' into 0701-count

7956f35

fix tests

958d306

Signed-off-by: sjl <sjl@anyscale.com>

Merge branch '0701-count' of https://github.com/scottjlee/ray into 07…

d6a1d79

…01-count Signed-off-by: sjl <sjl@anyscale.com>

github-actions bot disabled auto-merge July 4, 2024 00:19

update tests

87151ee

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee commented Jul 5, 2024

View reviewed changes

scottjlee mentioned this pull request Jul 6, 2024

[Data] Deprecate Dataset.get_internal_block_refs() #46455

Merged

8 tasks

Merge branch 'master' into 0701-count

3ea9759

Signed-off-by: Scott Lee <sjl@anyscale.com>

Scott Lee added 3 commits July 8, 2024 14:58

snapshot metadata only

b6df226

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

70f82de

Signed-off-by: Scott Lee <sjl@anyscale.com>

update parquet test

f520c62

Signed-off-by: Scott Lee <sjl@anyscale.com>

only cache metadata once iteration terminates

629d6bb

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested review from bveeramani and raulchen July 9, 2024 02:13

bveeramani approved these changes Jul 9, 2024

View reviewed changes

comments

4720bf6

Signed-off-by: Scott Lee <sjl@anyscale.com>

bveeramani merged commit f8ee70a into ray-project:master Jul 10, 2024
5 checks passed

scottjlee mentioned this pull request Jul 11, 2024

[Data] Update ExecutionPlan.execute_to_iterator() to return RefBundles instead of (Block, BlockMetadata) #46575

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Update `Dataset.count()` to avoid unnecessarily keeping `BlockRef`s in-memory #46369

[Data] Update `Dataset.count()` to avoid unnecessarily keeping `BlockRef`s in-memory #46369

scottjlee commented Jul 1, 2024 •

edited

Loading

scottjlee Jul 2, 2024

raulchen Jul 2, 2024

bveeramani Jul 2, 2024

bveeramani Jul 2, 2024

scottjlee Jul 2, 2024

bveeramani Jul 2, 2024

scottjlee Jul 2, 2024

bveeramani Jul 2, 2024

bveeramani Jul 2, 2024

bveeramani Jul 2, 2024

bveeramani Jul 3, 2024

bveeramani Jul 3, 2024

scottjlee Jul 3, 2024

bveeramani Jul 3, 2024

raulchen Jul 3, 2024

scottjlee Jul 3, 2024

bveeramani Jul 3, 2024

raulchen Jul 3, 2024

bveeramani Jul 3, 2024

raulchen Jul 3, 2024

scottjlee Jul 5, 2024

scottjlee commented Jul 5, 2024

bveeramani commented Jul 8, 2024

scottjlee commented Jul 8, 2024

bveeramani left a comment

bveeramani Jul 9, 2024

scottjlee Jul 9, 2024

		This function can be used for zero-copy access to the data. It does not
		keep the data materialized in-memory.

[Data] Update Dataset.count() to avoid unnecessarily keeping BlockRefs in-memory #46369

[Data] Update Dataset.count() to avoid unnecessarily keeping BlockRefs in-memory #46369

Conversation

scottjlee commented Jul 1, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottjlee commented Jul 5, 2024

bveeramani commented Jul 8, 2024

scottjlee commented Jul 8, 2024

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Data] Update `Dataset.count()` to avoid unnecessarily keeping `BlockRef`s in-memory #46369

[Data] Update `Dataset.count()` to avoid unnecessarily keeping `BlockRef`s in-memory #46369

scottjlee commented Jul 1, 2024 •

edited

Loading