[data] Handle nullable fields in schema across blocks for parquet files #48478

rickyyx · 2024-10-31T21:41:49Z

Why are these changes needed?

When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema.

This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields.
It also casts the table to the newly merged schema so that the write could happen.

Related issue number

Closes #48102

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

bveeramani

High level approach LGTM

bveeramani · 2024-10-31T22:29:20Z

python/ray/data/_internal/datasource/parquet_datasink.py

@@ -75,10 +75,12 @@ def write(

        def write_blocks_to_path():
            with self.open_output_stream(write_path) as file:
-                schema = BlockAccessor.for_block(blocks[0]).to_arrow().schema
+                tables = [BlockAccessor.for_block(block).to_arrow() for block in blocks]
+                schema = self._try_merge_nullable_fields(tables)


Rather than introducing a new method, we could extend the existing unify_schemas function:

ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Line 54 in 23cc23b

def unify_schemas(

bveeramani · 2024-10-31T22:29:40Z

python/ray/data/_internal/datasource/parquet_datasink.py

+                        if not table.schema.equals(schema):
+                            table = table.cast(schema)


What happens if we don't explicitly cast the tables?

The table would still have a mismatch schema. i.e.
table.schema.equals(schema) in this case would still be false.

Gotcha. Wasn't sure if PyArrow would implicitly cast tables to match the specified schema under-the-hood

Yeah, it doesn't do the casting because there's check on the schema equality here:

https://github.com/apache/arrow/blob/main/python/pyarrow/parquet/core.py#L1110-L1114

rickyyx · 2024-11-04T21:09:46Z

Updates:

Use pyarrow.unify_schemas to unify schemas from various blocks (Didn't use transform_arrows.py::unify_schemas since that routine's goal seems to be doing a lot more extra work that's not required here. But open to use that as well.)
Added some simple tests.

bveeramani

LGTM

bveeramani · 2024-11-04T22:24:38Z

python/ray/data/tests/test_parquet.py

+
+    if OperatorFusionRule in _PHYSICAL_RULES:
+        _PHYSICAL_RULES.remove(OperatorFusionRule)


Add a note why we're removing operator fusion here?

Wait, why are we removing fusion?

We'd not need to do that

When we don't do this - I think there will be only 1 block somehow. So the repro here didn't work, I guess we would need some other examples/repros if we don't remove the rule?

Oh, i see what you're saying.

Surely we can disable operator fusion, but that should be done t/h configuration not "physically" removing the rule from the list (just add config to DataContext disabling it)

just add config to DataContext disabling it

@alexeykudinkin do you envision us adding a config for each optimization rule, or special-case operator fusion?

In any case, adding an interface for disabling optimization rules seems orthogonal to the goal of this PR, and can probably be handled as a follow-up?

@rickyyx no need to block this PR on this, let's just reshape your test a bit:

Instead of using ray.data.range as source, create 2 parquet files -- 1 without nulls, another with nulls

Read both of these and then write out as single one

@alexeykudinkin I might be missing something here, but I am not sure how I can force writing the 2 files with a single block w/o disabling the operator fusing.

Something like below still only writes to the file with a single block (so there's technically no schema unification needed)

# Write each row to a separate file. for i, row in enumerate(row_data): ray.data.from_pandas(pd.DataFrame([row])).write_parquet( os.path.join(tmp_path, f"file_{i}.parquet") ) # Read files and merge into a single file shouldn't error. ray.data.read_parquet(tmp_path).write_parquet(tmp_path, num_rows_per_file=2)

Instead of using ray.data.range as source, create 2 parquet files -- 1 without nulls, another with nulls
Read both of these and then write out as single one

I don't think this'd reproduce the error. IIRC Ray Data will read both files in a single task, and then BlockOutputBuffer will combine the read data into a single block before passing it to the datasink

Yeah, that might require some fidgeting to make it work.

Alternative path is to specify num_cpus which should make them diverge and hence avoid fusion.

I was able to force with by changing the target_max_block_size if that's a better approach.

bveeramani · 2024-11-04T22:25:09Z

python/ray/data/tests/test_parquet.py

+    [
+        [{"a": 1, "b": None}, {"a": 1, "b": 2}],
+        [{"a": None, "b": None}, {"a": 1, "b": 2}],
+        [{"a": 1, "b": 2}, {"a": 1, "b": "hi"}],


What does the type get promoted to for "b" in this case?

Oh - this shouldn't pass actually. it was somehow passing without remove the fusion.

python/ray/data/_internal/datasource/parquet_datasink.py

alexeykudinkin · 2024-11-05T00:12:07Z

python/ray/data/tests/test_parquet.py

+
+    if OperatorFusionRule in _PHYSICAL_RULES:
+        _PHYSICAL_RULES.remove(OperatorFusionRule)


Wait, why are we removing fusion?

We'd not need to do that

Signed-off-by: rickyx <rickyx@anyscale.com>

rickyyx · 2024-11-13T02:33:14Z

Updates

Change test to avoid removal of operator
Resolve conflict.

bveeramani · 2024-11-13T02:43:25Z

python/ray/data/tests/test_parquet.py

+    ],
+    ids=["row1_b_null", "row1_a_null", "row_each_null"],
+)
+def test_write_auto_infer_nullable_fields(tmp_path, ray_start_regular_shared, row_data):


Add the restore_data_context fixture so that changes aren't persisted across tests

Suggested change

def test_write_auto_infer_nullable_fields(tmp_path, ray_start_regular_shared, row_data):

def test_write_auto_infer_nullable_fields(tmp_path, ray_start_regular_shared, row_data, restore_data_context):

bveeramani · 2024-11-13T02:45:43Z

python/ray/data/tests/test_parquet.py

+    ctx = DataContext.get_current()
+    # So that we force multiple blocks on mapping.
+    ctx.target_max_block_size = 1
+    ds = ray.data.range(len(row_data)).map(lambda i: row_data[i["id"]])


Nit: The name i makes me think that i is an int (index).

Suggested change

ds = ray.data.range(len(row_data)).map(lambda i: row_data[i["id"]])

ds = ray.data.range(len(row_data)).map(lambda row: row_data[row["id"]])

(Feel free to keep it as-is, too)

Signed-off-by: rickyx <rickyx@anyscale.com>

…es (ray-project#48478)   ## Why are these changes needed? When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema. This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields. It also casts the table to the newly merged schema so that the write could happen.  ## Related issue number Closes ray-project#48102 --------- Signed-off-by: rickyx <rickyx@anyscale.com>

…es (ray-project#48478)   ## Why are these changes needed? When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema. This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields. It also casts the table to the newly merged schema so that the write could happen.  ## Related issue number Closes ray-project#48102 --------- Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>

…es (ray-project#48478)   ## Why are these changes needed? When writing blocks to parquet, there might be blocks with fields that differ ONLY in nullability - by default, this would be rejected since some blocks might have a different schema than the ParquetWriter. However, we could potentially allow it to happen by tweaking the schema. This PR goes through all blocks before writing them to parquet, and merge schemas that differ only in nullability of the fields. It also casts the table to the newly merged schema so that the write could happen.  ## Related issue number Closes ray-project#48102 --------- Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: hjiang <dentinyhao@gmail.com>

rickyyx added 3 commits October 31, 2024 21:36

draft

ae1c6c7

comments

a948bc7

make set more strict

bc2dd0d

bveeramani reviewed Oct 31, 2024

View reviewed changes

rickyyx added 2 commits November 4, 2024 12:20

up

a086929

clean

4e7418b

rickyyx marked this pull request as ready for review November 4, 2024 21:10

rickyyx requested review from scottjlee, raulchen, stephanie-wang and omatthew98 as code owners November 4, 2024 21:10

bveeramani approved these changes Nov 4, 2024

View reviewed changes

alexeykudinkin reviewed Nov 5, 2024

View reviewed changes

rickyyx added 2 commits November 12, 2024 23:27

merged

2086c95

up

d041687

Signed-off-by: rickyx <rickyx@anyscale.com>

rickyyx requested a review from srinathk10 as a code owner November 13, 2024 02:31

bveeramani approved these changes Nov 13, 2024

View reviewed changes

rickyyx added 2 commits November 13, 2024 02:55

nits

d35d53a

Signed-off-by: rickyx <rickyx@anyscale.com>

up

a54e1c9

Signed-off-by: rickyx <rickyx@anyscale.com>

rickyyx enabled auto-merge (squash) November 13, 2024 22:31

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 13, 2024

rickyyx merged commit 138e59a into ray-project:master Nov 14, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Handle nullable fields in schema across blocks for parquet files #48478

[data] Handle nullable fields in schema across blocks for parquet files #48478

rickyyx commented Oct 31, 2024

bveeramani left a comment

bveeramani Oct 31, 2024

bveeramani Oct 31, 2024

rickyyx Oct 31, 2024

bveeramani Nov 1, 2024

rickyyx Nov 4, 2024

rickyyx commented Nov 4, 2024

bveeramani left a comment

bveeramani Nov 4, 2024

alexeykudinkin Nov 5, 2024

rickyyx Nov 5, 2024

alexeykudinkin Nov 6, 2024

bveeramani Nov 6, 2024

alexeykudinkin Nov 7, 2024

rickyyx Nov 7, 2024

bveeramani Nov 8, 2024

alexeykudinkin Nov 8, 2024

rickyyx Nov 13, 2024

bveeramani Nov 4, 2024

rickyyx Nov 4, 2024

alexeykudinkin Nov 5, 2024

rickyyx commented Nov 13, 2024

bveeramani Nov 13, 2024

bveeramani Nov 13, 2024

		if not table.schema.equals(schema):
		table = table.cast(schema)


		if OperatorFusionRule in _PHYSICAL_RULES:
		_PHYSICAL_RULES.remove(OperatorFusionRule)

	def test_write_auto_infer_nullable_fields(tmp_path, ray_start_regular_shared, row_data):
	def test_write_auto_infer_nullable_fields(tmp_path, ray_start_regular_shared, row_data, restore_data_context):

	ds = ray.data.range(len(row_data)).map(lambda i: row_data[i["id"]])
	ds = ray.data.range(len(row_data)).map(lambda row: row_data[row["id"]])

[data] Handle nullable fields in schema across blocks for parquet files #48478

[data] Handle nullable fields in schema across blocks for parquet files #48478

Conversation

rickyyx commented Oct 31, 2024

Why are these changes needed?

Related issue number

Checks

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rickyyx commented Nov 4, 2024

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rickyyx commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment