[Datasets] Add optional `tf_schema` parameter to `read_tfrecords` / `write_tfrecords` methods #32857

scottjlee · 2023-02-27T19:33:25Z

Why are these changes needed?

In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional tf_schema parameter to the ray.data.read_tfrecords and Dataset.write_tfrecords methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema).

In addition, this PR also:

consolidates individual data inputs into common TFRecord/data fixtures
makes use of several new utility methods to simplify test operations

Related issue number

Closes #32756

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

daikeshi · 2023-02-28T22:37:11Z

python/ray/data/datasource/tfrecords_datasource.py

-        record[feature_name] = _get_feature_value(feature)
+        schema_feature_type = None
+        if tf_schema is not None:
+            for schema_feature in tf_schema.feature:


nit: it might be more efficient to covert tf_schema into a dict first before looking up the feature type.

daikeshi · 2023-02-28T22:37:48Z

python/ray/data/datasource/tfrecords_datasource.py

-            features[name] = _value_to_feature(arrow_table[name][i])
+            schema_feature_type = None
+            if tf_schema is not None:
+                for schema_feature in tf_schema.feature:


daikeshi · 2023-02-28T22:57:42Z

python/ray/data/datasource/tfrecords_datasource.py

+    from tensorflow_metadata.proto.v0 import schema_pb2
+
+    detected_feature_type = {
+        "bytes": feature.HasField("bytes_list")


IMO, if a schema is provided, it should be the source of truth. We should ignore the type inferred from the data or even throw an error if the types don't match. WDYT?

Both options make sense to me. Should we emit a warning, but not throw an error if the types mismatch (and use the schema-specified type over the inferred type)? @clarkzinzow @c21 thoughts here?

IMO we should throw an error as this means the provided schema is inconsistent with actual schema in file.

Signed-off-by: Scott Lee <sjl@anyscale.com>

python/ray/data/datasource/tfrecords_datasource.py

c21 · 2023-03-09T00:07:23Z

python/ray/data/datasource/tfrecords_datasource.py

+                f"{spec_type}, but underlying type is {und_type}",
+            )
+        # Override the underlying value type with the type in the user-specified schema.
+        underlying_feature_type = specified_feature_type


if schema_feature_type is not None, should we also change the logic of line 189-193 that users may not want to unbox the list to scalar, if they specify the schema, and vice versa? We shall check with the user on this behavior.

Yes! If schema_feature_type is provided, we don't want the auto type conversion. The type should strictly follow the ones defined in the schema. Thanks!

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21

LG, cc @ericl to merge.

c21 · 2023-03-14T22:19:52Z

@scottjlee could you rebase to latest master? thanks.

Signed-off-by: Scott Lee <sjl@anyscale.com>

ericl · 2023-03-14T23:40:45Z

Pending tests. Please ping when tests are ready.

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-03-15T20:52:07Z

Tests are either (a) fixed by #33334, or (b) unrelated to this PR (looks to be related to RL, wandb)

…write_tfrecords` methods (ray-project#32857) In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema). In addition, this PR also: - consolidates individual data inputs into common TFRecord/data fixtures - makes use of several new utility methods to simplify test operations Signed-off-by: Jack He <jackhe2345@gmail.com>

…write_tfrecords` methods (ray-project#32857) In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema). In addition, this PR also: - consolidates individual data inputs into common TFRecord/data fixtures - makes use of several new utility methods to simplify test operations Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…write_tfrecords` methods (ray-project#32857) In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema). In addition, this PR also: - consolidates individual data inputs into common TFRecord/data fixtures - makes use of several new utility methods to simplify test operations

…write_tfrecords` methods (ray-project#32857) In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema). In addition, this PR also: - consolidates individual data inputs into common TFRecord/data fixtures - makes use of several new utility methods to simplify test operations Signed-off-by: elliottower <elliot@elliottower.com>

…write_tfrecords` methods (ray-project#32857) In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema). In addition, this PR also: - consolidates individual data inputs into common TFRecord/data fixtures - makes use of several new utility methods to simplify test operations Signed-off-by: Jack He <jackhe2345@gmail.com>

Scott Lee added 6 commits February 27, 2023 10:21

read/write with records, using empty list for all missing values

5b154a4

Signed-off-by: Scott Lee <sjl@anyscale.com>

merge test changes from master

2dfc533

Signed-off-by: Scott Lee <sjl@anyscale.com>

continue merge

726b049

Signed-off-by: Scott Lee <sjl@anyscale.com>

reorganize tests and data into fixtures

2525416

Signed-off-by: Scott Lee <sjl@anyscale.com>

update tests

e1eb519

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into read-tfrecords-schema

5a60ea4

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review February 28, 2023 19:39

scottjlee requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners February 28, 2023 19:39

scottjlee assigned c21 Feb 28, 2023

daikeshi reviewed Feb 28, 2023

View reviewed changes

Scott Lee added 3 commits March 1, 2023 13:46

address comments, additional tests

c5e573a

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into read-tfrecords-schema

e8a09ae

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into read-tfrecords-schema

e3f3ed7

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 reviewed Mar 9, 2023

View reviewed changes

Scott Lee added 2 commits March 10, 2023 14:14

disable unwrapping when schema is specified

cb3f9d4

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into read-tfrecords-schema

51521c4

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee requested review from daikeshi and c21 and removed request for daikeshi March 10, 2023 22:15

Merge branch 'master' into read-tfrecords-schema

67a8db9

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 approved these changes Mar 14, 2023

View reviewed changes

c21 assigned ericl Mar 14, 2023

Merge branch 'master' into read-tfrecords-schema

2aa4c63

Signed-off-by: Scott Lee <sjl@anyscale.com>

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 14, 2023

Scott Lee added 2 commits March 15, 2023 10:03

Merge branch 'master' into read-tfrecords-schema

bcc698b

Signed-off-by: Scott Lee <sjl@anyscale.com>

format

918530a

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Mar 15, 2023

ericl merged commit 9a48617 into ray-project:master Mar 15, 2023

matthewdeng mentioned this pull request Mar 17, 2023

[ci] doc:source/ray-air/doc_code/computer_vision failure #33420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add optional `tf_schema` parameter to `read_tfrecords` / `write_tfrecords` methods #32857

[Datasets] Add optional `tf_schema` parameter to `read_tfrecords` / `write_tfrecords` methods #32857

scottjlee commented Feb 27, 2023 •

edited

Loading

daikeshi Feb 28, 2023

daikeshi Feb 28, 2023

daikeshi Feb 28, 2023

scottjlee Mar 1, 2023

c21 Mar 1, 2023

c21 Mar 9, 2023

daikeshi Mar 10, 2023

c21 left a comment

c21 commented Mar 14, 2023

ericl commented Mar 14, 2023

scottjlee commented Mar 15, 2023

[Datasets] Add optional tf_schema parameter to read_tfrecords / write_tfrecords methods #32857

[Datasets] Add optional tf_schema parameter to read_tfrecords / write_tfrecords methods #32857

Conversation

scottjlee commented Feb 27, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

daikeshi Feb 28, 2023

Choose a reason for hiding this comment

daikeshi Feb 28, 2023

Choose a reason for hiding this comment

daikeshi Feb 28, 2023

Choose a reason for hiding this comment

scottjlee Mar 1, 2023

Choose a reason for hiding this comment

c21 Mar 1, 2023

Choose a reason for hiding this comment

c21 Mar 9, 2023

Choose a reason for hiding this comment

daikeshi Mar 10, 2023

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

c21 commented Mar 14, 2023

ericl commented Mar 14, 2023

scottjlee commented Mar 15, 2023

[Datasets] Add optional `tf_schema` parameter to `read_tfrecords` / `write_tfrecords` methods #32857

[Datasets] Add optional `tf_schema` parameter to `read_tfrecords` / `write_tfrecords` methods #32857

scottjlee commented Feb 27, 2023 •

edited

Loading