Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add optional tf_schema parameter to read_tfrecords / write_tfrecords methods #32857

Merged
merged 15 commits into from
Mar 15, 2023

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Feb 27, 2023

Why are these changes needed?

In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional tf_schema parameter to the ray.data.read_tfrecords and Dataset.write_tfrecords methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema).

In addition, this PR also:

  • consolidates individual data inputs into common TFRecord/data fixtures
  • makes use of several new utility methods to simplify test operations

Related issue number

Closes #32756

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Scott Lee added 6 commits February 27, 2023 10:21
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee marked this pull request as ready for review February 28, 2023 19:39
record[feature_name] = _get_feature_value(feature)
schema_feature_type = None
if tf_schema is not None:
for schema_feature in tf_schema.feature:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be more efficient to covert tf_schema into a dict first before looking up the feature type.

features[name] = _value_to_feature(arrow_table[name][i])
schema_feature_type = None
if tf_schema is not None:
for schema_feature in tf_schema.feature:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

from tensorflow_metadata.proto.v0 import schema_pb2

detected_feature_type = {
"bytes": feature.HasField("bytes_list")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if a schema is provided, it should be the source of truth. We should ignore the type inferred from the data or even throw an error if the types don't match. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both options make sense to me. Should we emit a warning, but not throw an error if the types mismatch (and use the schema-specified type over the inferred type)? @clarkzinzow @c21 thoughts here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should throw an error as this means the provided schema is inconsistent with actual schema in file.

Scott Lee added 3 commits March 1, 2023 13:46
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
python/ray/data/datasource/tfrecords_datasource.py Outdated Show resolved Hide resolved
f"{spec_type}, but underlying type is {und_type}",
)
# Override the underlying value type with the type in the user-specified schema.
underlying_feature_type = specified_feature_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if schema_feature_type is not None, should we also change the logic of line 189-193 that users may not want to unbox the list to scalar, if they specify the schema, and vice versa? We shall check with the user on this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! If schema_feature_type is provided, we don't want the auto type conversion. The type should strictly follow the ones defined in the schema. Thanks!

Scott Lee added 2 commits March 10, 2023 14:14
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee requested review from daikeshi and c21 and removed request for daikeshi March 10, 2023 22:15
Signed-off-by: Scott Lee <sjl@anyscale.com>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, cc @ericl to merge.

@c21
Copy link
Contributor

c21 commented Mar 14, 2023

@scottjlee could you rebase to latest master? thanks.

Signed-off-by: Scott Lee <sjl@anyscale.com>
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 14, 2023
@ericl
Copy link
Contributor

ericl commented Mar 14, 2023

Pending tests. Please ping when tests are ready.

Scott Lee added 2 commits March 15, 2023 10:03
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Mar 15, 2023
@scottjlee
Copy link
Contributor Author

Tests are either (a) fixed by #33334, or (b) unrelated to this PR (looks to be related to RL, wandb)

@ericl ericl merged commit 9a48617 into ray-project:master Mar 15, 2023
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
…write_tfrecords` methods (ray-project#32857)

In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema).

In addition, this PR also:
- consolidates individual data inputs into common TFRecord/data fixtures
- makes use of several new utility methods to simplify test operations

Signed-off-by: Jack He <jackhe2345@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…write_tfrecords` methods (ray-project#32857)

In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema).

In addition, this PR also:
- consolidates individual data inputs into common TFRecord/data fixtures
- makes use of several new utility methods to simplify test operations

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
…write_tfrecords` methods (ray-project#32857)

In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema).

In addition, this PR also:
- consolidates individual data inputs into common TFRecord/data fixtures
- makes use of several new utility methods to simplify test operations
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…write_tfrecords` methods (ray-project#32857)

In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema).

In addition, this PR also:
- consolidates individual data inputs into common TFRecord/data fixtures
- makes use of several new utility methods to simplify test operations

Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…write_tfrecords` methods (ray-project#32857)

In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema).

In addition, this PR also:
- consolidates individual data inputs into common TFRecord/data fixtures
- makes use of several new utility methods to simplify test operations

Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Dataset] Add optional schema param to read_tfrecords()
4 participants