-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add optional tf_schema
parameter to read_tfrecords
/ write_tfrecords
methods
#32857
Conversation
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
record[feature_name] = _get_feature_value(feature) | ||
schema_feature_type = None | ||
if tf_schema is not None: | ||
for schema_feature in tf_schema.feature: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it might be more efficient to covert tf_schema
into a dict first before looking up the feature type.
features[name] = _value_to_feature(arrow_table[name][i]) | ||
schema_feature_type = None | ||
if tf_schema is not None: | ||
for schema_feature in tf_schema.feature: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
from tensorflow_metadata.proto.v0 import schema_pb2 | ||
|
||
detected_feature_type = { | ||
"bytes": feature.HasField("bytes_list") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, if a schema is provided, it should be the source of truth. We should ignore the type inferred from the data or even throw an error if the types don't match. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both options make sense to me. Should we emit a warning, but not throw an error if the types mismatch (and use the schema-specified type over the inferred type)? @clarkzinzow @c21 thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should throw an error as this means the provided schema is inconsistent with actual schema in file.
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
f"{spec_type}, but underlying type is {und_type}", | ||
) | ||
# Override the underlying value type with the type in the user-specified schema. | ||
underlying_feature_type = specified_feature_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if schema_feature_type
is not None, should we also change the logic of line 189-193 that users may not want to unbox the list to scalar, if they specify the schema, and vice versa? We shall check with the user on this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! If schema_feature_type
is provided, we don't want the auto type conversion. The type should strictly follow the ones defined in the schema
. Thanks!
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, cc @ericl to merge.
@scottjlee could you rebase to latest master? thanks. |
Signed-off-by: Scott Lee <sjl@anyscale.com>
Pending tests. Please ping when tests are ready. |
Signed-off-by: Scott Lee <sjl@anyscale.com>
Tests are either (a) fixed by #33334, or (b) unrelated to this PR (looks to be related to RL, wandb) |
…write_tfrecords` methods (ray-project#32857) In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema). In addition, this PR also: - consolidates individual data inputs into common TFRecord/data fixtures - makes use of several new utility methods to simplify test operations Signed-off-by: Jack He <jackhe2345@gmail.com>
…write_tfrecords` methods (ray-project#32857) In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema). In addition, this PR also: - consolidates individual data inputs into common TFRecord/data fixtures - makes use of several new utility methods to simplify test operations Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…write_tfrecords` methods (ray-project#32857) In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema). In addition, this PR also: - consolidates individual data inputs into common TFRecord/data fixtures - makes use of several new utility methods to simplify test operations
…write_tfrecords` methods (ray-project#32857) In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema). In addition, this PR also: - consolidates individual data inputs into common TFRecord/data fixtures - makes use of several new utility methods to simplify test operations Signed-off-by: elliottower <elliot@elliottower.com>
…write_tfrecords` methods (ray-project#32857) In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional `tf_schema` parameter to the `ray.data.read_tfrecords` and `Dataset.write_tfrecords` methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema). In addition, this PR also: - consolidates individual data inputs into common TFRecord/data fixtures - makes use of several new utility methods to simplify test operations Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
In cases where a TFRecord contains partially or fully missing feature values, we cannot infer the data type for these features. To support these use cases, we add an optional
tf_schema
parameter to theray.data.read_tfrecords
andDataset.write_tfrecords
methods, which allows users to read TFRecords containing empty features into Ray Datasets (and vice versa, writing Datasets with missing columns to TFRecords with a valid schema).In addition, this PR also:
Related issue number
Closes #32756
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.