-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add benchmark for TFRecords read #30389
Conversation
# Convert images from NumPy to bytes | ||
def images_to_bytes(batch): | ||
images_as_bytes = [image.tobytes() for image in batch] | ||
return pa.table({"image": images_as_bytes}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a lot of context here, but why the bytes representation rather than our tensor extension type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ds.write_tfrecords()
not work as expected with our tensor extension type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The row format for TFRecord (tf.train.Example
) only support list of ints/floats/bytes (doc):
Dict[str,
Union[List[bytes],
List[int64],
List[float]]]
Is there a way to get bytes representation natively from tensor extension type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I suppose that I was imagining the TFRecords datasource would take care of that conversion during the write.
For Arrow 9+ (minor workaround needed for Arrow 6-8), this should return a NumPy ndarray for each element in a tensor extension columns:
features[name] = _value_to_feature(arrow_table[name][i].as_py()) |
And we could have support for NumPy ndarrays when converting Python values to TFRecord feature types:
ray/python/ray/data/datasource/tfrecords_datasource.py
Lines 115 to 137 in 2631806
def _value_to_feature(value: Union[bytes, float, int, List]) -> "tf.train.Feature": | |
import tensorflow as tf | |
# A Feature stores a list of values. | |
# If we have a single value, convert it to a singleton list first. | |
values = [value] if not isinstance(value, list) else value | |
if not values: | |
raise ValueError( | |
"Storing an empty value in a tf.train.Feature is not supported." | |
) | |
elif isinstance(values[0], bytes): | |
return tf.train.Feature(bytes_list=tf.train.BytesList(value=values)) | |
elif isinstance(values[0], float): | |
return tf.train.Feature(float_list=tf.train.FloatList(value=values)) | |
elif isinstance(values[0], int): | |
return tf.train.Feature(int64_list=tf.train.Int64List(value=values)) | |
else: | |
raise ValueError( | |
f"Value is of type {type(values[0])}, " | |
"which is not a supported tf.train.Feature storage type " | |
"(bytes, float, or int)." | |
) |
In order for the roundtrip write + read to work, when reading, we'd need to be able to cast those bytes features to the correct parameterized tensor extension type, so the user would have to provide a schema on read...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways, since this isn't already implemented as I had expected, writing the ndarrays as bytes for this benchmark seems fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clarkzinzow re:
And we could have support for NumPy ndarrays when converting Python values to TFRecord feature types
I am wondering how should we make sure the schema not lost here? We need to convert NumPy ndarray to bytes, right? Also we need to add additional information as TFRecord feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep TFRecords doesn't have support for schema/type metadata, unfortunately, so we would need to either:
- serialize that metadata within the TFRecords data payload,
- store a metadata sidecar file that contains this metadata, out-of-band for the TFRecord file format,
- enforce that schema is specified on-read.
(2) is what TensorFlow Datasets does: https://www.tensorflow.org/datasets/external_tfrecord
We'd have pros and cons to consider for each option, e.g. (1) would make the data written by us only readable by us, which isn't great. This might be mitigated by e.g. using pickle for the serialization, since that provides an easy path for it to be read by other libraries.
cluster_compute: single_node_benchmark_compute.yaml | ||
|
||
run: | ||
timeout: 1800 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the ballpark number of finish time we can have in practice? Maybe comment it here to be more clear what we are expecting to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to write expectation in code as people will not search PR history to find this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix - sure, added.
cluster_compute: single_node_benchmark_compute.yaml | ||
|
||
run: | ||
timeout: 1800 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to write expectation in code as people will not search PR history to find this out.
|
||
ds = ds.map_batches(images_to_bytes, batch_format="numpy") | ||
tfrecords_dir = tempfile.mkdtemp() | ||
ds.write_tfrecords(tfrecords_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to benchmark write_tfrecords? In this case, it looks the write bench that makes it clear the read is so much worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix - right, write_tfrecords
can be added later, which should be simple to add given the added methods here.
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
…ray-project#30390) This is to change read_tfrecords output from Pandas to Arrow format. From benchmark ray-project#30389, found the read_tfrecords is signigicantly slower than write_tfrecords. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
This is to add data benchmark for reading TFRecords files (read_tfrecords API) in single node. The input TFRecords files are generated from (1).randomly generated images, (2).randomly generated int, float, bytes value. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Cheng Su scnju13@gmail.com
Why are these changes needed?
This is to add data benchmark for reading TFRecords files (
read_tfrecords
API) in single node. The input TFRecords files are generated from (1).randomly generated images, (2).randomly generated int, float, bytes value.Ran the benchmark and verified it succeed:
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.