[Datasets] Add benchmark for TFRecords read #30389

c21 · 2022-11-17T05:14:02Z

Signed-off-by: Cheng Su scnju13@gmail.com

Why are these changes needed?

This is to add data benchmark for reading TFRecords files (read_tfrecords API) in single node. The input TFRecords files are generated from (1).randomly generated images, (2).randomly generated int, float, bytes value.

Ran the benchmark and verified it succeed:

Running benchmark: read-tfrecords
Read->Map_Batches: 100%|███████████████████████████████████████████████| 32/32 [00:04<00:00,  7.84it/s]
Write Progress: 100%|██████████████████████████████████████████████████| 32/32 [00:06<00:00,  4.78it/s]
Read->Map_Batches: 100%|█████████████████████████████████████████████| 100/100 [00:03<00:00, 32.52it/s]
Write Progress: 100%|████████████████████████████████████████████████| 100/100 [00:05<00:00, 19.03it/s]
Read->Map_Batches: 100%|███████████████████████████████████████████████| 51/51 [00:02<00:00, 19.91it/s]
Write Progress: 100%|██████████████████████████████████████████████████| 51/51 [00:05<00:00,  9.85it/s]
Read->Map_Batches: 100%|███████████████████████████████████████████████| 80/80 [02:53<00:00,  2.17s/it]
Write Progress: 100%|██████████████████████████████████████████████████| 80/80 [03:46<00:00,  2.84s/it]
Read->Map_Batches: 100%|███████████████████████████████████████████████| 80/80 [01:07<00:00,  1.19it/s]
Read->Map_Batches: 100%|███████████████████████████████████████████████| 80/80 [01:07<00:00,  1.19it/s]
Write Progress: 100%|██████████████████████████████████████████████████| 80/80 [03:24<00:00,  2.56s/it]
Read->Map_Batches: 100%|███████████████████████████████████████████████| 32/32 [00:21<00:00,  1.51it/s]
Write Progress: 100%|██████████████████████████████████████████████████| 32/32 [00:16<00:00,  1.89it/s]
Running case: tfrecords-images-100-256
Read progress: 100%|███████████████████████████████████████████████████| 32/32 [00:03<00:00, 10.57it/s]
Result of case tfrecords-images-100-256: {'time': 3.16600097999617}
Running case: tfrecords-images-100-2048
Read progress: 100%|█████████████████████████████████████████████████| 100/100 [00:14<00:00,  6.80it/s]
Result of case tfrecords-images-100-2048: {'time': 15.941341369005386}
Running case: tfrecords-images-1000-mix
Read progress: 100%|███████████████████████████████████████████████████| 51/51 [00:01<00:00, 34.99it/s]
Result of case tfrecords-images-1000-mix: {'time': 1.7659148280072259}
Running case: tfrecords-random-int-1g
Read progress: 100%|███████████████████████████████████████████████████| 80/80 [03:03<00:00,  2.29s/it]
Result of case tfrecords-random-int-1g: {'time': 204.05738306099374}
Running case: tfrecords-images-float-1g
Read progress: 100%|███████████████████████████████████████████████████| 80/80 [02:47<00:00,  2.09s/it]
Result of case tfrecords-images-float-1g: {'time': 186.0372944809933}
Running case: tfrecords-images-bytes-1g
Read progress: 100%|███████████████████████████████████████████████████| 32/32 [00:17<00:00,  1.83it/s]
Result of case tfrecords-images-bytes-1g: {'time': 22.197236428997712}
Finish benchmark: read-tfrecords

real    21m47.484s
user    2m30.749s
sys     0m58.017s
(base) ray@ip-172-31-80-220:~/default$

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

c21 · 2022-11-17T05:15:15Z

This depends on #30284. After #30284 is merged, this can be merged.

clarkzinzow · 2022-11-17T17:37:53Z

release/nightly_tests/dataset/read_tfrecords_benchmark.py

+        # Convert images from NumPy to bytes
+        def images_to_bytes(batch):
+            images_as_bytes = [image.tobytes() for image in batch]
+            return pa.table({"image": images_as_bytes})


I don't have a lot of context here, but why the bytes representation rather than our tensor extension type?

Does ds.write_tfrecords() not work as expected with our tensor extension type?

The row format for TFRecord (tf.train.Example) only support list of ints/floats/bytes (doc):

Dict[str, Union[List[bytes], List[int64], List[float]]]

Is there a way to get bytes representation natively from tensor extension type?

Yeah I suppose that I was imagining the TFRecords datasource would take care of that conversion during the write.

For Arrow 9+ (minor workaround needed for Arrow 6-8), this should return a NumPy ndarray for each element in a tensor extension columns:

ray/python/ray/data/datasource/tfrecords_datasource.py

Line 88 in 2631806

features[name] = _value_to_feature(arrow_table[name][i].as_py())

And we could have support for NumPy ndarrays when converting Python values to TFRecord feature types:

ray/python/ray/data/datasource/tfrecords_datasource.py

Lines 115 to 137 in 2631806

def _value_to_feature(value: Union[bytes, float, int, List]) -> "tf.train.Feature":

import tensorflow as tf

# A Feature stores a list of values.

# If we have a single value, convert it to a singleton list first.

values = [value] if not isinstance(value, list) else value

if not values:

raise ValueError(

"Storing an empty value in a tf.train.Feature is not supported."

)

elif isinstance(values[0], bytes):

return tf.train.Feature(bytes_list=tf.train.BytesList(value=values))

elif isinstance(values[0], float):

return tf.train.Feature(float_list=tf.train.FloatList(value=values))

elif isinstance(values[0], int):

return tf.train.Feature(int64_list=tf.train.Int64List(value=values))

else:

raise ValueError(

f"Value is of type {type(values[0])}, "

"which is not a supported tf.train.Feature storage type "

"(bytes, float, or int)."

)

In order for the roundtrip write + read to work, when reading, we'd need to be able to cast those bytes features to the correct parameterized tensor extension type, so the user would have to provide a schema on read...

Anyways, since this isn't already implemented as I had expected, writing the ndarrays as bytes for this benchmark seems fine!

@clarkzinzow re:

And we could have support for NumPy ndarrays when converting Python values to TFRecord feature types

I am wondering how should we make sure the schema not lost here? We need to convert NumPy ndarray to bytes, right? Also we need to add additional information as TFRecord feature.

Yep TFRecords doesn't have support for schema/type metadata, unfortunately, so we would need to either:

serialize that metadata within the TFRecords data payload,

store a metadata sidecar file that contains this metadata, out-of-band for the TFRecord file format,

enforce that schema is specified on-read.

(2) is what TensorFlow Datasets does: https://www.tensorflow.org/datasets/external_tfrecord

We'd have pros and cons to consider for each option, e.g. (1) would make the data written by us only readable by us, which isn't great. This might be mitigated by e.g. using pickle for the serialization, since that provides an easy path for it to be read by other libraries.

jianoaix · 2022-11-17T20:53:26Z

release/release_tests.yaml

+    cluster_compute: single_node_benchmark_compute.yaml
+
+  run:
+    timeout: 1800


What the ballpark number of finish time we can have in practice? Maybe comment it here to be more clear what we are expecting to happen.

@jianoaix - as in PR description, it's around 21 minutes with #30390 fix.

It's better to write expectation in code as people will not search PR history to find this out.

@jianoaix - sure, added.

jianoaix · 2022-11-17T21:58:52Z

release/release_tests.yaml

+    cluster_compute: single_node_benchmark_compute.yaml
+
+  run:
+    timeout: 1800


It's better to write expectation in code as people will not search PR history to find this out.

jianoaix · 2022-11-17T22:00:56Z

release/nightly_tests/dataset/read_tfrecords_benchmark.py

+
+        ds = ds.map_batches(images_to_bytes, batch_format="numpy")
+        tfrecords_dir = tempfile.mkdtemp()
+        ds.write_tfrecords(tfrecords_dir)


Do you plan to benchmark write_tfrecords? In this case, it looks the write bench that makes it clear the read is so much worse.

@jianoaix - right, write_tfrecords can be added later, which should be simple to add given the added methods here.

Signed-off-by: Cheng Su <scnju13@gmail.com>

…#30390) This is to change read_tfrecords output from Pandas to Arrow format. From benchmark #30389, found the read_tfrecords is signigicantly slower than write_tfrecords.

…ray-project#30390) This is to change read_tfrecords output from Pandas to Arrow format. From benchmark ray-project#30389, found the read_tfrecords is signigicantly slower than write_tfrecords. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

This is to add data benchmark for reading TFRecords files (read_tfrecords API) in single node. The input TFRecords files are generated from (1).randomly generated images, (2).randomly generated int, float, bytes value. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

c21 assigned clarkzinzow and jianoaix Nov 17, 2022

c21 mentioned this pull request Nov 17, 2022

[Datasets] Change read_tfrecords output from Pandas to Arrow format #30390

Merged

6 tasks

clarkzinzow reviewed Nov 17, 2022

View reviewed changes

clarkzinzow approved these changes Nov 17, 2022

View reviewed changes

c21 force-pushed the read-benchmark branch from 7f20f10 to fab73d6 Compare November 17, 2022 20:03

jianoaix approved these changes Nov 17, 2022

View reviewed changes

jianoaix reviewed Nov 17, 2022

View reviewed changes

c21 mentioned this pull request Nov 18, 2022

[Datasets] Enable dynamic block splitting by default #30442

Merged

7 tasks

c21 added 2 commits November 21, 2022 10:06

Add benchmark for TFRecords read

4b46913

Signed-off-by: Cheng Su <scnju13@gmail.com>

Address comment

eb4768e

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 force-pushed the read-benchmark branch from fab73d6 to eb4768e Compare November 21, 2022 18:11

c21 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 21, 2022

clarkzinzow merged commit 55e22e4 into ray-project:master Nov 22, 2022

c21 deleted the read-benchmark branch November 22, 2022 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add benchmark for TFRecords read #30389

[Datasets] Add benchmark for TFRecords read #30389

c21 commented Nov 17, 2022

c21 commented Nov 17, 2022

clarkzinzow Nov 17, 2022

clarkzinzow Nov 17, 2022

c21 Nov 17, 2022 •

edited

Loading

clarkzinzow Nov 17, 2022

clarkzinzow Nov 17, 2022

c21 Nov 17, 2022 •

edited

Loading

clarkzinzow Nov 17, 2022 •

edited

Loading

jianoaix Nov 17, 2022

c21 Nov 17, 2022

jianoaix Nov 17, 2022

c21 Nov 21, 2022

jianoaix Nov 17, 2022

jianoaix Nov 17, 2022

c21 Nov 21, 2022

	def _value_to_feature(value: Union[bytes, float, int, List]) -> "tf.train.Feature":
	import tensorflow as tf

	# A Feature stores a list of values.
	# If we have a single value, convert it to a singleton list first.
	values = [value] if not isinstance(value, list) else value

	if not values:
	raise ValueError(
	"Storing an empty value in a tf.train.Feature is not supported."
	)
	elif isinstance(values[0], bytes):
	return tf.train.Feature(bytes_list=tf.train.BytesList(value=values))
	elif isinstance(values[0], float):
	return tf.train.Feature(float_list=tf.train.FloatList(value=values))
	elif isinstance(values[0], int):
	return tf.train.Feature(int64_list=tf.train.Int64List(value=values))
	else:
	raise ValueError(
	f"Value is of type {type(values[0])}, "
	"which is not a supported tf.train.Feature storage type "
	"(bytes, float, or int)."
	)

[Datasets] Add benchmark for TFRecords read #30389

[Datasets] Add benchmark for TFRecords read #30389

Conversation

c21 commented Nov 17, 2022

Why are these changes needed?

Related issue number

Checks

c21 commented Nov 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 Nov 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 Nov 17, 2022 • edited Loading

Choose a reason for hiding this comment

clarkzinzow Nov 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 Nov 17, 2022 •

edited

Loading

c21 Nov 17, 2022 •

edited

Loading

clarkzinzow Nov 17, 2022 •

edited

Loading