Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add benchmark for TFRecords read #30389

Merged
merged 2 commits into from
Nov 22, 2022

Conversation

c21
Copy link
Contributor

@c21 c21 commented Nov 17, 2022

Signed-off-by: Cheng Su scnju13@gmail.com

Why are these changes needed?

This is to add data benchmark for reading TFRecords files (read_tfrecords API) in single node. The input TFRecords files are generated from (1).randomly generated images, (2).randomly generated int, float, bytes value.

Ran the benchmark and verified it succeed:

Running benchmark: read-tfrecords
Read->Map_Batches: 100%|███████████████████████████████████████████████| 32/32 [00:04<00:00,  7.84it/s]
Write Progress: 100%|██████████████████████████████████████████████████| 32/32 [00:06<00:00,  4.78it/s]
Read->Map_Batches: 100%|█████████████████████████████████████████████| 100/100 [00:03<00:00, 32.52it/s]
Write Progress: 100%|████████████████████████████████████████████████| 100/100 [00:05<00:00, 19.03it/s]
Read->Map_Batches: 100%|███████████████████████████████████████████████| 51/51 [00:02<00:00, 19.91it/s]
Write Progress: 100%|██████████████████████████████████████████████████| 51/51 [00:05<00:00,  9.85it/s]
Read->Map_Batches: 100%|███████████████████████████████████████████████| 80/80 [02:53<00:00,  2.17s/it]
Write Progress: 100%|██████████████████████████████████████████████████| 80/80 [03:46<00:00,  2.84s/it]
Read->Map_Batches: 100%|███████████████████████████████████████████████| 80/80 [01:07<00:00,  1.19it/s]
Read->Map_Batches: 100%|███████████████████████████████████████████████| 80/80 [01:07<00:00,  1.19it/s]
Write Progress: 100%|██████████████████████████████████████████████████| 80/80 [03:24<00:00,  2.56s/it]
Read->Map_Batches: 100%|███████████████████████████████████████████████| 32/32 [00:21<00:00,  1.51it/s]
Write Progress: 100%|██████████████████████████████████████████████████| 32/32 [00:16<00:00,  1.89it/s]
Running case: tfrecords-images-100-256
Read progress: 100%|███████████████████████████████████████████████████| 32/32 [00:03<00:00, 10.57it/s]
Result of case tfrecords-images-100-256: {'time': 3.16600097999617}
Running case: tfrecords-images-100-2048
Read progress: 100%|█████████████████████████████████████████████████| 100/100 [00:14<00:00,  6.80it/s]
Result of case tfrecords-images-100-2048: {'time': 15.941341369005386}
Running case: tfrecords-images-1000-mix
Read progress: 100%|███████████████████████████████████████████████████| 51/51 [00:01<00:00, 34.99it/s]
Result of case tfrecords-images-1000-mix: {'time': 1.7659148280072259}
Running case: tfrecords-random-int-1g
Read progress: 100%|███████████████████████████████████████████████████| 80/80 [03:03<00:00,  2.29s/it]
Result of case tfrecords-random-int-1g: {'time': 204.05738306099374}
Running case: tfrecords-images-float-1g
Read progress: 100%|███████████████████████████████████████████████████| 80/80 [02:47<00:00,  2.09s/it]
Result of case tfrecords-images-float-1g: {'time': 186.0372944809933}
Running case: tfrecords-images-bytes-1g
Read progress: 100%|███████████████████████████████████████████████████| 32/32 [00:17<00:00,  1.83it/s]
Result of case tfrecords-images-bytes-1g: {'time': 22.197236428997712}
Finish benchmark: read-tfrecords

real    21m47.484s
user    2m30.749s
sys     0m58.017s
(base) ray@ip-172-31-80-220:~/default$ 

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@c21
Copy link
Contributor Author

c21 commented Nov 17, 2022

This depends on #30284. After #30284 is merged, this can be merged.

# Convert images from NumPy to bytes
def images_to_bytes(batch):
images_as_bytes = [image.tobytes() for image in batch]
return pa.table({"image": images_as_bytes})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a lot of context here, but why the bytes representation rather than our tensor extension type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ds.write_tfrecords() not work as expected with our tensor extension type?

Copy link
Contributor Author

@c21 c21 Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row format for TFRecord (tf.train.Example) only support list of ints/floats/bytes (doc):

Dict[str,
     Union[List[bytes],
           List[int64],
           List[float]]]

Is there a way to get bytes representation natively from tensor extension type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I suppose that I was imagining the TFRecords datasource would take care of that conversion during the write.

For Arrow 9+ (minor workaround needed for Arrow 6-8), this should return a NumPy ndarray for each element in a tensor extension columns:

features[name] = _value_to_feature(arrow_table[name][i].as_py())

And we could have support for NumPy ndarrays when converting Python values to TFRecord feature types:

def _value_to_feature(value: Union[bytes, float, int, List]) -> "tf.train.Feature":
import tensorflow as tf
# A Feature stores a list of values.
# If we have a single value, convert it to a singleton list first.
values = [value] if not isinstance(value, list) else value
if not values:
raise ValueError(
"Storing an empty value in a tf.train.Feature is not supported."
)
elif isinstance(values[0], bytes):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=values))
elif isinstance(values[0], float):
return tf.train.Feature(float_list=tf.train.FloatList(value=values))
elif isinstance(values[0], int):
return tf.train.Feature(int64_list=tf.train.Int64List(value=values))
else:
raise ValueError(
f"Value is of type {type(values[0])}, "
"which is not a supported tf.train.Feature storage type "
"(bytes, float, or int)."
)

In order for the roundtrip write + read to work, when reading, we'd need to be able to cast those bytes features to the correct parameterized tensor extension type, so the user would have to provide a schema on read...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, since this isn't already implemented as I had expected, writing the ndarrays as bytes for this benchmark seems fine!

Copy link
Contributor Author

@c21 c21 Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clarkzinzow re:

And we could have support for NumPy ndarrays when converting Python values to TFRecord feature types

I am wondering how should we make sure the schema not lost here? We need to convert NumPy ndarray to bytes, right? Also we need to add additional information as TFRecord feature.

Copy link
Contributor

@clarkzinzow clarkzinzow Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep TFRecords doesn't have support for schema/type metadata, unfortunately, so we would need to either:

  1. serialize that metadata within the TFRecords data payload,
  2. store a metadata sidecar file that contains this metadata, out-of-band for the TFRecord file format,
  3. enforce that schema is specified on-read.

(2) is what TensorFlow Datasets does: https://www.tensorflow.org/datasets/external_tfrecord

We'd have pros and cons to consider for each option, e.g. (1) would make the data written by us only readable by us, which isn't great. This might be mitigated by e.g. using pickle for the serialization, since that provides an easy path for it to be read by other libraries.

cluster_compute: single_node_benchmark_compute.yaml

run:
timeout: 1800
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the ballpark number of finish time we can have in practice? Maybe comment it here to be more clear what we are expecting to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - as in PR description, it's around 21 minutes with #30390 fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to write expectation in code as people will not search PR history to find this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - sure, added.

cluster_compute: single_node_benchmark_compute.yaml

run:
timeout: 1800
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to write expectation in code as people will not search PR history to find this out.


ds = ds.map_batches(images_to_bytes, batch_format="numpy")
tfrecords_dir = tempfile.mkdtemp()
ds.write_tfrecords(tfrecords_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to benchmark write_tfrecords? In this case, it looks the write bench that makes it clear the read is so much worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - right, write_tfrecords can be added later, which should be simple to add given the added methods here.

c21 added 2 commits November 21, 2022 10:06
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
clarkzinzow pushed a commit that referenced this pull request Nov 21, 2022
…#30390)

This is to change read_tfrecords output from Pandas to Arrow format. From benchmark #30389, found the read_tfrecords is signigicantly slower than write_tfrecords.
@c21 c21 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 21, 2022
@clarkzinzow clarkzinzow merged commit 55e22e4 into ray-project:master Nov 22, 2022
@c21 c21 deleted the read-benchmark branch November 22, 2022 21:32
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…ray-project#30390)

This is to change read_tfrecords output from Pandas to Arrow format. From benchmark ray-project#30389, found the read_tfrecords is signigicantly slower than write_tfrecords.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
This is to add data benchmark for reading TFRecords files (read_tfrecords API) in single node. The input TFRecords files are generated from (1).randomly generated images, (2).randomly generated int, float, bytes value.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants