Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress TaskUpdateRequest JSON to avoid running out of native memory #16354

Closed
dzhi-lyft opened this issue Mar 3, 2023 · 12 comments
Closed

Comments

@dzhi-lyft
Copy link

dzhi-lyft commented Mar 3, 2023

The TrinoServer process in our coordinator pod sometimes allocate over 100 GB native memory (NOT Java heap, which is fine) and hence being OOM-killed by k8s due to the pod ran out of memory. We have tracked down the reason being each TaskUpdateRequest may contain hundreds of splits each with the full table schema (all columns types, names and comments!). In one example, a single request contains 500 splits total 140 MB (because the table has 600+ columns, and the exact schema is duplicated 500 times in the request). There could be thousands of such requests with minutes, each with a JSON body of similar size. As the SocketChannel layer allocate native memory from the direct buffer pool for each request, the native memory reserved could quickly exceeds 100 GB within few minutes, and the TrinoServer process get OOM killed.

We are eager to hear alternative effective solutions for such issues. One specific proposal is to compress the TaskUpdateRequest JSON into GZIP (which I measured lead to a size reduction of 12x).

Currently io.airlift.json.JsonCodec is used by sendUpdate() to serialize the object into JSON at https://github.com/trinodb/trino/blob/f6ad54257324bc5b891b03cec9ba82aeda24fb1b/core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java#LL703 on the sender side.

On the receiving worker side, by the time it gets to createOrUpdateTask() at

. The JSON is already deserialized into the taskUpdateRequest object, underneath by JsonCodec likely due to following line in CoordinatorModule.java : jsonCodecBinder(binder).bindJsonCodec(TaskUpdateRequest.class);

I don't see an existing support to enable GZIP for the TaskUpdateRequest hence this issue to support GZIP compression for TaskUpdateRequest. It will avoid these severe OOM kill incidents due to TrinoServer process reserved too much native memory.

@raunaqmorarka
Copy link
Member

Could you take a look at using the feature added in #15721 and see if that solves this problem ?

cc: @sopel39 @Dith3r

@dzhi-lyft
Copy link
Author

I think it is related but won't fully address the issue. The request sizes I observed varies from 5MB~140MB in our attempt to root cause the issue. Individual size wise, it is ok, but given so many requests (and maybe number of threads also played a role), the total size of these requests and the native memory reserved as a result was astonishing at first observation.

The key issue is the large schema is duplicated for each split within the request, and there are 500 of them in the one I observed. Sending multiple smaller requests won't reduce the total size (maybe increase it a little bit due to overhead). The schema is NOT shared across all the split within the request. However when I looked at the code, it would involve bigger changes to possible let all the split within the request share the same schema. The compression/decompression proposal hopefully is less intrusive, and will decrease native memory allocation by over 10x.

@raunaqmorarka
Copy link
Member

raunaqmorarka commented Mar 3, 2023

Do you know what in the schema is taking up space ?
Which release are you on ? #15601 had reduced the size of HiveSplit from large schemas.

@dzhi-lyft
Copy link
Author

split_523.json.txt
Let me attach one (out of 500) of the splits in one of the request to illustrate the schema it containes. It is about 290kb and we can remove the attachment later to save space if necessary. We are running Trino 405.3.

@raunaqmorarka
Copy link
Member

Thanks, there is definitely a lot of info there that we don't need and can remove.
cc: @arhimondr

@arhimondr
Copy link
Contributor

@dzhi-lyft

There's a patch that we have already landed that limits the total size of a single update request: #15721

Also a couple of releases back we significantly reduced split size for most common formats (formats that Trino supports natively): #15601

What version are you running? Have you tried updating to the last version?

@dzhi-lyft
Copy link
Author

@arhimondr I replied above for #15721 (don't think it will help). For #15601, it is not clear to me exactly what is removed as a result of stripUnnecessaryProperties() in ParquetPageSourceFactory.java. maybe you can tell better given the split example I attached? We are running Trino 405.3 which appears slightly before 15601.

@raunaqmorarka
Copy link
Member

#15601 should remove everything other than one line "serialization.lib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" and most likely solve this.
Please re-open if you encounter this issue even after upgrade to 406.

@arhimondr
Copy link
Contributor

arhimondr commented Mar 3, 2023

@dzhi-lyft Per Hive convention to create a format reader all table and storage properties must be available. Sometimes people store all kind of things in table and storage properties (such as comments, descriptions, etc.). For the format readers Trino implements natively (e.g.: ORC, Parquet, RCFile) the list of properties necessary to create a reader is well known in advance, hence we can trim everything what is known to be not needed. However it is much harder to do universally, as it is not always obvious what table properties might be necessary to create a reader for some more exotic file format (for example for CSV the separator is stored as a table property)

@dzhi-lyft
Copy link
Author

@raunaqmorarka I just tried 15601 with the same test query, there is no noticeable difference, sendUpdate send in 1864 requests total 116 GB within 4 minutes. The split size 293kb and its content are similar as before. I will skip attach it as it is very similar to the previous one. The above stripUnnecessaryProperties() code, not very sure what it is doing, has the condition of LazyBinaryColumnarSerDe. Is it effective at all?

@dzhi-lyft
Copy link
Author

Oh, wait. let me ensure the new trino-hive-405.jar is used and redo the test and get back.

@dzhi-lyft
Copy link
Author

I confirm now the individual split size 1427 bytes, a reduction of 200x. The total size of TaskUpdateRequest is negligible comparing to the 116 GB before. I do expect this will indeed fix the OOM kill problem as we observed, despite to be verified in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants