-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress TaskUpdateRequest JSON to avoid running out of native memory #16354
Comments
I think it is related but won't fully address the issue. The request sizes I observed varies from 5MB~140MB in our attempt to root cause the issue. Individual size wise, it is ok, but given so many requests (and maybe number of threads also played a role), the total size of these requests and the native memory reserved as a result was astonishing at first observation. The key issue is the large schema is duplicated for each split within the request, and there are 500 of them in the one I observed. Sending multiple smaller requests won't reduce the total size (maybe increase it a little bit due to overhead). The schema is NOT shared across all the split within the request. However when I looked at the code, it would involve bigger changes to possible let all the split within the request share the same schema. The compression/decompression proposal hopefully is less intrusive, and will decrease native memory allocation by over 10x. |
Do you know what in the schema is taking up space ? |
split_523.json.txt |
Thanks, there is definitely a lot of info there that we don't need and can remove. |
There's a patch that we have already landed that limits the total size of a single update request: #15721 Also a couple of releases back we significantly reduced split size for most common formats (formats that Trino supports natively): #15601 What version are you running? Have you tried updating to the last version? |
@arhimondr I replied above for #15721 (don't think it will help). For #15601, it is not clear to me exactly what is removed as a result of stripUnnecessaryProperties() in ParquetPageSourceFactory.java. maybe you can tell better given the split example I attached? We are running Trino 405.3 which appears slightly before 15601. |
#15601 should remove everything other than one line |
@dzhi-lyft Per Hive convention to create a format reader all table and storage properties must be available. Sometimes people store all kind of things in table and storage properties (such as comments, descriptions, etc.). For the format readers Trino implements natively (e.g.: ORC, Parquet, RCFile) the list of properties necessary to create a reader is well known in advance, hence we can trim everything what is known to be not needed. However it is much harder to do universally, as it is not always obvious what table properties might be necessary to create a reader for some more exotic file format (for example for CSV the separator is stored as a table property) |
@raunaqmorarka I just tried 15601 with the same test query, there is no noticeable difference, sendUpdate send in 1864 requests total 116 GB within 4 minutes. The split size 293kb and its content are similar as before. I will skip attach it as it is very similar to the previous one. The above stripUnnecessaryProperties() code, not very sure what it is doing, has the condition of LazyBinaryColumnarSerDe. Is it effective at all? |
Oh, wait. let me ensure the new trino-hive-405.jar is used and redo the test and get back. |
I confirm now the individual split size 1427 bytes, a reduction of 200x. The total size of TaskUpdateRequest is negligible comparing to the 116 GB before. I do expect this will indeed fix the OOM kill problem as we observed, despite to be verified in production. |
The TrinoServer process in our coordinator pod sometimes allocate over 100 GB native memory (NOT Java heap, which is fine) and hence being OOM-killed by k8s due to the pod ran out of memory. We have tracked down the reason being each TaskUpdateRequest may contain hundreds of splits each with the full table schema (all columns types, names and comments!). In one example, a single request contains 500 splits total 140 MB (because the table has 600+ columns, and the exact schema is duplicated 500 times in the request). There could be thousands of such requests with minutes, each with a JSON body of similar size. As the SocketChannel layer allocate native memory from the direct buffer pool for each request, the native memory reserved could quickly exceeds 100 GB within few minutes, and the TrinoServer process get OOM killed.
We are eager to hear alternative effective solutions for such issues. One specific proposal is to compress the TaskUpdateRequest JSON into GZIP (which I measured lead to a size reduction of 12x).
Currently io.airlift.json.JsonCodec is used by sendUpdate() to serialize the object into JSON at https://github.com/trinodb/trino/blob/f6ad54257324bc5b891b03cec9ba82aeda24fb1b/core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java#LL703 on the sender side.
On the receiving worker side, by the time it gets to createOrUpdateTask() at
trino/core/trino-main/src/main/java/io/trino/server/TaskResource.java
Line 140 in f6ad542
I don't see an existing support to enable GZIP for the TaskUpdateRequest hence this issue to support GZIP compression for TaskUpdateRequest. It will avoid these severe OOM kill incidents due to TrinoServer process reserved too much native memory.
The text was updated successfully, but these errors were encountered: