Compress TaskUpdateRequest JSON to avoid running out of native memory #16354

dzhi-lyft · 2023-03-03T02:40:36Z

The TrinoServer process in our coordinator pod sometimes allocate over 100 GB native memory (NOT Java heap, which is fine) and hence being OOM-killed by k8s due to the pod ran out of memory. We have tracked down the reason being each TaskUpdateRequest may contain hundreds of splits each with the full table schema (all columns types, names and comments!). In one example, a single request contains 500 splits total 140 MB (because the table has 600+ columns, and the exact schema is duplicated 500 times in the request). There could be thousands of such requests with minutes, each with a JSON body of similar size. As the SocketChannel layer allocate native memory from the direct buffer pool for each request, the native memory reserved could quickly exceeds 100 GB within few minutes, and the TrinoServer process get OOM killed.

We are eager to hear alternative effective solutions for such issues. One specific proposal is to compress the TaskUpdateRequest JSON into GZIP (which I measured lead to a size reduction of 12x).

Currently io.airlift.json.JsonCodec is used by sendUpdate() to serialize the object into JSON at https://github.com/trinodb/trino/blob/f6ad54257324bc5b891b03cec9ba82aeda24fb1b/core/trino-main/src/main/java/io/trino/server/remotetask/HttpRemoteTask.java#LL703 on the sender side.

On the receiving worker side, by the time it gets to createOrUpdateTask() at

trino/core/trino-main/src/main/java/io/trino/server/TaskResource.java

Line 140 in f6ad542

public void createOrUpdateTask(

. The JSON is already deserialized into the taskUpdateRequest object, underneath by JsonCodec likely due to following line in CoordinatorModule.java : jsonCodecBinder(binder).bindJsonCodec(TaskUpdateRequest.class);

I don't see an existing support to enable GZIP for the TaskUpdateRequest hence this issue to support GZIP compression for TaskUpdateRequest. It will avoid these severe OOM kill incidents due to TrinoServer process reserved too much native memory.

raunaqmorarka · 2023-03-03T02:46:53Z

Could you take a look at using the feature added in #15721 and see if that solves this problem ?

cc: @sopel39 @Dith3r

dzhi-lyft · 2023-03-03T02:57:46Z

I think it is related but won't fully address the issue. The request sizes I observed varies from 5MB~140MB in our attempt to root cause the issue. Individual size wise, it is ok, but given so many requests (and maybe number of threads also played a role), the total size of these requests and the native memory reserved as a result was astonishing at first observation.

The key issue is the large schema is duplicated for each split within the request, and there are 500 of them in the one I observed. Sending multiple smaller requests won't reduce the total size (maybe increase it a little bit due to overhead). The schema is NOT shared across all the split within the request. However when I looked at the code, it would involve bigger changes to possible let all the split within the request share the same schema. The compression/decompression proposal hopefully is less intrusive, and will decrease native memory allocation by over 10x.

raunaqmorarka · 2023-03-03T03:10:13Z

Do you know what in the schema is taking up space ?
Which release are you on ? #15601 had reduced the size of HiveSplit from large schemas.

dzhi-lyft · 2023-03-03T03:21:40Z

split_523.json.txt
Let me attach one (out of 500) of the splits in one of the request to illustrate the schema it containes. It is about 290kb and we can remove the attachment later to save space if necessary. We are running Trino 405.3.

raunaqmorarka · 2023-03-03T03:47:46Z

Thanks, there is definitely a lot of info there that we don't need and can remove.
cc: @arhimondr

arhimondr · 2023-03-03T14:27:33Z

@dzhi-lyft

There's a patch that we have already landed that limits the total size of a single update request: #15721

Also a couple of releases back we significantly reduced split size for most common formats (formats that Trino supports natively): #15601

What version are you running? Have you tried updating to the last version?

dzhi-lyft · 2023-03-03T17:59:58Z

@arhimondr I replied above for #15721 (don't think it will help). For #15601, it is not clear to me exactly what is removed as a result of stripUnnecessaryProperties() in ParquetPageSourceFactory.java. maybe you can tell better given the split example I attached? We are running Trino 405.3 which appears slightly before 15601.

raunaqmorarka · 2023-03-03T18:12:22Z

#15601 should remove everything other than one line "serialization.lib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe" and most likely solve this.
Please re-open if you encounter this issue even after upgrade to 406.

arhimondr · 2023-03-03T18:42:39Z

@dzhi-lyft Per Hive convention to create a format reader all table and storage properties must be available. Sometimes people store all kind of things in table and storage properties (such as comments, descriptions, etc.). For the format readers Trino implements natively (e.g.: ORC, Parquet, RCFile) the list of properties necessary to create a reader is well known in advance, hence we can trim everything what is known to be not needed. However it is much harder to do universally, as it is not always obvious what table properties might be necessary to create a reader for some more exotic file format (for example for CSV the separator is stored as a table property)

dzhi-lyft · 2023-03-03T19:29:03Z

@raunaqmorarka I just tried 15601 with the same test query, there is no noticeable difference, sendUpdate send in 1864 requests total 116 GB within 4 minutes. The split size 293kb and its content are similar as before. I will skip attach it as it is very similar to the previous one. The above stripUnnecessaryProperties() code, not very sure what it is doing, has the condition of LazyBinaryColumnarSerDe. Is it effective at all?

dzhi-lyft · 2023-03-03T19:32:17Z

Oh, wait. let me ensure the new trino-hive-405.jar is used and redo the test and get back.

dzhi-lyft · 2023-03-03T19:50:54Z

I confirm now the individual split size 1427 bytes, a reduction of 200x. The total size of TaskUpdateRequest is negligible comparing to the 116 GB before. I do expect this will indeed fix the OOM kill problem as we observed, despite to be verified in production.

raunaqmorarka closed this as completed Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress TaskUpdateRequest JSON to avoid running out of native memory #16354

Compress TaskUpdateRequest JSON to avoid running out of native memory #16354

dzhi-lyft commented Mar 3, 2023 •

edited

Loading

raunaqmorarka commented Mar 3, 2023

dzhi-lyft commented Mar 3, 2023

raunaqmorarka commented Mar 3, 2023 •

edited

Loading

dzhi-lyft commented Mar 3, 2023

raunaqmorarka commented Mar 3, 2023

arhimondr commented Mar 3, 2023

dzhi-lyft commented Mar 3, 2023

raunaqmorarka commented Mar 3, 2023

arhimondr commented Mar 3, 2023 •

edited

Loading

dzhi-lyft commented Mar 3, 2023

dzhi-lyft commented Mar 3, 2023

dzhi-lyft commented Mar 3, 2023

Compress TaskUpdateRequest JSON to avoid running out of native memory #16354

Compress TaskUpdateRequest JSON to avoid running out of native memory #16354

Comments

dzhi-lyft commented Mar 3, 2023 • edited Loading

raunaqmorarka commented Mar 3, 2023

dzhi-lyft commented Mar 3, 2023

raunaqmorarka commented Mar 3, 2023 • edited Loading

dzhi-lyft commented Mar 3, 2023

raunaqmorarka commented Mar 3, 2023

arhimondr commented Mar 3, 2023

dzhi-lyft commented Mar 3, 2023

raunaqmorarka commented Mar 3, 2023

arhimondr commented Mar 3, 2023 • edited Loading

dzhi-lyft commented Mar 3, 2023

dzhi-lyft commented Mar 3, 2023

dzhi-lyft commented Mar 3, 2023

dzhi-lyft commented Mar 3, 2023 •

edited

Loading

raunaqmorarka commented Mar 3, 2023 •

edited

Loading

arhimondr commented Mar 3, 2023 •

edited

Loading