Optionally distribute Hive writes on partition keys #579

dain · 2019-04-03T19:46:25Z

Today, when writing Hive tables, Presto arbitrarily distributes the data across the writer nodes. This results in good parallelization for the common case of a query writing a single partition, since each worker will write a separate file for the partition. For queries that need to writer hundreds or thousands of partitions, this behavior causes problems as each worker will end up writing a file to each partition, so there are hundreds (or thousands) of large output buffers and open file streams.

To resolve this issue, we should add a session property that when set changes the InsertLayout and CreateTableLayout in HiveMetadata to declare a partitioning based on the partition keys.

findepi · 2019-04-04T08:50:34Z

Duplicates #304

dain added the enhancement New feature or request label Apr 3, 2019

findepi closed this as completed Jun 6, 2019

findepi mentioned this issue Jun 6, 2019

Repartition writes across nodes when loading data into partitioned table #304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally distribute Hive writes on partition keys #579

Optionally distribute Hive writes on partition keys #579

dain commented Apr 3, 2019

findepi commented Apr 4, 2019

Optionally distribute Hive writes on partition keys #579

Optionally distribute Hive writes on partition keys #579

Comments

dain commented Apr 3, 2019

findepi commented Apr 4, 2019