Error while using bucket partitions #274

moulimukherjee · 2019-07-10T23:13:48Z

Seeing following error while using the bucket partition

at org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:381)
at org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:342)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:118)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The relevant code looks like

val spec = PartitionSpec.builderFor(schema).bucket("_id", 800).build()

Sorting by the column does not help as its bucketted using hash.

The text was updated successfully, but these errors were encountered:

rdblue · 2019-07-11T17:38:45Z

Yeah, you need to get the bucket function to Spark and sort by that. Here's how to create a Spark UDF with the function:

import com.netflix.iceberg.transforms.Transforms
import com.netflix.iceberg.types.Types
import org.apache.spark.sql.types.IntegerType

// load the bucket transform from Iceberg to use as a UDF
val bucketTransform = Transforms.bucket[java.lang.Long](Types.LongType.get(), 16)

// needed because Scala has trouble with the Java transform type
def bucketFunc(id: Long): Int = bucketTransform.apply(id)

// create and register a UDF
val bucket16 = spark.udf.register("bucket16", bucketFunc _)

Then you can use it like this:

INSERT INTO table SELECT id, data FROM source ORDER BY bucket16(id)

moulimukherjee · 2019-07-11T17:42:32Z

@rdblue Thanks, let me try it out

rdblue · 2019-07-16T16:30:14Z

@moulimukherjee, did that solve the problem?

moulimukherjee · 2019-07-16T20:19:18Z

@rdblue Yes, it did. Thanks!

github-actions · 2024-01-20T00:11:29Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions · 2024-02-04T00:12:24Z

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

rdblue mentioned this issue Sep 20, 2019

Support bucket table for Iceberg #430

Closed

github-actions bot added the stale label Jan 20, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while using bucket partitions #274

Error while using bucket partitions #274

moulimukherjee commented Jul 10, 2019

rdblue commented Jul 11, 2019

moulimukherjee commented Jul 11, 2019

rdblue commented Jul 16, 2019

moulimukherjee commented Jul 16, 2019

github-actions bot commented Jan 20, 2024

github-actions bot commented Feb 4, 2024

Error while using bucket partitions #274

Error while using bucket partitions #274

Comments

moulimukherjee commented Jul 10, 2019

rdblue commented Jul 11, 2019

moulimukherjee commented Jul 11, 2019

rdblue commented Jul 16, 2019

moulimukherjee commented Jul 16, 2019

github-actions bot commented Jan 20, 2024

github-actions bot commented Feb 4, 2024