Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while using bucket partitions #274

Closed
moulimukherjee opened this issue Jul 10, 2019 · 6 comments
Closed

Error while using bucket partitions #274

moulimukherjee opened this issue Jul 10, 2019 · 6 comments
Labels

Comments

@moulimukherjee
Copy link
Contributor

Seeing following error while using the bucket partition

at org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:381)
at org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:342)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:118)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The relevant code looks like

val spec = PartitionSpec.builderFor(schema).bucket("_id", 800).build()

Sorting by the column does not help as its bucketted using hash.

@rdblue
Copy link
Contributor

rdblue commented Jul 11, 2019

Yeah, you need to get the bucket function to Spark and sort by that. Here's how to create a Spark UDF with the function:

import com.netflix.iceberg.transforms.Transforms
import com.netflix.iceberg.types.Types
import org.apache.spark.sql.types.IntegerType

// load the bucket transform from Iceberg to use as a UDF
val bucketTransform = Transforms.bucket[java.lang.Long](Types.LongType.get(), 16)

// needed because Scala has trouble with the Java transform type
def bucketFunc(id: Long): Int = bucketTransform.apply(id)

// create and register a UDF
val bucket16 = spark.udf.register("bucket16", bucketFunc _)

Then you can use it like this:

INSERT INTO table SELECT id, data FROM source ORDER BY bucket16(id)

@moulimukherjee
Copy link
Contributor Author

@rdblue Thanks, let me try it out

@rdblue
Copy link
Contributor

rdblue commented Jul 16, 2019

@moulimukherjee, did that solve the problem?

@moulimukherjee
Copy link
Contributor Author

@rdblue Yes, it did. Thanks!

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Jan 20, 2024
Copy link

github-actions bot commented Feb 4, 2024

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants