Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate sanitization step when reading JSON floats #5879

Merged
merged 1 commit into from
Jun 22, 2022

Conversation

andygrove
Copy link
Contributor

Closes #4837

When reading floating-point values from JSON we use regexp to filter valid JSON floating-point numbers and then call GpuCast which does its own regexp filtering which is redundant in this case. This PR avoids the redundant check.

I ran performance tests and see an improvement with these changes

// test data
val df = Range(0, 10000000).map(n => n * 3.141592).toDF("n")
df.write.json("/tmp/pies.json")

// test
spark.conf.set("spark.rapids.sql.json.read.double.enabled", true)
spark.conf.set("spark.rapids.sql.format.json.enabled", true)
spark.conf.set("spark.rapids.sql.format.json.read.enabled", true)
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("n", DataTypes.DoubleType, false)))
val df = spark.read.schema(schema).json("/tmp/pies.json")
df.createOrReplaceTempView("pi")
val df2 = spark.sql("SELECT avg(n) FROM pi")
spark.time(df2.collect)

// without optimization
Time taken: 4040 ms
Time taken: 3998 ms
Time taken: 3878 ms

// with optimization
Time taken: 3495 ms     
Time taken: 3685 ms
Time taken: 3728 ms

@andygrove andygrove added the performance A performance related task/issue label Jun 21, 2022
@andygrove andygrove added this to the Jun 20 - Jul 8 milestone Jun 21, 2022
@andygrove andygrove self-assigned this Jun 21, 2022
@andygrove andygrove changed the title avoid duplicate sanitization step when reading JSON floats Avoid duplicate sanitization step when reading JSON floats Jun 21, 2022
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit 1b05262 into NVIDIA:branch-22.08 Jun 22, 2022
@andygrove andygrove deleted the json-float-opt branch June 22, 2022 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Optimize JSON reading of floating-point values
2 participants