Skip to content

Commit

Permalink
[SQL][TEST] Re-run collation benchmark
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
Re-running the collation benchmark with two modifications:

- UTF8_BINARY_LCASE has been renamed to UTF8_LCASE in apache#46924
- UTF8_BINARY should appear first in the collation benchmark results, so performance is relative to it

### Why are the changes needed?
We've changed the meaning of LCASE collation in Spark, and also modified how equality checks / hashing/ expressions work with this collation, so we need to re-run the benchmarks and identify areas of improvement.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Rxisting tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47030 from uros-db/collation-benchmarks.

Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
uros-db authored and attilapiros committed Oct 4, 2024
1 parent b015d73 commit aaaf90d
Show file tree
Hide file tree
Showing 3 changed files with 61 additions and 61 deletions.
60 changes: 30 additions & 30 deletions sql/core/benchmarks/CollationBenchmark-jdk21-results.txt
Original file line number Diff line number Diff line change
@@ -1,54 +1,54 @@
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 2948 2958 13 0.0 29483.6 1.0X
UNICODE 2040 2042 3 0.0 20396.6 1.4X
UTF8_BINARY 2043 2043 0 0.0 20426.3 1.4X
UNICODE_CI 16318 16338 28 0.0 163178.4 0.2X
UTF8_BINARY 1355 1358 4 0.1 13551.1 1.0X
UTF8_LCASE 4983 4984 3 0.0 49826.4 0.3X
UNICODE 18212 18220 12 0.0 182120.9 0.1X
UNICODE_CI 17568 17577 14 0.0 175677.2 0.1X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 3227 3228 1 0.0 32272.1 1.0X
UNICODE 16637 16643 9 0.0 166367.7 0.2X
UTF8_BINARY 3132 3137 7 0.0 31319.2 1.0X
UNICODE_CI 17816 17829 18 0.0 178162.4 0.2X
UTF8_BINARY 1772 1774 3 0.1 17722.3 1.0X
UTF8_LCASE 4365 4365 0 0.0 43649.6 0.4X
UNICODE 16538 16544 9 0.0 165375.5 0.1X
UNICODE_CI 16296 16305 12 0.0 162961.9 0.1X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 4824 4824 0 0.0 48243.7 1.0X
UNICODE 69416 69475 84 0.0 694158.3 0.1X
UTF8_BINARY 3806 3808 2 0.0 38062.8 1.3X
UNICODE_CI 60943 60975 45 0.0 609426.2 0.1X
UTF8_BINARY 7279 7280 1 0.0 72791.2 1.0X
UTF8_LCASE 18538 18543 6 0.0 185381.0 0.4X
UNICODE 71514 71520 8 0.0 715144.6 0.1X
UNICODE_CI 60488 60488 0 0.0 604880.9 0.1X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 11979 11980 1 0.0 119790.4 1.0X
UNICODE 6469 6474 7 0.0 64694.8 1.9X
UTF8_BINARY 7253 7253 1 0.0 72528.3 1.7X
UNICODE_CI 319124 319881 1070 0.0 3191244.0 0.0X
UTF8_BINARY 7516 7519 4 0.0 75162.9 1.0X
UTF8_LCASE 120330 120338 12 0.0 1203299.2 0.1X
UNICODE 371784 371946 228 0.0 3717840.7 0.0X
UNICODE_CI 427401 427547 207 0.0 4274009.0 0.0X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - startsWith: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 11584 11595 15 0.0 115841.4 1.0X
UNICODE 6155 6156 2 0.0 61548.7 1.9X
UTF8_BINARY 6979 6982 5 0.0 69785.6 1.7X
UNICODE_CI 318228 318726 705 0.0 3182275.2 0.0X
UTF8_BINARY 6504 6507 3 0.0 65044.6 1.0X
UTF8_LCASE 60331 60359 40 0.0 603313.9 0.1X
UNICODE 369394 369404 13 0.0 3693943.0 0.0X
UNICODE_CI 427382 427421 55 0.0 4273819.7 0.0X

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - endsWith: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 11655 11664 12 0.0 116552.8 1.0X
UNICODE 6235 6239 5 0.0 62350.8 1.9X
UTF8_BINARY 7066 7069 5 0.0 70658.1 1.6X
UNICODE_CI 313515 313999 685 0.0 3135149.1 0.0X
UTF8_BINARY 6600 6601 1 0.0 66002.7 1.0X
UTF8_LCASE 58723 58751 39 0.0 587230.1 0.1X
UNICODE 379668 379789 172 0.0 3796677.7 0.0X
UNICODE_CI 437119 437194 106 0.0 4371189.5 0.0X

60 changes: 30 additions & 30 deletions sql/core/benchmarks/CollationBenchmark-results.txt
Original file line number Diff line number Diff line change
@@ -1,54 +1,54 @@
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 3571 3576 7 0.0 35708.8 1.0X
UNICODE 2235 2240 7 0.0 22349.2 1.6X
UTF8_BINARY 2237 2242 6 0.0 22371.7 1.6X
UNICODE_CI 18733 18817 118 0.0 187333.8 0.2X
UTF8_BINARY 1370 1370 1 0.1 13698.4 1.0X
UTF8_LCASE 4836 4836 0 0.0 48359.5 0.3X
UNICODE 19239 19271 45 0.0 192391.8 0.1X
UNICODE_CI 18936 18954 25 0.0 189362.4 0.1X

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 4260 4290 41 0.0 42602.6 1.0X
UNICODE 19536 19624 124 0.0 195360.2 0.2X
UTF8_BINARY 3582 3612 43 0.0 35818.5 1.2X
UNICODE_CI 20381 20454 103 0.0 203814.1 0.2X
UTF8_BINARY 1726 1727 1 0.1 17260.4 1.0X
UTF8_LCASE 6293 6304 16 0.0 62927.1 0.3X
UNICODE 18677 18679 4 0.0 186768.3 0.1X
UNICODE_CI 18488 18504 23 0.0 184879.6 0.1X

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 7347 7349 3 0.0 73467.1 1.0X
UNICODE 73462 73608 206 0.0 734623.2 0.1X
UTF8_BINARY 5775 5815 57 0.0 57746.0 1.3X
UNICODE_CI 57543 57619 108 0.0 575425.2 0.1X
UTF8_BINARY 3028 3029 1 0.0 30283.4 1.0X
UTF8_LCASE 19773 19830 81 0.0 197726.4 0.2X
UNICODE 68565 68594 41 0.0 685646.9 0.0X
UNICODE_CI 53100 53101 2 0.0 530996.0 0.1X

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 15415 15424 13 0.0 154147.1 1.0X
UNICODE 8091 8108 25 0.0 80907.9 1.9X
UTF8_BINARY 8964 8979 21 0.0 89643.5 1.7X
UNICODE_CI 469123 474822 8060 0.0 4691227.7 0.0X
UTF8_BINARY 7024 7026 3 0.0 70244.6 1.0X
UTF8_LCASE 118693 118703 15 0.0 1186926.5 0.1X
UNICODE 385409 386299 1257 0.0 3854093.7 0.0X
UNICODE_CI 434618 435527 1285 0.0 4346181.0 0.0X

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - startsWith: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 13064 13080 23 0.0 130635.2 1.0X
UNICODE 6836 6851 22 0.0 68360.1 1.9X
UTF8_BINARY 7693 7719 36 0.0 76933.9 1.7X
UNICODE_CI 488919 495530 9349 0.0 4889190.5 0.0X
UTF8_BINARY 6069 6090 29 0.0 60691.9 1.0X
UTF8_LCASE 61809 61828 27 0.0 618094.5 0.1X
UNICODE 370523 371729 1705 0.0 3705229.7 0.0X
UNICODE_CI 435805 436945 1612 0.0 4358051.5 0.0X

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - endsWith: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY_LCASE 13097 13112 21 0.0 130970.4 1.0X
UNICODE 6960 6985 34 0.0 69603.9 1.9X
UTF8_BINARY 7766 7768 3 0.0 77663.5 1.7X
UNICODE_CI 456956 470733 19485 0.0 4569556.7 0.0X
UTF8_BINARY 6725 6732 10 0.0 67247.9 1.0X
UTF8_LCASE 54990 55010 28 0.0 549896.0 0.1X
UNICODE 380872 383258 3375 0.0 3808722.0 0.0X
UNICODE_CI 443911 444111 283 0.0 4439112.3 0.0X

Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ import org.apache.spark.unsafe.types.UTF8String

abstract class CollationBenchmarkBase extends BenchmarkBase {
protected val collationTypes: Seq[String] =
Seq("UTF8_LCASE", "UNICODE", "UTF8_BINARY", "UNICODE_CI")
Seq("UTF8_BINARY", "UTF8_LCASE", "UNICODE", "UNICODE_CI")

def generateSeqInput(n: Long): Seq[UTF8String]

Expand Down

0 comments on commit aaaf90d

Please sign in to comment.