Skip to content

Commit

Permalink
[SPARK-48748][SQL] Cache numChars in UTF8String
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
Cache `numChars` value in a thread-safe way.

### Why are the changes needed?
Faster access to `numChars()` method, which currently requires entire UTF8String scan every time.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47142 from uros-db/cache-numchars.

Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
Signed-off-by: Kent Yao <yao@apache.org>
  • Loading branch information
uros-db authored and yaooqinn committed Jul 1, 2024
1 parent f49418b commit 0487d78
Showing 1 changed file with 11 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ public final class UTF8String implements Comparable<UTF8String>, Externalizable,
private Object base;
private long offset;
private int numBytes;
private volatile int numChars = -1;

public Object getBaseObject() { return base; }
public long getBaseOffset() { return offset; }
Expand Down Expand Up @@ -254,6 +255,16 @@ public int numBytes() {
* Returns the number of code points in it.
*/
public int numChars() {
if (numChars == -1) numChars = getNumChars();
return numChars;
}

/**
* Private helper method to calculate the number of code points in the UTF-8 string. Counting
* the code points is a linear time operation, as we need to scan the entire UTF-8 string.
* Hence, this method should generally only be called once for non-empty UTF-8 strings.
*/
private int getNumChars() {
int len = 0;
for (int i = 0; i < numBytes; i += numBytesForFirstByte(getByte(i))) {
len += 1;
Expand Down

0 comments on commit 0487d78

Please sign in to comment.