Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48748][SQL] Cache numChars in UTF8String #47142

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ public final class UTF8String implements Comparable<UTF8String>, Externalizable,
private Object base;
private long offset;
private int numBytes;
private volatile int numChars = -1;

public Object getBaseObject() { return base; }
public long getBaseOffset() { return offset; }
Expand Down Expand Up @@ -253,6 +254,16 @@ public int numBytes() {
* Returns the number of code points in it.
*/
public int numChars() {
if (numChars == -1) numChars = getNumChars();
return numChars;
}

/**
* Private helper method to calculate the number of code points in the UTF-8 string. Counting
* the code points is a linear time operation, as we need to scan the entire UTF-8 string.
* Hence, this method should generally only be called once for non-empty UTF-8 strings.
*/
private int getNumChars() {
int len = 0;
for (int i = 0; i < numBytes; i += numBytesForFirstByte(getByte(i))) {
len += 1;
Expand Down