Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-48947][SQL] Use lowercased charset name to decrease cache miss…
…ing in Charset.forName ### What changes were proposed in this pull request? Since charset names can be non-literal values, they might be case inconsistent. In this pull request, we are converting the charset name to lowercase before calling Charset.forName. This allows values like 'ISO-8859-1' and 'Iso-8859-1' to hit the 2-level cached Charset. By using lowercase instead of uppercase charset names, we align the way `Charset.forName` does further lookup after cache missing. ### Why are the changes needed? performance improvement - L1 lookup ```java private static Charset lookup(String charsetName) { if (charsetName == null) throw new IllegalArgumentException("Null charset name"); Object[] a; if ((a = cache1) != null && charsetName.equals(a[0])) return (Charset)a[1]; // We expect most programs to use one Charset repeatedly. // We convey a hint to this effect to the VM by putting the // level 1 cache miss code in a separate method. return lookup2(charsetName); } ``` - L2 lookup ```java private static Charset lookup2(String charsetName) { Object[] a; if ((a = cache2) != null && charsetName.equals(a[0])) { cache2 = cache1; cache1 = a; return (Charset)a[1]; } Charset cs; if ((cs = standardProvider.charsetForName(charsetName)) != null || (cs = lookupExtendedCharset(charsetName)) != null || (cs = lookupViaProviders(charsetName)) != null) { cache(charsetName, cs); return cs; } /* Only need to check the name if we didn't find a charset for it */ checkName(charsetName); return null; } ``` - After missing ```java private Map<String,Charset> cache() { Map<String,Charset> map = cache; if (map == null) { map = new Cache(); map.put("utf-8", UTF_8.INSTANCE); map.put("iso-8859-1", ISO_8859_1.INSTANCE); map.put("us-ascii", US_ASCII.INSTANCE); map.put("utf-16", java.nio.charset.StandardCharsets.UTF_16); map.put("utf-16be", java.nio.charset.StandardCharsets.UTF_16BE); map.put("utf-16le", java.nio.charset.StandardCharsets.UTF_16LE); cache = map; } return map; } ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47420 from yaooqinn/SPARK-48947. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
- Loading branch information