From a87015efb5cf36103bc4eb82ae8613874e2eb408 Mon Sep 17 00:00:00 2001 From: Hyukjin Kwon Date: Thu, 22 Feb 2024 12:13:24 +0900 Subject: [PATCH] [SPARK-47125][SQL] Return null if Univocity never triggers parsing ### What changes were proposed in this pull request? This PR proposes to prevent `null` for `tokenizer.getContext`. This is similar with https://github.com/apache/spark/pull/28029. `getContext` seemingly via the univocity library, it can return null if `begingParsing` is not invoked (https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/AbstractParser.java#L53). This can happen when `parseLine` is not invoked at https://github.com/apache/spark/blob/e081f06ea401a2b6b8c214a36126583d35eaf55f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L300 - `parseLine` invokes `begingParsing`. ### Why are the changes needed? To fix up a bug. ### Does this PR introduce _any_ user-facing change? Yes. In a very rare case, when `CsvToStructs` is used as a sole predicate against an empty row, it might trigger NPE. This PR fixes it. ### How was this patch tested? Manually tested, but test case will be done in a separate PR. We should backport this to all branches. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45210 from HyukjinKwon/SPARK-47125. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- .../org/apache/spark/sql/catalyst/csv/UnivocityParser.scala | 1 + 1 file changed, 1 insertion(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala index 06057626461b5..a5158d8a22c6b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala @@ -136,6 +136,7 @@ class UnivocityParser( // Retrieve the raw record string. private def getCurrentInput: UTF8String = { + if (tokenizer.getContext == null) return null val currentContent = tokenizer.getContext.currentParsedContent() if (currentContent == null) null else UTF8String.fromString(currentContent.stripLineEnd) }