[SPARK-31261][SQL] Avoid npe when reading bad csv input with `columnN…

…ameCorruptRecord` specified ### What changes were proposed in this pull request? SPARK-25387 avoids npe for bad csv input, but when reading bad csv input with `columnNameCorruptRecord` specified, `getCurrentInput` is called and it still throws npe. ### Why are the changes needed? Bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add a test. Closes apache#28029 from wzhfy/corrupt_column_npe. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 791d2ba) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
scunniff · Nov 5, 2020 · 4be7094 · 4be7094
1 parent 375e827
commit 4be7094
Showing 1 changed file with 14 additions and 0 deletions.
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -1924,6 +1924,20 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa
       Set(Row(null, "\u0001\u0000\u0001234")))
   }
 
+  test("SPARK-31261: bad csv input with `columnNameCorruptRecord` should not cause NPE") {
+    val schema = StructType(
+      StructField("a", IntegerType) :: StructField("_corrupt_record", StringType) :: Nil)
+    val input = spark.createDataset(Seq("\u0000\u0000\u0001234"))
+
+    checkAnswer(
+      spark.read
+        .option("columnNameOfCorruptRecord", "_corrupt_record")
+        .schema(schema)
+        .csv(input),
+      Row(null, null))
+    assert(spark.read.csv(input).collect().toSet == Set(Row()))
+  }
+
   test("field names of inferred schema shouldn't compare to the first row") {
     val input = Seq("1,2").toDS()
     val df = spark.read.option("enforceSchema", false).csv(input)