Empty strings not interpreted as null when reading CSV files #7797

66OJ66 · 2023-10-11T21:16:28Z

Describe the bug

Initial discussion here: #7761

In short, it seems like empty strings in CSV files aren't being interpreted as null

To Reproduce

Create a simple .csv file this this:

id,name
1,
2,bob

Run the following code:

#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
    let ctx = SessionContext::new();

    let results = ctx
        .read_csv("input.csv", CsvReadOptions::new())
        .await?
        .filter(IsNotNull(Box::new(col("name"))))?
        .collect()
        .await?;

    let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?.to_string();

    println!("{}", pretty_results);

    Ok(())
}

Expected behavior

I was expecting the output to look like this:

+----+------+
| id | name |
+----+------+
| 2  | bob  |
+----+------+

But the full dataset is returned instead:

+----+------+
| id | name |
+----+------+
| 1  |      |
| 2  | bob  |
+----+------+

Additional context

I've tested this on main and v31.0.0, and the result is the same

The text was updated successfully, but these errors were encountered:

alamb · 2023-10-13T15:12:40Z

I agree this should likely be fixed. Thank you for filing this @66OJ66

haohuaijin · 2023-10-15T11:20:04Z

I find arrow-csv also have the above problem and seem like arrow-csv never set string to null, see below link
https://github.com/apache/arrow-rs/blob/bb8e42f6392284f4a7a39d3eec74144a603b481c/arrow-csv/src/reader/mod.rs#L792-L795

DataType::Utf8 => Ok(Arc::new(
    rows.iter()
        .map(|row| Some(row.get(i)))
        .collect::<StringArray>(),

a example show this problem
https://github.com/apache/arrow-rs/blob/bb8e42f6392284f4a7a39d3eec74144a603b481c/arrow-csv/src/reader/mod.rs#L1544-L1572

c_int,c_float,c_string,c_bool,c_null
,,,,
2,2.2,"a",TRUE,
3,,"b",true,
4,4.4,,False,
5,6.6,"",FALSE,

    fn test_init_nulls_with_inference() {
        let format = Format::default().with_header(true).with_delimiter(b',');

        let mut file = File::open("test/data/init_null_test.csv").unwrap();
        let (schema, _) = format.infer_schema(&mut file, None).unwrap();
        file.rewind().unwrap();

        let mut csv = ReaderBuilder::new(Arc::new(schema))
            .with_format(format)
            .build(file)
            .unwrap();

        let batch = csv.next().unwrap().unwrap();
        println!("{:?}",batch);
    }

the print result is

arrow-rs's infer schema
RecordBatch { schema: Schema { fields: [Field { name: "c_int", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_float", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_string", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_bool", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_null", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  null,
  2,
  3,
  4,
  5,
], PrimitiveArray<Float64>
[
  null,
  2.2,
  null,
  4.4,
  6.6,
], StringArray
[
  "",
  "a",
  "b",
  "",
  "",
], BooleanArray
[
  null,
  true,
  true,
  false,
  false,
], NullArray(5)], row_count: 5 }

and I also find the infer scheme of datafusion is different from arrow-csv

datafusion's infer schema
RecordBatch { schema: Schema { fields: [Field { name: "c_int", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_float", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_string", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_bool", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_null", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  null,
  2,
  3,
  4,
  5,
], PrimitiveArray<Float64>
[
  null,
  2.2,
  null,
  4.4,
  6.6,
], StringArray
[
  "",
  "a",
  "b",
  "",
  "",
], BooleanArray
[
  null,
  true,
  true,
  false,
  false,
], StringArray
[
  "",
  "",
  "",
  "",
  "",
]], row_count: 5 }

66OJ66 · 2023-12-06T15:12:45Z

I tested the above code again using v33.0, and it's returning the expected output now

Thanks @haohuaijin for fixing the underlying issue!

66OJ66 added the bug Something isn't working label Oct 11, 2023

haohuaijin mentioned this issue Oct 16, 2023

Should we make blank values and empty string to None in csv? apache/arrow-rs#4939

Closed

66OJ66 closed this as completed Dec 6, 2023

rumpuslabs mentioned this issue Aug 17, 2024

Empty strings in CSV files aren't being interpreted as null when using a Dictionary(_, Utf8) #12041

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty strings not interpreted as null when reading CSV files #7797

Empty strings not interpreted as null when reading CSV files #7797

66OJ66 commented Oct 11, 2023 •

edited

Loading

alamb commented Oct 13, 2023 •

edited

Loading

haohuaijin commented Oct 15, 2023 •

edited

Loading

66OJ66 commented Dec 6, 2023

Empty strings not interpreted as null when reading CSV files #7797

Empty strings not interpreted as null when reading CSV files #7797

Comments

66OJ66 commented Oct 11, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Oct 13, 2023 • edited Loading

haohuaijin commented Oct 15, 2023 • edited Loading

66OJ66 commented Dec 6, 2023

66OJ66 commented Oct 11, 2023 •

edited

Loading

alamb commented Oct 13, 2023 •

edited

Loading

haohuaijin commented Oct 15, 2023 •

edited

Loading