Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty strings not interpreted as null when reading CSV files #7797

Closed
66OJ66 opened this issue Oct 11, 2023 · 3 comments
Closed

Empty strings not interpreted as null when reading CSV files #7797

66OJ66 opened this issue Oct 11, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@66OJ66
Copy link

66OJ66 commented Oct 11, 2023

Describe the bug

Initial discussion here: #7761

In short, it seems like empty strings in CSV files aren't being interpreted as null

To Reproduce

Create a simple .csv file this this:

id,name
1,
2,bob

Run the following code:

#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
    let ctx = SessionContext::new();

    let results = ctx
        .read_csv("input.csv", CsvReadOptions::new())
        .await?
        .filter(IsNotNull(Box::new(col("name"))))?
        .collect()
        .await?;

    let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?.to_string();

    println!("{}", pretty_results);

    Ok(())
}

Expected behavior

I was expecting the output to look like this:

+----+------+
| id | name |
+----+------+
| 2  | bob  |
+----+------+

But the full dataset is returned instead:

+----+------+
| id | name |
+----+------+
| 1  |      |
| 2  | bob  |
+----+------+

Additional context

I've tested this on main and v31.0.0, and the result is the same

@66OJ66 66OJ66 added the bug Something isn't working label Oct 11, 2023
@alamb
Copy link
Contributor

alamb commented Oct 13, 2023

I agree this should likely be fixed. Thank you for filing this @66OJ66

@haohuaijin
Copy link
Contributor

haohuaijin commented Oct 15, 2023

I find arrow-csv also have the above problem and seem like arrow-csv never set string to null, see below link
https://github.com/apache/arrow-rs/blob/bb8e42f6392284f4a7a39d3eec74144a603b481c/arrow-csv/src/reader/mod.rs#L792-L795

DataType::Utf8 => Ok(Arc::new(
    rows.iter()
        .map(|row| Some(row.get(i)))
        .collect::<StringArray>(),

a example show this problem
https://github.com/apache/arrow-rs/blob/bb8e42f6392284f4a7a39d3eec74144a603b481c/arrow-csv/src/reader/mod.rs#L1544-L1572

c_int,c_float,c_string,c_bool,c_null
,,,,
2,2.2,"a",TRUE,
3,,"b",true,
4,4.4,,False,
5,6.6,"",FALSE,
    fn test_init_nulls_with_inference() {
        let format = Format::default().with_header(true).with_delimiter(b',');

        let mut file = File::open("test/data/init_null_test.csv").unwrap();
        let (schema, _) = format.infer_schema(&mut file, None).unwrap();
        file.rewind().unwrap();

        let mut csv = ReaderBuilder::new(Arc::new(schema))
            .with_format(format)
            .build(file)
            .unwrap();

        let batch = csv.next().unwrap().unwrap();
        println!("{:?}",batch);
    }

the print result is

arrow-rs's infer schema
RecordBatch { schema: Schema { fields: [Field { name: "c_int", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_float", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_string", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_bool", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_null", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  null,
  2,
  3,
  4,
  5,
], PrimitiveArray<Float64>
[
  null,
  2.2,
  null,
  4.4,
  6.6,
], StringArray
[
  "",
  "a",
  "b",
  "",
  "",
], BooleanArray
[
  null,
  true,
  true,
  false,
  false,
], NullArray(5)], row_count: 5 }

and I also find the infer scheme of datafusion is different from arrow-csv

datafusion's infer schema
RecordBatch { schema: Schema { fields: [Field { name: "c_int", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_float", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_string", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_bool", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c_null", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64>
[
  null,
  2,
  3,
  4,
  5,
], PrimitiveArray<Float64>
[
  null,
  2.2,
  null,
  4.4,
  6.6,
], StringArray
[
  "",
  "a",
  "b",
  "",
  "",
], BooleanArray
[
  null,
  true,
  true,
  false,
  false,
], StringArray
[
  "",
  "",
  "",
  "",
  "",
]], row_count: 5 }

@66OJ66
Copy link
Author

66OJ66 commented Dec 6, 2023

I tested the above code again using v33.0, and it's returning the expected output now

Thanks @haohuaijin for fixing the underlying issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants