Column `hasNulls` property is actually isNullable #428

Kopilov · 2023-07-22T09:01:05Z

Hello

If I get some dataframe with some column and I want to check if it contains null values, I use hasNulls property.

val nullable = DataColumn.create("nullable", listOf("a" as String?, "b" as String?, "c" as String?))
println(nullable.hasNulls)

expected: false
actual: true

From another side, column can be marked as not nullable but contain null values (this behavior might be a critical issue itself)

    val notNullable = DataColumn.create(
        "notNullable",
        listOf("a" as String?, "b" as String?, "c" as String?, null as String?),
        typeOf<String>().withNullability(false)
    )
    println(notNullable.hasNulls)

expected: true or IllegalArgumentException
actual: false

The text was updated successfully, but these errors were encountered:

koperagen · 2023-07-24T10:45:04Z

Hi! It's indeed seems confusing, but at the same time can be justified.
DataColumn tries to be an efficient constructor, that's why it relies on supplied reified type parameter by default and doesn't check anything. It has infer parameter to match nullability / type to data though.
Maybe it'll make sense to change its default behaviour to Infer.Nulls and maybe introduce another function with explicit naming, like createWithoutTypeInference

Kopilov · 2023-07-24T11:30:41Z

DataColumn.create behavior can be justified but what about hasNulls? The issue is about hasNulls primarily, without regard for data origin.

In real life, I want to save nullable (but without null values) column as not nullable in Arrow and get unexpected NullValueError

Jolanrensen · 2023-07-24T11:40:02Z

I agree about your change for hasNulls(), it should only return true if there are actually nulls in the column. But, do we also have a canHaveNulls()? Because that sounds like useful behavior.

koperagen · 2023-07-24T11:40:29Z

I still see it as a DataColumn.create problem, maybe i miss something.
Its DataColumn.create responsibility to match actual values with type and nullability, and hasNulls only looks at type supplied to constructor. So, wouldn't DataColumn.create("col", listOf("a" as String?, "b" as String?, "c" as String?), Infer.Throw) solve this problem? Like, if infered type is String?, but values are of type String, then throw an exception

i.e. it's expected that whenever someone creates data column, they should supply exact type (as if Infer.Type is passed), then hasNulls == isNullable

Jolanrensen · 2023-07-24T11:47:10Z

I still see it as a DataColumn.create problem, maybe i miss something. Its DataColumn.create responsibility to match actual values with type and nullability, and hasNulls only looks at type supplied to constructor. So, wouldn't DataColumn.create("col", listOf("a" as String?, "b" as String?, "c" as String?), Infer.Throw) solve this problem? Like, if infered type is String?, but values are of type String, then throw an exception

i.e. it's expected that whenever someone creates data column, they should supply exact type (as if Infer.Type is passed), then hasNulls == isNullable

yes, but when you're generating type schemas from, say, OpenApi, this is not always the case. The column is nullable, since it can contain null values, but it does not necessarily needs to contain them. Same as a List<Int?>. Your logic works if the data is static, but if you want to reuse your logic for data that might be nullable sometimes then hasNulls != isNullable

koperagen · 2023-07-24T12:23:00Z

DataColumn is created from known data and is immutable. There is no reason for it (DataColumn instance itself!) to have nullable type when created if there are no actual null values. You can still cast dataframe that contains it to schema with nullable property. From type system perspective it's fine.

Kopilov · 2023-07-25T13:02:33Z

@koperagen thank you for the opinion.

Currently I have made workaround in my project by rewrapping all columns with Infer.Type before saving. (At the same time, I change blank strings with nulls if target type is not String but number and so on)

koperagen · 2023-07-26T12:32:30Z

Initially i thought that you call DataColumn.create yourself from your code, it seems i misunderstood the problem. You know where this column is created?

Kopilov · 2023-07-26T13:45:28Z

Currently it is created by DataFrame.Companion.readArrowFeather method that keeps original Arrow schema (including nullability). Feather file is created in separate software and might be dirty but cleaner then Excel (that was used as main input on prevoius step, when I initially made Arrow IO). Also it can be any another data import. To make it absolutely clean, I also have to convert some columns to not nullable.

FYI, current project is BPLEX (Russian only: https://bia-tech.ru/solutions/platforma-dlya-optimizacii-2/) that actually is middleware for transferring data between business GUI or DWH and mathematics models as well as from one model output to another input.

Kopilov · 2023-07-26T15:11:14Z

And I have common entrypoint for internal data validation and saving (actually, 90% of it's code is Dataframe ArrowWriter itself)

koperagen · 2023-07-27T13:03:38Z

So, do you think it can (or should) be fixed on DataFrame.Companion.readArrowFeather side? I vaguely remember discussing nullability of columns when reading Arrow files, but not in much details.

Kopilov · 2023-07-28T07:15:44Z

It should not, in my mind.

Kotlin Dataframe is pretty good library, it's default behavior looks good for data science. But in can also be used (and is actually used, at least in BPLEX :) ) for data engineering. And in this case static schemas looks better. Arrow is also data engineering tool and it should keep it's approach together with Kotlin Dataframe, IMHO.

Once again, this issue is not about BPLEX (this is my problem) and not about Arrow (this is just one case) but about method name and it's actual behavior mismatch, without regard for data origin and sink. Please don't kill Dataframe data engineering opportunity.

koperagen · 2023-08-02T18:13:32Z

It's very cool and inspiring that dataframe is used for such tasks successfully :)
Regarding the issue, let me come back to it in a week or two. There's a decision to be made about how it should be done and i'm not ready to do it now.

Kopilov mentioned this issue Jul 22, 2023

Expected hasNulls behavior for #428 #429

Open

koperagen added the invalid This issue/PR doesn't seem right label Aug 2, 2023

koperagen added this to the 0.12.0 milestone Aug 2, 2023

zaleslaw modified the milestones: 0.12.0, 0.13.0 Oct 9, 2023

zaleslaw assigned koperagen Oct 9, 2023

koperagen mentioned this issue Dec 15, 2023

DataColumn nullability in JDBC #541

Closed

Kopilov mentioned this issue Jan 7, 2024

Reading Arrow NullVector #550

Merged

zaleslaw removed the invalid This issue/PR doesn't seem right label Apr 8, 2024

zaleslaw modified the milestones: 0.13.+, 0.14.0 Apr 25, 2024

zaleslaw modified the milestones: 0.14.0, 0.15.0 Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column `hasNulls` property is actually isNullable #428

Column `hasNulls` property is actually isNullable #428

Kopilov commented Jul 22, 2023

koperagen commented Jul 24, 2023 •

edited

Loading

Kopilov commented Jul 24, 2023

Jolanrensen commented Jul 24, 2023

koperagen commented Jul 24, 2023 •

edited

Loading

Jolanrensen commented Jul 24, 2023 •

edited

Loading

koperagen commented Jul 24, 2023 •

edited

Loading

Kopilov commented Jul 25, 2023

koperagen commented Jul 26, 2023

Kopilov commented Jul 26, 2023 •

edited

Loading

Kopilov commented Jul 26, 2023

koperagen commented Jul 27, 2023

Kopilov commented Jul 28, 2023 •

edited

Loading

koperagen commented Aug 2, 2023

Column hasNulls property is actually isNullable #428

Column hasNulls property is actually isNullable #428

Comments

Kopilov commented Jul 22, 2023

koperagen commented Jul 24, 2023 • edited Loading

Kopilov commented Jul 24, 2023

Jolanrensen commented Jul 24, 2023

koperagen commented Jul 24, 2023 • edited Loading

Jolanrensen commented Jul 24, 2023 • edited Loading

koperagen commented Jul 24, 2023 • edited Loading

Kopilov commented Jul 25, 2023

koperagen commented Jul 26, 2023

Kopilov commented Jul 26, 2023 • edited Loading

Kopilov commented Jul 26, 2023

koperagen commented Jul 27, 2023

Kopilov commented Jul 28, 2023 • edited Loading

koperagen commented Aug 2, 2023

Column `hasNulls` property is actually isNullable #428

Column `hasNulls` property is actually isNullable #428

koperagen commented Jul 24, 2023 •

edited

Loading

koperagen commented Jul 24, 2023 •

edited

Loading

Jolanrensen commented Jul 24, 2023 •

edited

Loading

koperagen commented Jul 24, 2023 •

edited

Loading

Kopilov commented Jul 26, 2023 •

edited

Loading

Kopilov commented Jul 28, 2023 •

edited

Loading