Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column hasNulls property is actually isNullable #428

Open
Kopilov opened this issue Jul 22, 2023 · 13 comments
Open

Column hasNulls property is actually isNullable #428

Kopilov opened this issue Jul 22, 2023 · 13 comments
Assignees
Milestone

Comments

@Kopilov
Copy link
Contributor

Kopilov commented Jul 22, 2023

Hello

If I get some dataframe with some column and I want to check if it contains null values, I use hasNulls property.

val nullable = DataColumn.create("nullable", listOf("a" as String?, "b" as String?, "c" as String?))
println(nullable.hasNulls)

expected: false
actual: true

From another side, column can be marked as not nullable but contain null values (this behavior might be a critical issue itself)

    val notNullable = DataColumn.create(
        "notNullable",
        listOf("a" as String?, "b" as String?, "c" as String?, null as String?),
        typeOf<String>().withNullability(false)
    )
    println(notNullable.hasNulls)

expected: true or IllegalArgumentException
actual: false

@koperagen
Copy link
Collaborator

koperagen commented Jul 24, 2023

Hi! It's indeed seems confusing, but at the same time can be justified.
DataColumn tries to be an efficient constructor, that's why it relies on supplied reified type parameter by default and doesn't check anything. It has infer parameter to match nullability / type to data though.
Maybe it'll make sense to change its default behaviour to Infer.Nulls and maybe introduce another function with explicit naming, like createWithoutTypeInference

@Kopilov
Copy link
Contributor Author

Kopilov commented Jul 24, 2023

DataColumn.create behavior can be justified but what about hasNulls? The issue is about hasNulls primarily, without regard for data origin.

In real life, I want to save nullable (but without null values) column as not nullable in Arrow and get unexpected NullValueError

@Jolanrensen
Copy link
Collaborator

I agree about your change for hasNulls(), it should only return true if there are actually nulls in the column. But, do we also have a canHaveNulls()? Because that sounds like useful behavior.

@koperagen
Copy link
Collaborator

koperagen commented Jul 24, 2023

I still see it as a DataColumn.create problem, maybe i miss something.
Its DataColumn.create responsibility to match actual values with type and nullability, and hasNulls only looks at type supplied to constructor. So, wouldn't DataColumn.create("col", listOf("a" as String?, "b" as String?, "c" as String?), Infer.Throw) solve this problem? Like, if infered type is String?, but values are of type String, then throw an exception

i.e. it's expected that whenever someone creates data column, they should supply exact type (as if Infer.Type is passed), then hasNulls == isNullable

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Jul 24, 2023

I still see it as a DataColumn.create problem, maybe i miss something. Its DataColumn.create responsibility to match actual values with type and nullability, and hasNulls only looks at type supplied to constructor. So, wouldn't DataColumn.create("col", listOf("a" as String?, "b" as String?, "c" as String?), Infer.Throw) solve this problem? Like, if infered type is String?, but values are of type String, then throw an exception

i.e. it's expected that whenever someone creates data column, they should supply exact type (as if Infer.Type is passed), then hasNulls == isNullable

yes, but when you're generating type schemas from, say, OpenApi, this is not always the case. The column is nullable, since it can contain null values, but it does not necessarily needs to contain them. Same as a List<Int?>. Your logic works if the data is static, but if you want to reuse your logic for data that might be nullable sometimes then hasNulls != isNullable

@koperagen
Copy link
Collaborator

koperagen commented Jul 24, 2023

DataColumn is created from known data and is immutable. There is no reason for it (DataColumn instance itself!) to have nullable type when created if there are no actual null values. You can still cast dataframe that contains it to schema with nullable property. From type system perspective it's fine.

@Kopilov
Copy link
Contributor Author

Kopilov commented Jul 25, 2023

@koperagen thank you for the opinion.

Currently I have made workaround in my project by rewrapping all columns with Infer.Type before saving. (At the same time, I change blank strings with nulls if target type is not String but number and so on)

@koperagen
Copy link
Collaborator

Initially i thought that you call DataColumn.create yourself from your code, it seems i misunderstood the problem. You know where this column is created?

@Kopilov
Copy link
Contributor Author

Kopilov commented Jul 26, 2023

Currently it is created by DataFrame.Companion.readArrowFeather method that keeps original Arrow schema (including nullability). Feather file is created in separate software and might be dirty but cleaner then Excel (that was used as main input on prevoius step, when I initially made Arrow IO). Also it can be any another data import. To make it absolutely clean, I also have to convert some columns to not nullable.

FYI, current project is BPLEX (Russian only: https://bia-tech.ru/solutions/platforma-dlya-optimizacii-2/) that actually is middleware for transferring data between business GUI or DWH and mathematics models as well as from one model output to another input.

@Kopilov
Copy link
Contributor Author

Kopilov commented Jul 26, 2023

And I have common entrypoint for internal data validation and saving (actually, 90% of it's code is Dataframe ArrowWriter itself)

@koperagen
Copy link
Collaborator

So, do you think it can (or should) be fixed on DataFrame.Companion.readArrowFeather side? I vaguely remember discussing nullability of columns when reading Arrow files, but not in much details.

@Kopilov
Copy link
Contributor Author

Kopilov commented Jul 28, 2023

It should not, in my mind.

Kotlin Dataframe is pretty good library, it's default behavior looks good for data science. But in can also be used (and is actually used, at least in BPLEX :) ) for data engineering. And in this case static schemas looks better. Arrow is also data engineering tool and it should keep it's approach together with Kotlin Dataframe, IMHO.

Once again, this issue is not about BPLEX (this is my problem) and not about Arrow (this is just one case) but about method name and it's actual behavior mismatch, without regard for data origin and sink. Please don't kill Dataframe data engineering opportunity.

@koperagen koperagen added the invalid This issue/PR doesn't seem right label Aug 2, 2023
@koperagen koperagen added this to the 0.12.0 milestone Aug 2, 2023
@koperagen
Copy link
Collaborator

It's very cool and inspiring that dataframe is used for such tasks successfully :)
Regarding the issue, let me come back to it in a week or two. There's a decision to be made about how it should be done and i'm not ready to do it now.

@zaleslaw zaleslaw modified the milestones: 0.12.0, 0.13.0 Oct 9, 2023
@zaleslaw zaleslaw removed the invalid This issue/PR doesn't seem right label Apr 8, 2024
@zaleslaw zaleslaw modified the milestones: 0.13.+, 0.14.0 Apr 25, 2024
@zaleslaw zaleslaw modified the milestones: 0.14.0, 0.15.0 Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants