Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.describe().describe() crashes #724

Closed
cmelchior opened this issue Jun 5, 2024 · 4 comments · Fixed by #726
Closed

df.describe().describe() crashes #724

cmelchior opened this issue Jun 5, 2024 · 4 comments · Fixed by #726
Assignees
Labels
bug Something isn't working
Milestone

Comments

@cmelchior
Copy link
Contributor

Ran into this by accident while testing AI Assistant.

It looks like the output of df.describe() creates a DataFrame with some invalid types. I managed to reproduce crashes in two cases:

%use dataframe
val df = DataFrame.read("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv")
df.describe().describe()

// Output in Notebook
null
java.lang.ClassCastException

I also reproduced this in a unit test:

    @Test
    fun describeTwice() {
        val df = typed.group { age and weight }.into("info").groupBy { city }.toDataFrame()
        val desc = df.describe().describe()
        desc::class shouldBe DataFrame::class
    }

Which crashed with:

class java.lang.Integer cannot be cast to class java.lang.String (java.lang.Integer and java.lang.String are in module java.base of loader 'bootstrap')
java.lang.ClassCastException: class java.lang.Integer cannot be cast to class java.lang.String (java.lang.Integer and java.lang.String are in module java.base of loader 'bootstrap')
	at java.base/java.lang.String.compareTo(String.java:140)
	at kotlin.sequences.SequencesKt___SequencesKt.minOrNull(_Sequences.kt:2097)
	at org.jetbrains.kotlinx.dataframe.api.MinKt.minOrNull(min.kt:27)
	at org.jetbrains.kotlinx.dataframe.impl.api.DescribeKt$describeImpl$df$1$20.invoke(describe.kt:79)
	at org.jetbrains.kotlinx.dataframe.impl.api.DescribeKt$describeImpl$df$1$20.invoke(describe.kt:79)
	at org.jetbrains.kotlinx.dataframe.impl.api.DescribeKt$describeImpl$df$1.invoke(describe.kt:183)
	at org.jetbrains.kotlinx.dataframe.impl.api.DescribeKt$describeImpl$df$1.invoke(describe.kt:62)
	at org.jetbrains.kotlinx.dataframe.impl.api.ToDataFrameKt.createDataFrameImpl(toDataFrame.kt:155)
	at org.jetbrains.kotlinx.dataframe.impl.api.DescribeKt.describeImpl(describe.kt:113)
	at org.jetbrains.kotlinx.dataframe.api.DescribeKt.describe(describe.kt:45)
	at org.jetbrains.kotlinx.dataframe.api.DescribeKt.describe(describe.kt:42)
	at org.jetbrains.kotlinx.dataframe.testSets.person.DataFrameTests.describeTwice(DataFrameTests.kt:2364)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:112)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:40)
	at org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:60)
	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:52)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
	at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
	at jdk.proxy1/jdk.proxy1.$Proxy2.processTestClass(Unknown Source)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker$2.run(TestWorker.java:176)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.executeAndMaintainThreadName(TestWorker.java:129)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:100)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:60)
	at org.gradle.process.internal.worker.child.ActionExecutionWorker.execute(ActionExecutionWorker.java:56)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:113)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
	at worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
@Jolanrensen Jolanrensen added the bug Something isn't working label Jun 5, 2024
@Jolanrensen Jolanrensen added this to the Backlog milestone Jun 5, 2024
@Jolanrensen
Copy link
Collaborator

This is a very interesting bug!

@Test
fun `describe twice 1`() {
    val df = dataFrameOf("a", "b")(1, 2, 3, 4)
    val desc1 = df.describe()
    val desc2 = desc1.describe()
    desc2::class shouldBe DataFrameImpl::class
}

works fine, but

@Test
fun `describe twice 2`() {
    val df = dataFrameOf("a", "b")(1, "foo", 3, "bar")
    val desc1 = df.describe()
    val desc2 = desc1.describe()
    desc2::class shouldBe DataFrameImpl::class
}

breaks.

I suspect this is due to columns being created with type Comparable<*>/Comparable<Nothing> after running describe():

   name:String type:Any count:Int unique:Int nulls:Int top:Comparable<*> freq:Int mean:Double? std:Double? min:Comparable<*> median:Comparable<*> max:Comparable<*>
 0           a      Int         2          2         0                 1        1          2.0    1.414214                 1                    2                 3
 1           b   String         2          2         0               foo        1         null        null               bar                  bar               foo

If you now run another describe() on this table, it will try to find the min of columns like top and compare Int and String. However, these two are incomparable, as we can see by the exception.

Our current implementation only checks if a column AnyCol.isComparable() = isSubtypeOf<Comparable<*>?>(), not whether the type T != Nothing. I'm not sure we can actually.

@Jolanrensen Jolanrensen self-assigned this Jun 5, 2024
@koperagen
Copy link
Collaborator

koperagen commented Jun 5, 2024

Our current implementation only checks if a column AnyCol.isComparable() = isSubtypeOf<Comparable<*>?>(), not whether the type T != Nothing. I'm not sure we can actually.

Maybe typeOf<Comparable<Any?>?>() will work as expected here. I suspect this code was written with an assumption that * means Any?, but Comparable has in variance and Comparable<*> == Comparable<Nothing> and from type system perspective you can't compare two Comparable<Nothing>
image
image

@Jolanrensen
Copy link
Collaborator

@koperagen thanks for the tip! But unfortunately:

typeOf<Int>().isSubtypeOf(typeOf<Comparable<Any?>>()) == false
typeOf<Int>().isSubtypeOf(typeOf<Comparable<Any>>()) == false

variance is fun :)

@Jolanrensen
Copy link
Collaborator

It can be fixed like:

/**
 * Returns `true` if [this] column is comparable, i.e. its type is a subtype of [Comparable] and its
 * type argument is not [Nothing].
 */
public fun AnyCol.isComparable(): Boolean = isSubtypeOf<Comparable<*>?>()
    && type().projectTo(Comparable::class).arguments[0].let {
        it != KTypeProjection.STAR &&
            it.type?.isNothing != true
    }

I'll probably make a PR later :)

Jolanrensen added a commit that referenced this issue Jun 6, 2024
…o check whether the values in the `Comparable` column are actually "comparable" aka, the generic type of Comparable is not */Nothing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants