Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unresolved reference when using Properties to access #703

Open
murfel opened this issue May 21, 2024 · 9 comments
Open

Unresolved reference when using Properties to access #703

murfel opened this issue May 21, 2024 · 9 comments
Assignees
Labels
documentation Improvements or additions to documentation (not KDocs) question Further information is requested
Milestone

Comments

@murfel
Copy link

murfel commented May 21, 2024

I'm trying to use Properties to access a column, and it throws "Unresolved reference: title". Here's my demo repo, branch demo.

I'm using Dataframe version "0.13.1", as suggested in the onboarding documentation, and movies.csv dataset.

package org.example

import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.api.column
import org.jetbrains.kotlinx.dataframe.api.print
import org.jetbrains.kotlinx.dataframe.io.read

fun main() {
    val df = DataFrame.read("movies.csv")
    println(df.columnNames())  // [movieId, title, genres]

    // Properties - doesn't work "Unresolved reference: title"
//    df.title

    // Accessors - OK
    val title by column<String>()
    df[title].print()

    // Strings - OK
    df["title"].print()
}

(Following the getColumns doc page.)

It is extremely bizarre as the documentation shows the Properties tabs on almost every single page and yet doesn't mention if I need any extra imports or something else for it to work.

@murfel
Copy link
Author

murfel commented May 21, 2024

Also a bit confusing that there's no tab which uses Indexes.

I do want to know that you can also select a column using an index, df.getColumn(1).

@murfel
Copy link
Author

murfel commented May 21, 2024

Also here Properties and Accessors are completely identical code, except the Accessors tab has an extra variable definition. Surely the Properties style wouldn't work.

// Properties
df.getColumn { age }
// Accessors
val age by column<Int>()

df.getColumn { age }

https://kotlin.github.io/dataframe/getcolumn.html

@Jolanrensen
Copy link
Collaborator

Hi there!
Thanks for reaching out. You indeed make a good point that our documentation is a bit confusing in this regard. The generated properties are available out-of-the-box in notebooks. After a cell is executed, the data inside dataframe instances is analysed and extension properties like df.title will work. This is why it's featured so prominently in the documentation.

In Gradle projects, it requires a bit more configuration; We namely need to tell the compiler how and where to generate these extension properties.
As seen in that part of the documentation, in Gradle projects there are 3 ways. You can either:

  • Create an interface/data class annotated with @DataSchema. Then, after recompiling, DataFrames cast to that interface/class will have the extension properties available to them.
  • Add a reference to (a sample of) your data to a @file:ImportDataSchema(..) statement at the top of your file. This will generate @DataSchema interfaces and extension properties for you.
  • Add a dataframes { schema {} } task to your gradle file. This works the same as @file:ImportDataSchema.

As for indexing, you're right, it's mentioned only here as far as I can see. That said, our documentation website is far from all-inclusive and needs a lot of work still. Discovering the API and possibilities from the IDE's autocomplete is the best way to explore the functionalities of DataFrame :).

Hopefully, this answered some of your questions/concerns. Feel free to reach out if you have more questions!

@Jolanrensen Jolanrensen added the question Further information is requested label May 21, 2024
@murfel
Copy link
Author

murfel commented May 21, 2024

Re indexing, the link you provided indexes the row, not the column.

Otherwise thank you for clarification!

I haven't even looked into DataSchemas documentation, since I never had any references towards it. It could be useful to at least link to it from Getting Started on Gradle, and ideally in each of the "Properties" tab, too.

However, I understand that the documentation is not the priority yet, now. Something like a disclaimer "Warning: documentation is in beta mode, missing information and discrepancies are possible" would be nice, so that users don't expect it to be perfect, double check things and don't get upset when something doesn't work.

@murfel
Copy link
Author

murfel commented May 21, 2024

In general, do you need feedback at this stage?

A few things I noticed which are different from pandas -

  1. I cannot load a CSV without a header - I either get the first row as a header or I need to provide my header for each column.
  2. select doesn't allow to select the same column twice because of the name conflict

My use case is I have a CSV with 20 columns, out of which I only need 4, and two of them need to be duplicated. So I want to load, select required columns, and only then provide their headers.

val df = DataFrame.readCSV("filename.csv", header=???).select { cols(2, 4, 5, 17) } // cannot select { cols(2, 2) }

I understand that workarounds exist - create a fake header corresponding to indexes, header=(0..19).map { it.toString() } and insert the extra column afterwards, but it would be nice to have it out of the box if this is in the plans.

(I'm still unsure how to rename the header, though, apart from creating a new DataFrame.)

@murfel
Copy link
Author

murfel commented May 21, 2024

Also this val title by column<String>() doesn't seem to pull the data.

This sample prints Amazing org.jetbrains.kotlinx.dataframe.impl.columns.ColumnAccessorImpl@70f02c32 instead of the actual title.

    val df = DataFrame.read("movies.csv")
    println(df.columnNames())  // [movieId, title, genres]
    val title by column<String>()
    val newDf = df.add("amazingTitle") { "Amazing $title" }
    println(newDf[0]["amazingTitle"])

@koperagen
Copy link
Collaborator

Feedback is appreciated :)
Try this
df.select { col(2) named "col1" and col(2) named "col2" /* and so on */ }
In general every operation creates a new dataframe, but it's ok because data is reused whenever is possible

To pull the data by column accessor in DataRow context you can use either invoke or get functions
val newDf = df.add("amazingTitle") { "Amazing ${title()}" }

@murfel
Copy link
Author

murfel commented May 21, 2024

Thanks so much!

Aha, and then I assume I cannot rename the header because Dataframe is sort of unmutable in the persistent Kotlin style, but I can create a new df with the new titles.

@koperagen
Copy link
Collaborator

Exactly this, yes

@Jolanrensen Jolanrensen added the documentation Improvements or additions to documentation (not KDocs) label May 21, 2024
@zaleslaw zaleslaw added this to the 0.14.0 milestone Jul 19, 2024
@zaleslaw zaleslaw self-assigned this Jul 19, 2024
@zaleslaw zaleslaw modified the milestones: 0.14.0, 0.15.0 Sep 4, 2024
@zaleslaw zaleslaw removed their assignment Oct 1, 2024
@zaleslaw zaleslaw modified the milestones: 0.15.0, 0.16.0 Oct 1, 2024
@zaleslaw zaleslaw self-assigned this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation (not KDocs) question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants