-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide dataframe with hierarchical columns in PandasTransformComponent #204
Labels
Core
Core framework
Comments
RobbeSneyders
added a commit
that referenced
this issue
Jun 14, 2023
Fixes #183 There's some todo's left, we should - [ ] Look into the redefinition of the divisions after we clear them. Now that we take this out of the hands of the user, we should define which strategy we want to follow here. (#205) - [ ] Move to `hierarchical columns`. Pandas can work with hierarchical columns, which would be a lot nicer as a user interface. I want to check if I can make this work with Dask, and otherwise move the translation from underscored names to hierarchical columns and back at the level of the `PandasTransformComponent` (#204) - [ ] Update the reusable components to leverage the `PandasTransformComponent` (#203)
RobbeSneyders
added a commit
that referenced
this issue
Jun 16, 2023
Fixes #204 Unfortunately, wasn't able to propagate this through Dask. I got the Dask code to work with hierarchical columns as long as the index was not hierarchical, but I run into issues when trying to write this data to parquet. So this means that only the pandas component gets hierarchical columns and I do the translation when we switch from Dask to Pandas. I would propose to change our `_` for the flat column names to a symbol which we expect to be less frequently used in the names of data columns or fields. Eg. `images_data` could become `images+data` instead. This will have no impact on users using the `PandasTransformComponent`, but it will change for users using the `DaskTransformComponent` (which is not clear yet how much there would be). Retrieving the data using hierarchical columns is easy, but building a dataframe with hierarchical columns is harder, we might want to add some "cookbook" style examples for this.
Hakimovich99
pushed a commit
that referenced
this issue
Oct 16, 2023
Fixes #183 There's some todo's left, we should - [ ] Look into the redefinition of the divisions after we clear them. Now that we take this out of the hands of the user, we should define which strategy we want to follow here. (#205) - [ ] Move to `hierarchical columns`. Pandas can work with hierarchical columns, which would be a lot nicer as a user interface. I want to check if I can make this work with Dask, and otherwise move the translation from underscored names to hierarchical columns and back at the level of the `PandasTransformComponent` (#204) - [ ] Update the reusable components to leverage the `PandasTransformComponent` (#203)
Hakimovich99
pushed a commit
that referenced
this issue
Oct 16, 2023
Fixes #204 Unfortunately, wasn't able to propagate this through Dask. I got the Dask code to work with hierarchical columns as long as the index was not hierarchical, but I run into issues when trying to write this data to parquet. So this means that only the pandas component gets hierarchical columns and I do the translation when we switch from Dask to Pandas. I would propose to change our `_` for the flat column names to a symbol which we expect to be less frequently used in the names of data columns or fields. Eg. `images_data` could become `images+data` instead. This will have no impact on users using the `PandasTransformComponent`, but it will change for users using the `DaskTransformComponent` (which is not clear yet how much there would be). Retrieving the data using hierarchical columns is easy, but building a dataframe with hierarchical columns is harder, we might want to add some "cookbook" style examples for this.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We currently provide the data to the user as a Dask dataframe with columns in the format of "{subset}_{field}" since Dask doesn't fully support hierarchical columns (dask/dask#1493). Pandas does, and #200 adds a
PandasTransformComponent
, so we can provide users with a Pandas dataframe with hierarchical columns now.We should investigate if this can be propagated through Dask, as hierarchy might only be an issue for the Dataframe index. If it's not possible, we can keep the current approach in Dask and translate to / from hierarchical columns in the
PandasTransformComponent
.The text was updated successfully, but these errors were encountered: