Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide dataframe with hierarchical columns in PandasTransformComponent #204

Closed
RobbeSneyders opened this issue Jun 14, 2023 · 0 comments · Fixed by #211
Closed

Provide dataframe with hierarchical columns in PandasTransformComponent #204

RobbeSneyders opened this issue Jun 14, 2023 · 0 comments · Fixed by #211
Labels
Core Core framework

Comments

@RobbeSneyders
Copy link
Member

We currently provide the data to the user as a Dask dataframe with columns in the format of "{subset}_{field}" since Dask doesn't fully support hierarchical columns (dask/dask#1493). Pandas does, and #200 adds a PandasTransformComponent, so we can provide users with a Pandas dataframe with hierarchical columns now.

We should investigate if this can be propagated through Dask, as hierarchy might only be an issue for the Dataframe index. If it's not possible, we can keep the current approach in Dask and translate to / from hierarchical columns in the PandasTransformComponent.

@RobbeSneyders RobbeSneyders converted this from a draft issue Jun 14, 2023
@RobbeSneyders RobbeSneyders added the Core Core framework label Jun 14, 2023
@RobbeSneyders RobbeSneyders mentioned this issue Jun 14, 2023
3 tasks
RobbeSneyders added a commit that referenced this issue Jun 14, 2023
Fixes #183 

There's some todo's left, we should
- [ ] Look into the redefinition of the divisions after we clear them.
Now that we take this out of the hands of the user, we should define
which strategy we want to follow here. (#205)
- [ ] Move to `hierarchical columns`. Pandas can work with hierarchical
columns, which would be a lot nicer as a user interface. I want to check
if I can make this work with Dask, and otherwise move the translation
from underscored names to hierarchical columns and back at the level of
the `PandasTransformComponent` (#204)
- [ ] Update the reusable components to leverage the
`PandasTransformComponent` (#203)
@RobbeSneyders RobbeSneyders moved this from Breakdown to Validation in Fondant development Jun 15, 2023
RobbeSneyders added a commit that referenced this issue Jun 16, 2023
Fixes #204 

Unfortunately, wasn't able to propagate this through Dask. I got the
Dask code to work with hierarchical columns as long as the index was not
hierarchical, but I run into issues when trying to write this data to
parquet.

So this means that only the pandas component gets hierarchical columns
and I do the translation when we switch from Dask to Pandas.

I would propose to change our `_` for the flat column names to a symbol
which we expect to be less frequently used in the names of data columns
or fields. Eg. `images_data` could become `images+data` instead. This
will have no impact on users using the `PandasTransformComponent`, but
it will change for users using the `DaskTransformComponent` (which is
not clear yet how much there would be).

Retrieving the data using hierarchical columns is easy, but building a
dataframe with hierarchical columns is harder, we might want to add some
"cookbook" style examples for this.
@github-project-automation github-project-automation bot moved this from Validation to Done in Fondant development Jun 16, 2023
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
Fixes #183 

There's some todo's left, we should
- [ ] Look into the redefinition of the divisions after we clear them.
Now that we take this out of the hands of the user, we should define
which strategy we want to follow here. (#205)
- [ ] Move to `hierarchical columns`. Pandas can work with hierarchical
columns, which would be a lot nicer as a user interface. I want to check
if I can make this work with Dask, and otherwise move the translation
from underscored names to hierarchical columns and back at the level of
the `PandasTransformComponent` (#204)
- [ ] Update the reusable components to leverage the
`PandasTransformComponent` (#203)
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
Fixes #204 

Unfortunately, wasn't able to propagate this through Dask. I got the
Dask code to work with hierarchical columns as long as the index was not
hierarchical, but I run into issues when trying to write this data to
parquet.

So this means that only the pandas component gets hierarchical columns
and I do the translation when we switch from Dask to Pandas.

I would propose to change our `_` for the flat column names to a symbol
which we expect to be less frequently used in the names of data columns
or fields. Eg. `images_data` could become `images+data` instead. This
will have no impact on users using the `PandasTransformComponent`, but
it will change for users using the `DaskTransformComponent` (which is
not clear yet how much there would be).

Retrieving the data using hierarchical columns is easy, but building a
dataframe with hierarchical columns is harder, we might want to add some
"cookbook" style examples for this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Core framework
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant