Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue 137/PR 145 automatically search for categorical variables #145

Merged

Conversation

patrickleonardy
Copy link
Contributor

Story Title

Automatically search for categorical variables #137

Changes made

  • Add functionality to Cobra to automatically assume which variables are continuous and which are discontinuous. one still is able to correct cobra if something is assumed in the wrong way.
  • Add get_continous_and_discreate_columns function to the PreProcesser class.
  • Changed fit and fit_transform to automatically call get_continous_and_discreate_columns if both continuous_vars, discrete_vars parameters are equal to None.

How does the solution address the problem

One can now set the values for continuous_vars, discrete_vars in fitand fit_transform equal to None. and Cobra will guess which variables are continuous and which are discontinuous.

Linked issues

Resolves #137

Change tests for newer sklearn
@patrickleonardy patrickleonardy merged commit 1163861 into develop Jan 16, 2023
Copy link
Contributor

@ZlaTanskY ZlaTanskY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review

@@ -223,28 +223,87 @@ def from_pipeline(cls, pipeline: dict):
target_encoder,
is_fitted=pipeline["_is_fitted"],
)

def get_continous_and_discreate_columns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_continuous_and_discrete_columns(

target_column_name :str
) -> tuple:
"""Filters out the continious and discreate varaibles out of a dataframe and returns a tuple containing lists of column names
It assumes that numerical comumns with less than or equal to 10 different values are categorical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continuous instead of continious
discrete instead of discrete
variables instead of varaibles
columns instead of columns

Parameters
----------
df : pd.DataFrame
DataFrame that you want to divide in discreate and continous variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos

f"""Cobra automaticaly assumes that following variables are
discrete: {discrete_vars}
continuous: {continuous_vars}
If you want to change this behaviour you can specify the discrete/continuous variables yourself with the continuous_vars and discrete_vars keywords. \nIt assumes that numerical comumns with less than or equal to 10 different values are categorical"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo comumns -> columns

):
"""Fit the data to the preprocessing pipeline.
If you put continious_vars and target_vars equal to `None` and give the id_col_name Cobra will guess which varaibles are continious and which are not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos:
continuous, variables, continuous

"""
if not (continuous_vars and discrete_vars):
continuous_vars, discrete_vars = self.get_continous_and_discreate_columns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos in code can happen, but please always double-check your method names for typos, since this snowballs to the end user eventually.
get_continuous_and_discrete_columns(...)

"""Fit preprocessing pipeline and transform the data.
If you put continious_vars and target_vars equal to `None` and give the id_col_name Cobra will guess which varaibles are continious and which are not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos

@sandervh14 sandervh14 added the enhancement New feature or request label Mar 9, 2023
@sandervh14 sandervh14 added this to the 2023-03 milestone Mar 9, 2023
@sandervh14
Copy link
Contributor

Hi @patrickleonardy, it looks like Jano reviewed this after you and Pedro merged it. Still, his comments are valuable. Can you check your work against his comments?

@patrickleonardy
Copy link
Contributor Author

patrickleonardy commented Mar 13, 2023

Hi Sander,
I had some problems figuring out how to reopen an already merged branch and thus reopened a branch to change the typos mentioned by Jano in #147 (the moment I figured out that one can reopen branches that were already merged I forgot to change this).
I would thus suggest that I add the changes done in #146 here and we delete #146.

@sandervh14 sandervh14 changed the title 137 automatically search for categorical variables 147 automatically search for categorical variables Apr 7, 2023
@sandervh14 sandervh14 changed the title 147 automatically search for categorical variables 1457 automatically search for categorical variables Apr 7, 2023
@sandervh14 sandervh14 changed the title 1457 automatically search for categorical variables issue 137/PR 145 automatically search for categorical variables Apr 7, 2023
@sandervh14 sandervh14 modified the milestones: 2023-03, 2023-04 Apr 7, 2023
@sandervh14 sandervh14 modified the milestones: 2023-04, 2023-05 May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically search for categorical variables
4 participants