-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(engine): dataset type detection, update flatline logic #22
Conversation
@@ -64,6 +66,13 @@ def dtypes(self, dtypes: dict): | |||
dtypes[col] = dtype | |||
self._dtypes = dtypes | |||
|
|||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same logic as dtypes except it has no explicit setter method (will make sense to include if we wish to add a new argument dataset_type p.e.).
In init the private attribute _df_types is stored as None, the first time the df_type property is called, the getter method will use the default inference to set the private attribute
@@ -189,31 +187,6 @@ def predict_missingness(df: pd.DataFrame, feature: str): | |||
# 5. Return the area under the roc curve | |||
return roc_auc_score(y_test, y_pred) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes no sense to keep this in modelling (smart inference could be a different story). Moved to auxiliary
@@ -19,7 +20,10 @@ def __init__(self, df: pd.DataFrame, vmv_extensions: Optional[list]=[]): | |||
vmv_extensions: A list of user provided Value Missing Values to append to defaults. | |||
""" | |||
super().__init__(df=df) | |||
self._tests = ["flatlines", "predefined_valued_missing_values"] | |||
if self.df_type == DataFrameType.TIMESERIES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This impacts the evaluate of the legacy engines
@@ -78,10 +82,13 @@ def flatlines(self, th: int=5, skip: list=[]): | |||
th: Defines the minimum length required for a flatline event to be reported. | |||
skip: List of columns that will not be target of search for flatlines. | |||
Pass '__index' inside skip list to skip looking for flatlines at the index.""" | |||
if self.df_type == DataFrameType.TABULAR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This impacts the explicit method execution of legacy engines
@@ -40,12 +41,6 @@ def tdf(self): | |||
def __get_missing_labels(df: pd.DataFrame, label: str): | |||
return df[df[label].isna()] | |||
|
|||
def _get_data_types(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was unused
@@ -40,3 +41,40 @@ def random_split(df: Union[pd.DataFrame, pd.Series], split_size: float, shuffle: | |||
split = sample.iloc[:split_len] | |||
remainder = sample.iloc[split_len:] | |||
return split, remainder | |||
|
|||
def infer_dtypes(df: Union[pd.DataFrame, pd.Series], skip: Union[list, set] = []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method made more sense to be in auxiliary than modelling
010e05e
to
5203eb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good here
A series of changes are introduced:
Important changes
Minor changes