Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(engine): dataset type detection, update flatline logic #22

Merged
merged 1 commit into from
Sep 20, 2021

Conversation

jfsantos-ds
Copy link
Contributor

A series of changes are introduced:

Important changes

  • Add enum DataFrameType with both types of supported DataFrame types
  • Add df_type property
  • Add infer_df_type auxiliary method, inferring type of DataFrame based on the index column
  • Updated flatline (VMV) logic, only working in timeseries DataFrames and just numerical dtypes

Minor changes

  • Stored an updated corrupt version of the macrodata (VMV) example DataFrame
  • Moving dtype inferral to auxiliary methods
  • Removing unused method on labelling engine

@@ -64,6 +66,13 @@ def dtypes(self, dtypes: dict):
dtypes[col] = dtype
self._dtypes = dtypes

@property
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same logic as dtypes except it has no explicit setter method (will make sense to include if we wish to add a new argument dataset_type p.e.).
In init the private attribute _df_types is stored as None, the first time the df_type property is called, the getter method will use the default inference to set the private attribute

@@ -189,31 +187,6 @@ def predict_missingness(df: pd.DataFrame, feature: str):
# 5. Return the area under the roc curve
return roc_auc_score(y_test, y_pred)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes no sense to keep this in modelling (smart inference could be a different story). Moved to auxiliary

@@ -19,7 +20,10 @@ def __init__(self, df: pd.DataFrame, vmv_extensions: Optional[list]=[]):
vmv_extensions: A list of user provided Value Missing Values to append to defaults.
"""
super().__init__(df=df)
self._tests = ["flatlines", "predefined_valued_missing_values"]
if self.df_type == DataFrameType.TIMESERIES:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This impacts the evaluate of the legacy engines

@@ -78,10 +82,13 @@ def flatlines(self, th: int=5, skip: list=[]):
th: Defines the minimum length required for a flatline event to be reported.
skip: List of columns that will not be target of search for flatlines.
Pass '__index' inside skip list to skip looking for flatlines at the index."""
if self.df_type == DataFrameType.TABULAR:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This impacts the explicit method execution of legacy engines

@@ -40,12 +41,6 @@ def tdf(self):
def __get_missing_labels(df: pd.DataFrame, label: str):
return df[df[label].isna()]

def _get_data_types(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was unused

@@ -40,3 +41,40 @@ def random_split(df: Union[pd.DataFrame, pd.Series], split_size: float, shuffle:
split = sample.iloc[:split_len]
remainder = sample.iloc[split_len:]
return split, remainder

def infer_dtypes(df: Union[pd.DataFrame, pd.Series], skip: Union[list, set] = []):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method made more sense to be in auxiliary than modelling

Copy link
Contributor Author

@jfsantos-ds jfsantos-ds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good here

@UrbanoFonseca UrbanoFonseca merged commit c88e333 into master Sep 20, 2021
@UrbanoFonseca UrbanoFonseca deleted the fix/smarter_flatline_detection branch September 21, 2021 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants