feat(engine): dataset type detection, update flatline logic #22

jfsantos-ds · 2021-09-15T12:20:30Z

A series of changes are introduced:

Important changes

Add enum DataFrameType with both types of supported DataFrame types
Add df_type property
Add infer_df_type auxiliary method, inferring type of DataFrame based on the index column
Updated flatline (VMV) logic, only working in timeseries DataFrames and just numerical dtypes

Minor changes

Stored an updated corrupt version of the macrodata (VMV) example DataFrame
Moving dtype inferral to auxiliary methods
Removing unused method on labelling engine

examples/macrodata/macrodata.csv

jfsantos-ds · 2021-09-15T12:26:19Z

src/ydata_quality/core/engine.py

@@ -64,6 +66,13 @@ def dtypes(self, dtypes: dict):
                dtypes[col] = dtype
        self._dtypes = dtypes

+    @property


Same logic as dtypes except it has no explicit setter method (will make sense to include if we wish to add a new argument dataset_type p.e.).
In init the private attribute _df_types is stored as None, the first time the df_type property is called, the getter method will use the default inference to set the private attribute

jfsantos-ds · 2021-09-15T12:29:21Z

src/ydata_quality/utils/modelling.py

@@ -189,31 +187,6 @@ def predict_missingness(df: pd.DataFrame, feature: str):
    # 5. Return the area under the roc curve
    return roc_auc_score(y_test, y_pred)



makes no sense to keep this in modelling (smart inference could be a different story). Moved to auxiliary

jfsantos-ds · 2021-09-15T12:29:50Z

src/ydata_quality/valued_missing_values/engine.py

@@ -19,7 +20,10 @@ def __init__(self, df: pd.DataFrame, vmv_extensions: Optional[list]=[]):
            vmv_extensions: A list of user provided Value Missing Values to append to defaults.
        """
        super().__init__(df=df)
-        self._tests = ["flatlines", "predefined_valued_missing_values"]
+        if self.df_type == DataFrameType.TIMESERIES:


This impacts the evaluate of the legacy engines

jfsantos-ds · 2021-09-15T12:30:13Z

src/ydata_quality/valued_missing_values/engine.py

@@ -78,10 +82,13 @@ def flatlines(self, th: int=5, skip: list=[]):
            th: Defines the minimum length required for a flatline event to be reported.
            skip: List of columns that will not be target of search for flatlines.
                Pass '__index' inside skip list to skip looking for flatlines at the index."""
+        if self.df_type == DataFrameType.TABULAR:


This impacts the explicit method execution of legacy engines

jfsantos-ds · 2021-09-20T11:05:36Z

src/ydata_quality/labelling/engine.py

@@ -40,12 +41,6 @@ def tdf(self):
    def __get_missing_labels(df: pd.DataFrame, label: str):
        return df[df[label].isna()]

-    def _get_data_types(self):


This was unused

jfsantos-ds · 2021-09-20T11:06:26Z

src/ydata_quality/utils/auxiliary.py

@@ -40,3 +41,40 @@ def random_split(df: Union[pd.DataFrame, pd.Series], split_size: float, shuffle:
    split = sample.iloc[:split_len]
    remainder = sample.iloc[split_len:]
    return split, remainder
+
+def infer_dtypes(df: Union[pd.DataFrame, pd.Series], skip: Union[list, set] = []):


This method made more sense to be in auxiliary than modelling

jfsantos-ds

All good here

jfsantos-ds requested a review from UrbanoFonseca September 15, 2021 12:20

jfsantos-ds self-assigned this Sep 15, 2021

jfsantos-ds commented Sep 15, 2021

View reviewed changes

examples/macrodata/macrodata.csv Outdated Show resolved Hide resolved

jfsantos-ds commented Sep 15, 2021

View reviewed changes

jfsantos-ds commented Sep 20, 2021

View reviewed changes

fix(engine): dataset type & updated flatline logic

5203eb5

UrbanoFonseca force-pushed the fix/smarter_flatline_detection branch from 010e05e to 5203eb5 Compare September 20, 2021 16:27

jfsantos-ds commented Sep 20, 2021

View reviewed changes

UrbanoFonseca merged commit c88e333 into master Sep 20, 2021

UrbanoFonseca deleted the fix/smarter_flatline_detection branch September 21, 2021 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(engine): dataset type detection, update flatline logic #22

feat(engine): dataset type detection, update flatline logic #22

jfsantos-ds commented Sep 15, 2021

jfsantos-ds Sep 15, 2021

jfsantos-ds Sep 15, 2021

jfsantos-ds Sep 15, 2021

jfsantos-ds Sep 15, 2021

jfsantos-ds Sep 20, 2021

jfsantos-ds Sep 20, 2021

jfsantos-ds left a comment

		@@ -189,31 +187,6 @@ def predict_missingness(df: pd.DataFrame, feature: str):
		# 5. Return the area under the roc curve
		return roc_auc_score(y_test, y_pred)

feat(engine): dataset type detection, update flatline logic #22

feat(engine): dataset type detection, update flatline logic #22

Conversation

jfsantos-ds commented Sep 15, 2021

Important changes

Minor changes

jfsantos-ds Sep 15, 2021

Choose a reason for hiding this comment

jfsantos-ds Sep 15, 2021

Choose a reason for hiding this comment

jfsantos-ds Sep 15, 2021

Choose a reason for hiding this comment

jfsantos-ds Sep 15, 2021

Choose a reason for hiding this comment

jfsantos-ds Sep 20, 2021

Choose a reason for hiding this comment

jfsantos-ds Sep 20, 2021

Choose a reason for hiding this comment

jfsantos-ds left a comment

Choose a reason for hiding this comment