FEAT: Add dbinfer DFS Feature Transformer #2

AnirudhDagar · 2024-06-07T21:43:30Z

Since the DFS feature transformer is slightly different to other feature transformers, the BaseFeatureTransformer is inherited but the _fit_dataframes and _transform_dataframes methods are unused, instead transform and fit are overriden as we work with the whole data together for applying DFS transform. Besides there is technically no concept of fit for dfs, as no params are learnt. The fit method here serves the purpose of setup before the actual transformation takes place in transform.

IMPORTANT:

This PR also moves dropping id column to the end, while applying transformations, since single table DFS requires id columns as the key.
Inteded to use the released version of dbinfer, but a specific dfs engine (dfs2sql) is not released in the open, which the code relies on. Otherwise it would be very straightforward to switch from internal tab2graph to released dbinfer code quickly. Just by switching the imports at the top.

TODO:

Needs benchmarking on evaluation_37.txt
Improve automatic metadata.yaml file creation (to be discussed and done iteratively in another PR)

canerturkmen

Absolute masterclass. Thanks!!

I have a bunch of comments and questions, and my top overarching questions is 1/ what is the difference between tab2graph and dfs2sql and can we decouple even more from tab2graph? 2/ seeing as this is the most non-trivial functionality we're building, let's cover it with some tests to at least make sure the key assumptions and input-output expectations are being satisfied.

Really looking forward to the benchmark results as well.

canerturkmen · 2024-06-10T20:09:17Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+
+
+_DTYPE_MAP = {
+    "Categorical": "category",  # NOTE: Limitation, int dtype columns could be just categorical columns but AG won't infer.


can't we use a trick to cast them as strings beforehand if there are less than X unique values? I thought this was the trick used by AG.

https://github.com/autogluon/autogluon/blob/bda6174f4a1fb8398aef4f375d9eacfd29bb46d9/common/src/autogluon/common/features/infer_types.py#L12

AG is simply checking the dtypes for inferring here. I agree, that should work better, but is this something that we should be supporting in AG?

Let's maybe build it here as cat variables will be reused as FKs?

canerturkmen · 2024-06-10T20:09:40Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+}
+
+
+# NOTE: Inherits from BaseFeatureTransformer but reimplement methods


Noted. We can convert this to a TODO for now.

src/autogluon_assistant/transformer/feature_transformers/dfs.py

canerturkmen · 2024-06-10T20:17:59Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+        raise NotImplementedError
+
+    def _get_time_column(self, column_name_types_map):
+        for key, value in column_name_types_map.items():


interesting. how does DFS use the datetime column? if it's to ensure information from the future doesn't leak in rollups, is it too simplified to take the first datetime column encountered?

we can maybe add a todo to make this another transformer in the future as other feature generators can benefit from it.

Yes afaik the reason that tab2graph needs to have the time column is to avoid temporal leakage, DFS should not aggregate information from the future beyond the given timestamp (a.k.a. cutoff time).

And yes it is too simplified to take the first datetime column here, but this is again part of the problem that I mention, to automatically generate the metadata yaml file, we can iterate this functionality improving it over time. For now there are some assumptions, I should have added a note/todo comment here.

Can be improved with making the metadata creation more intelligent through _get_dtype_with_llm(self):

canerturkmen · 2024-06-10T20:21:35Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+        except:
+            logger.warning(f"FeatureTransformer {self.__class__.__name__} failed to transform.")
+        finally:
+            return task


would be great to cover at least the fit_transform method of this class with tests, even if for some dummy tasks, just to make sure that columns are respected. this can help us both 1/ start setting up the unit test scaffolding and 2/ make sure nothing gets lost or leaked in this class (this is our most nontrivial feature transformer at this point.)

great point, completely agree, I'll setup the tests and have some dummy data generation and tasks to test this.

src/autogluon_assistant/transformer/feature_transformers/dfs.py

canerturkmen · 2024-06-10T20:35:29Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+        column_type_dict = {k: v for k, v in column_type_dict.items() if v != "null"}
+        column_config = []
+        for k, v in column_type_dict.items():
+            if v == "ID":


For my understanding, so the logic is that we mark only the PK of the main table as a "foreign_key" (why?) and nothing else as a key? However then in the pre-DFS processing categorical variables will also be converted to FKs from a dummy table?

A dataset could have multiple ID columns, especially if we already have the defined metadata yaml file passed from the user. As discussed offline. If we have an LLM prompt to handle this, we can hope to pickup more id columns if present and mark them as foreign keys.

However then in the pre-DFS processing categorical variables will also be converted to FKs from a dummy table?
And yes this is mentioned in the paper in appendix G4.

I'll add a note here as well. _get_dtype_with_llm this method is aimed at improving all these questions, but I've skipped it for now to have the most basic implementation in place to start with.

Let's take this discussion offline for better sync.

canerturkmen · 2024-06-10T20:42:08Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+                "dfs": {
+                    "max_depth": depth,
+                    "use_cutoff_time": True,
+                    "engine": "dfs2sql",  # dfs2sql engine is not realeased yet in dbinfer; use tab2graph


confused a little by this? aren't we specifying "dfs2sql" here? also, what is the difference?

TGIF team has two libraries:

tab2graph and dbinfer. dbinfer refers to the code that has been open-sourced. tab2graph has some private functionality and is also being developed on continuously afaics. dfs2sql engine is the backend for DFS preprocessing that is available only in the private repository, hence we use tab2graph repo. Otherwise we could have just depended on the dbinfer codebase.

Maybe it is not very necessary here, to have the dfs2sql engine. For example see here in the open sourced code (dbinfer codebase) configs.

If we don't specify the engine key, one could still get by but probably then the codebase will fallback to featuretools engine which is supposed to be slower than sql engine.

AnirudhDagar

what is the difference between tab2graph and dfs2sql and can we decouple even more from tab2graph?

see comment in the review, and yes it is possible to only depend on dbinfer if we don't use dfs2sql.

seeing as this is the most non-trivial functionality we're building, let's cover it with some tests to at least make sure the key assumptions and input-output expectations are being satisfied.

I'll work on adding those tests, especially with making sure there is no loss of data after the transformation.

src/autogluon_assistant/transformer/feature_transformers/dfs.py

AnirudhDagar · 2024-06-10T21:47:49Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+
+
+_DTYPE_MAP = {
+    "Categorical": "category",  # NOTE: Limitation, int dtype columns could be just categorical columns but AG won't infer.


https://github.com/autogluon/autogluon/blob/bda6174f4a1fb8398aef4f375d9eacfd29bb46d9/common/src/autogluon/common/features/infer_types.py#L12

AG is simply checking the dtypes for inferring here. I agree, that should work better, but is this something that we should be supporting in AG?

AnirudhDagar · 2024-06-10T21:49:46Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+        except:
+            logger.warning(f"FeatureTransformer {self.__class__.__name__} failed to transform.")
+        finally:
+            return task


great point, completely agree, I'll setup the tests and have some dummy data generation and tasks to test this.

AnirudhDagar · 2024-06-10T22:12:02Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+        raise NotImplementedError
+
+    def _get_time_column(self, column_name_types_map):
+        for key, value in column_name_types_map.items():


Yes afaik the reason that tab2graph needs to have the time column is to avoid temporal leakage, DFS should not aggregate information from the future beyond the given timestamp (a.k.a. cutoff time).

And yes it is too simplified to take the first datetime column here, but this is again part of the problem that I mention, to automatically generate the metadata yaml file, we can iterate this functionality improving it over time. For now there are some assumptions, I should have added a note/todo comment here.

Can be improved with making the metadata creation more intelligent through _get_dtype_with_llm(self):

AnirudhDagar · 2024-06-10T22:20:31Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+                "dfs": {
+                    "max_depth": depth,
+                    "use_cutoff_time": True,
+                    "engine": "dfs2sql",  # dfs2sql engine is not realeased yet in dbinfer; use tab2graph


TGIF team has two libraries:

tab2graph and dbinfer. dbinfer refers to the code that has been open-sourced. tab2graph has some private functionality and is also being developed on continuously afaics. dfs2sql engine is the backend for DFS preprocessing that is available only in the private repository, hence we use tab2graph repo. Otherwise we could have just depended on the dbinfer codebase.

Maybe it is not very necessary here, to have the dfs2sql engine. For example see here in the open sourced code (dbinfer codebase) configs.

If we don't specify the engine key, one could still get by but probably then the codebase will fallback to featuretools engine which is supposed to be slower than sql engine.

AnirudhDagar · 2024-06-11T07:55:28Z

src/autogluon_assistant/transformer/feature_transformers/dfs.py

+        column_type_dict = {k: v for k, v in column_type_dict.items() if v != "null"}
+        column_config = []
+        for k, v in column_type_dict.items():
+            if v == "ID":


A dataset could have multiple ID columns, especially if we already have the defined metadata yaml file passed from the user. As discussed offline. If we have an LLM prompt to handle this, we can hope to pickup more id columns if present and mark them as foreign keys.

However then in the pre-DFS processing categorical variables will also be converted to FKs from a dummy table?
And yes this is mentioned in the paper in appendix G4.

I'll add a note here as well. _get_dtype_with_llm this method is aimed at improving all these questions, but I've skipped it for now to have the most basic implementation in place to start with.

Let's take this discussion offline for better sync.

Committer: AnirudhDagar <anirudhdagar6@gmail.com>

AnirudhDagar added 2 commits June 7, 2024 23:24

FEAT: Add Single Table DFS FeatureTransformer

7db34f9

cleanup

5b06dfd

canerturkmen reviewed Jun 10, 2024

View reviewed changes

address staticmethod and other nits

db6c3cc

AnirudhDagar commented Jun 11, 2024

View reviewed changes

AnirudhDagar and others added 8 commits June 11, 2024 15:19

switch to dbinfer

66daaff

[WIP] use categorical vars as id

59cbe68

use categorical vars as id [broken]

2ab115d

pin ag to 1.1 for dbinfer support

d9e6f36

Committer: AnirudhDagar <anirudhdagar6@gmail.com>

switch to tab2graph

092381f

Fix original df mutation

81dbfc7

Add overridable params to limit fk generation

790c73e

use dbinfer

a88d481

AnirudhDagar mentioned this pull request Oct 28, 2024

Glove cpu #57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Add dbinfer DFS Feature Transformer #2

FEAT: Add dbinfer DFS Feature Transformer #2

AnirudhDagar commented Jun 7, 2024 •

edited

Loading

canerturkmen left a comment

canerturkmen Jun 10, 2024

AnirudhDagar Jun 10, 2024

canerturkmen Jun 11, 2024

canerturkmen Jun 10, 2024

canerturkmen Jun 10, 2024

AnirudhDagar Jun 10, 2024

canerturkmen Jun 10, 2024

AnirudhDagar Jun 10, 2024

canerturkmen Jun 10, 2024

AnirudhDagar Jun 11, 2024

canerturkmen Jun 10, 2024

AnirudhDagar Jun 10, 2024

AnirudhDagar left a comment •

edited

Loading

AnirudhDagar Jun 10, 2024

AnirudhDagar Jun 10, 2024

AnirudhDagar Jun 10, 2024

AnirudhDagar Jun 10, 2024

AnirudhDagar Jun 11, 2024



		_DTYPE_MAP = {
		"Categorical": "category", # NOTE: Limitation, int dtype columns could be just categorical columns but AG won't infer.

		}


		# NOTE: Inherits from BaseFeatureTransformer but reimplement methods

FEAT: Add dbinfer DFS Feature Transformer #2

Are you sure you want to change the base?

FEAT: Add dbinfer DFS Feature Transformer #2

Conversation

AnirudhDagar commented Jun 7, 2024 • edited Loading

canerturkmen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnirudhDagar left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnirudhDagar commented Jun 7, 2024 •

edited

Loading

AnirudhDagar left a comment •

edited

Loading