Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Add dbinfer DFS Feature Transformer #2

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

AnirudhDagar
Copy link
Collaborator

@AnirudhDagar AnirudhDagar commented Jun 7, 2024

Since the DFS feature transformer is slightly different to other feature transformers, the BaseFeatureTransformer is inherited but the _fit_dataframes and _transform_dataframes methods are unused, instead transform and fit are overriden as we work with the whole data together for applying DFS transform. Besides there is technically no concept of fit for dfs, as no params are learnt. The fit method here serves the purpose of setup before the actual transformation takes place in transform.

IMPORTANT:

  1. This PR also moves dropping id column to the end, while applying transformations, since single table DFS requires id columns as the key.
  2. Inteded to use the released version of dbinfer, but a specific dfs engine (dfs2sql) is not released in the open, which the code relies on. Otherwise it would be very straightforward to switch from internal tab2graph to released dbinfer code quickly. Just by switching the imports at the top.

TODO:

  • Needs benchmarking on evaluation_37.txt
  • Improve automatic metadata.yaml file creation (to be discussed and done iteratively in another PR)

Copy link
Collaborator

@canerturkmen canerturkmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolute masterclass. Thanks!!

I have a bunch of comments and questions, and my top overarching questions is 1/ what is the difference between tab2graph and dfs2sql and can we decouple even more from tab2graph? 2/ seeing as this is the most non-trivial functionality we're building, let's cover it with some tests to at least make sure the key assumptions and input-output expectations are being satisfied.

Really looking forward to the benchmark results as well.



_DTYPE_MAP = {
"Categorical": "category", # NOTE: Limitation, int dtype columns could be just categorical columns but AG won't infer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we use a trick to cast them as strings beforehand if there are less than X unique values? I thought this was the trick used by AG.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/autogluon/autogluon/blob/bda6174f4a1fb8398aef4f375d9eacfd29bb46d9/common/src/autogluon/common/features/infer_types.py#L12

AG is simply checking the dtypes for inferring here. I agree, that should work better, but is this something that we should be supporting in AG?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe build it here as cat variables will be reused as FKs?

}


# NOTE: Inherits from BaseFeatureTransformer but reimplement methods
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. We can convert this to a TODO for now.

raise NotImplementedError

def _get_time_column(self, column_name_types_map):
for key, value in column_name_types_map.items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. how does DFS use the datetime column? if it's to ensure information from the future doesn't leak in rollups, is it too simplified to take the first datetime column encountered?

we can maybe add a todo to make this another transformer in the future as other feature generators can benefit from it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes afaik the reason that tab2graph needs to have the time column is to avoid temporal leakage, DFS should not aggregate information from the future beyond the given timestamp (a.k.a. cutoff time).

And yes it is too simplified to take the first datetime column here, but this is again part of the problem that I mention, to automatically generate the metadata yaml file, we can iterate this functionality improving it over time. For now there are some assumptions, I should have added a note/todo comment here.

Can be improved with making the metadata creation more intelligent through _get_dtype_with_llm(self):

except:
logger.warning(f"FeatureTransformer {self.__class__.__name__} failed to transform.")
finally:
return task
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to cover at least the fit_transform method of this class with tests, even if for some dummy tasks, just to make sure that columns are respected. this can help us both 1/ start setting up the unit test scaffolding and 2/ make sure nothing gets lost or leaked in this class (this is our most nontrivial feature transformer at this point.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point, completely agree, I'll setup the tests and have some dummy data generation and tasks to test this.

column_type_dict = {k: v for k, v in column_type_dict.items() if v != "null"}
column_config = []
for k, v in column_type_dict.items():
if v == "ID":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, so the logic is that we mark only the PK of the main table as a "foreign_key" (why?) and nothing else as a key? However then in the pre-DFS processing categorical variables will also be converted to FKs from a dummy table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dataset could have multiple ID columns, especially if we already have the defined metadata yaml file passed from the user. As discussed offline. If we have an LLM prompt to handle this, we can hope to pickup more id columns if present and mark them as foreign keys.

However then in the pre-DFS processing categorical variables will also be converted to FKs from a dummy table?
And yes this is mentioned in the paper in appendix G4.

I'll add a note here as well. _get_dtype_with_llm this method is aimed at improving all these questions, but I've skipped it for now to have the most basic implementation in place to start with.

Let's take this discussion offline for better sync.

"dfs": {
"max_depth": depth,
"use_cutoff_time": True,
"engine": "dfs2sql", # dfs2sql engine is not realeased yet in dbinfer; use tab2graph
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused a little by this? aren't we specifying "dfs2sql" here? also, what is the difference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TGIF team has two libraries:

tab2graph and dbinfer. dbinfer refers to the code that has been open-sourced. tab2graph has some private functionality and is also being developed on continuously afaics. dfs2sql engine is the backend for DFS preprocessing that is available only in the private repository, hence we use tab2graph repo. Otherwise we could have just depended on the dbinfer codebase.

Maybe it is not very necessary here, to have the dfs2sql engine. For example see here in the open sourced code (dbinfer codebase) configs.

If we don't specify the engine key, one could still get by but probably then the codebase will fallback to featuretools engine which is supposed to be slower than sql engine.

Copy link
Collaborator Author

@AnirudhDagar AnirudhDagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between tab2graph and dfs2sql and can we decouple even more from tab2graph?

see comment in the review, and yes it is possible to only depend on dbinfer if we don't use dfs2sql.

seeing as this is the most non-trivial functionality we're building, let's cover it with some tests to at least make sure the key assumptions and input-output expectations are being satisfied.

I'll work on adding those tests, especially with making sure there is no loss of data after the transformation.



_DTYPE_MAP = {
"Categorical": "category", # NOTE: Limitation, int dtype columns could be just categorical columns but AG won't infer.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/autogluon/autogluon/blob/bda6174f4a1fb8398aef4f375d9eacfd29bb46d9/common/src/autogluon/common/features/infer_types.py#L12

AG is simply checking the dtypes for inferring here. I agree, that should work better, but is this something that we should be supporting in AG?

except:
logger.warning(f"FeatureTransformer {self.__class__.__name__} failed to transform.")
finally:
return task
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point, completely agree, I'll setup the tests and have some dummy data generation and tasks to test this.

raise NotImplementedError

def _get_time_column(self, column_name_types_map):
for key, value in column_name_types_map.items():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes afaik the reason that tab2graph needs to have the time column is to avoid temporal leakage, DFS should not aggregate information from the future beyond the given timestamp (a.k.a. cutoff time).

And yes it is too simplified to take the first datetime column here, but this is again part of the problem that I mention, to automatically generate the metadata yaml file, we can iterate this functionality improving it over time. For now there are some assumptions, I should have added a note/todo comment here.

Can be improved with making the metadata creation more intelligent through _get_dtype_with_llm(self):

"dfs": {
"max_depth": depth,
"use_cutoff_time": True,
"engine": "dfs2sql", # dfs2sql engine is not realeased yet in dbinfer; use tab2graph
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TGIF team has two libraries:

tab2graph and dbinfer. dbinfer refers to the code that has been open-sourced. tab2graph has some private functionality and is also being developed on continuously afaics. dfs2sql engine is the backend for DFS preprocessing that is available only in the private repository, hence we use tab2graph repo. Otherwise we could have just depended on the dbinfer codebase.

Maybe it is not very necessary here, to have the dfs2sql engine. For example see here in the open sourced code (dbinfer codebase) configs.

If we don't specify the engine key, one could still get by but probably then the codebase will fallback to featuretools engine which is supposed to be slower than sql engine.

column_type_dict = {k: v for k, v in column_type_dict.items() if v != "null"}
column_config = []
for k, v in column_type_dict.items():
if v == "ID":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dataset could have multiple ID columns, especially if we already have the defined metadata yaml file passed from the user. As discussed offline. If we have an LLM prompt to handle this, we can hope to pickup more id columns if present and mark them as foreign keys.

However then in the pre-DFS processing categorical variables will also be converted to FKs from a dummy table?
And yes this is mentioned in the paper in appendix G4.

I'll add a note here as well. _get_dtype_with_llm this method is aimed at improving all these questions, but I've skipped it for now to have the most basic implementation in place to start with.

Let's take this discussion offline for better sync.

@AnirudhDagar AnirudhDagar mentioned this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants