-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Add dbinfer DFS Feature Transformer #2
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolute masterclass. Thanks!!
I have a bunch of comments and questions, and my top overarching questions is 1/ what is the difference between tab2graph
and dfs2sql
and can we decouple even more from tab2graph
? 2/ seeing as this is the most non-trivial functionality we're building, let's cover it with some tests to at least make sure the key assumptions and input-output expectations are being satisfied.
Really looking forward to the benchmark results as well.
|
||
|
||
_DTYPE_MAP = { | ||
"Categorical": "category", # NOTE: Limitation, int dtype columns could be just categorical columns but AG won't infer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we use a trick to cast them as strings beforehand if there are less than X unique values? I thought this was the trick used by AG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AG is simply checking the dtypes for inferring here. I agree, that should work better, but is this something that we should be supporting in AG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's maybe build it here as cat variables will be reused as FKs?
} | ||
|
||
|
||
# NOTE: Inherits from BaseFeatureTransformer but reimplement methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. We can convert this to a TODO for now.
src/autogluon_assistant/transformer/feature_transformers/dfs.py
Outdated
Show resolved
Hide resolved
raise NotImplementedError | ||
|
||
def _get_time_column(self, column_name_types_map): | ||
for key, value in column_name_types_map.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting. how does DFS use the datetime column? if it's to ensure information from the future doesn't leak in rollups, is it too simplified to take the first datetime column encountered?
we can maybe add a todo to make this another transformer in the future as other feature generators can benefit from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes afaik the reason that tab2graph needs to have the time column is to avoid temporal leakage, DFS should not aggregate information from the future beyond the given timestamp (a.k.a. cutoff time).
And yes it is too simplified to take the first datetime column here, but this is again part of the problem that I mention, to automatically generate the metadata yaml file
, we can iterate this functionality improving it over time. For now there are some assumptions, I should have added a note/todo comment here.
Can be improved with making the metadata creation more intelligent through _get_dtype_with_llm(self):
except: | ||
logger.warning(f"FeatureTransformer {self.__class__.__name__} failed to transform.") | ||
finally: | ||
return task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be great to cover at least the fit_transform
method of this class with tests, even if for some dummy tasks, just to make sure that columns are respected. this can help us both 1/ start setting up the unit test scaffolding and 2/ make sure nothing gets lost or leaked in this class (this is our most nontrivial feature transformer at this point.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point, completely agree, I'll setup the tests and have some dummy data generation and tasks to test this.
src/autogluon_assistant/transformer/feature_transformers/dfs.py
Outdated
Show resolved
Hide resolved
src/autogluon_assistant/transformer/feature_transformers/dfs.py
Outdated
Show resolved
Hide resolved
src/autogluon_assistant/transformer/feature_transformers/dfs.py
Outdated
Show resolved
Hide resolved
column_type_dict = {k: v for k, v in column_type_dict.items() if v != "null"} | ||
column_config = [] | ||
for k, v in column_type_dict.items(): | ||
if v == "ID": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, so the logic is that we mark only the PK of the main table as a "foreign_key" (why?) and nothing else as a key? However then in the pre-DFS processing categorical variables will also be converted to FKs from a dummy table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dataset could have multiple ID columns, especially if we already have the defined metadata yaml file passed from the user. As discussed offline. If we have an LLM prompt to handle this, we can hope to pickup more id columns if present and mark them as foreign keys.
However then in the pre-DFS processing categorical variables will also be converted to FKs from a dummy table?
And yes this is mentioned in the paper in appendix G4.
I'll add a note here as well. _get_dtype_with_llm
this method is aimed at improving all these questions, but I've skipped it for now to have the most basic implementation in place to start with.
Let's take this discussion offline for better sync.
"dfs": { | ||
"max_depth": depth, | ||
"use_cutoff_time": True, | ||
"engine": "dfs2sql", # dfs2sql engine is not realeased yet in dbinfer; use tab2graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confused a little by this? aren't we specifying "dfs2sql" here? also, what is the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TGIF team has two libraries:
tab2graph and dbinfer. dbinfer refers to the code that has been open-sourced. tab2graph
has some private functionality and is also being developed on continuously afaics. dfs2sql engine
is the backend for DFS preprocessing that is available only in the private repository, hence we use tab2graph
repo. Otherwise we could have just depended on the dbinfer
codebase.
Maybe it is not very necessary here, to have the dfs2sql
engine. For example see here in the open sourced code (dbinfer codebase) configs.
If we don't specify the engine key, one could still get by but probably then the codebase will fallback to featuretools engine
which is supposed to be slower than sql engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference between tab2graph and dfs2sql and can we decouple even more from tab2graph?
see comment in the review, and yes it is possible to only depend on dbinfer if we don't use dfs2sql
.
seeing as this is the most non-trivial functionality we're building, let's cover it with some tests to at least make sure the key assumptions and input-output expectations are being satisfied.
I'll work on adding those tests, especially with making sure there is no loss of data after the transformation.
src/autogluon_assistant/transformer/feature_transformers/dfs.py
Outdated
Show resolved
Hide resolved
|
||
|
||
_DTYPE_MAP = { | ||
"Categorical": "category", # NOTE: Limitation, int dtype columns could be just categorical columns but AG won't infer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AG is simply checking the dtypes for inferring here. I agree, that should work better, but is this something that we should be supporting in AG?
except: | ||
logger.warning(f"FeatureTransformer {self.__class__.__name__} failed to transform.") | ||
finally: | ||
return task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point, completely agree, I'll setup the tests and have some dummy data generation and tasks to test this.
raise NotImplementedError | ||
|
||
def _get_time_column(self, column_name_types_map): | ||
for key, value in column_name_types_map.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes afaik the reason that tab2graph needs to have the time column is to avoid temporal leakage, DFS should not aggregate information from the future beyond the given timestamp (a.k.a. cutoff time).
And yes it is too simplified to take the first datetime column here, but this is again part of the problem that I mention, to automatically generate the metadata yaml file
, we can iterate this functionality improving it over time. For now there are some assumptions, I should have added a note/todo comment here.
Can be improved with making the metadata creation more intelligent through _get_dtype_with_llm(self):
"dfs": { | ||
"max_depth": depth, | ||
"use_cutoff_time": True, | ||
"engine": "dfs2sql", # dfs2sql engine is not realeased yet in dbinfer; use tab2graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TGIF team has two libraries:
tab2graph and dbinfer. dbinfer refers to the code that has been open-sourced. tab2graph
has some private functionality and is also being developed on continuously afaics. dfs2sql engine
is the backend for DFS preprocessing that is available only in the private repository, hence we use tab2graph
repo. Otherwise we could have just depended on the dbinfer
codebase.
Maybe it is not very necessary here, to have the dfs2sql
engine. For example see here in the open sourced code (dbinfer codebase) configs.
If we don't specify the engine key, one could still get by but probably then the codebase will fallback to featuretools engine
which is supposed to be slower than sql engine.
column_type_dict = {k: v for k, v in column_type_dict.items() if v != "null"} | ||
column_config = [] | ||
for k, v in column_type_dict.items(): | ||
if v == "ID": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dataset could have multiple ID columns, especially if we already have the defined metadata yaml file passed from the user. As discussed offline. If we have an LLM prompt to handle this, we can hope to pickup more id columns if present and mark them as foreign keys.
However then in the pre-DFS processing categorical variables will also be converted to FKs from a dummy table?
And yes this is mentioned in the paper in appendix G4.
I'll add a note here as well. _get_dtype_with_llm
this method is aimed at improving all these questions, but I've skipped it for now to have the most basic implementation in place to start with.
Let's take this discussion offline for better sync.
Committer: AnirudhDagar <anirudhdagar6@gmail.com>
Since the DFS feature transformer is slightly different to other feature transformers, the
BaseFeatureTransformer
is inherited but the_fit_dataframes
and_transform_dataframes
methods are unused, insteadtransform
andfit
are overriden as we work with the whole data together for applying DFS transform. Besides there is technically no concept of fit for dfs, as no params are learnt. Thefit
method here serves the purpose of setup before the actual transformation takes place intransform
.IMPORTANT:
TODO: