-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer min and max timestamps from entity_df to limit data read from BQ source #1665
Conversation
…Q source Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
Hi @Mwad22. Thanks for your PR. I'm waiting for a feast-dev member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks for this @Mwadd22! A bit of a nitpick, but is anything preventing us from inferring and using the timestamps purely in SQL? I suspect it would be significantly faster than scanning a dataframe. |
Codecov Report
@@ Coverage Diff @@
## master #1665 +/- ##
==========================================
+ Coverage 82.69% 82.75% +0.05%
==========================================
Files 76 76
Lines 6734 6754 +20
==========================================
+ Hits 5569 5589 +20
Misses 1165 1165
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
…ying BQ source Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
Thanks for pointing that out @woop. I just committed some changes that infer and use the timestamps purely in SQL as requested (and removed the |
LGTM |
Looks much nicer. @MattDelac are you ok with this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR; I am fine with the current PR as the overhead of the MIN/MAX computations should be minimal (especially compare to the overall cost of the whole SQL). That being said, the more long term approach would be to fetch the min/max locally to avoid adding complexity to the SQL template. We could even plan to fetch some useful metadata first and reduce the complexity of the SQL template
Reasoning:
it sounds like MAX()
and MIN()
operations are going to be computed multiple times in BigQuery.
For example, I just looked at the "Execution details" of the following query in BQ and the MIN() MAX() operations are computed at least twice
WITH left_table AS (
SELECT "user 1" AS col_agg, 1 AS col_value, TIMESTAMP '2021-01-01' AS col_timestamp
UNION ALL
SELECT "user 1" AS col_agg, 2 AS col_value, TIMESTAMP '2021-01-02' AS col_timestamp
UNION ALL
SELECT "user 2" AS col_agg, 3 AS col_value, TIMESTAMP '2021-01-03' AS col_timestamp
UNION ALL
SELECT "user 2" AS col_agg, 4 AS col_value, TIMESTAMP '2021-01-04' AS col_timestamp
UNION ALL
SELECT "user 2" AS col_agg, 5 AS col_value, TIMESTAMP '2021-01-05' AS col_timestamp
),
subquery_table AS (
SELECT "user 1" AS col_agg, 6 AS col_value, TIMESTAMP '2021-01-01' AS col_timestamp
UNION ALL
SELECT "user 1" AS col_agg, 7 AS col_value, TIMESTAMP '2021-01-02' AS col_timestamp
UNION ALL
SELECT "user 2" AS col_agg, 8 AS col_value, TIMESTAMP '2021-01-03' AS col_timestamp
UNION ALL
SELECT "user 2" AS col_agg, 9 AS col_value, TIMESTAMP '2021-01-04' AS col_timestamp
UNION ALL
SELECT "user 2" AS col_agg, 10 AS col_value, TIMESTAMP '2021-01-05' AS col_timestamp
),
timestamp_bounds AS (
SELECT
MAX(col_timestamp) AS max_boundary,
MIN(col_timestamp) AS min_boundary
FROM left_table
),
subquery AS (
SELECT *
FROM left_table
WHERE left_table.col_timestamp <= (SELECT max_boundary FROM timestamp_bounds)
AND left_table.col_timestamp >= TIMESTAMP_SUB((SELECT min_boundary FROM timestamp_bounds), INTERVAL 3000 second)
),
compute_something AS (
SELECT
col_agg,
SUM(col_value) AS col_value_tot
FROM subquery
GROUP BY 1
)
SELECT *
FROM subquery
LEFT JOIN compute_something USING (col_agg)
So it sounds like it would be more optimized and long term to just fetch the boundaries locally and avoid adding complexity to the current BQ template. Especially since it might be computed again and again because of the "for loop" happening on each FeatureView
query = """
SELECT
MIN(timestamp_col) AS min_timetamp,
MAX(timestamp_col) AS max_timetamp
FROM left_table_query_string
"""
boundary_df = bigquery.query(query=query).to_pandas()
min_timestamp = boundary_df.loc[0, "min_timetamp"]
max_timestamp = boundary_df.loc[0, "max_timetamp"]
Note: We could potentially get the min/max info of the pandas dataframe without performing another SQL query
I agree that we can further optimize the query by determining the min/max in another query.
lgtm
This PR originally scanned the local dataframe, but I'm not sure if that will hold up for large datasets. I'd want to move |
Ho yes, it would only work when you pass a pandas dataframe as a left feature_view (instead of a query). So it does not solve all use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pending the column naming change
…ep retrieval query simple Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
Thank you all for your detailed reviews, definitely appreciated! If there are no other changes to be requested, I think this change should be good to go. |
Overall it looks good to me. Just wanted to get your opinion on one last nitpick :) Otherwise I am happy to ship it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An integration test for _get_entity_df_timestamp_bounds()
would be lovely
Otherwise it LGTM
@woop @MattDelac I've added my thoughts here on why I think subtracting the TTL in the query might be preferable. I am also open to adding an integration test for |
Ok, from my side you can leave the query as is. I think |
Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
…ical retrieval test Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
…/max timestamp inference Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
Looking great |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MattDelac, Mwad22, woop The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Signed-off-by: Mwad22 51929507+Mwad22@users.noreply.github.com
What this PR does / why we need it:
Infers the minimum and maximum timestamps from the provided
entity_df
if possible (if entity_df is provided as a Pandas dataframe). Right now, too much data is being read since the range of time for the feature data doesn't account for the min and max timestamps for the base data (entity_df
).For instance, if the max timestamp on order ids is 5/1/2020, we want to avoid wasting time looking at data in the range 5/2/2020 to present day– we would not join feature data from this range.
In short, this change will allow us to read less data from BigQuery sources.
Which issue(s) this PR fixes:
No issue assigned to this at the moment, but was left as TODO item in
bigquery.py
hereDoes this PR introduce a user-facing change?: