Infer min and max timestamps from entity_df to limit data read from BQ source #1665

Mwad22 · 2021-06-23T21:11:24Z

Signed-off-by: Mwad22 51929507+Mwad22@users.noreply.github.com

What this PR does / why we need it:
Infers the minimum and maximum timestamps from the provided entity_df if possible (if entity_df is provided as a Pandas dataframe). Right now, too much data is being read since the range of time for the feature data doesn't account for the min and max timestamps for the base data (entity_df).

For instance, if the max timestamp on order ids is 5/1/2020, we want to avoid wasting time looking at data in the range 5/2/2020 to present day– we would not join feature data from this range.

In short, this change will allow us to read less data from BigQuery sources.

Which issue(s) this PR fixes:

No issue assigned to this at the moment, but was left as TODO item in bigquery.py here

Does this PR introduce a user-facing change?:

NONE

…Q source Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

feast-ci-bot · 2021-06-23T21:11:33Z

Hi @Mwad22. Thanks for your PR.

I'm waiting for a feast-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sdk/python/feast/infra/offline_stores/bigquery.py

woop · 2021-06-24T06:06:10Z

Thanks for this @Mwadd22!

A bit of a nitpick, but is anything preventing us from inferring and using the timestamps purely in SQL? I suspect it would be significantly faster than scanning a dataframe.

codecov-commenter · 2021-06-24T06:08:12Z

Codecov Report

Merging #1665 (c3a77dc) into master (2e0113e) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1665      +/-   ##
==========================================
+ Coverage   82.69%   82.75%   +0.05%     
==========================================
  Files          76       76              
  Lines        6734     6754      +20     
==========================================
+ Hits         5569     5589      +20     
  Misses       1165     1165

Flag	Coverage Δ
integrationtests	`82.67% <100.00%> (+0.05%)`	⬆️
unittests	`69.73% <23.80%> (-0.15%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdk/python/feast/infra/offline_stores/bigquery.py	`76.14% <100.00%> (+0.79%)`	⬆️
sdk/python/tests/test_historical_retrieval.py	`99.09% <100.00%> (+0.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e0113e...c3a77dc. Read the comment docs.

…ying BQ source Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

Mwad22 · 2021-06-25T00:34:00Z

Thanks for this @Mwadd22!

A bit of a nitpick, but is anything preventing us from inferring and using the timestamps purely in SQL? I suspect it would be significantly faster than scanning a dataframe.

Thanks for pointing that out @woop. I just committed some changes that infer and use the timestamps purely in SQL as requested (and removed the min/max_timestamps arguments as they are no longer needed).

mavysavydav · 2021-06-25T06:59:19Z

LGTM

woop · 2021-06-25T21:36:27Z

Looks much nicer. @MattDelac are you ok with this change?

sdk/python/feast/infra/offline_stores/bigquery.py

MattDelac

TLDR; I am fine with the current PR as the overhead of the MIN/MAX computations should be minimal (especially compare to the overall cost of the whole SQL). That being said, the more long term approach would be to fetch the min/max locally to avoid adding complexity to the SQL template. We could even plan to fetch some useful metadata first and reduce the complexity of the SQL template

Reasoning:
it sounds like MAX() and MIN() operations are going to be computed multiple times in BigQuery.

For example, I just looked at the "Execution details" of the following query in BQ and the MIN() MAX() operations are computed at least twice

WITH left_table AS (
    SELECT "user 1" AS col_agg, 1 AS col_value, TIMESTAMP '2021-01-01' AS col_timestamp
    UNION ALL
    SELECT "user 1" AS col_agg, 2 AS col_value, TIMESTAMP '2021-01-02' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 3 AS col_value, TIMESTAMP '2021-01-03' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 4 AS col_value, TIMESTAMP '2021-01-04' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 5 AS col_value, TIMESTAMP '2021-01-05' AS col_timestamp
),

subquery_table AS (
    SELECT "user 1" AS col_agg, 6 AS col_value, TIMESTAMP '2021-01-01' AS col_timestamp
    UNION ALL
    SELECT "user 1" AS col_agg, 7 AS col_value, TIMESTAMP '2021-01-02' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 8 AS col_value, TIMESTAMP '2021-01-03' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 9 AS col_value, TIMESTAMP '2021-01-04' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 10 AS col_value, TIMESTAMP '2021-01-05' AS col_timestamp
),

timestamp_bounds AS (
    SELECT
        MAX(col_timestamp) AS max_boundary,
        MIN(col_timestamp) AS min_boundary
    FROM left_table
),

subquery AS (
    SELECT *
    FROM left_table 
    WHERE left_table.col_timestamp <= (SELECT max_boundary FROM timestamp_bounds)
    AND left_table.col_timestamp >= TIMESTAMP_SUB((SELECT min_boundary FROM timestamp_bounds), INTERVAL 3000 second)
),

compute_something AS (
    SELECT
        col_agg,
        SUM(col_value) AS col_value_tot
    FROM subquery 
    GROUP BY 1
)

SELECT *
FROM subquery
LEFT JOIN compute_something USING (col_agg)

So it sounds like it would be more optimized and long term to just fetch the boundaries locally and avoid adding complexity to the current BQ template. Especially since it might be computed again and again because of the "for loop" happening on each FeatureView

query = """
SELECT
  MIN(timestamp_col) AS min_timetamp,
  MAX(timestamp_col) AS max_timetamp
FROM left_table_query_string
"""
boundary_df = bigquery.query(query=query).to_pandas()
min_timestamp = boundary_df.loc[0, "min_timetamp"]
max_timestamp = boundary_df.loc[0, "max_timetamp"]

Note: We could potentially get the min/max info of the pandas dataframe without performing another SQL query

sdk/python/feast/infra/offline_stores/bigquery.py

woop · 2021-06-28T15:40:16Z

So it sounds like it would be more optimized and long term to just fetch the boundaries locally and avoid adding complexity to the current BQ template. Especially since it might be computed again and again because of the "for loop" happening on each FeatureView

I agree that we can further optimize the query by determining the min/max in another query.

query = """
SELECT
  MIN(timestamp_col) AS min_timetamp,
  MAX(timestamp_col) AS max_timetamp
FROM left_table_query_string
"""
boundary_df = bigquery.query(query=query).to_pandas()
min_timestamp = boundary_df.loc[0, "min_timetamp"]
max_timestamp = boundary_df.loc[0, "max_timetamp"]

lgtm

Note: We could potentially get the min/max info of the pandas dataframe without performing another SQL query

This PR originally scanned the local dataframe, but I'm not sure if that will hold up for large datasets. I'd want to move ~~all~~ as much compute into the DWH as possible.

MattDelac · 2021-06-28T15:49:46Z

This PR originally scanned the local dataframe, but I'm not sure if that will hold up for large datasets. I'd want to move ~~all~~ as much compute into the DWH as possible.

Ho yes, it would only work when you pass a pandas dataframe as a left feature_view (instead of a query). So it does not solve all use cases.

achals

lgtm pending the column naming change

…ep retrieval query simple Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

Mwad22 · 2021-07-02T17:44:02Z

Thank you all for your detailed reviews, definitely appreciated! If there are no other changes to be requested, I think this change should be good to go.

sdk/python/feast/infra/offline_stores/bigquery.py

woop · 2021-07-05T00:39:27Z

Overall it looks good to me. Just wanted to get your opinion on one last nitpick :)

Otherwise I am happy to ship it.

MattDelac

An integration test for _get_entity_df_timestamp_bounds() would be lovely

Otherwise it LGTM

Mwad22 · 2021-07-05T19:08:02Z

@woop @MattDelac I've added my thoughts here on why I think subtracting the TTL in the query might be preferable. I am also open to adding an integration test for _get_entity_df_timestamp_bounds(), but am currently unsure where the best place to add this test would be? Should I create a new file test_bigquery.py or add it to one (or both) of the test_historical_retrieval.py or test_online_retrieval.py?Thanks!

woop · 2021-07-05T19:11:53Z

Ok, from my side you can leave the query as is.

I think _get_entity_df_timestamp_bounds should be a new test. You can use test_historical_retrieval.py and add a standalone function for BigQuery. We will be refactoring the tests over time, so for now it's just to ensure we don't forget about it.

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

sdk/python/tests/test_historical_retrieval.py

…ical retrieval test Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

…/max timestamp inference Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

mavysavydav · 2021-07-08T05:01:27Z

Looking great

feast-ci-bot · 2021-07-08T22:42:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MattDelac, Mwad22, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [woop]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

woop · 2021-07-08T22:43:28Z

/lgtm

filter data read by timestamp inferred from entity_df when querying B…

008819a

…Q source Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

Mwad22 requested review from achals, tsotnet, woop and a team as code owners June 23, 2021 21:11

feast-ci-bot added release-note-none needs-kind needs-ok-to-test labels Jun 23, 2021

feast-ci-bot added the size/S label Jun 23, 2021

codyjlin reviewed Jun 23, 2021

View reviewed changes

sdk/python/feast/infra/offline_stores/bigquery.py Outdated Show resolved Hide resolved

sdk/python/feast/infra/offline_stores/bigquery.py Outdated Show resolved Hide resolved

timestamps inferred in SQL query rather than from pandas df when quer…

57a9bcf

…ying BQ source Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

woop changed the title ~~infer min and max timestamps from entity_df to limit data read from BQ source~~ Infer min and max timestamps from entity_df to limit data read from BQ source Jun 25, 2021

tsotnet reviewed Jun 26, 2021

View reviewed changes

sdk/python/feast/infra/offline_stores/bigquery.py Outdated Show resolved Hide resolved

MattDelac reviewed Jun 28, 2021

View reviewed changes

sdk/python/feast/infra/offline_stores/bigquery.py Outdated Show resolved Hide resolved

achals reviewed Jun 28, 2021

View reviewed changes

achals added ok-to-test kind/housekeeping and removed needs-ok-to-test labels Jun 29, 2021

feast-ci-bot removed the needs-kind label Jun 29, 2021

timestamps inferred in separate SQL query to avoid recomputin, and ke…

8d590a7

…ep retrieval query simple Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

feast-ci-bot added size/M and removed size/S labels Jul 2, 2021

Mwad22 requested a review from tsotnet July 2, 2021 22:49

woop reviewed Jul 5, 2021

View reviewed changes

sdk/python/feast/infra/offline_stores/bigquery.py Show resolved Hide resolved

MattDelac approved these changes Jul 5, 2021

View reviewed changes

Mwad22 added 3 commits July 5, 2021 16:21

Added an integration test for _get_entity_df_timestamp_bounds

1216506

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

Merge branch 'master' into infer-timestamp

bdcb2f8

Linted and reformatted after merge conflict

e8a361f

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

woop reviewed Jul 5, 2021

View reviewed changes

sdk/python/tests/test_historical_retrieval.py Outdated Show resolved Hide resolved

Mwad22 added 3 commits July 5, 2021 17:57

removed redundant function upload_entity_df_into_bigquery from histor…

80e5810

…ical retrieval test Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

Explicitly set retrieved timestamps to same timezone when testing min…

12daf01

…/max timestamp inference Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

typo fix, comparing object references instead of values

2292e97

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

Mwad22 requested review from woop and achals July 7, 2021 23:10

Mwad22 added 2 commits July 8, 2021 00:41

Merge branch 'master' into infer-timestamp

f204603

fixed python linter issues, should be good to go

c3a77dc

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>

woop approved these changes Jul 8, 2021

View reviewed changes

feast-ci-bot added the approved label Jul 8, 2021

feast-ci-bot assigned woop Jul 8, 2021

feast-ci-bot added the lgtm label Jul 8, 2021

feast-ci-bot merged commit b3c0cce into feast-dev:master Jul 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infer min and max timestamps from entity_df to limit data read from BQ source #1665

Infer min and max timestamps from entity_df to limit data read from BQ source #1665

Mwad22 commented Jun 23, 2021

feast-ci-bot commented Jun 23, 2021

woop commented Jun 24, 2021 •

edited

Loading

codecov-commenter commented Jun 24, 2021 •

edited

Loading

Mwad22 commented Jun 25, 2021

mavysavydav commented Jun 25, 2021

woop commented Jun 25, 2021

MattDelac left a comment •

edited

Loading

woop commented Jun 28, 2021 •

edited

Loading

MattDelac commented Jun 28, 2021

achals left a comment

Mwad22 commented Jul 2, 2021

woop commented Jul 5, 2021

MattDelac left a comment

Mwad22 commented Jul 5, 2021

woop commented Jul 5, 2021

mavysavydav commented Jul 8, 2021

feast-ci-bot commented Jul 8, 2021

woop commented Jul 8, 2021

Infer min and max timestamps from entity_df to limit data read from BQ source #1665

Infer min and max timestamps from entity_df to limit data read from BQ source #1665

Conversation

Mwad22 commented Jun 23, 2021

feast-ci-bot commented Jun 23, 2021

woop commented Jun 24, 2021 • edited Loading

codecov-commenter commented Jun 24, 2021 • edited Loading

Codecov Report

Mwad22 commented Jun 25, 2021

mavysavydav commented Jun 25, 2021

woop commented Jun 25, 2021

MattDelac left a comment • edited Loading

Choose a reason for hiding this comment

woop commented Jun 28, 2021 • edited Loading

MattDelac commented Jun 28, 2021

achals left a comment

Choose a reason for hiding this comment

Mwad22 commented Jul 2, 2021

woop commented Jul 5, 2021

MattDelac left a comment

Choose a reason for hiding this comment

Mwad22 commented Jul 5, 2021

woop commented Jul 5, 2021

mavysavydav commented Jul 8, 2021

feast-ci-bot commented Jul 8, 2021

woop commented Jul 8, 2021

woop commented Jun 24, 2021 •

edited

Loading

codecov-commenter commented Jun 24, 2021 •

edited

Loading

MattDelac left a comment •

edited

Loading

woop commented Jun 28, 2021 •

edited

Loading