Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer min and max timestamps from entity_df to limit data read from BQ source #1665

Merged
merged 11 commits into from
Jul 8, 2021

Conversation

Mwad22
Copy link
Contributor

@Mwad22 Mwad22 commented Jun 23, 2021

Signed-off-by: Mwad22 51929507+Mwad22@users.noreply.github.com

What this PR does / why we need it:
Infers the minimum and maximum timestamps from the provided entity_df if possible (if entity_df is provided as a Pandas dataframe). Right now, too much data is being read since the range of time for the feature data doesn't account for the min and max timestamps for the base data (entity_df).

For instance, if the max timestamp on order ids is 5/1/2020, we want to avoid wasting time looking at data in the range 5/2/2020 to present day– we would not join feature data from this range.

In short, this change will allow us to read less data from BigQuery sources.

Which issue(s) this PR fixes:

No issue assigned to this at the moment, but was left as TODO item in bigquery.py here

Does this PR introduce a user-facing change?:

NONE

…Q source

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
@feast-ci-bot
Copy link
Collaborator

Hi @Mwad22. Thanks for your PR.

I'm waiting for a feast-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@woop
Copy link
Member

woop commented Jun 24, 2021

Thanks for this @Mwadd22!

A bit of a nitpick, but is anything preventing us from inferring and using the timestamps purely in SQL? I suspect it would be significantly faster than scanning a dataframe.

@codecov-commenter
Copy link

codecov-commenter commented Jun 24, 2021

Codecov Report

Merging #1665 (c3a77dc) into master (2e0113e) will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1665      +/-   ##
==========================================
+ Coverage   82.69%   82.75%   +0.05%     
==========================================
  Files          76       76              
  Lines        6734     6754      +20     
==========================================
+ Hits         5569     5589      +20     
  Misses       1165     1165              
Flag Coverage Δ
integrationtests 82.67% <100.00%> (+0.05%) ⬆️
unittests 69.73% <23.80%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/offline_stores/bigquery.py 76.14% <100.00%> (+0.79%) ⬆️
sdk/python/tests/test_historical_retrieval.py 99.09% <100.00%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e0113e...c3a77dc. Read the comment docs.

…ying BQ source

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
@Mwad22
Copy link
Contributor Author

Mwad22 commented Jun 25, 2021

Thanks for this @Mwadd22!

A bit of a nitpick, but is anything preventing us from inferring and using the timestamps purely in SQL? I suspect it would be significantly faster than scanning a dataframe.

Thanks for pointing that out @woop. I just committed some changes that infer and use the timestamps purely in SQL as requested (and removed the min/max_timestamps arguments as they are no longer needed).

@mavysavydav
Copy link
Collaborator

LGTM

@woop woop changed the title infer min and max timestamps from entity_df to limit data read from BQ source Infer min and max timestamps from entity_df to limit data read from BQ source Jun 25, 2021
@woop
Copy link
Member

woop commented Jun 25, 2021

Looks much nicer. @MattDelac are you ok with this change?

Copy link
Collaborator

@MattDelac MattDelac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR; I am fine with the current PR as the overhead of the MIN/MAX computations should be minimal (especially compare to the overall cost of the whole SQL). That being said, the more long term approach would be to fetch the min/max locally to avoid adding complexity to the SQL template. We could even plan to fetch some useful metadata first and reduce the complexity of the SQL template

Reasoning:
it sounds like MAX() and MIN() operations are going to be computed multiple times in BigQuery.

For example, I just looked at the "Execution details" of the following query in BQ and the MIN() MAX() operations are computed at least twice

WITH left_table AS (
    SELECT "user 1" AS col_agg, 1 AS col_value, TIMESTAMP '2021-01-01' AS col_timestamp
    UNION ALL
    SELECT "user 1" AS col_agg, 2 AS col_value, TIMESTAMP '2021-01-02' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 3 AS col_value, TIMESTAMP '2021-01-03' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 4 AS col_value, TIMESTAMP '2021-01-04' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 5 AS col_value, TIMESTAMP '2021-01-05' AS col_timestamp
),

subquery_table AS (
    SELECT "user 1" AS col_agg, 6 AS col_value, TIMESTAMP '2021-01-01' AS col_timestamp
    UNION ALL
    SELECT "user 1" AS col_agg, 7 AS col_value, TIMESTAMP '2021-01-02' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 8 AS col_value, TIMESTAMP '2021-01-03' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 9 AS col_value, TIMESTAMP '2021-01-04' AS col_timestamp
    UNION ALL
    SELECT "user 2" AS col_agg, 10 AS col_value, TIMESTAMP '2021-01-05' AS col_timestamp
),

timestamp_bounds AS (
    SELECT
        MAX(col_timestamp) AS max_boundary,
        MIN(col_timestamp) AS min_boundary
    FROM left_table
),

subquery AS (
    SELECT *
    FROM left_table 
    WHERE left_table.col_timestamp <= (SELECT max_boundary FROM timestamp_bounds)
    AND left_table.col_timestamp >= TIMESTAMP_SUB((SELECT min_boundary FROM timestamp_bounds), INTERVAL 3000 second)
),

compute_something AS (
    SELECT
        col_agg,
        SUM(col_value) AS col_value_tot
    FROM subquery 
    GROUP BY 1
)

SELECT *
FROM subquery
LEFT JOIN compute_something USING (col_agg)

So it sounds like it would be more optimized and long term to just fetch the boundaries locally and avoid adding complexity to the current BQ template. Especially since it might be computed again and again because of the "for loop" happening on each FeatureView

query = """
SELECT
  MIN(timestamp_col) AS min_timetamp,
  MAX(timestamp_col) AS max_timetamp
FROM left_table_query_string
"""
boundary_df = bigquery.query(query=query).to_pandas()
min_timestamp = boundary_df.loc[0, "min_timetamp"]
max_timestamp = boundary_df.loc[0, "max_timetamp"]

Note: We could potentially get the min/max info of the pandas dataframe without performing another SQL query

sdk/python/feast/infra/offline_stores/bigquery.py Outdated Show resolved Hide resolved
@woop
Copy link
Member

woop commented Jun 28, 2021

So it sounds like it would be more optimized and long term to just fetch the boundaries locally and avoid adding complexity to the current BQ template. Especially since it might be computed again and again because of the "for loop" happening on each FeatureView

I agree that we can further optimize the query by determining the min/max in another query.

query = """
SELECT
  MIN(timestamp_col) AS min_timetamp,
  MAX(timestamp_col) AS max_timetamp
FROM left_table_query_string
"""
boundary_df = bigquery.query(query=query).to_pandas()
min_timestamp = boundary_df.loc[0, "min_timetamp"]
max_timestamp = boundary_df.loc[0, "max_timetamp"]

lgtm

Note: We could potentially get the min/max info of the pandas dataframe without performing another SQL query

This PR originally scanned the local dataframe, but I'm not sure if that will hold up for large datasets. I'd want to move all as much compute into the DWH as possible.

@MattDelac
Copy link
Collaborator

This PR originally scanned the local dataframe, but I'm not sure if that will hold up for large datasets. I'd want to move all as much compute into the DWH as possible.

Ho yes, it would only work when you pass a pandas dataframe as a left feature_view (instead of a query). So it does not solve all use cases.

Copy link
Member

@achals achals left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending the column naming change

…ep retrieval query simple

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
@feast-ci-bot feast-ci-bot added size/M and removed size/S labels Jul 2, 2021
@Mwad22
Copy link
Contributor Author

Mwad22 commented Jul 2, 2021

Thank you all for your detailed reviews, definitely appreciated! If there are no other changes to be requested, I think this change should be good to go.

@Mwad22 Mwad22 requested a review from tsotnet July 2, 2021 22:49
@woop
Copy link
Member

woop commented Jul 5, 2021

Overall it looks good to me. Just wanted to get your opinion on one last nitpick :)

Otherwise I am happy to ship it.

Copy link
Collaborator

@MattDelac MattDelac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An integration test for _get_entity_df_timestamp_bounds() would be lovely

Otherwise it LGTM

@Mwad22
Copy link
Contributor Author

Mwad22 commented Jul 5, 2021

@woop @MattDelac I've added my thoughts here on why I think subtracting the TTL in the query might be preferable. I am also open to adding an integration test for _get_entity_df_timestamp_bounds(), but am currently unsure where the best place to add this test would be? Should I create a new file test_bigquery.py or add it to one (or both) of the test_historical_retrieval.py or test_online_retrieval.py?Thanks!

@woop
Copy link
Member

woop commented Jul 5, 2021

Ok, from my side you can leave the query as is.

I think _get_entity_df_timestamp_bounds should be a new test. You can use test_historical_retrieval.py and add a standalone function for BigQuery. We will be refactoring the tests over time, so for now it's just to ensure we don't forget about it.

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
…ical retrieval test

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
…/max timestamp inference

Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
@Mwad22 Mwad22 requested review from woop and achals July 7, 2021 23:10
Signed-off-by: Mwad22 <51929507+Mwad22@users.noreply.github.com>
@mavysavydav
Copy link
Collaborator

Looking great

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MattDelac, Mwad22, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@woop
Copy link
Member

woop commented Jul 8, 2021

/lgtm

@feast-ci-bot feast-ci-bot merged commit b3c0cce into feast-dev:master Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants