Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid skewed join between entity_df & feature views #1712

Merged
merged 1 commit into from
Jul 19, 2021

Conversation

MattDelac
Copy link
Collaborator

@MattDelac MattDelac commented Jul 14, 2021

What this PR does / why we need it:

The problem is that if we ask for historical features coming from multiple entities that have a 1:many relationship between them, then we encounter skewed join

That's basically how the JOIN is performed with the current template
Imagine that driver_id=1 contains millions of rides
image

And that's what this PR is proposing
image

Let's have a look about the statistics of the SQL template on our use case (4 FeatureViews, 2 entities, entity_dataframe containing 100M rows)

SQL template currently in production
I cancelled the query as it was still running after 25min

9:14 AM Query has been running for 25 min

SQL template of this PR
Elapsed time 2 min 34 sec
Slot time consumed 1 day 21 hr
Bytes shuffled 3.27 TB
Bytes spilled to disk 0 B

Note: On our full entity_dataframe (3B rows) the current SQL template was still running after 45 min while the SQL template of this PR finished after 15min

Which issue(s) this PR fixes:

Fixes None

Does this PR introduce a user-facing change?:

NONE

@feast-ci-bot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2021

Codecov Report

Merging #1712 (c2e08d3) into master (703c4be) will increase coverage by 1.14%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1712      +/-   ##
==========================================
+ Coverage   83.32%   84.47%   +1.14%     
==========================================
  Files          76       79       +3     
  Lines        6794     7071     +277     
==========================================
+ Hits         5661     5973     +312     
+ Misses       1133     1098      -35     
Flag Coverage Δ
integrationtests 84.40% <66.66%> (+1.15%) ⬆️
unittests 69.45% <11.11%> (-0.34%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/offline_stores/bigquery.py 80.60% <66.66%> (+4.45%) ⬆️
sdk/python/feast/entity.py 88.28% <0.00%> (-8.11%) ⬇️
sdk/python/feast/infra/offline_stores/file.py 93.49% <0.00%> (-3.54%) ⬇️
sdk/python/feast/feature_view.py 84.25% <0.00%> (-0.88%) ⬇️
sdk/python/feast/registry.py 80.82% <0.00%> (-0.48%) ⬇️
sdk/python/feast/repo_operations.py 31.06% <0.00%> (-0.31%) ⬇️
sdk/python/tests/test_historical_retrieval.py 99.09% <0.00%> (-0.01%) ⬇️
...dk/python/tensorflow_metadata/proto/v0/path_pb2.py 100.00% <0.00%> (ø)
.../python/tensorflow_metadata/proto/v0/schema_pb2.py 100.00% <0.00%> (ø)
...hon/tensorflow_metadata/proto/v0/statistics_pb2.py 100.00% <0.00%> (ø)
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 703c4be...c2e08d3. Read the comment docs.

@woop
Copy link
Member

woop commented Jul 14, 2021

What isn't clear to me is what the before and after of this test is. What is the problem that we are seeing and how do we know we are solving it? I realize it has to do with one-to-many relationships. Can we add a test that uses a 1:many relationship and shows how this test actually fixes the response? Or could we just extend our existing historical retrieval to have one of these relationships?

@MattDelac
Copy link
Collaborator Author

MattDelac commented Jul 15, 2021

What is the problem that we are seeing and how do we know we are solving it?

That's an optimization problem. So beside running some benchmark on my side and prove you that this new template is better to scale, I don't have an idea of a good unit test for it.

Can we add a test that uses a 1:many relationship and shows how this test actually fixes the response? Or could we just extend our existing historical retrieval to have one of these relationships?

As I was saying, this is an optimization problem. We can extend the current test if we think that the coverage is not enough. A dedicated test for that does not seem like a good option IMO

Also I am going to spend time in benchmarking the 2 templates on our use case and will publish as many detail as possible in this PR

@woop
Copy link
Member

woop commented Jul 16, 2021

What is the problem that we are seeing and how do we know we are solving it?

That's an optimization problem. So beside running some benchmark on my side and prove you that this new template is better to scale, I don't have an idea of a good unit test for it.

Can we add a test that uses a 1:many relationship and shows how this test actually fixes the response? Or could we just extend our existing historical retrieval to have one of these relationships?

As I was saying, this is an optimization problem. We can extend the current test if we think that the coverage is not enough. A dedicated test for that does not seem like a good option IMO

Also I am going to spend time in benchmarking the 2 templates on our use case and will publish as many detail as possible in this PR

Thanks. As long as it's purely an optimization change then I don't see a need for a new test. Let me know when/if you feel comfortable merging after your analysis.

Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
@MattDelac MattDelac marked this pull request as ready for review July 19, 2021 13:50
@MattDelac MattDelac requested review from achals, tsotnet, woop and a team as code owners July 19, 2021 13:50
@MattDelac MattDelac changed the title Attempt to optimize potential skewed join Optimize potential skewed join between entity_df & feature views Jul 19, 2021
@MattDelac MattDelac changed the title Optimize potential skewed join between entity_df & feature views Avoid skewed join between entity_df & feature views Jul 19, 2021
@woop
Copy link
Member

woop commented Jul 19, 2021

/lgtm

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MattDelac, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit 8cfe914 into feast-dev:master Jul 19, 2021
@MattDelac MattDelac deleted the optimize_sql_template branch July 19, 2021 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants