-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid skewed join between entity_df & feature views #1712
Conversation
Skipping CI for Draft Pull Request. |
7812c84
to
4ed6e5d
Compare
Codecov Report
@@ Coverage Diff @@
## master #1712 +/- ##
==========================================
+ Coverage 83.32% 84.47% +1.14%
==========================================
Files 76 79 +3
Lines 6794 7071 +277
==========================================
+ Hits 5661 5973 +312
+ Misses 1133 1098 -35
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
What isn't clear to me is what the before and after of this test is. What is the problem that we are seeing and how do we know we are solving it? I realize it has to do with one-to-many relationships. Can we add a test that uses a 1:many relationship and shows how this test actually fixes the response? Or could we just extend our existing historical retrieval to have one of these relationships? |
That's an optimization problem. So beside running some benchmark on my side and prove you that this new template is better to scale, I don't have an idea of a good unit test for it.
As I was saying, this is an optimization problem. We can extend the current test if we think that the coverage is not enough. A dedicated test for that does not seem like a good option IMO Also I am going to spend time in benchmarking the 2 templates on our use case and will publish as many detail as possible in this PR |
Thanks. As long as it's purely an optimization change then I don't see a need for a new test. Let me know when/if you feel comfortable merging after your analysis. |
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
4ed6e5d
to
c2e08d3
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MattDelac, woop The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
The problem is that if we ask for historical features coming from multiple entities that have a 1:many relationship between them, then we encounter skewed join
That's basically how the JOIN is performed with the current template
Imagine that driver_id=1 contains millions of rides
And that's what this PR is proposing
Let's have a look about the statistics of the SQL template on our use case (4 FeatureViews, 2 entities, entity_dataframe containing 100M rows)
SQL template currently in production
I cancelled the query as it was still running after 25min
SQL template of this PR
Elapsed time 2 min 34 sec
Slot time consumed 1 day 21 hr
Bytes shuffled 3.27 TB
Bytes spilled to disk 0 B
Note: On our full entity_dataframe (3B rows) the current SQL template was still running after 45 min while the SQL template of this PR finished after 15min
Which issue(s) this PR fixes:
Fixes None
Does this PR introduce a user-facing change?: