-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix BQ historical retrieval with rows that got backfilled #1744
Fix BQ historical retrieval with rows that got backfilled #1744
Conversation
Skipping CI for Draft Pull Request. |
Codecov Report
@@ Coverage Diff @@
## master #1744 +/- ##
==========================================
+ Coverage 84.65% 84.73% +0.08%
==========================================
Files 85 85
Lines 6268 6297 +29
==========================================
+ Hits 5306 5336 +30
+ Misses 962 961 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
a61f9d2
to
2655af7
Compare
930519b
to
7df8899
Compare
18bbfc7
to
7b9cc71
Compare
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
7b9cc71
to
325ad25
Compare
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MattDelac, woop The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
What this PR does / why we need it:
There is bug about finding the "latest" values when the rows have been backfilled (when created_timestamp and event_timestamp don't follow the same order).
The problem is that the
ANY_VALUE
strategy does not work here because it will take the first occurrence it sees.For example, the following example would fail
The JOIN happening later would consist of the entity_dataframe with the
__latest
CTE where created_at and event_timestamp got mixedBecause the JOIN operation happens on
USING(driver_id, event_timestamp, created_at)
, it cannot find a match and will return NULL for the desired feature.The goal of this PR is to make this step work and deterministic by using a Window function. Then we preserve the information between event_timestamp and created_at on a given row
Tophat
Metrics (compared to current SQL template)
Elapsed time: 24 min 49 sec (vs 26 min 35 sec)
Slot time consumed: 21 days 6 hr (vs 26 days 0 hr)
Bytes shuffled: 23.11 TB (vs 22.86 TB)
Bytes spilled to disk: 7.87 TB (10.09 TB)
I also confirm that it fixes the bug we had on a specific feature view
The new test properly catches the problem encountered with the backfill rows (as the feature of the second driver is Null instead of 40)
Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?: