-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change relationships schema test to use LOJ instead of not in #799
Conversation
…uter join where null
Thanks @rsmichaeldunn. I just had a chat about this with @jthandy this morning. We found that the explain plans were identical in Redshift, but it looks to me like the Can you point me to any docs/articles for BQ that would explain the performance gain for a join over a subquery? Or, your intuition is totally welcomed here too. My Redshift test was by no means rigorous, but I want to avoid a scenario where someone comes along 3 months from now with a PR to change this back :) We have a way to implement this test differently on a per-adapter basis. We can totally do that, but I want to make sure we have good reason to before we go down that road |
@drewbanin For columnar DB's, for anything but really large implementations, it probably won't make too much difference. I haven't seen any documentation for BQ with this kind of specificity - just the tests that I ran. The larger test I ran uses a table containing 2MM random hashes, each of which has between 1 and 100 timestamps, and each resulting row has either "entrance" or "exit" randomly assigned to it. (This is for a SQL assessment where I ask candidates to tell me who is currently in a secured space given an audit log of entrances and exits). There is a second table containing the correct list of (~1MM) hashes that should be returned by the candidate's query. So I used this as a test case for the relationship schema test, expecting that all ids in the correct answer table exist in the test data table. The LOJ method looks like this: select count(*)
from (
select
t1.person_id
from s_michael.candidate_test_correct_result t1
left outer join s_michael.candidate_test_data t2
on t1.person_id = t2.person_id
where t1.person_id is not null
and t2.person_id is null
) validation_errors S00: Input
S01: Input
S02: Join+
S03: Output
The select count(*)
from (
select
person_id as id
from s_michael.candidate_test_correct_result
where person_id is not null
and person_id not in (select person_id from s_michael.candidate_test_data)
) validation_errors S00: Input
S01: Input
S02: Aggregate
S03: Input
S04: Join+
S05: Aggregate+
S06: Output
So it certainly appears to require more steps in BQ to do the |
Thanks for the writeup! I'll do some similar benchmarking for Postgres / Redshift / Snowflake and report back here |
@rsmichaeldunn we ended up adding this in #921, incorporating much of your original code! Going to close this PR, but thanks very much for the original implementation and prose on performance. Keep an eye out for a credit in the release notes! |
It's more efficient to do the join than to do the
not in
on Oracle (and thus I presume on other row storage DBs), and it's more efficient in BigQuery as well, taking 72% less slot time (computational efficiency) asnot in
(10.398s vs 37.491s) and shuffling 97% less data (I/O efficiency) (5.76MB vs 186.04MB) in a small test case. On a larger dataset that I use for candidate SQL exercises (compares 1MM record table to 100MM record table), the revised logic uses 2m56s of slot time vs 15m11s, and shuffles 71MB of data vs 186MB.