Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change relationships schema test to use LOJ instead of not in #799

Closed
wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Jun 18, 2018

It's more efficient to do the join than to do the not in on Oracle (and thus I presume on other row storage DBs), and it's more efficient in BigQuery as well, taking 72% less slot time (computational efficiency) as not in (10.398s vs 37.491s) and shuffling 97% less data (I/O efficiency) (5.76MB vs 186.04MB) in a small test case. On a larger dataset that I use for candidate SQL exercises (compares 1MM record table to 100MM record table), the revised logic uses 2m56s of slot time vs 15m11s, and shuffles 71MB of data vs 186MB.

@drewbanin
Copy link
Contributor

Thanks @rsmichaeldunn. I just had a chat about this with @jthandy this morning. We found that the explain plans were identical in Redshift, but it looks to me like the where ... in code is faster than the equivalent left outer join on Redshift??

Can you point me to any docs/articles for BQ that would explain the performance gain for a join over a subquery? Or, your intuition is totally welcomed here too. My Redshift test was by no means rigorous, but I want to avoid a scenario where someone comes along 3 months from now with a PR to change this back :)

We have a way to implement this test differently on a per-adapter basis. We can totally do that, but I want to make sure we have good reason to before we go down that road

@ghost
Copy link
Author

ghost commented Jun 19, 2018

@drewbanin For columnar DB's, for anything but really large implementations, it probably won't make too much difference. I haven't seen any documentation for BQ with this kind of specificity - just the tests that I ran. The larger test I ran uses a table containing 2MM random hashes, each of which has between 1 and 100 timestamps, and each resulting row has either "entrance" or "exit" randomly assigned to it. (This is for a SQL assessment where I ask candidates to tell me who is currently in a secured space given an audit log of entrances and exits). There is a second table containing the correct list of (~1MM) hashes that should be returned by the candidate's query. So I used this as a test case for the relationship schema test, expecting that all ids in the correct answer table exist in the test data table.

The LOJ method looks like this:

select count(*)
from (
    select
        t1.person_id
    from s_michael.candidate_test_correct_result t1
    left outer join s_michael.candidate_test_data t2
      on  t1.person_id = t2.person_id
    where t1.person_id is not null
      and	t2.person_id is null
) validation_errors

and has this execution plan:
image

S00: Input

Action Details
READ $1:person_id
FROM s_michael.candidate_test_data
WRITE $1
TO __stage00_output
BY HASH($1)

S01: Input

Action Details
READ $10:person_id
FROM s_michael.candidate_test_correct_result
WHERE not(is_null($10))
WRITE $10
TO __stage01_output
BY HASH($10)

S02: Join+

Action Details
READ $10
FROM __stage01_output
READ $1
FROM __stage00_output
AGGREGATE $30 := COUNT_STAR()
FILTER is_null($40)
JOIN LEFT OUTER HASH JOIN EACH WITH EACH ON $10 = $1
WRITE $30
TO __stage02_output

S03: Output

Action Details
READ $30
FROM __stage02_output
AGGREGATE $20 := SUM_OF_COUNTS($30)
WRITE $20
TO __stage03_output

The not in method looks like this:

select count(*)
from (
    select
        person_id as id
    from s_michael.candidate_test_correct_result
    where person_id is not null
      and person_id not in (select person_id from s_michael.candidate_test_data)
) validation_errors

and has this execution plan:
image

S00: Input

Action Details
READ $1:person_id
FROM s_michael.candidate_test_data
AGGREGATE $70 := SUM($90)
COMPUTE $90 := if(is_null($1), 1, 0)
WRITE $70
TO __stage00_output

S01: Input

Action Details
READ $10:person_id
FROM s_michael.candidate_test_data
COMPUTE $120 := 1
WRITE $10, $120
TO __stage01_output
BY HASH($10)

S02: Aggregate

Action Details
READ $70
FROM __stage00_output
AGGREGATE $50 := SUM($70)
WRITE $50
TO __stage02_output

S03: Input

Action Details
READ $20:person_id
FROM s_michael.candidate_test_correct_result
WHERE not(is_null($20))
COMPUTE $110 := ()
WRITE $20, $110
TO __stage03_output
BY HASH($20)

S04: Join+

Action Details
READ $20, $110
FROM __stage03_output
READ $10, $120
FROM __stage01_output
AGGREGATE GROUP BY $140 := $131
$80 := ANY_VALUE($100)
$81 := ANY_VALUE($130)
COMPUTE $100 := not(is_null($132))
JOIN LEFT OUTER HASH JOIN EACH WITH EACH ON $20 = $10
WRITE $80, $81, $140
TO __stage04_output
BY HASH($140)

S05: Aggregate+

Action Details
READ $50
FROM __stage02_output
READ $80, $81, $140
FROM __stage04_output
AGGREGATE $40 := COUNT_STAR()
FILTER not(if($60, 1, if(or(is_null($50), and(equal($50, 0), not(is_null($61)))), 0, NULL)))
AGGREGATE GROUP BY $150 := $140
$60 := ANY_VALUE($80)
$61 := ANY_VALUE($81)
WRITE $40
TO __stage05_output

S06: Output

Action Details
READ $40
FROM __stage05_output
AGGREGATE $30 := SUM_OF_COUNTS($40)
WRITE $30
TO __stage06_output

So it certainly appears to require more steps in BQ to do the not in method. If it was just BQ, then I'd say it's probably not worth going to the trouble of adding adapter-specific schema test methods, but what about Postgres? A few years ago on Oracle, moving from not in to LOJ was a significant performance boost, so I would imagine it would be similar for other non-columnar databases.

@drewbanin
Copy link
Contributor

Thanks for the writeup! I'll do some similar benchmarking for Postgres / Redshift / Snowflake and report back here

@drewbanin
Copy link
Contributor

@rsmichaeldunn we ended up adding this in #921, incorporating much of your original code! Going to close this PR, but thanks very much for the original implementation and prose on performance. Keep an eye out for a credit in the release notes!

@drewbanin drewbanin closed this Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant