Change relationships schema test to use LOJ instead of not in #799

ghost · 2018-06-18T19:26:05Z

It's more efficient to do the join than to do the not in on Oracle (and thus I presume on other row storage DBs), and it's more efficient in BigQuery as well, taking 72% less slot time (computational efficiency) as not in (10.398s vs 37.491s) and shuffling 97% less data (I/O efficiency) (5.76MB vs 186.04MB) in a small test case. On a larger dataset that I use for candidate SQL exercises (compares 1MM record table to 100MM record table), the revised logic uses 2m56s of slot time vs 15m11s, and shuffles 71MB of data vs 186MB.

…uter join where null

drewbanin · 2018-06-18T19:52:57Z

Thanks @rsmichaeldunn. I just had a chat about this with @jthandy this morning. We found that the explain plans were identical in Redshift, but it looks to me like the where ... in code is faster than the equivalent left outer join on Redshift??

Can you point me to any docs/articles for BQ that would explain the performance gain for a join over a subquery? Or, your intuition is totally welcomed here too. My Redshift test was by no means rigorous, but I want to avoid a scenario where someone comes along 3 months from now with a PR to change this back :)

We have a way to implement this test differently on a per-adapter basis. We can totally do that, but I want to make sure we have good reason to before we go down that road

ghost · 2018-06-19T18:19:47Z

@drewbanin For columnar DB's, for anything but really large implementations, it probably won't make too much difference. I haven't seen any documentation for BQ with this kind of specificity - just the tests that I ran. The larger test I ran uses a table containing 2MM random hashes, each of which has between 1 and 100 timestamps, and each resulting row has either "entrance" or "exit" randomly assigned to it. (This is for a SQL assessment where I ask candidates to tell me who is currently in a secured space given an audit log of entrances and exits). There is a second table containing the correct list of (~1MM) hashes that should be returned by the candidate's query. So I used this as a test case for the relationship schema test, expecting that all ids in the correct answer table exist in the test data table.

The LOJ method looks like this:

select count(*)
from (
    select
        t1.person_id
    from s_michael.candidate_test_correct_result t1
    left outer join s_michael.candidate_test_data t2
      on  t1.person_id = t2.person_id
    where t1.person_id is not null
      and	t2.person_id is null
) validation_errors

and has this execution plan:

S00: Input

Action	Details
READ	$1:person_id FROM s_michael.candidate_test_data
WRITE	$1 TO __stage00_output BY HASH($1)

S01: Input

Action	Details
READ	$10:person_id FROM s_michael.candidate_test_correct_result WHERE not(is_null($10))
WRITE	$10 TO __stage01_output BY HASH($10)

S02: Join+

Action	Details
READ	$10 FROM __stage01_output
READ	$1 FROM __stage00_output
AGGREGATE	$30 := COUNT_STAR()
FILTER	is_null($40)
JOIN	LEFT OUTER HASH JOIN EACH WITH EACH ON $10 = $1
WRITE	$30 TO __stage02_output

S03: Output

Action	Details
READ	$30 FROM __stage02_output
AGGREGATE	$20 := SUM_OF_COUNTS($30)
WRITE	$20 TO __stage03_output

The not in method looks like this:

select count(*)
from (
    select
        person_id as id
    from s_michael.candidate_test_correct_result
    where person_id is not null
      and person_id not in (select person_id from s_michael.candidate_test_data)
) validation_errors

and has this execution plan:

S00: Input

Action	Details
READ	$1:person_id FROM s_michael.candidate_test_data
AGGREGATE	$70 := SUM($90)
COMPUTE	$90 := if(is_null($1), 1, 0)
WRITE	$70 TO __stage00_output

S01: Input

Action	Details
READ	$10:person_id FROM s_michael.candidate_test_data
COMPUTE	$120 := 1
WRITE	$10, $120 TO __stage01_output BY HASH($10)

S02: Aggregate

Action	Details
READ	$70 FROM __stage00_output
AGGREGATE	$50 := SUM($70)
WRITE	$50 TO __stage02_output

S03: Input

Action	Details
READ	$20:person_id FROM s_michael.candidate_test_correct_result WHERE not(is_null($20))
COMPUTE	$110 := ()
WRITE	$20, $110 TO __stage03_output BY HASH($20)

S04: Join+

Action	Details
READ	$20, $110 FROM __stage03_output
READ	$10, $120 FROM __stage01_output
AGGREGATE	GROUP BY $140 := $131 $80 := ANY_VALUE($100) $81 := ANY_VALUE($130)
COMPUTE	$100 := not(is_null($132))
JOIN	LEFT OUTER HASH JOIN EACH WITH EACH ON $20 = $10
WRITE	$80, $81, $140 TO __stage04_output BY HASH($140)

S05: Aggregate+

Action	Details
READ	$50 FROM __stage02_output
READ	$80, $81, $140 FROM __stage04_output
AGGREGATE	$40 := COUNT_STAR()
FILTER	not(if($60, 1, if(or(is_null($50), and(equal($50, 0), not(is_null($61)))), 0, NULL)))
AGGREGATE	GROUP BY $150 := $140 $60 := ANY_VALUE($80) $61 := ANY_VALUE($81)
WRITE	$40 TO __stage05_output

S06: Output

Action	Details
READ	$40 FROM __stage05_output
AGGREGATE	$30 := SUM_OF_COUNTS($40)
WRITE	$30 TO __stage06_output

So it certainly appears to require more steps in BQ to do the not in method. If it was just BQ, then I'd say it's probably not worth going to the trouble of adding adapter-specific schema test methods, but what about Postgres? A few years ago on Oracle, moving from not in to LOJ was a significant performance boost, so I would imagine it would be similar for other non-columnar databases.

drewbanin · 2018-06-20T13:15:58Z

Thanks for the writeup! I'll do some similar benchmarking for Postgres / Redshift / Snowflake and report back here

drewbanin · 2018-08-23T18:22:21Z

@rsmichaeldunn we ended up adding this in #921, incorporating much of your original code! Going to close this PR, but thanks very much for the original implementation and prose on performance. Keep an eye out for a credit in the release notes!

CHange relationships schema test from not in to more efficient left o…

b8bfc2a

…uter join where null

drewbanin self-requested a review June 20, 2018 13:16

drewbanin added the enhancement New feature or request label Jun 20, 2018

MartinLue mentioned this pull request Jul 23, 2018

Rewrite relationships schema test for improved performance #872

Closed

beckjake mentioned this pull request Aug 15, 2018

Relationships tests with nulls (#887) #921

Merged

drewbanin closed this Aug 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change relationships schema test to use LOJ instead of not in #799

Change relationships schema test to use LOJ instead of not in #799

ghost commented Jun 18, 2018

drewbanin commented Jun 18, 2018

ghost commented Jun 19, 2018 •

edited by ghost

Loading

drewbanin commented Jun 20, 2018

drewbanin commented Aug 23, 2018

Change relationships schema test to use LOJ instead of not in #799

Change relationships schema test to use LOJ instead of not in #799

Conversation

ghost commented Jun 18, 2018

drewbanin commented Jun 18, 2018

ghost commented Jun 19, 2018 • edited by ghost Loading

S00: Input

S01: Input

S02: Join+

S03: Output

S00: Input

S01: Input

S02: Aggregate

S03: Input

S04: Join+

S05: Aggregate+

S06: Output

drewbanin commented Jun 20, 2018

drewbanin commented Aug 23, 2018

ghost commented Jun 19, 2018 •

edited by ghost

Loading