Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] q88 regression between 21.10 and 21.12 #4280

Closed
abellina opened this issue Dec 3, 2021 · 9 comments
Closed

[BUG] q88 regression between 21.10 and 21.12 #4280

abellina opened this issue Dec 3, 2021 · 9 comments
Labels
bug Something isn't working P0 Must have for release performance A performance related task/issue

Comments

@abellina
Copy link
Collaborator

abellina commented Dec 3, 2021

We seem to have lost a little over a second in q88 between 21.10 and 21.12 in the spark2a environment.

The baseline time for 21.10 was 18.86 seconds, and the time for 21.12 is pretty steady in the 20s to 21s range.

This specific issue is to find the difference. Is it something that needs to change in cuDF, is it the plugin, or is it the environment.

"queryTimes" : [ 20969 ],
"queryTimes" : [ 20747 ],
"queryTimes" : [ 20851 ],
"queryTimes" : [ 20996 ],
"queryTimes" : [ 20894 ]

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify performance A performance related task/issue P0 Must have for release labels Dec 3, 2021
@jbrennan333
Copy link
Contributor

@abellina were those times all using the same properties? Can you give more details on the properties used so I can reproduce?

@abellina
Copy link
Collaborator Author

abellina commented Dec 3, 2021

@abellina were those times all using the same properties? Can you give more details on the properties used so I can reproduce?

They are. These are just repeats of the same exact parameters one after the next. I am trying to establish a mini baseline without changing the jars and the configs. If repeated invocations of the same query with all else the same never reaches the old value, something is likely amiss.

This is UCX off, 3.1.1, decimals set to false.

@jbrennan333
Copy link
Contributor

Testing this in branch-22.02, I believe a lot of the difference may be due to enabling nvcomp by default in CUDF.

spark.executorEnv.LIBCUDF_NVCOMP_POLICY=OFF
q88-22.02-nonvcomp/tpcds-testjtb_no_nvcomp-gpu-aqe-on-ucx-off-16-cores-decimals-false-q88-1638657273360.json:  "queryTimes" : [ 19981 ],
q88-22.02-nonvcomp/tpcds-testjtb_no_nvcomp-gpu-aqe-on-ucx-off-16-cores-decimals-false-q88-1638657336190.json:  "queryTimes" : [ 19697 ],
q88-22.02-nonvcomp/tpcds-testjtb_no_nvcomp-gpu-aqe-on-ucx-off-16-cores-decimals-false-q88-1638657416402.json:  "queryTimes" : [ 19876 ],
q88-22.02-nonvcomp/tpcds-testjtb_no_nvcomp-gpu-aqe-on-ucx-off-16-cores-decimals-false-q88-1638657522317.json:  "queryTimes" : [ 19493 ],

When I run with the default setting of STABLE, I get:

q88-22.02/tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q88-1638658848294.json:  "queryTimes" : [ 22046 ],
q88-22.02/tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q88-1638658977523.json:  "queryTimes" : [ 21958 ],
q88-22.02/tpcds-testjtb-gpu-aqe-on-ucx-off-16-cores-decimals-false-q88-1638659050383.json:  "queryTimes" : [ 23032 ],

@jbrennan333
Copy link
Contributor

This is the PR for enabling nvcomp in cudf: rapidsai/cudf#9582

@abellina
Copy link
Collaborator Author

abellina commented Dec 6, 2021

This bug was specific to 21.12, so we shouldn't see extra slowdowns due to nvcomp being enabled yet. That said, what you are raising is an issue of its own. So it got slower in 21.12, and you are finding it's slower still in 22.02 (so far).

We should figure out what is changing. Is it kernel time (re 21.10 vs 21.12 and 22.02) for parquet scans? If so, we may need to file a cuDF bug soon here.

@jlowe
Copy link
Contributor

jlowe commented Dec 6, 2021

This bug was specific to 21.12, so we shouldn't see extra slowdowns due to nvcomp being enabled yet.

The cudf PR went into 21.12, so nvcomp was enabled by default.

@jbrennan333
Copy link
Contributor

I believe rapidsai/cudf#9582 was merged into branch-21.12.

@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Dec 7, 2021
@jbrennan333
Copy link
Contributor

@abellina. @jlowe do we still need this issue? I think the majority of the difference is explained by the change to use nvcomp as the default snappy compressor in 21.12. As noted above, we have other issues tracking possible improvements for this query.

@abellina
Copy link
Collaborator Author

abellina commented Jan 4, 2022

+1 closing this.

@abellina abellina closed this as completed Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

4 participants