-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[testing] kfp-ci cluster related tests flaky #6815
Comments
/assign @capri-xiyue @Bobgy |
/cc @chensun |
Are there other causes for flakiness? |
Saw another cause of flakiness in https://oss-prow.knative.dev/view/gs/oss-prow/pr-logs/pull/kubeflow_pipelines/6796/kubeflow-pipelines-samples-v2/1453537694247292928
It happened on the xgboost sample. |
For the lock wait timeout issue, Looks like it is because the backend api has a lot of transactions involved in mysql part. Maybe deadlock happened or the mysql is not fined tuned. FYI: https://stackoverflow.com/questions/5836623/getting-lock-wait-timeout-exceeded-try-restarting-transaction-even-though-im |
The following comments try to solve the following problem, because it shows up the most often.
Connect to in-cluster mysql DB via:
Here's current DB size:
Some queries on run seem to run for a very long time. I'm trying to use |
This query stays in preparing state for more than 1 minute:
Based on https://dba.stackexchange.com/a/121846, it seems our query should not use a select subquery, instead it should use JOIN for better performance. I verified an actual query is also slow:
|
I tried to rewrite this query using JOIN, and it's now much faster:
rewrote query
|
Found out the offending query in source code, actually jingzhang left a TODO to optimize it : )
|
Temporarily skipped archive experiment step in tests to verify whether it helps resolve the flakiness. |
https://testgrid.k8s.io/googleoss-kubeflow-pipelines#kubeflow-pipelines-periodic-functional-test |
Closing because I think we get the flakiness resolved, please reopen if not. |
example error log
The text was updated successfully, but these errors were encountered: