-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade AWX 19.2.2 -> 19.5.1 - UI Job Output Causing Postgres DB to hit 100% CPU - AWX Becomes Unresponsive #11647
Comments
Possibly related but this seems like a more extreme issue as it actively breaks the usage of the product: |
+1. same issue with 19.5.0 -> 19.5.1 |
I have the same issue after upgrading to 19.5.1 :( When can we expect a fix? |
I'm seeing this issue too with 19.5.1. I had to downgrade to make it work because 19.5.1 is unusable. |
I'm having the same issue with 19.5.1 on Kubernetes version v1.21.8 |
I'm wondering if anybody tested to see if the same BUG presents in 20.0.0 ? |
As mentioned in several other issues (#11765, #11818), the Playbook---
- hosts: localhost
gather_facts: no
tasks:
- ansible.builtin.debug:
msg: "An example message before warning: 1"
- ansible.builtin.debug:
msg: "An example message before warning: 2"
- ansible.builtin.debug:
msg: "An example message before warning: 3"
- name: Force to warn by using unsupported socks proxy
ansible.builtin.shell: curl --socks5 localhost:9000 http://www.ansible.com
args:
warn: true
changed_when: no
failed_when: no
- ansible.builtin.debug:
msg: "An example message before warning: 4"
- ansible.builtin.debug:
msg: "An example message before warning: 5"
- ansible.builtin.debug:
msg: "An example message before warning: 6" Example output in
|
I'm afraid to ask, but I'm asking anyway: anyone tried if same presents in 20.0.1 ? |
@sooslaca |
@keithjgrant |
i don't know if is realy a problem with Warning but with ?not__stdout="" the output is also good |
@mick1627 Is this with the UI job output page open in browser, or not? It sounds like there may be a couple different issues at play in the comments in this discussion |
Yes, exactly, it's with the UI job output page open in browser. |
I confirm that when the output of the job fails to render, the cpu on my db goes crazy. |
closing as duplicate of #11765 |
Please confirm the following
Summary
Following the upgrade to AWX 19.5.1 from 19.2.2, streaming job logs or scrolling though job logs either while the job is running or after the job has finished causes Postgres DB CPU Usage to reach 100% on 4 CPU cores.
Streaming job output via API (cli) does not yield the same issue.
The issue only seems to occur when the job output is longer than what fits on the initial output window size.
REVERTING TO 19.5.0 fixes the issue - the issue is with 19.5.1
AWX version
19.5.1
Installation method
kubernetes
Modifications
yes
Ansible version
EEs W/ 2.9.27 & 2.11.7
Operating system
Ubuntu as k8s Hosts
Web browser
Chrome
Steps to reproduce
observe postgres node cpu usage spiking extremely high while the load is occuring, coming from postgres processes (see screenshot)
Expected results
Fast STDout rendering, with low CPU overhead as in the API and previous versions.
Actual results
Extreme CPU usage on DB node.
Many select processes showing when using top / htop on the physical host that has the postgres DB on it.
We were also able to observe a large number of uncaught promises when scrolling in the job output UI, and the operation its self is causing a 500 internal server error, which is causing the uncaught promise error.
Manually navigating to one of the affected URLS:
https://awx.my.cool.company/api/v2/job_events/15472448/children/?counter__gt=4&order_by=counter&page_size=1
returns an X-API-Time of 2.485 Seconds This rough number seems consistent between many refreshes and many clients, different jobs and job events experience the same issue, including freshly run jobs following the upgrade.
Other APIs do not seem to be affected - listing users, for example, took 0.15 seconds
Listing job templates (>1000 items) took 0.597 seconds.
Partial job logs will also load sometimes, i.e we will see line 380 load, and line 56 load and no line inbetween or after, untill CPU usage settles down and it eventually all loads, again casuing other clients to experience perf issues.
Additional information
We run Custom EEs for job execution, however AWX Containers, DB containers and Control Plane EEs are standard.
Running AWX Operator 0.16.0
External Postgres DB Hosted on k8s v12.5.0
max_locks_per_transaction on postgres set to 512
Decreasing this value causes errors with shared memory exhaustion, increasing the value does not seem to provide any noticeable difference, that is, same performance issue.
We have seen this across two different deployments both upgraded following the same path
Two different physical host sets, same version of deployables
Both k8s versions at 1.16.15
Same issue occurs in brave browser, chrome, new edge
Additional Notes:
Reducing the number of awx nodes from 3 to 1 reduces the number of incoming select IPs visible in htop, however the CPU issue is not resolved.
For best results try and use a job with > 250 lines of output (1000+ does a really good job of breaking it)
The text was updated successfully, but these errors were encountered: