Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade AWX 19.2.2 -> 19.5.1 - UI Job Output Causing Postgres DB to hit 100% CPU - AWX Becomes Unresponsive #11647

Closed
3 tasks done
MrBones757 opened this issue Jan 31, 2022 · 17 comments

Comments

@MrBones757
Copy link

MrBones757 commented Jan 31, 2022

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Summary

Following the upgrade to AWX 19.5.1 from 19.2.2, streaming job logs or scrolling though job logs either while the job is running or after the job has finished causes Postgres DB CPU Usage to reach 100% on 4 CPU cores.
Streaming job output via API (cli) does not yield the same issue.
The issue only seems to occur when the job output is longer than what fits on the initial output window size.

REVERTING TO 19.5.0 fixes the issue - the issue is with 19.5.1

AWX version

19.5.1

Installation method

kubernetes

Modifications

yes

Ansible version

EEs W/ 2.9.27 & 2.11.7

Operating system

Ubuntu as k8s Hosts

Web browser

Chrome

Steps to reproduce

  1. Upgrade an instance from 19.2.2 - 19.5.1 (optional?)
  2. Run a job where the job output exceeds one loaded page in the UI
  3. observe results not load (taking upto minutes to render)
    observe postgres node cpu usage spiking extremely high while the load is occuring, coming from postgres processes (see screenshot)

Expected results

Fast STDout rendering, with low CPU overhead as in the API and previous versions.

Actual results

Extreme CPU usage on DB node.
Many select processes showing when using top / htop on the physical host that has the postgres DB on it.
CPU Screenshot
We were also able to observe a large number of uncaught promises when scrolling in the job output UI, and the operation its self is causing a 500 internal server error, which is causing the uncaught promise error.
DevTools

Manually navigating to one of the affected URLS:
https://awx.my.cool.company/api/v2/job_events/15472448/children/?counter__gt=4&order_by=counter&page_size=1
returns an X-API-Time of 2.485 Seconds This rough number seems consistent between many refreshes and many clients, different jobs and job events experience the same issue, including freshly run jobs following the upgrade.
Other APIs do not seem to be affected - listing users, for example, took 0.15 seconds
Listing job templates (>1000 items) took 0.597 seconds.

Partial job logs will also load sometimes, i.e we will see line 380 load, and line 56 load and no line inbetween or after, untill CPU usage settles down and it eventually all loads, again casuing other clients to experience perf issues.

Additional information

We run Custom EEs for job execution, however AWX Containers, DB containers and Control Plane EEs are standard.
Running AWX Operator 0.16.0
External Postgres DB Hosted on k8s v12.5.0
max_locks_per_transaction on postgres set to 512

Decreasing this value causes errors with shared memory exhaustion, increasing the value does not seem to provide any noticeable difference, that is, same performance issue.

We have seen this across two different deployments both upgraded following the same path
Two different physical host sets, same version of deployables

Both k8s versions at 1.16.15

Same issue occurs in brave browser, chrome, new edge

Additional Notes:
Reducing the number of awx nodes from 3 to 1 reduces the number of incoming select IPs visible in htop, however the CPU issue is not resolved.
For best results try and use a job with > 250 lines of output (1000+ does a really good job of breaking it)

@MrBones757
Copy link
Author

Possibly related but this seems like a more extreme issue as it actively breaks the usage of the product:
#11629

@sooslaca
Copy link

sooslaca commented Feb 1, 2022

+1. same issue with 19.5.0 -> 19.5.1

@dmagyar
Copy link

dmagyar commented Feb 1, 2022

I have the same issue after upgrading to 19.5.1 :( When can we expect a fix?

@mtannertdev
Copy link

mtannertdev commented Feb 1, 2022

I'm seeing this issue too with 19.5.1. I had to downgrade to make it work because 19.5.1 is unusable.

@chofstede
Copy link

I'm having the same issue with 19.5.1 on Kubernetes version v1.21.8

@sooslaca
Copy link

I'm wondering if anybody tested to see if the same BUG presents in 20.0.0 ?

@mick1627
Copy link

Same issue in 20.0.0

In the following graph you can see the CPU usage of the postgres rds database :

image

At 11:40 one job template launch in 20.0.0
At 12:50 same job template launch in 19.5.0

@kurokobo
Copy link
Contributor

kurokobo commented Mar 7, 2022

As mentioned in several other issues (#11765, #11818), the [WARNING] message in the playbook output may be the trigger of this issue. I've not digged into the code, but here is a minimal playbook to reproduce this issue for testing:

Playbook

---
- hosts: localhost
  gather_facts: no

  tasks:
    - ansible.builtin.debug:
        msg: "An example message before warning: 1"
    - ansible.builtin.debug:
        msg: "An example message before warning: 2"
    - ansible.builtin.debug:
        msg: "An example message before warning: 3"

    - name: Force to warn by using unsupported socks proxy
      ansible.builtin.shell: curl --socks5 localhost:9000 http://www.ansible.com
      args:
        warn: true
      changed_when: no
      failed_when: no

    - ansible.builtin.debug:
        msg: "An example message before warning: 4"
    - ansible.builtin.debug:
        msg: "An example message before warning: 5"
    - ansible.builtin.debug:
        msg: "An example message before warning: 6"

Example output in ansible-playbook

image

Example output in AWX 20.0.0

issue

As shown in this animation, the page with this issue kept creating new requests to following URLs forever. This may cause high PSQL load.

  • /api/v2/jobs/*/job_events/?order_by=counter&page=*&page_size=*
  • /api/v2/jobs/X/job_events/?counter__gt=*&counter__lt=*&order_by=counter&page_size=*
  • /api/v2/jobs/*/job_events/?uuid=*

In my environment, there appears to be some patterns;

@sooslaca
Copy link

sooslaca commented Mar 9, 2022

I'm afraid to ask, but I'm asking anyway: anyone tried if same presents in 20.0.1 ?

@kurokobo
Copy link
Contributor

kurokobo commented Mar 9, 2022

@sooslaca
I've confirmed that this issue still exists in 20.0.1.

@kurokobo
Copy link
Contributor

kurokobo commented Mar 9, 2022

@keithjgrant
I'm tagging you because you seemed to have expertise about this area according to the PR #11312.
It's a bit difficult for me to dig into the code because I'm not familiar with React yet, but do you have any ideas on this issue?

@chris93111
Copy link
Contributor

i don't know if is realy a problem with Warning but with ?not__stdout="" the output is also good

@keithjgrant keithjgrant self-assigned this Mar 11, 2022
@keithjgrant
Copy link
Member

@kurokobo Yeah, I'm looking into this one. At this point I think this has the same root cause as #11765... it looks like you saw my comment there

@keithjgrant
Copy link
Member

In the following graph you can see the CPU usage of the postgres rds database

@mick1627 Is this with the UI job output page open in browser, or not? It sounds like there may be a couple different issues at play in the comments in this discussion

@mick1627
Copy link

In the following graph you can see the CPU usage of the postgres rds database

@mick1627 Is this with the UI job output page open in browser, or not? It sounds like there may be a couple different issues at play in the comments in this discussion

Yes, exactly, it's with the UI job output page open in browser.

@infra-monkey
Copy link

I confirm that when the output of the job fails to render, the cpu on my db goes crazy.
If I navigate to another page (ie: jobs) cpu comes back to normal on the db. If I go back to the job output, cpu ramps up

@keithjgrant
Copy link
Member

closing as duplicate of #11765

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests