Upgrade AWX 19.2.2 -> 19.5.1 - UI Job Output Causing Postgres DB to hit 100% CPU - AWX Becomes Unresponsive #11647

MrBones757 · 2022-01-31T11:09:39Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Summary

Following the upgrade to AWX 19.5.1 from 19.2.2, streaming job logs or scrolling though job logs either while the job is running or after the job has finished causes Postgres DB CPU Usage to reach 100% on 4 CPU cores.
Streaming job output via API (cli) does not yield the same issue.
The issue only seems to occur when the job output is longer than what fits on the initial output window size.

REVERTING TO 19.5.0 fixes the issue - the issue is with 19.5.1

AWX version

19.5.1

Installation method

kubernetes

Modifications

yes

Ansible version

EEs W/ 2.9.27 & 2.11.7

Operating system

Ubuntu as k8s Hosts

Web browser

Chrome

Steps to reproduce

Upgrade an instance from 19.2.2 - 19.5.1 (optional?)
Run a job where the job output exceeds one loaded page in the UI
observe results not load (taking upto minutes to render)
observe postgres node cpu usage spiking extremely high while the load is occuring, coming from postgres processes (see screenshot)

Expected results

Fast STDout rendering, with low CPU overhead as in the API and previous versions.

Actual results

Extreme CPU usage on DB node.
Many select processes showing when using top / htop on the physical host that has the postgres DB on it.

We were also able to observe a large number of uncaught promises when scrolling in the job output UI, and the operation its self is causing a 500 internal server error, which is causing the uncaught promise error.

Manually navigating to one of the affected URLS:
https://awx.my.cool.company/api/v2/job_events/15472448/children/?counter__gt=4&order_by=counter&page_size=1
returns an X-API-Time of 2.485 Seconds This rough number seems consistent between many refreshes and many clients, different jobs and job events experience the same issue, including freshly run jobs following the upgrade.
Other APIs do not seem to be affected - listing users, for example, took 0.15 seconds
Listing job templates (>1000 items) took 0.597 seconds.

Partial job logs will also load sometimes, i.e we will see line 380 load, and line 56 load and no line inbetween or after, untill CPU usage settles down and it eventually all loads, again casuing other clients to experience perf issues.

Additional information

We run Custom EEs for job execution, however AWX Containers, DB containers and Control Plane EEs are standard.
Running AWX Operator 0.16.0
External Postgres DB Hosted on k8s v12.5.0
max_locks_per_transaction on postgres set to 512

Decreasing this value causes errors with shared memory exhaustion, increasing the value does not seem to provide any noticeable difference, that is, same performance issue.

We have seen this across two different deployments both upgraded following the same path
Two different physical host sets, same version of deployables

Both k8s versions at 1.16.15

Same issue occurs in brave browser, chrome, new edge

Additional Notes:
Reducing the number of awx nodes from 3 to 1 reduces the number of incoming select IPs visible in htop, however the CPU issue is not resolved.
For best results try and use a job with > 250 lines of output (1000+ does a really good job of breaking it)

MrBones757 · 2022-01-31T11:11:38Z

Possibly related but this seems like a more extreme issue as it actively breaks the usage of the product:
#11629

sooslaca · 2022-02-01T14:27:56Z

+1. same issue with 19.5.0 -> 19.5.1

dmagyar · 2022-02-01T14:51:23Z

I have the same issue after upgrading to 19.5.1 :( When can we expect a fix?

mtannertdev · 2022-02-01T16:54:15Z

I'm seeing this issue too with 19.5.1. I had to downgrade to make it work because 19.5.1 is unusable.

chofstede · 2022-02-04T10:29:20Z

I'm having the same issue with 19.5.1 on Kubernetes version v1.21.8

sooslaca · 2022-02-16T09:39:14Z

I'm wondering if anybody tested to see if the same BUG presents in 20.0.0 ?

mick1627 · 2022-02-17T12:00:20Z

Same issue in 20.0.0

In the following graph you can see the CPU usage of the postgres rds database :

At 11:40 one job template launch in 20.0.0
At 12:50 same job template launch in 19.5.0

kurokobo · 2022-03-07T13:17:08Z

As mentioned in several other issues (#11765, #11818), the [WARNING] message in the playbook output may be the trigger of this issue. I've not digged into the code, but here is a minimal playbook to reproduce this issue for testing:

Playbook

---
- hosts: localhost
  gather_facts: no

  tasks:
    - ansible.builtin.debug:
        msg: "An example message before warning: 1"
    - ansible.builtin.debug:
        msg: "An example message before warning: 2"
    - ansible.builtin.debug:
        msg: "An example message before warning: 3"

    - name: Force to warn by using unsupported socks proxy
      ansible.builtin.shell: curl --socks5 localhost:9000 http://www.ansible.com
      args:
        warn: true
      changed_when: no
      failed_when: no

    - ansible.builtin.debug:
        msg: "An example message before warning: 4"
    - ansible.builtin.debug:
        msg: "An example message before warning: 5"
    - ansible.builtin.debug:
        msg: "An example message before warning: 6"

Example output in `ansible-playbook`

Example output in AWX 20.0.0

As shown in this animation, the page with this issue kept creating new requests to following URLs forever. This may cause high PSQL load.

/api/v2/jobs/*/job_events/?order_by=counter&page=*&page_size=*
/api/v2/jobs/X/job_events/?counter__gt=*&counter__lt=*&order_by=counter&page_size=*
/api/v2/jobs/*/job_events/?uuid=*

In my environment, there appears to be some patterns;

If the WARNING is in the middle of the output
- The exact this issue is triggered.
- Causes high load on PSQL and slowing AWX down.
If the WARNING is at the end of the output
- No HTTP requests are repeated and no high PSQL load occurs.
- But no output after WARNING e.g. PLAY RECAP is displayed, like other issues (Job output is not displayed fully #11765, Most of output is greyed out since upgrading to 20.0.0 #11818)

sooslaca · 2022-03-09T09:42:50Z

I'm afraid to ask, but I'm asking anyway: anyone tried if same presents in 20.0.1 ?

kurokobo · 2022-03-09T12:19:54Z

@sooslaca
I've confirmed that this issue still exists in 20.0.1.

kurokobo · 2022-03-09T16:15:26Z

@keithjgrant
I'm tagging you because you seemed to have expertise about this area according to the PR #11312.
It's a bit difficult for me to dig into the code because I'm not familiar with React yet, but do you have any ideas on this issue?

chris93111 · 2022-03-09T20:00:17Z

i don't know if is realy a problem with Warning but with ?not__stdout="" the output is also good

keithjgrant · 2022-03-16T17:37:23Z

@kurokobo Yeah, I'm looking into this one. At this point I think this has the same root cause as #11765... it looks like you saw my comment there

keithjgrant · 2022-03-24T15:52:46Z

In the following graph you can see the CPU usage of the postgres rds database

@mick1627 Is this with the UI job output page open in browser, or not? It sounds like there may be a couple different issues at play in the comments in this discussion

mick1627 · 2022-03-24T16:04:59Z

In the following graph you can see the CPU usage of the postgres rds database

@mick1627 Is this with the UI job output page open in browser, or not? It sounds like there may be a couple different issues at play in the comments in this discussion

Yes, exactly, it's with the UI job output page open in browser.

infra-monkey · 2022-03-25T09:20:38Z

I confirm that when the output of the job fails to render, the cpu on my db goes crazy.
If I navigate to another page (ie: jobs) cpu comes back to normal on the db. If I go back to the job output, cpu ramps up

keithjgrant · 2022-03-29T17:19:12Z

closing as duplicate of #11765

nixocio added component:ui needs_triage labels Jan 31, 2022

nixocio added the component:api label Feb 4, 2022

MrBones757 mentioned this issue Mar 3, 2022

Most of output is greyed out since upgrading to 20.0.0 #11818

Closed

6 tasks

AlexSCorey removed the needs_triage label Mar 4, 2022

guliaka mentioned this issue Mar 7, 2022

Job output is not displayed fully #11765

Closed

6 tasks

kurokobo mentioned this issue Mar 9, 2022

Missing Job Output - Maybe related to failures or error message #11878

Closed

6 tasks

keithjgrant self-assigned this Mar 11, 2022

keithjgrant mentioned this issue Mar 23, 2022

Use new children-summary endpoint data to traverse job event tree #11944

Merged

keithjgrant closed this as completed Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade AWX 19.2.2 -> 19.5.1 - UI Job Output Causing Postgres DB to hit 100% CPU - AWX Becomes Unresponsive #11647

Upgrade AWX 19.2.2 -> 19.5.1 - UI Job Output Causing Postgres DB to hit 100% CPU - AWX Becomes Unresponsive #11647

MrBones757 commented Jan 31, 2022 •

edited

Loading

MrBones757 commented Jan 31, 2022

sooslaca commented Feb 1, 2022

dmagyar commented Feb 1, 2022

mtannertdev commented Feb 1, 2022 •

edited

Loading

chofstede commented Feb 4, 2022

sooslaca commented Feb 16, 2022

mick1627 commented Feb 17, 2022

kurokobo commented Mar 7, 2022 •

edited

Loading

sooslaca commented Mar 9, 2022

kurokobo commented Mar 9, 2022

kurokobo commented Mar 9, 2022

chris93111 commented Mar 9, 2022

keithjgrant commented Mar 16, 2022

keithjgrant commented Mar 24, 2022

mick1627 commented Mar 24, 2022

infra-monkey commented Mar 25, 2022

keithjgrant commented Mar 29, 2022

Upgrade AWX 19.2.2 -> 19.5.1 - UI Job Output Causing Postgres DB to hit 100% CPU - AWX Becomes Unresponsive #11647

Upgrade AWX 19.2.2 -> 19.5.1 - UI Job Output Causing Postgres DB to hit 100% CPU - AWX Becomes Unresponsive #11647

Comments

MrBones757 commented Jan 31, 2022 • edited Loading

Please confirm the following

Summary

AWX version

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

MrBones757 commented Jan 31, 2022

sooslaca commented Feb 1, 2022

dmagyar commented Feb 1, 2022

mtannertdev commented Feb 1, 2022 • edited Loading

chofstede commented Feb 4, 2022

sooslaca commented Feb 16, 2022

mick1627 commented Feb 17, 2022

kurokobo commented Mar 7, 2022 • edited Loading

Playbook

Example output in ansible-playbook

Example output in AWX 20.0.0

sooslaca commented Mar 9, 2022

kurokobo commented Mar 9, 2022

kurokobo commented Mar 9, 2022

chris93111 commented Mar 9, 2022

keithjgrant commented Mar 16, 2022

keithjgrant commented Mar 24, 2022

mick1627 commented Mar 24, 2022

infra-monkey commented Mar 25, 2022

keithjgrant commented Mar 29, 2022

MrBones757 commented Jan 31, 2022 •

edited

Loading

mtannertdev commented Feb 1, 2022 •

edited

Loading

kurokobo commented Mar 7, 2022 •

edited

Loading

Example output in `ansible-playbook`