Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exec driver is showing cache+RSS in memory usage UI view #15830

Closed
Blefish opened this issue Jan 20, 2023 · 5 comments · Fixed by #15909
Closed

Exec driver is showing cache+RSS in memory usage UI view #15830

Blefish opened this issue Jan 20, 2023 · 5 comments · Fixed by #15909
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/exec theme/ui type/bug

Comments

@Blefish
Copy link

Blefish commented Jan 20, 2023

Nomad version

Nomad v1.4.3 (f464aca)

Operating system and Environment details

Amazon Linux 2

Issue

Nomad exec driver seems to be reporting cache+RSS in UI view, causing confusion among operators that memory usage of the application is too high.

Reproduction steps

Launch an exec driver job that uses an external volume and reads a large file

Expected Result

Nomad's memory usage UI view should indicate memory used by the process as seems to be the case with docker driver.

Actual Result

Nomad's memory usage UI shows cache+RSS.

UI allocation view
Screenshot from 2023-01-20 09-07-12

nomad alloc stats
Screenshot from 2023-01-20 09-07-35

@shoenig
Copy link
Member

shoenig commented Jan 20, 2023

Hi @Blefish, can you post the detected cgroup version? e.g.

➜ nomad node status -self -verbose | grep cgroup.version
unique.cgroup.version           = v2

@Blefish
Copy link
Author

Blefish commented Jan 23, 2023

Hi, it seems like v1 is currently being detected. Could that be the reason?

# nomad node status -self -verbose | grep cgroup.version
unique.cgroup.version                    = v1

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 24, 2023

Hi @Blefish 👋

Would you also be able to provide the output of the /v1/client/allocation/:alloc_id/stats for the allocation in question?

I think the UI uses the RSS value as fallback, so I want to make sure your API result is non-zero.

Also, is the Nomad agent running as root?

Thanks!

@lgfa29 lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 24, 2023
@lgfa29 lgfa29 moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 24, 2023
@lgfa29 lgfa29 self-assigned this Jan 24, 2023
@Blefish
Copy link
Author

Blefish commented Jan 24, 2023

Absolutely, not a problem. I took a different allocation, adding the information below.
UI view
Screenshot from 2023-01-24 09-58-17
nomad alloc status 53cb923c-8330-fe4b-4d88-91f63dc257f9
Screenshot from 2023-01-24 09-59-29
curl /v1/client/allocation/53cb923c-8330-fe4b-4d88-91f63dc257f9/stats | jq

  "ResourceUsage": {
    "MemoryStats": {
      "RSS": 74035200,
      "Cache": 902995968,
      "Swap": 983515136,
      "MappedFile": 0,
      "Usage": 983515136,
      "MaxUsage": 997433344,
      "KernelUsage": 6488064,
      "KernelMaxUsage": 8470528,
      "Measured": [
        "RSS",
        "Cache",
        "Swap",
        "Usage",
        "Max Usage",
        "Kernel Usage",
        "Kernel Max Usage"
      ]
    },
    "CpuStats": {
      "SystemMode": 0,
      "UserMode": 0.9996656738113818,
      "TotalTicks": 20.62405806735794,
      "ThrottledPeriods": 0,
      "ThrottledTime": 0,
      "Percent": 0.9378834955597062,
      "Measured": [
        "System Mode",
        "User Mode",
        "Throttled Periods",
        "Throttled Time",
        "Percent"
      ]
    },
    "DeviceStats": []
  },
  "Tasks": {
    "taskname": {
      "ResourceUsage": {
        "MemoryStats": {
          "RSS": 74035200,
          "Cache": 902995968,
          "Swap": 983515136,
          "MappedFile": 0,
          "Usage": 983515136,
          "MaxUsage": 997433344,
          "KernelUsage": 6488064,
          "KernelMaxUsage": 8470528,
          "Measured": [
            "RSS",
            "Cache",
            "Swap",
            "Usage",
            "Max Usage",
            "Kernel Usage",
            "Kernel Max Usage"
          ]
        },
        "CpuStats": {
          "SystemMode": 0,
          "UserMode": 0.9996656738113818,
          "TotalTicks": 20.62405806735794,
          "ThrottledPeriods": 0,
          "ThrottledTime": 0,
          "Percent": 0.9378834955597062,
          "Measured": [
            "System Mode",
            "User Mode",
            "Throttled Periods",
            "Throttled Time",
            "Percent"
          ]
        },
        "DeviceStats": null
      },
      "Timestamp": 1674547207760584700,
      "Pids": {
        "23947": {
          "MemoryStats": {
            "RSS": 619208704,
            "Cache": 0,
            "Swap": 0,
            "MappedFile": 0,
            "Usage": 0,
            "MaxUsage": 0,
            "KernelUsage": 0,
            "KernelMaxUsage": 0,
            "Measured": [
              "RSS",
              "Swap"
            ]
          },
          "CpuStats": {
            "SystemMode": 0,
            "UserMode": 0.9996659636175891,
            "TotalTicks": 0,
            "ThrottledPeriods": 0,
            "ThrottledTime": 0,
            "Percent": 0.9996659546724126,
            "Measured": [
              "System Mode",
              "User Mode",
              "Percent"
            ]
          },
          "DeviceStats": null
        }
      }
    }
  },
  "Timestamp": 1674547207760584700
}

Looks interesting that it is showing swap as roughly 983MB. The host does not have swap enabled, but the exec task uses a volume that is mounted via LVM to the host. A file on that volume is read into the task.

cat /proc/pid/status

RssAnon:	   72180 kB
RssFile:	  536508 kB
RssShmem:	       0 kB
VmData:	  201920 kB
VmStk:	     228 kB
VmExe:	   25936 kB
VmLib:	    1928 kB
VmPTE:	    1360 kB
VmPMD:	      36 kB
VmSwap:	       0 kB

-EDIT-
Forgot to clarify that nomad agent is running as root user

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 26, 2023

Thanks for the extra info @Blefish.

It turns out it was a mismatch between the value picked by the UI and the CLI. #15909 adjusts the UI logic so it's the same as the CLI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/exec theme/ui type/bug
Projects
Development

Successfully merging a pull request may close this issue.

3 participants