Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant memory growth with the file backend #6010

Closed
0xAX opened this issue Oct 16, 2024 · 16 comments
Closed

Constant memory growth with the file backend #6010

0xAX opened this issue Oct 16, 2024 · 16 comments
Labels
defect Suspected defect such as a bug or regression k8s

Comments

@0xAX
Copy link
Contributor

0xAX commented Oct 16, 2024

Observed behavior

Hello,

We are running 3 replicas NATS cluster in a kubernetes cluster and I am trying to use NATS streams for locking, something very similar is described here - #4803. I create some streams (not so much about 5) with:

stream = Stream.create(pid, %Stream{name: stream <> "_#{proc_name}",
                                    subjects: ["#{stream}.#{proc_name}.locks.*"],
                                    max_msgs_per_subject: 1,
                                    discard_new_per_subject: true,
                                    discard: :new,
                                    duplicate_window: 5000000000, # 5 seconds
                                    max_age: 5000000000, # 5 seconds
                                    storage: :file})

sorry for the elixir syntax but I suppose it should be clear as names of stream settings should be very similar to original. After streams are created the multiple instances of application start to send and purge messages from these streams (like stream_name.proc_name.locks.{LOCK_ID}). If the storage is set to file (with memory it is not reproduced) we see constant memory growth although the streams are not overflowed at least it is visible on grafana:

streams

The memory consumption is:

nats-resources

The NATS cluster configuration is:

  nats.conf: |
    {
      "cluster": {
        "name": "tposs-long-test-10-14-v2",
        "no_advertise": true,
        "port": 6222,
        "routes": [
          "nats://cdm-0.cdm-headless:6222",
          "nats://cdm-1.cdm-headless:6222",
          "nats://cdm-2.cdm-headless:6222"
        ]
      },
      "prof_port": 65432,
      "http_port": 8222,
      "jetstream": {
        "max_file_store": 10Gi,
        "max_memory_store": 0,
        "store_dir": "/data"
      },
      "lame_duck_duration": "30s",
      "lame_duck_grace_period": "10s",
      "leafnodes": {
        "no_advertise": true,
        "port": 7422,
        "remotes": [
          {
            "urls": [
              $CDM_CONNECTION_URL
            ]
          }
        ]
      },
      "pid_file": "/var/run/nats/nats.pid",
      "port": 4222,
      "server_name": $SERVER_NAME
    }

The GOMEMLIMIT is set to:

env:
  - name: GOMEMLIMIT
    value: 512MiB

Resources are not set in kubernetes exactly for this deployment, but I tried to set and it didn't help. Information about the streams:

/ # ./nats stream report
Obtaining Stream stats

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                            Stream Report                                                           │
├────────────────────────────┬─────────┬───────────┬───────────┬──────────┬───────┬──────┬─────────┬─────────────────────────────────┤
│ Stream                     │ Storage │ Placement │ Consumers │ Messages │ Bytes │ Lost │ Deleted │ Replicas                        │
├────────────────────────────┼─────────┼───────────┼───────────┼──────────┼───────┼──────┼─────────┼─────────────────────────────────┤
│ TPOSS_LOCK_MGR_gx_lock     │ File    │           │         0 │ 0        │ 0 B   │ 0    │       0 │ tposs-long-test-10-14-v2-cdm-1* │
│ TPOSS_LOCK_MGR_nchf_lock   │ File    │           │         0 │ 0        │ 0 B   │ 0    │       0 │ tposs-long-test-10-14-v2-cdm-2* │
│ TPOSS_LOCK_MGR_npcf_lock   │ File    │           │         0 │ 0        │ 0 B   │ 0    │       0 │ tposs-long-test-10-14-v2-cdm-0* │
│ TPOSS_LOCK_MGR_ruckus_lock │ File    │           │         0 │ 0        │ 0 B   │ 0    │       0 │ tposs-long-test-10-14-v2-cdm-1* │
│ TPOSS_LOCK_MGR_scg_lock    │ File    │           │         0 │ 0        │ 0 B   │ 0    │       0 │ tposs-long-test-10-14-v2-cdm-2* │
│ TPOSS_LOCK_MGR_radius_lock │ File    │           │         0 │ 1        │ 130 B │ 0    │       0 │ tposs-long-test-10-14-v2-cdm-1* │
│ TPOSS_LOCK_MGR_gy_lock     │ File    │           │         0 │ 1        │ 172 B │ 0    │       0 │ tposs-long-test-10-14-v2-cdm-0* │
╰────────────────────────────┴─────────┴───────────┴───────────┴──────────┴───────┴──────┴─────────┴─────────────────────────────────╯

Some streams are empty as you may see above, but it is ok. As I wrote above messages are constantly purged (or deleted because of TTL). So streams are clearly not overflowed with data.

The information about a stream (only one here but I suppose it is very similar to all) is:

/ # ./nats stream info TPOSS_LOCK_MGR_gy_lock
Information for Stream TPOSS_LOCK_MGR_gy_lock created 2024-10-15 14:09:00

              Subjects: TPOSS_LOCK_MGR.gy_lock.locks.*
              Replicas: 1
               Storage: File

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: New Per Subject
      Duplicate Window: 5.00s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: 1
         Maximum Bytes: unlimited
           Maximum Age: 5.00s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: tposs-long-test-10-14-v2
                Leader: tposs-long-test-10-14-v2-cdm-0

State:

              Messages: 1
                 Bytes: 172 B
        First Sequence: 41,837,957 @ 2024-10-16 07:52:51
         Last Sequence: 41,837,957 @ 2024-10-16 07:52:51
      Active Consumers: 0
    Number of Subjects: 1

Also attached memory profile taken as described in the https://docs.nats.io/running-a-nats-service/nats_admin/profiling -
mem.prof.zip. The profiler data was taken from the instance with the highest memory consumption (yellow at the screenshot from grafana).

Expected behavior

As number of messages within stream does not grow - stable memory consumption.

Server and client version

The NATS version is - nats:2.10.21-alpine. The client version is the latest version of https://github.com/nats-io/nats.ex.

@0xAX 0xAX added the defect Suspected defect such as a bug or regression label Oct 16, 2024
@neilalexander
Copy link
Member

The memory profile you posted shows less than 6MB actual heap usage, can you please look at what the actual RSS memory of the nats-server process is inside the container? And the output of nats server ls?

@0xAX
Copy link
Contributor Author

0xAX commented Oct 16, 2024

@neilalexander the memory within container is certainly less as well:

/ # ps -o rss,comm | grep nats-server
 25m nats-server
2652 nats-server-con

By the way could it be because of constant subjects creation - as we post messages to TPOSS_LOCK_MGR.gy_lock.locks.{LOCK_ID} while LOCK_ID part could be always different within time.

For the nats server ls with the config I've pasted above I see the:

/ # ./nats server ls
nats: error: no results received, ensure the account used has system privileges and appropriate permissions

so as soon as I'll figure out how to get it, I'll post

@neilalexander
Copy link
Member

25MB is very much in the range of what we'd expect here. nats server ls will need to be done in the system account BTW but no big deal, I suspect it'll just report the same.

Inside the container, can you please do free -m and paste the results? I suspect some of this is kernel page caches.

@0xAX
Copy link
Contributor Author

0xAX commented Oct 16, 2024

/ # free -m
              total        used        free      shared  buff/cache   available
Mem:          64189       20997       15680          11       27512       42462
Swap:          1950           0        1950

from within container. And yes that screenshot from grafana shows container_memory_working_set_bytes, with container_memory_rss it is looks pretty ok:

rss

while it is still concerning as container_memory_working_set_bytes is what OOM killer is looking for I suppose

@neilalexander
Copy link
Member

How about the total_active_file metric? How does that one look?

@0xAX
Copy link
Contributor Author

0xAX commented Oct 16, 2024

I do not have this metrics unfortunatelly for some reasons, but maybe info from cgroup memory.stat will be useful:

cat /sys/fs/cgroup/memory.stat | grep file
file 857731072
file_mapped 0
file_dirty 4096
file_writeback 0
file_thp 0
inactive_file 644096000
active_file 213635072
workingset_refault_file 0
workingset_activate_file 0
workingset_restore_file 0

@neilalexander
Copy link
Member

So we aren't experiencing a memory leak in the NATS Server process in this case, which is something.

I've been reading kubernetes/kubernetes#43916 which is quite the rollercoaster, but the consensus seems to be that Kubernetes is incorrectly considering the page cache when deciding the memory pressure. The "fix" seems to be to set equal values for memory request and memory limit. That should limit the kernel page cache for that cgroup up to the container limit only.

Then, for good measure, set GOMEMLIMIT to some value below that limit, i.e. try something like 256mb for memory request/memory limit in K8s, and then 128mb for GOMEMLIMIT.

@0xAX
Copy link
Contributor Author

0xAX commented Oct 16, 2024

Thank you, I'll do the new test with GOMEMLIMIT 256 and 128 afterwards. And also set kubernetes resources close to it.

@0xAX
Copy link
Contributor Author

0xAX commented Oct 16, 2024

After I've set:

resources:
  limits:
    cpu: 500m
    memory: 500Mi
  requests:
    cpu: 500m
    memory: 500Mi

and

- name: GOMEMLIMIT
  value: 256MiB

I see the following picture:

nats-memory

do not pay big attention to the last slow-down, just test stopping

@neilalexander
Copy link
Member

So just to be clear, in this case, the NATS Server process doesn't restart and maintains a stable RSS?

If so, after each "see saw" drop, free -m should report that the buff/cache utilisation has dropped?

@giannello
Copy link

giannello commented Oct 16, 2024

@0xAX are you running nats on a GKE cluster? Which version?

@0xAX
Copy link
Contributor Author

0xAX commented Oct 17, 2024

No, it is not GKE, version:

Server Version: version.Info{Major:"1", Minor:"30", GitVersion:"v1.30.4+rke2r1", GitCommit:"a51b3b711150f57ffc1f526a640ec058514ed596", GitTreeState:"clean", BuildDate:"2024-08-15T06:00:17Z", GoVersion:"go1.22.5 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

@0xAX
Copy link
Contributor Author

0xAX commented Oct 17, 2024

So just to be clear, in this case, the NATS Server process doesn't restart and maintains a stable RSS?

No, it does not restart. And it is ok for me but the thing is that client start to get timeout errors around time of such drops.

If so, after each "see saw" drop, free -m should report that the buff/cache utilisation has dropped?

There is drop:

/ # free -m
              total        used        free      shared  buff/cache   available
Mem:          64189       21215       14064          11       28910       42274
Swap:          1950           0        1950

after:

/ # free -m
              total        used        free      shared  buff/cache   available
Mem:          64189       21184       14340          11       28665       42319
Swap:          1950           0        1950

@0xAX
Copy link
Contributor Author

0xAX commented Oct 18, 2024

Could it be because of use case itself? I mean we store 1 key per subject within the stream (created with the max_msgs_per_subject: 1 and discard_new_per_subject: true options and subject topic.*). Each key is locked-{DATETIME} that is published to subject topic.{KEY} where the KEY is different during the program lifetime for each entity which wants to acquire a lock. After 'lock' represented by this message is released we purge stream with the same subject topic.{KEY}. But could it be a problem that stream have too much topics after some time since KEY is different all the time and in theory could be unlimited variations of them. And too much subjects not in one unit of time but during time as locks are created and released all the time with different subjects. Or this use-case is normal and purge operation cleans up subject? For me seems ok... without knowing internals of NATS and looking at how it behaves if in-memory storage is only used, but maybe someone see something wrong with this use case?

@neilalexander
Copy link
Member

neilalexander commented Oct 18, 2024

I don't think you have a usage pattern problem here. In fact I don't think you have a problem here at all. The continuous stream writes result in the creation of new filestore blocks (this is normal operation) and therefore the kernel page cache is growing with in-memory caches of those files.

What is supposed to happen is that the operating system will evict caches/buffers when the memory needs to be reclaimed for something else. In the free -m output, free and available are reported separately — free is how much memory is doing absolutely nothing and is ready for use, whereas available is how much memory can be reclaimed and freed if suddenly it's needed for something else.

In this case, the host system with 64GB RAM has more than enough memory to satisfy application pressure without having to evict the caches because, even with 28GB of buffers/caches, there's still 14GB that isn't being used for anything, so the buffers/caches will just keep growing to fill the free space. This isn't a bad thing BTW — these caches are really good for improving application read performance.

The reason this looks so bad is that Kubernetes is counting the buffers/caches in with the pod's real-world memory usage, which in my opinion it shouldn't do, which means that the container reports memory usage as being much higher than the sum total of the RSS of the processes inside it.

The only bit that's particularly relevant to us is the nats-server RSS usage, which based on your earlier reporting is very low (at only 25MB-ish). That means that NATS isn't leaking memory, it's normal kernel page cache operation that you're seeing.

@0xAX
Copy link
Contributor Author

0xAX commented Oct 18, 2024

@neilalexander thank you. We will do the longer test and if it is not killed by OOM, I am pretty ok with this. I'll close issue for now and will return if there will be new findings.

@0xAX 0xAX closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression k8s
Projects
None yet
Development

No branches or pull requests

3 participants