Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods restart because their log volume exceeds its limit #620

Closed
siegfriedweber opened this issue Jul 11, 2023 · 2 comments
Closed

Pods restart because their log volume exceeds its limit #620

siegfriedweber opened this issue Jul 11, 2023 · 2 comments

Comments

@siegfriedweber
Copy link
Member

Pods restart because their log volume exceeds its limit:

Usage of EmptyDir volume "log" exceeds the limit "11Mi".

This was observed in an OpenShift cluster for ZooKeeper 3.8.0 and HBase.

@siegfriedweber
Copy link
Member Author

siegfriedweber commented Jul 11, 2023

Observations with ZooKeeper 3.8.1 in a local kind cluster:

The log file rollover works as desired:

  1. The logs are written to zookeeper.log4j.xml.
  2. When the log file exceeds 5 MiB (5,244,013 Bytes in the test which are 5 MiB + 1,133 Bytes) then it is renamed to zookeeper.log4j.xml.1 and new logs are written again to zookeeper.log4j.xml.
  3. When the log file zookeeper.log4j.xml exceeds again 5 MiB (5,243,584 Bytes in the test which are 5 MiB + 704 Bytes) then zookeeper.log4j.xml.1 is deleted first (see also https://github.com/qos-ch/logback/blob/v_1.2.10/logback-core/src/main/java/ch/qos/logback/core/rolling/FixedWindowRollingPolicy.java#L128-L132), and then zookeeper.log4j.xml is renamed to zookeeper.log4j.xml.1.

The logs of the prepare container used 515 Bytes.

A total of 10,488,112 Bytes were used which means, nearly 1 MiB (1,046,224 Bytes) were unused.

If the size limit of the log volume is set to 2 MiB then the Pod is evicted with the shown error message when the log file exceeds 2,300,856 Bytes (~ 2.2 MiB). So the size limit is checked in a kind cluster.

Possible explanations for the behavior in OpenShift:

  • The file system uses a large block size.
  • The deletion does not happen immediately or the renaming occupies twice the file size for a short amount of time.
  • The node ephemeral storage is not sufficient:

    A size limit can be specified for the default medium, which limits the capacity of the emptyDir volume. The storage is allocated from node ephemeral storage. If that is filled up from another source (for example, log files or image overlays), the emptyDir may run out of capacity before this limit.

    https://kubernetes.io/docs/concepts/storage/volumes/#emptydir

  • Some log entries are huge (more than 0.5 MiB).

@siegfriedweber
Copy link
Member Author

The issue can be reproduced on our OpenShift cluster.

The problem is that the disk usage per log file is not just required blocks * block size. For instance, a zookeeper.log4j.xml with 4,127,151 bytes occupied 4,032 blocks. Then log entries were written and the actual file size increased to 4,132,477 bytes which occupy suddenly 8,128 blocks. On smaller files, less additional blocks are reserved. The number of reserved blocks can also decrease. For instance, the zookeeper.log4j.xml.1 contains 5,243,985 bytes and occupies only 5,124 blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment