diff --git a/pip/pip-323.md b/pip/pip-323.md new file mode 100644 index 0000000000000..dc607fff3d58c --- /dev/null +++ b/pip/pip-323.md @@ -0,0 +1,171 @@ +# PIP-323: Complete Backlog Quota Telemetry + +# Background knowledge + +## Backlog + +A topic in Pulsar is the place where messages are written to. They are consumed by subscriptions. A topic can have many +subscriptions, and it is those that maintains the state of message acknowledgment, per subscription - which messages +were acknowledged and which were not. + +A subscription backlog is the set of unacknowledged messages in that subscription. +A subscription backlog size is the sum of the size of the unacknowledged messages (in bytes).. + +Since a topic can have many subscriptions, and each has its own backlog, how does one define a backlog for a topic? +A topic backlog is defined as the backlog of the subscription which has the **oldest** unacknowledged message. +Since acknowledged messages can be interleaved with unacknowledged messages, calculating the exact size of that +subscription backlog can be expensive as it requires I/O operations to read the messages from the ledgers. +For that reason, the topic backlog size is actually defined to be the *estimated* backlog size of that subscription. +It does so by summarizing the size of all the ledgers, starting from the current active one (the one being written to), +up to the ledger which contains the oldest unacknowledged message for that subscription (There is actually a faster +way to calculate it, but this was the definition chosen for this estimation in Pulsar). + +A topic backlog age is the age of the oldest unacknowledged message (same subscription as defined for topic backlog size). +If that message was written 30 minutes ago, its age is 30 minutes, and so is the topic backlog age. + +## Backlog Quota + +Pulsar has a feature called [backlog quota](https://pulsar.apache.org/docs/3.1.x/cookbooks-retention-expiry/#backlog-quotas). +It allows a user to define a quota - in effect, a limit - which limits the topic backlog. +There are two types of quotas: + +1. Size based: The limit is for the topic backlog size (as we defined above). +2. Time based: The limit is for the topic backlog age (as we defined above). + +Once a topic backlog exceeds either one of those limits, an action is taken to hold the backlog to that limit: + +* The producer write is placed on hold for a certain amount of time before failing. +* The producer write is failed +* The subscriptions oldest unacknowledged messages will be acknowledged in-order until both the topic backlog size or + age will fall inside the limit (quota). The process is called backlog eviction (happens every interval). + +The quotas can be defined as a default value for any topic, by using the following broker configuration keys: +`backlogQuotaDefaultLimitBytes` and `backlogQuotaDefaultLimitSecond`. + +The quota can also be specified directly for all topics in a given namespace using the namespace policy, +or a specific topic using a topic policy. + +## Monitoring Backlog Quota + +The user today can calculate quota used for size based limit, since there are two metrics exposed today on +a topic level: `pulsar_storage_backlog_quota_limit` and `pulsar_storage_backlog_size`. +You can just divide the two to get a percentage and know how close the topic backlog to its size limit. + +For the time-based limit, the only metric exposed today is the quota itself - `pulsar_storage_backlog_quota_limit_time` + +## Backlog Quota Eviction in the Broker + +The broker has a method called `BrokerService.monitorBacklogQuota()`. It is scheduled to run every x seconds, +as defined by the configuration `backlogQuotaCheckIntervalInSeconds`. +This method loops over all persistent topics, and for each topic is checks whether the topic backlog exceeded +either one of those topics. + +As mentioned before, checking backlog size is a memory-only calculation, since +each topic has the list of ledgers stored in-memory, including the size of each ledger. Same goes for the subscriptions, +they are all stored in memory, and the `ManagedCursor` keeps track of the subscription with the oldest unacknowledged +message, thus retrieveing it is O(1). Checking backlog based on time is costly if configuration key +`preciseTimeBasedBacklogQuotaCheck` was set to true. In that case, it needs to read the oldest message to obtain +its public timestamp, which is expensive in terms of I/O. If it was set to false, it's in-memory access only, since +it uses the age of the ledger instead of the message, and the ledgers metadata is kept in memory. + +For each topic which has exceeded its quota, if the policy chosen is eviction, then the process it performed +synchronously. This process consumes I/O, as it needs read messages (using skip) to know where to stop acknowledging +messages. + + +# Motivation + +Users which have defined backlog quota based on time, have no means today to monitor the backlog quota usage, +time-wise, to know whether the topic backlog is close to its time limit or even passed it. + +If it has passed it, the user has no means to know if it happened, when and how many times. + + +# Goals + +## In Scope +- Allow the user to know the backlog quota usage for time-based quota, per topic +- Allow the user to know how many times backlog eviction happened, and for which backlog quota type + +## Out of Scope + +None + + +# High Level Design + +We'll use the existing backlog monitoring process running in intervals. For each topic, the subscription with +the oldest unacknowledged message is retrieved, to calculate the topic backlog age. At that point, we will +cache the following for the oldest unacknowledged message: +* Subscription name +* Message position +* Message publish timestamp + +That cache will allow us to add a metric exposing the topic backlog age - `pulsar_storage_backlog_age_seconds`, +which will be both consistent (same ones used for deciding on backlog eviction) and cheap to retrieve +(no additional I/O involved). +Coupled with the existing `pulsar_storage_backlog_quota_limit_time` metric, the user can use both to divide and +get the usage of the quota (both are in seconds units). + +We will add the subscription name containing the oldest unacknowledged message to the Admin API +topic stats endpoints (`{tenant}/{namespace}/{topic}/stats` and `{tenant}/{namespace}/{topic}/partitioned-stats`), +allowing the user a complete workflow: alert using metrics when topic backlog is about to be exceeded, then +query topic stats for that topic to retrieve the subscription name which contains the oldest message. +For completeness, we will also add the backlog quota limits, both age and size, and the age of oldest +unacknowledged message. + +We will add a metric allowing the user to know how many times the usage exceeded the quota, both for time or size - +`pulsar_storage_backlog_quota_exceeded_evictions_total`, where the `quota_type` label will be either `time` or +`size`. Monitoring that counter over time will allow the user to know when a topic backlog exceeded its quota, +and if backlog eviction was chosen as action, then it happened, and how many times. + +Some users may want the backlog quota check to happen more frequently, and as a consequence, the backlog age +metric more frequently updated. They can modify `backlogQuotaCheckIntervalInSeconds` configuration key, but without +knowing how long this check takes, it will be hard for them. Hence, we will add the metric +`pulsar_storage_backlog_quota_check_duration_seconds` which will be of histogram type. + +# Detailed Design + +## Public-facing Changes + +### Public API +Adding the following to the response of topic stats, of both `{tenant}/{namespace}/{topic}/stats` +and `{tenant}/{namespace}/{topic}/partitioned-stats`: + +* `backlogQuotaLimitSize` - the size in bytes of the topic backlog quota +* `backlogQuotaLimitTime` - the topic backlog age quota, in seconds. +* `oldestBacklogMessageAgeSeconds` - the age of the oldest unacknowledged (i.e. backlog) message, measured by + the time elapsed from its published time, in seconds. This value is recorded every backlog quota check + interval, hence it represents the value seen in the last check. +* `oldestBacklogMessageSubscriptionName` - the name of the subscription containing the oldest unacknowledged message. + This value is recorded every backlog quota check interval, hence it represents the value seen in the last check. + + +### Metrics + +| Name | Description | Attributes | Units | +|----------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|--------------------------------------------------------|---------| +| `pulsar_storage_backlog_age_seconds` | Gauge. The age of the oldest unacknowledged message (backlog) | cluster, namespace, topic | seconds | +| `pulsar_storage_backlog_quota_exceeded_evictions_total` | Counter. The number of times a backlog was evicted since it has exceeded its quota | cluster, namespace, topic, quota_type = (time \| size) | | +| `pulsar_storage_backlog_quota_check_duration_seconds` | Histogram. The duration of the backlog quota check process. | cluster | seconds | +| `pulsar_broker_storage_backlog_quota_exceeded_evictions_total` | Counter. The number of times a backlog was evicted since it has exceeded its quota, in broker level | cluster, quota_type = (time \| size) | | + +* Since `pulsar_storage_backlog_age_seconds` can not be aggregated, with proper meaning, to a namespace-level, it will + not be included as a metric when configuration key `exposeTopicLevelMetricsInPrometheus` is set to false. +* `pulsar_storage_backlog_quota_exceeded_evictions_total` will be included as a metric also in namespace aggregation. + +# Alternatives + +One alternative is to separate the backlog quota check into 2 separate processes, running in their own frequency: +1. Check backlog quota exceeded for all persistent topics. The result will be marked in memory. + If precise time backlog quota was configured then this will the I/O cost as described before. +2. Evict messages for those topics marked. + +This *may* enable more frequent updates to the backlog age metric making it more fresh, but the cost associated with it +might be high, since it might result in more frequent I/O calls, especially with many topics. +Another disadvantage is that it makes the backlog check and eviction more complex. + +# Links + +* Mailing List discussion thread: https://lists.apache.org/thread/xv33xjjzc3t2n06ynz2gmcd4s06ckrqh +* Mailing List voting thread: https://lists.apache.org/thread/x2ypnft3x5jdyyxbwgvzxgcw20o44vps