From 1ff3a596d166b475a840d672081b6b0bb0fde9c3 Mon Sep 17 00:00:00 2001 From: Joe Elliott Date: Tue, 24 Aug 2021 16:15:55 -0400 Subject: [PATCH] Doc improvements (#909) * Updated to match querier poll cycle Signed-off-by: Joe Elliott * Removed incorrect sentence in runbook Signed-off-by: Joe Elliott * Added notes Signed-off-by: Joe Elliott --- docs/tempo/website/configuration/polling.md | 9 +++++++-- docs/tempo/website/operations/polling.md | 4 +++- operations/jsonnet/microservices/configmap.libsonnet | 2 +- operations/tempo-mixin/runbook.md | 5 +---- 4 files changed, 12 insertions(+), 8 deletions(-) diff --git a/docs/tempo/website/configuration/polling.md b/docs/tempo/website/configuration/polling.md index 26f1a21396e..06520c96c19 100644 --- a/docs/tempo/website/configuration/polling.md +++ b/docs/tempo/website/configuration/polling.md @@ -37,11 +37,16 @@ ingester: The compactor `compacted_block_retention` is used to keep a block in the backend for a given period of time after it has been compacted and the data is no longer needed. This allows queriers with a stale blocklist to access -these blocks successfully until they complete their polling cycles and have up to date blocklists. +these blocks successfully until they complete their polling cycles and have up to date blocklists. Like the +`complete_block_timeout` this should be at a minimum 2x the configurated `blocklist_poll` duration. ``` compactor: compaction: # How long to leave a block in the backend after it has been compacted successfully. Default is 1h [compacted_block_retention: ] -``` \ No newline at end of file +``` + +Additionally, it is important that the querier `blocklist_poll` duration is greater than or equal to the compactor +`blocklist_poll` duration. Otherwise a querier may not correctly check all assigned blocks and incorrectly return 404. +It is recommended to simply set both components to use the same poll duration. \ No newline at end of file diff --git a/docs/tempo/website/operations/polling.md b/docs/tempo/website/operations/polling.md index 54bc9962ab9..33eb4fc9c04 100644 --- a/docs/tempo/website/operations/polling.md +++ b/docs/tempo/website/operations/polling.md @@ -14,7 +14,9 @@ what's called a tenant index. The tenant index is a gzip'ed json file located at an entry for every block and compacted block for that tenant. This is done once every `blocklist_poll` duration. All other compactors and all queriers then rely on downloading this file, unzipping it and using the contained list. -Again this is done once every `blocklist_poll` duration. +Again this is done once every `blocklist_poll` duration. **NOTE** It is important that the querier `blocklist_poll` duration +is greater than or equal to the compactor `blocklist_poll` duration. Otherwise a querier may not correctly check +all assigned blocks and incorrectly return 404. Due to this behavior a given compactor or querier will often have an out of date blocklist. During normal operation it will stale by at most 2x the configured `blocklist_poll`. See [configuration]({{< relref "../configuration/polling" >}}) diff --git a/operations/jsonnet/microservices/configmap.libsonnet b/operations/jsonnet/microservices/configmap.libsonnet index c5e1aa29d28..d261a2b835b 100644 --- a/operations/jsonnet/microservices/configmap.libsonnet +++ b/operations/jsonnet/microservices/configmap.libsonnet @@ -73,7 +73,7 @@ }, storage+: { trace+: { - blocklist_poll: '10m', + blocklist_poll: '5m', }, }, }, diff --git a/operations/tempo-mixin/runbook.md b/operations/tempo-mixin/runbook.md index 85823cb811c..8e84ee3283a 100644 --- a/operations/tempo-mixin/runbook.md +++ b/operations/tempo-mixin/runbook.md @@ -6,10 +6,7 @@ This document should help with remediating operational issues in Tempo. ## TempoRequestLatency Aside from obvious errors in the logs the only real lever you can pull here is scaling. Use the Reads or Writes dashboard -to identify the component that is struggling and scale it up. It should be noted that right now quickly scaling the -Ingester component can cause 404s on traces until they are flushed to the backend. For safety you may only want to -scale one per hour. However, if Ingesters are falling over, it's better to scale fast, ingest successfully and throw 404s -on query than to have an unstable ingest path. Make the call! +to identify the component that is struggling and scale it up. The Query path is instrumented with tracing (!) and this can be used to diagnose issues with higher latency. View the logs of the Query Frontend, where you can find an info level message for every request. Filter for requests with high latency and view traces.