Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promtail terminates when loki returns a 500 error from back-pressure #89

Closed
BrianChristie opened this issue Dec 13, 2018 · 3 comments
Closed
Labels
component/agent type/bug Somehing is not working as expected

Comments

@BrianChristie
Copy link

BrianChristie commented Dec 13, 2018

It appears promtail is terminating (and the pod is restarting) when it receives a 500 error from the loki server.
"Error sending batch: Error doing write: 500 - 500 Internal Server Error"
From a discussion on Slack, this occurs when the remote end is overloaded. Possibly this should be a more specific 503 slow down error?

Perhaps back-pressure from the remote end should be expected and handled by promtail, by retrying the request with a capped exponential backoff with jitter.

Additionally promtail could expose a metric indicating its consumer lag, that is the delta between the current head of the log file and what it has successfully processed sent to the remote server. That could be used in AlertManager to warn when there is a danger of loosing logs (for example in Kubernetes, Nodes automatically rotate and delete log files as they grow).

@tomwilkie
Copy link
Contributor

Thanks for the report @BrianChristie! This should not happen. Would you mind posting the last few 10s of lines from promtail before it exits? Do you know the exit code?

@BrianChristie
Copy link
Author

BrianChristie commented Dec 18, 2018

I failed to mention, this was with sending logs to logs-us-west1.grafana.net.
I just gave it a try again and I'm not seeing the error now, presumably the backend capacity has been increased. Also I may have been mistaken about the process terminating.

Here's the prior logs:

```   10:00:52.198 promtail level=info ts=2018-12-13T10:00:46.685374757Z caller=kubernetes.go:178 component=discovery discovery=k8s msg="Using pod service account via in-cluster config"   10:00:52.198 promtail level=info ts=2018-12-13T10:00:46.686164863Z caller=kubernetes.go:178 component=discovery discovery=k8s msg="Using pod service account via in-cluster config"   10:00:52.381 promtail level=info ts=2018-12-13T10:00:52.381542318Z caller=targetmanager.go:140 msg="Adding target" key="{app=\"node-exporter\", controller_revision_hash=\"3272103897\", instance=\"node-exporter-2lkz4\", job=\"monitoring/node-exporter\", namespace=\"monitoring\", pod_template_generation=\"6\"}"   10:00:52.381 promtail time="2018-12-13T10:00:52Z" level=info msg="newTarget{app=\"node-exporter\", controller_revision_hash=\"3272103897\", instance=\"node-exporter-2lkz4\", job=\"monitoring/node-exporter\", namespace=\"monitoring\", pod_template_generation=\"6\"}"   10:00:52.484 promtail level=info ts=2018-12-13T10:00:52.484306778Z caller=targetmanager.go:140 msg="Adding target" key="{app=\"cluster-fluentd-cloudwatch\", controller_revision_hash=\"811795448\", instance=\"cluster-fluentd-cloudwatch-t2mrv\", job=\"cluster-fluentd-cloudwatch/cluster-fluentd-cloudwatch\", namespace=\"cluster-fluentd-cloudwatch\", pod_template_generation=\"2\"}"   10:00:52.484 promtail time="2018-12-13T10:00:52Z" level=info msg="newTarget{app=\"cluster-fluentd-cloudwatch\", controller_revision_hash=\"811795448\", instance=\"cluster-fluentd-cloudwatch-t2mrv\", job=\"cluster-fluentd-cloudwatch/cluster-fluentd-cloudwatch\", namespace=\"cluster-fluentd-cloudwatch\", pod_template_generation=\"2\"}"   10:00:52.490 promtail 2018/12/13 10:00:52 Seeked /var/log/pods/eeb66730-d92b-11e8-b4be-0ab5bd2076ce/cluster-fluentd-cloudwatch/0.log - &{Offset:2450983 Whence:0}   10:00:52.501 promtail 2018/12/13 10:00:52 Seeked /var/log/pods/fda8ecf6-f3d4-11e8-aa30-02194dce8c18/node-exporter/0.log - &{Offset:5617 Whence:0}   10:00:52.501 promtail time="2018-12-13T10:00:52Z" level=info msg="Tailing file: /var/log/pods/fda8ecf6-f3d4-11e8-aa30-02194dce8c18/node-exporter/0.log"   10:00:52.586 promtail time="2018-12-13T10:00:52Z" level=info msg="Tailing file: /var/log/pods/eeb66730-d92b-11e8-b4be-0ab5bd2076ce/cluster-fluentd-cloudwatch/0.log"   10:00:52.587 promtail level=info ts=2018-12-13T10:00:52.587002987Z caller=targetmanager.go:140 msg="Adding target" key="{app=\"kiam\", controller_revision_hash=\"2838842168\", instance=\"kiam-agent-7kp8m\", job=\"cluster-kiam/kiam\", namespace=\"cluster-kiam\", pod_template_generation=\"1\", role=\"agent\"}"   10:00:52.587 promtail time="2018-12-13T10:00:52Z" level=info msg="newTarget{app=\"kiam\", controller_revision_hash=\"2838842168\", instance=\"kiam-agent-7kp8m\", job=\"cluster-kiam/kiam\", namespace=\"cluster-kiam\", pod_template_generation=\"1\", role=\"agent\"}"   10:00:52.681 promtail 2018/12/13 10:00:52 Seeked /var/log/pods/50127684-d604-11e8-b4be-0ab5bd2076ce/agent/0.log - &{Offset:5994548 Whence:0}   10:00:52.681 promtail time="2018-12-13T10:00:52Z" level=info msg="Tailing file: /var/log/pods/50127684-d604-11e8-b4be-0ab5bd2076ce/agent/0.log"   10:00:52.685 promtail 2018/12/13 10:00:52 Seeked /var/log/pods/0520614c-a703-11e8-b4be-0ab5bd2076ce/kiam/1.log - &{Offset:597 Whence:0}   10:00:52.685 promtail time="2018-12-13T10:00:52Z" level=info msg="Tailing file: /var/log/pods/0520614c-a703-11e8-b4be-0ab5bd2076ce/kiam/1.log"   10:00:52.685 promtail 2018/12/13 10:00:52 Seeked /var/log/pods/0520614c-a703-11e8-b4be-0ab5bd2076ce/kiam/2.log - &{Offset:2992790 Whence:0}   10:00:52.685 promtail time="2018-12-13T10:00:52Z" level=info msg="Tailing file: /var/log/pods/0520614c-a703-11e8-b4be-0ab5bd2076ce/kiam/2.log"   10:00:52.884 promtail level=info ts=2018-12-13T10:00:52.884543451Z caller=targetmanager.go:140 msg="Adding target" key="{app=\"flannel\", controller_revision_hash=\"3331141618\", instance=\"kube-flannel-ds-x4plh\", job=\"kube-system/flannel\", namespace=\"kube-system\", pod_template_generation=\"2\", role_kubernetes_io_networking=\"1\", tier=\"node\"}"   10:00:52.884 promtail time="2018-12-13T10:00:52Z" level=info msg="newTarget{app=\"flannel\", controller_revision_hash=\"3331141618\", instance=\"kube-flannel-ds-x4plh\", job=\"kube-system/flannel\", namespace=\"kube-system\", pod_template_generation=\"2\", role_kubernetes_io_networking=\"1\", tier=\"node\"}"   10:00:53.083 promtail level=info ts=2018-12-13T10:00:53.082819354Z caller=targetmanager.go:140 msg="Adding target" key="{app=\"scalyr-agent-2\", controller_revision_hash=\"154853095\", instance=\"scalyr-agent-2-8w5t4\", job=\"cluster-scalyr/scalyr-agent-2\", namespace=\"cluster-scalyr\", pod_template_generation=\"1\"}"   10:00:53.083 promtail time="2018-12-13T10:00:53Z" level=info msg="newTarget{app=\"scalyr-agent-2\", controller_revision_hash=\"154853095\", instance=\"scalyr-agent-2-8w5t4\", job=\"cluster-scalyr/scalyr-agent-2\", namespace=\"cluster-scalyr\", pod_template_generation=\"1\"}"   10:00:53.088 promtail level=info ts=2018-12-13T10:00:53.088129332Z caller=targetmanager.go:140 msg="Adding target" key="{instance=\"promtail-cm4fn\", job=\"cluster-loki/promtail\", namespace=\"cluster-loki\"}"   10:00:53.088 promtail time="2018-12-13T10:00:53Z" level=info msg="newTarget{instance=\"promtail-cm4fn\", job=\"cluster-loki/promtail\", namespace=\"cluster-loki\"}"   10:00:53.181 promtail 2018/12/13 10:00:53 Seeked /var/log/pods/7c5c1158-fa11-11e8-aa30-02194dce8c18/scalyr-agent/0.log - &{Offset:0 Whence:0}   10:00:53.181 promtail time="2018-12-13T10:00:53Z" level=info msg="Tailing file: /var/log/pods/7c5c1158-fa11-11e8-aa30-02194dce8c18/scalyr-agent/0.log"   10:00:53.284 promtail 2018/12/13 10:00:53 Seeked /var/log/pods/ef063e24-febd-11e8-aa30-02194dce8c18/promtail/0.log - &{Offset:0 Whence:0}   10:00:53.382 promtail time="2018-12-13T10:00:53Z" level=info msg="Tailing file: /var/log/pods/ef063e24-febd-11e8-aa30-02194dce8c18/promtail/0.log"   10:00:53.383 promtail 2018/12/13 10:00:53 Seeked /var/log/pods/0525c5e0-a703-11e8-b4be-0ab5bd2076ce/kube-flannel/0.log - &{Offset:4735 Whence:0}   10:00:53.383 promtail time="2018-12-13T10:00:53Z" level=info msg="Tailing file: /var/log/pods/0525c5e0-a703-11e8-b4be-0ab5bd2076ce/kube-flannel/0.log"   10:00:58.661 promtail time="2018-12-13T10:00:58Z" level=error msg="Error sending batch: Error doing write: 500 - 500 Internal Server Error" ```

@tomwilkie
Copy link
Contributor

@BrianChristie we've fixed a bunch of errors on the backend now, yeah - you shouldn't see as many 500s. I don't think the process was terminating either, or at least I've not been able to reproduce this.

cyriltovena pushed a commit to cyriltovena/loki that referenced this issue Jun 11, 2021
* Squashed 'tools/' changes from b783528..1fe184f

1fe184f Bazel rules for building gogo protobufs (grafana#123)
b917bb8 Merge pull request grafana#122 from weaveworks/fix-scope-gc
c029ce0 Add regex to match scope VMs
0d4824b Merge pull request grafana#121 from weaveworks/provisioning-readme-terraform
5a82d64 Move terraform instructions to tf section
d285d78 Merge pull request grafana#120 from weaveworks/gocyclo-return-value
76b94a4 Do not spawn subshell when reading cyclo output
93b3c0d Use golang:1.9.2-stretch image
d40728f Gocyclo should return error code if issues detected
c4ac1c3 Merge pull request grafana#114 from weaveworks/tune-spell-check
8980656 Only check files
12ebc73 Don't spell-check pki files
578904a Special-case spell-check the same way we do code checks
e772ed5 Special-case on mime type and extension using just patterns
ae82b50 Merge pull request grafana#117 from weaveworks/test-verbose
8943473 Propagate verbose flag to 'go test'.
7c79b43 Merge pull request grafana#113 from weaveworks/update-shfmt-instructions
258ef01 Merge pull request grafana#115 from weaveworks/extra-linting
e690202 Use tools in built image to lint itself
126eb56 Add shellcheck to bring linting in line with scope
63ad68f Don't run lint on files under .git
51d908a Update shfmt instructions
e91cb0d Merge pull request grafana#112 from weaveworks/add-python-lint-tools
0c87554 Add yapf and flake8 to golang build image
35679ee Merge pull request grafana#110 from weaveworks/parallel-push-errors
3ae41b6 Remove unneeded if block
51ff31a Exit on first error
0faad9f Check for errors when pushing images in parallel
74dc626 Merge pull request grafana#108 from weaveworks/disable-apt-daily
b4f1d91 Merge pull request grafana#107 from weaveworks/docker-17-update
7436aa1 Override apt daily job to not run immediately on boot
7980f15 Merge pull request grafana#106 from weaveworks/document-docker-install-role
f741e53 Bump to Docker 17.06 from CE repo
61796a1 Update Docker CE Debian repo details
0d86f5e Allow for Docker package to be named docker-ce
065c68d Document selection of Docker installation role.
3809053 Just --porcelain; it defaults to v1
11400ea Merge pull request grafana#105 from weaveworks/remove-weaveplugin-remnants
b8b4d64 remove weaveplugin remnants
35099c9 Merge pull request grafana#104 from weaveworks/pull-docker-py
cdd48fc Pull docker-py to speed tests/builds up.
e1c6c24 Merge pull request grafana#103 from weaveworks/test-build-tags
d5d71e0 Add -tags option so callers can pass in build tags
8949b2b Merge pull request grafana#98 from weaveworks/git-status-tag
ac30687 Merge pull request grafana#100 from weaveworks/python_linting
4b125b5 Pin yapf & flake8 versions
7efb485 Lint python linting function
444755b Swap diff direction to reflect changes required
c5b2434 Install flake8 & yapf
5600eac Lint python in build-tools repo
0b02ca9 Add python linting
c011c0d Merge pull request grafana#79 from kinvolk/schu/python-shebang
6577d07 Merge pull request grafana#99 from weaveworks/shfmt-version
00ce0dc Use git status instead of diff to add 'WIP' tag
411fd13 Use shfmt v1.3.0 instead of latest from master.
0d6d4da Run shfmt 1.3 on the code.
5cdba32 Add sudo
c322ca8 circle.yml: Install shfmt binary.
e59c225 Install shfmt 1.3 binary.
30706e6 Install pyhcl in the build container.
960d222 Merge pull request grafana#97 from kinvolk/alban/update-shfmt-3
1d535c7 shellcheck: fix escaping issue
5542498 Merge pull request grafana#96 from kinvolk/alban/update-shfmt-2
32f7cc5 shfmt: fix coding style
09f72af lint: print the diff in case of error
571c7d7 Merge pull request grafana#95 from kinvolk/alban/update-shfmt
bead6ed Update for latest shfmt
b08dc4d Update for latest shfmt (grafana#94)
2ed8aaa Add no-race argument to test script (grafana#92)
80dd78e Merge pull request grafana#91 from weaveworks/upgrade-go-1.8.1
08dcd0d Please ./lint as shfmt changed its rules between 1.0.0 and 1.3.0.
a8bc9ab Upgrade default Go version to 1.8.1.
41c5622 Merge pull request grafana#90 from weaveworks/build-golang-service-conf
e8ebdd5 broaden imagetag regex to fix haskell build image
ba3fbfa Merge pull request grafana#89 from weaveworks/build-golang-service-conf
e506f1b Fix up test script for updated shfmt
9216db8 Add stuff for service-conf build to build-goland image
66a9a93 Merge pull request grafana#88 from weaveworks/haskell-image
cb3e3a2 shfmt
74a5239 Haskell build image
4ccd42b Trying circle quay login
b2c295f Merge branch 'common-build'
0ac746f Trim quay prefix in circle script
c405b31 Merge pull request grafana#87 from weaveworks/common-build
9672d7c Push build images to quay as they have sane robot accounts
a2bf112 Review feedback
fef9b7d Add protobuf tools
10a77ea Update readme
254f266 Don't need the image name in
ffb59fc Adding a weaveworks/build-golang image with tags
b817368 Update min Weave Net docker version
cf87ca3 Merge pull request grafana#86 from weaveworks/lock-kubeadm-version
3ae6919 Add example of custom SSH private key to tf_ssh's usage.
cf8bd8a Add example of custom SSH private key to tf_ansi's usage.
c7d3370 Lock kubeadm's Kubernetes version.
faaaa6f Merge pull request grafana#84 from weaveworks/centos-rhel
ef552e7 Select weave-kube YAML URL based on K8S version.
b4c1198 Upgrade default kubernetes_version to 1.6.1.
b82805e Use a fixed version of kubeadm.
f33888b Factorise and make kubeconfig option optional.
f7b8b89 Install EPEL repo for CentOS.
615917a Fix error in decrypting AWS access key and secret.
86f97b4 Add CentOS 7 AMI and username for AWS via Terraform.
eafd810 Add tf_ansi example with Ansible variables.
2b05787 Skip setup of Docker over TCP for CentOS/RHEL.
84c420b Add docker-ce role for CentOS/RHEL.
00a820c Add setup_weave-net_debug.yml playbook for user issues' debugging.
3eae480 Upgrade default kubernetes_version to 1.5.4.
753921c Allow injection of Docker installation role.
e1ff90d Fix kubectl taint command for 1.5.
b989e97 Fix typo in kubectl taint for single node K8S cluster.
541f58d Remove 'install_recommends: no' for ethtool.
c3f9711 Make Ansible role docker-from-get.docker.com work on RHEL/CentOS.
038c0ae Add frequently used OS images, for convenience.
d30649f Add --insecure-registry to docker.conf
1dd9218 shfmt -i 4 -w push-images
6de96ac Add option to not push docker hub images
310f53d Add push-images script from cortex
8641381 Add port 6443 to kubeadm join commands for K8S 1.6+.
50bf0bc Force type of K8S token to string.
08ab1c0 Remove trailing whitespaces.
ae9efb8 Enable testing against K8S release candidates.
9e32194 Secure GCP servers for Scope: open port 80.
a22536a Secure GCP servers for Scope.
89c3a29 Merge pull request grafana#78 from weaveworks/lint-merge-rebase-issue-in-docs
73ad56d Add linter function to avoid bad merge/rebase artefact
31d069d Change Python shebang to `#!/usr/bin/env python`
52d695c Merge pull request grafana#77 from kinvolk/schu/fix-relative-weave-path
77aed01 Merge pull request grafana#73 from weaveworks/mike/sched/fix-unicode-issue
7c080f4 integration/sanity_check: disable SC1090
d6d360a integration/gce.sh: update gcloud command
e8def2c provisioning/setup: fix shellcheck SC2140
cc02224 integration/config: fix weave path
9c0d6a5 Fix config_management/README.md
334708c Merge pull request grafana#75 from kinvolk/alban/external-build-1
da2505d gce.sh: template: print creation date
e676854 integration tests: fix user account
8530836 host nameing: add repo name
b556c0a gce.sh: fix deletion of gce instances
2ecd1c2 integration: fix GCE --zones/--zone parameter
3e863df sched: Fix unicode encoding issues
51785b5 Use rm -f and set current dir using BASH_SOURCE.
f5c6d68 Merge pull request grafana#71 from kinvolk/schu/fix-linter-warnings
0269628 Document requirement for `lint_sh`
9a3f09e Fix linter warnings
efcf9d2 Merge pull request grafana#53 from weaveworks/2647-testing-mvp
d31ea57 Weave Kube playbook now works with multiple nodes.
27868dd Add GCP firewall rule for FastDP crypto.
edc8bb3 Differentiated name of dev and test playbooks, to avoid confusion.
efa3df7 Moved utility Ansible Yaml to library directory.
fcd2769 Add shorthands to run Ansible playbooks against Terraform-provisioned virtual machines.
f7946fb Add shorthands to SSH into Terraform-provisioned virtual machines.
aad5c6f Mention Terraform and Ansible in README.md.
dddabf0 Add Terraform output required for templates' creation.
dcc7d02 Add Ansible configuration playbooks for development environments.
f86481c Add Ansible configuration playbooks for Docker, K8S and Weave-Net.
efedd25 Git-ignore Ansible retry files.
765c4ca Add helper functions to setup Terraform programmatically.
801dd1d Add Terraform cloud provisioning scripts.
b8017e1 Install hclfmt on CircleCI.
4815e19 Git-ignore Terraform state files.
0aaebc7 Add script to generate cartesian product of dependencies of cross-version testing.
007d90a Add script to list OS images from GCP, AWS and DO.
ca65cc0 Add script to list relevant versions of Go, Docker and Kubernetes.
aa66f44 Scripts now source dependencies using absolute path (previously breaking make depending on current directory).
7865e86 Add -p option to parallelise lint.
36c1835 Merge pull request grafana#69 from weaveworks/mflag
9857568 Use mflag and mflagext package from weaveworks/common.
9799112 Quote bash variable.
10a36b3 Merge pull request grafana#67 from weaveworks/shfmt-ignore
a59884f Add support for .lintignore.
03cc598 Don't lint generated protobuf code.
2b55c2d Merge pull request grafana#66 from weaveworks/reduce-test-timeout
d4e163c Make timeout a flag
49a8609 Reduce test timeout
8fa15cb Merge pull request grafana#63 from weaveworks/test-defaults

git-subtree-dir: tools
git-subtree-split: 1fe184f1f5330c4444c4377bef84f2d30e7dc7fe

* Use keyed fields in composite literal

* Squashed 'tools/' changes from 1fe184f..ccc8316

ccc8316 Revert "Gocyclo should return error code if issues detected" (grafana#124)

git-subtree-dir: tools
git-subtree-split: ccc831682b5d51e068b17fe9ad482f025abd1fbb
periklis added a commit to periklis/loki that referenced this issue Dec 6, 2021
periklis added a commit to periklis/loki that referenced this issue Dec 2, 2022
[release-5.6] Update from upstream/main repository
@chaudum chaudum added the type/bug Somehing is not working as expected label Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/agent type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

3 participants