-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: promtail known failure modes #924
doc: promtail known failure modes #924
Conversation
docs/promtail-failure-modes.md
Outdated
- `/app.log` size is >= than the position before truncating | ||
|
||
If the `/app.log` file size is less than the previous position, then the file is detected as truncated and logs will be tailed starting from position `0`. Otherwise, if the `/app.log` file size is >= than the previous position, `promtail` can't detect it was truncated while not running and will continue tailing the file from position `100`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth noting if the log file rolled mulitple times while promtail wasn't running and the size is greater than the position from the file it will start at the position from the positions file and not the beginning. Or put another way, promtail does not do anything fancy like track a hash of the log file to know if it's actually continuing from the same file or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere I feel like we need to document the advantages of using larger log files in regards to promtail and decreasing odds of lost log lines. The less frequently log files are rolled the better success promtail has when things go wrong... Not sure where this might go but maybe it fits on this page?
docs/promtail-failure-modes.md
Outdated
|
||
When `promtail` shutdown gracefully, it saves the last read offsets in the positions file, so that on a subsequent restart it will continue tailing logs without duplicates neither losses. | ||
|
||
In the unlikely event of a crash, `promtail` can't save the last read offsets in the positions file. When restarted, `promtail` will read the positions file saved at the last sync period and will continue tailing the files from there. This means that if new log entries have been read and pushed to the ingester between the last sync period and the crash, these log entries will be sent again to the ingester on `promtail` restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, i'm not sure how unlikely this is :) but I appreciate your optimism... Though yes crashing is hopefully unlikely but OOM's in a kubernetes type environment could happen.
Maybe also note that resending logs would be ignored by loki as loki currently rejects logs with older timestamps than it has already received? So you don't need to crank down the sync_period, sending of some duplicates is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also note that resending logs would be ignored by loki as loki currently rejects logs with older timestamps than it has already received? So you don't need to crank down the sync_period, sending of some duplicates is ok
You're definitely right. I will revisit that paragraph accordingly.
120ef85
to
d282ce1
Compare
Thanks @slim-bean for taking the time to read it. I've tried to address your comments. May you re-review it please? |
d282ce1
to
0cba30f
Compare
Thanks again @slim-bean for reviewing it. I have addressed your last comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pracucci ! LGTM!
What this PR does / why we need it:
From the user perspective, I believe it's important to address known failure modes in the documentation, in order to set expectations and point the user into the right direction when it comes to configuration settings.
In this PR I'm suggesting to start documenting
promtail
known failure modes.Checklist