-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix newlogd crash due to VM's console output #3099
Conversation
pkg/newlog/cmd/newlogd.go
Outdated
@@ -1578,7 +1578,8 @@ func getDevTop10Inputs() { | |||
func getPtypeTimestamp(timeStr string) *timestamp.Timestamp { | |||
t, err := time.Parse(time.RFC3339, timeStr) | |||
if err != nil { | |||
log.Fatal(err) | |||
log.Warn(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a good idea to produce a log from the logger. This can lead to infinite loop if (for some reason) you again fail to parse something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. But in this case, getPtypeTimestamp() is called only at one point in the code (here). The only situation that would lead to a infinite loop as you described is that if log.Warn() generates wrong timestamps, which should not happen. However, silently change the timestamp can lead to the following situation:
- Months later nobody will remember or care about these changes
- Then somebody needs to debug a very annoying bug
- Let's say some logs have bad timestamps (for any reason): log messages will have timestamps with no warning at all; and these timestamps will not correspond to the reality (the exact time which events happened), it might confuse who is debugging the system
- It might be very difficult to realize timestamps were fixed by newlogd
Although I think is too much kill newlogd only because of wrong timestamps, I think is still the best option if you really won't use log.Warn() here.
Since I removed the parsing of VM console messages, we shouldn't have this timestamp problem anymore, unless some container or other entity that we care about it starts to send wrong timestamps. So maybe it might make sense let log.Fatal(). In a nutshell, we are not falling in this code anymore (at least in all my tests)
@eriknordmark, @rouming, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree with @rouming on this one. Later we will forget about this and it can shoot us in the leg pretty bad. At this point you are relying on a 3rd party library to provide you correct time formatting. From two evils of infinite log generation and unknown missing logs I'd choose former :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe it might make sense let log.Fatal()
Personally I don't like an idea of failing the whole system if you can't parse something. If you can't open a file - yes, crash it; if you can't create a directory for your logs - yes, but if parsing failed - no. So I would return nil and let the caller decide what to do: replace time on now() or something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right @rouming . I've changed my PR with a new approach:
- Do not use log.Warn() so we avoid any chance of recursive calls.
- Return the error to the caller instead of fail, so the caller can decide what to do.
- If the timestamp is invalid, we cannot trust in the time information, so return default (initial) Unix time instead of the current time. This can give a good indication to who is analyzing the logs that the timestamp is invalid
I've also rebased with master.
Formatted log messages (coming from memlogd, for instance) are suppose to have a timestamp in the RFC 3339 standard. In this format, dates can be represented through different ways. For example, the strings below represent the same date: 2020-12-31T21:07:14-05:00 2021-01-01T02:07:14Z The Z character in the second one means UTC time. In getMemlogMsg(), if ParseLogInfo() was not able to get the timestamp, the one used in the original message will be used. However, it assumes only the second format (with suffix Z). This commit removes this assumption and let the timestamp to be parsed in getPtypeTimeStamp(). If it fails, it considers Unix default (initial) time as the timestamp. Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br>
The output console from an application VM can be logged, i.e., have contents sent from memlogd to newlogd container. In this case, it should not be parsed as a JSON log message. This commit removes the calling of ParseLogInfo() for any log message coming from an application and treat it as a normal log message. Also, it searches for "guest_vm" only in the field source of the log message. Otherwise, any JSON log message containing the string "guest_vm" would be considered as an application's log message. Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br>
This commit makes a small improvement in newlogd.go: - Agregate two ifs into one: getAppStatsMap() must be called only if appuuid is not null - Remove redundant attribution: getAppStatsMap() already attributes the found item to the map ("appStatsMap[appuuid] = appM") inside the function Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br>
var jsonOK bool | ||
var logInfo agentlog.Loginfo | ||
if strings.Contains(sourceName, "guest_vm") { | ||
isApp = true | ||
logInfo.Source = sourceName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you crash here, because 'logInfo' is nil (it is created in the else branch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no @rouming, now I'm defining logInfo at line 575 (above the if), so there is no problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a pointer... sorry. brain fart.
logmetrics.AppMetrics.NumInputEvent++ | ||
var jsonOK bool | ||
var logInfo agentlog.Loginfo | ||
if strings.Contains(sourceName, "guest_vm") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not it be better to check the sourceName
against the contains('guest_vm')
and not the whole log string? Because that what we are interested in after all: to filter out the 'guest_vm' source, which is exactly the the meaning of the parsed field 'sourceName'. Because if there is a log message which contains the 'guest_vm' string (regardless of the real source of the log message), it will be accounted as a vm source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this is exactly what I'm doing, and that is exactly what is explained in the commit message 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above. sorry for the noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
@rene This PR broke eden test @rouming Please do not merge PR if all 4 eden integration test runs are failing. The newly introduced issue then just propagates to someone else's PR and it gets harder to find the culprit. |
@milan-zededa oops, my fault. I still don't understand when those are failing for real or they fail just accidentally. Next time will be more careful. Thanks for debugging. |
Thanks @milan-zededa for pointed this out. The problem was here: 3712cb0#diff-cc9fc15a79bf38bee943c9d5953fb00741b86a3de2e3041f4b34aaee549a5bbdR788 the attribution in this line was not redundant because we are not using pointers in the map (my bad, sorry). This commit was just a small improvement so his reversion doesn't cause any issue and it fixes Eden tests. |
This PR fixes the issue: #3090 (see issue for context and discussion)
pkg: newlog: Do not crash on a bad timestamp
pkg: newlog: Do not parse application's console as JSON log messages
pkg: newlog: make a small improvement