Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix newlogd crash due to VM's console output #3099

Merged
merged 3 commits into from
Mar 17, 2023
Merged

Conversation

rene
Copy link
Contributor

@rene rene commented Mar 14, 2023

This PR fixes the issue: #3090 (see issue for context and discussion)

pkg: newlog: Do not crash on a bad timestamp

Formatted log messages (coming from memlogd, for instance) are suppose to
have a timestamp in the RFC 3339 standard. In this format, dates can be
represented through different ways. For example, the strings below
represent the same date:

2020-12-31T21:07:14-05:00
2021-01-01T02:07:14Z

The Z character in the second one means UTC time. In getMemlogMsg(), if
ParseLogInfo() was not able to get the timestamp, the one used in the
original message will be used. However, it assumes only the second format
(with suffix Z). This commit removes this assumption and let the timestamp
to be parsed in getPtypeTimeStamp(). If it fails, it considers Unix default
(initial) time as the timestamp.

pkg: newlog: Do not parse application's console as JSON log messages

The output console from an application VM can be logged, i.e., have
contents sent from memlogd to newlogd container. In this case, it should
not be parsed as a JSON log message.

This commit removes the calling of ParseLogInfo() for any log message
coming from an application and treat it as a normal log message. Also, it
searches for "guest_vm" only in the field source of the log message.
Otherwise, any JSON log message containing the string "guest_vm" would be
considered as an application's log message.

pkg: newlog: make a small improvement

This commit makes a small improvement in newlogd.go:

- Agregate two ifs into one: getAppStatsMap() must be called only if
  appuuid is not null
- Remove redundant attribution: getAppStatsMap() already attributes the
  found item to the map ("appStatsMap[appuuid] = appM") inside the
  function

@rene rene requested review from eriknordmark and rvs as code owners March 14, 2023 12:56
@rene rene requested a review from rouming March 14, 2023 12:56
@@ -1578,7 +1578,8 @@ func getDevTop10Inputs() {
func getPtypeTimestamp(timeStr string) *timestamp.Timestamp {
t, err := time.Parse(time.RFC3339, timeStr)
if err != nil {
log.Fatal(err)
log.Warn(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a good idea to produce a log from the logger. This can lead to infinite loop if (for some reason) you again fail to parse something.

Copy link
Contributor Author

@rene rene Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But in this case, getPtypeTimestamp() is called only at one point in the code (here). The only situation that would lead to a infinite loop as you described is that if log.Warn() generates wrong timestamps, which should not happen. However, silently change the timestamp can lead to the following situation:

  1. Months later nobody will remember or care about these changes
  2. Then somebody needs to debug a very annoying bug
  3. Let's say some logs have bad timestamps (for any reason): log messages will have timestamps with no warning at all; and these timestamps will not correspond to the reality (the exact time which events happened), it might confuse who is debugging the system
  4. It might be very difficult to realize timestamps were fixed by newlogd

Although I think is too much kill newlogd only because of wrong timestamps, I think is still the best option if you really won't use log.Warn() here.

⚠️ One important observation ⚠️
Since I removed the parsing of VM console messages, we shouldn't have this timestamp problem anymore, unless some container or other entity that we care about it starts to send wrong timestamps. So maybe it might make sense let log.Fatal(). In a nutshell, we are not falling in this code anymore (at least in all my tests)

@eriknordmark, @rouming, thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with @rouming on this one. Later we will forget about this and it can shoot us in the leg pretty bad. At this point you are relying on a 3rd party library to provide you correct time formatting. From two evils of infinite log generation and unknown missing logs I'd choose former :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe it might make sense let log.Fatal()

Personally I don't like an idea of failing the whole system if you can't parse something. If you can't open a file - yes, crash it; if you can't create a directory for your logs - yes, but if parsing failed - no. So I would return nil and let the caller decide what to do: replace time on now() or something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right @rouming . I've changed my PR with a new approach:

  • Do not use log.Warn() so we avoid any chance of recursive calls.
  • Return the error to the caller instead of fail, so the caller can decide what to do.
  • If the timestamp is invalid, we cannot trust in the time information, so return default (initial) Unix time instead of the current time. This can give a good indication to who is analyzing the logs that the timestamp is invalid

I've also rebased with master.

rene added 3 commits March 15, 2023 11:35
Formatted log messages (coming from memlogd, for instance) are suppose to
have a timestamp in the RFC 3339 standard. In this format, dates can be
represented through different ways. For example, the strings below
represent the same date:

2020-12-31T21:07:14-05:00
2021-01-01T02:07:14Z

The Z character in the second one means UTC time. In getMemlogMsg(), if
ParseLogInfo() was not able to get the timestamp, the one used in the
original message will be used. However, it assumes only the second format
(with suffix Z). This commit removes this assumption and let the timestamp
to be parsed in getPtypeTimeStamp(). If it fails, it considers Unix default
(initial) time as the timestamp.

Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br>
The output console from an application VM can be logged, i.e., have
contents sent from memlogd to newlogd container. In this case, it should
not be parsed as a JSON log message.

This commit removes the calling of ParseLogInfo() for any log message
coming from an application and treat it as a normal log message. Also, it
searches for "guest_vm" only in the field source of the log message.
Otherwise, any JSON log message containing the string "guest_vm" would be
considered as an application's log message.

Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br>
This commit makes a small improvement in newlogd.go:

- Agregate two ifs into one: getAppStatsMap() must be called only if
  appuuid is not null
- Remove redundant attribution: getAppStatsMap() already attributes the
  found item to the map ("appStatsMap[appuuid] = appM") inside the
  function

Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br>
var jsonOK bool
var logInfo agentlog.Loginfo
if strings.Contains(sourceName, "guest_vm") {
isApp = true
logInfo.Source = sourceName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you crash here, because 'logInfo' is nil (it is created in the else branch)

Copy link
Contributor Author

@rene rene Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no @rouming, now I'm defining logInfo at line 575 (above the if), so there is no problem here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a pointer... sorry. brain fart.

logmetrics.AppMetrics.NumInputEvent++
var jsonOK bool
var logInfo agentlog.Loginfo
if strings.Contains(sourceName, "guest_vm") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not it be better to check the sourceName against the contains('guest_vm') and not the whole log string? Because that what we are interested in after all: to filter out the 'guest_vm' source, which is exactly the the meaning of the parsed field 'sourceName'. Because if there is a log message which contains the 'guest_vm' string (regardless of the real source of the log message), it will be accounted as a vm source.

Copy link
Contributor Author

@rene rene Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this is exactly what I'm doing, and that is exactly what is explained in the commit message 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. sorry for the noise.

Copy link
Contributor

@rouming rouming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@milan-zededa
Copy link
Contributor

@rene This PR broke eden test app_logs - see for example here. Test is fixed if I revert this commit of yours. Could you please have a look?

@rouming Please do not merge PR if all 4 eden integration test runs are failing. The newly introduced issue then just propagates to someone else's PR and it gets harder to find the culprit.

@rouming
Copy link
Contributor

rouming commented Mar 21, 2023

@milan-zededa oops, my fault. I still don't understand when those are failing for real or they fail just accidentally. Next time will be more careful. Thanks for debugging.

@rene
Copy link
Contributor Author

rene commented Mar 22, 2023

Thanks @milan-zededa for pointed this out. The problem was here: 3712cb0#diff-cc9fc15a79bf38bee943c9d5953fb00741b86a3de2e3041f4b34aaee549a5bbdR788 the attribution in this line was not redundant because we are not using pointers in the map (my bad, sorry). This commit was just a small improvement so his reversion doesn't cause any issue and it fixes Eden tests.
FYI: @rouming , @eriknordmark

@rene rene deleted the fix-newlogd branch May 24, 2023 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants