Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat's journald input might return an empty message that breaks multiline parser #41331

Open
Tracked by #37086
belimawr opened this issue Oct 18, 2024 · 3 comments
Open
Tracked by #37086
Assignees
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@belimawr
Copy link
Contributor

When using the journald input to read from a file, there is a race condition that happens when journalctl exits. Whenever the journald input starts and there is a chance of having messages from previous boots (depending on filtering arguments) we first start the process without the follow flag, read all messages until it exits successfully, then restart it with the follow flag.

During the read of this first execution, there might be a race condition where reading from journalctl stdout returns the following error

read |0: file already closed

That's an fs.PathError

	err := fs.PathError{
		Op:   "read",
		Path: "|0",
		Err:  errors.New("file already closed"),
	}

This happens here:

data, err := reader.ReadBytes('\n')

When this happens we just restart journalctl and keep reading, no data is lost.

However, this cause problems when the multiline parse is used. This error makes the reader to return an recoverable error and and empty message, the multiline parser cannot handle this situation and returns early, leading to a wrong aggregation.

This can be seen by running the following test from filebeat/input/journald multiple times until it fails. (usually the easiest run the test with the -count flag: go test -count=100 -run=TestInputParsers)

// TestInputParsers ensures journald input support parsers,
// it only tests a single parser, but that is enough to ensure
// we're correctly using the parsers
func TestInputParsers(t *testing.T) {
	inputParsersExpected := []string{"1st line\n2nd line\n3rd line", "4th line\n5th line\n6th line"}
	env := newInputTestingEnvironment(t)

	inp := env.mustCreateInput(mapstr.M{
		"paths":                 []string{path.Join("testdata", "input-multiline-parser.journal")},
		"include_matches.match": []string{"_SYSTEMD_USER_UNIT=log-service.service"},
		"parsers": []mapstr.M{
			{
				"multiline": mapstr.M{
					"type":        "count",
					"count_lines": 3,
				},
			},
		},
	})

	ctx, cancelInput := context.WithCancel(context.Background())
	env.startInput(ctx, inp)
	env.waitUntilEventCount(len(inputParsersExpected))

	for idx, event := range env.pipeline.clients[0].GetEvents() {
		if got, expected := event.Fields["message"], inputParsersExpected[idx]; got != expected {
			t.Errorf("expecting event message %q, got %q", expected, got)
		}
	}

	cancelInput()
}

A similar early return might happen if multiline is used and journalctl exits unexpectedly. This case still needs confirmation.

@belimawr belimawr added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Oct 18, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member

cmacknz commented Oct 21, 2024

What are the user visible symptoms of this?

When this happens we just restart journalctl and keep reading, no data is lost.

This is good, it sounds like it shouldn't block GA of the input, but if this has a symptom that would lead to support cases I would want it to be fixed before GA regardless.

@belimawr
Copy link
Contributor Author

What are the user visible symptoms of this?

If using the multiline parser, messages are sent before all the conditions are met. In the example we use the count method, which means each 3 lies of the input will generate a single event. On that case, we get one event that contains a single line from the input.

So far I have only seen that happening when reading from journal files.

One case that has just crossed my mind:
When #41244 is merged, we will have journalctl being restarted at least once during normal operation. That could cause the same observable behaviour. A the moment the only issue I can think of is the multline parser. I need to test it to be 100% sure if that's the case.

but if this has a symptom that would lead to support cases I would want it to be fixed before GA regardless.

I believe it could generate support cases if people are using journald + multiline. I added a note to #41244 to at least check whether this happens or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

3 participants