Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic-Agent] Elastic Agent index "logs-elastic.agent-default" is sometimes not present #21310

Closed
mdelapenya opened this issue Sep 24, 2020 · 6 comments
Assignees
Labels
bug failed-test indicates a failed automation test relates v7.11.0

Comments

@mdelapenya
Copy link
Contributor

mdelapenya commented Sep 24, 2020

This report comes from executing the E2E tests in https://github.com/elastic/e2e-testing

We have a test scenario querying the elastic agent's log index in ES (logs-elastic.agent-default) which is randomly failing because the index is not found.

The test scenario (BDD)

@deploy-stand-alone
Scenario Outline: Deploying a <image> stand-alone agent
  When a "<image>" stand-alone agent is deployed
  Then there is new data in the index from agent
Examples:
| image   |
| default |
| ubi8    |

This scenario starts a docker compose file with Elasticsearch, Kibana and the Package Registry, and adds to the compose file a docker container with the Elastic Agent image (When), for the default and the UBI8 flavours (so the tests are exactly the same but changing which docker image is used to spin up the service). Finally, it queries ES using the official Go client (Then) with the following query (in Golang), which we see retrieving documents in a non deterministic manner:

timezone := "America/New_York"
hostname := "" // is received by the function calling this block
var startDate time.Time // is received by the function calling this block

esQuery := map[string]interface{}{
	"version": true,
	"size":    500,
	"docvalue_fields": []map[string]interface{}{
		{
			"field":  "@timestamp",
			"format": "date_time",
		},
		{
			"field":  "system.process.cpu.start_time",
			"format": "date_time",
		},
		{
			"field":  "system.service.state_since",
			"format": "date_time",
		},
	},
	"_source": map[string]interface{}{
		"excludes": []map[string]interface{}{},
	},
	"query": map[string]interface{}{
		"bool": map[string]interface{}{
			"must": []map[string]interface{}{},
			"filter": []map[string]interface{}{
				{
					"bool": map[string]interface{}{
						"filter": []map[string]interface{}{
							{
								"bool": map[string]interface{}{
									"should": []map[string]interface{}{
										{
											"match_phrase": map[string]interface{}{
												"host.name": hostname,
											},
										},
									},
									"minimum_should_match": 1,
								},
							},
							{
								"bool": map[string]interface{}{
									"should": []map[string]interface{}{
										{
											"range": map[string]interface{}{
												"@timestamp": map[string]interface{}{
													"gte":       startDate,
													"time_zone": timezone,
												},
											},
										},
									},
									"minimum_should_match": 1,
								},
							},
						},
					},
				},
				{
					"range": map[string]interface{}{
						"@timestamp": map[string]interface{}{
							"gte":    startDate,
							"format": "strict_date_optional_time",
						},
					},
				},
			},
			"should":   []map[string]interface{}{},
			"must_not": []map[string]interface{}{},
		},
	},
}

The ES client returns a 404 error wit the following message:

[2020-09-24T05:51:50.616Z] time="2020-09-24T05:51:50Z" level=warning msg="There was an error executing the query" desiredHits=50 elapsedTime=26.350829ms error="Error getting response from Elasticsearch. Status: 404 Not Found, ResponseError: map[error:map[index:logs-elastic.agent-default index_uuid:na reason:no such index [logs-elastic.agent-default] resource.id:logs-elastic.agent-default resource.type:index_or_alias root_cause:[map[index:logs-elastic.agent-default index_uuid:na reason:no such index [logs-elastic.agent-default] resource.id:logs-elastic.agent-default resource.type:index_or_alias type:index_not_found_exception]] type:index_not_found_exception] status:404]" index=logs-elastic.agent-default retry=1

It's important to note that his query is executed in a backoff strategy for a maxTimeout = 3 minutes before failing, so we do not think it's a problem of the tests being too fast.

How to test this

  1. Install requirements: docker + docker-compose
  2. Clone the e2e repo: git clone https://github.com/elastic/e2e-testing
  3. Run the stand-alone tests
$ cd e2e-testing
$ SUITE=ingest-manager DEVELOPER_MODE=true FEATURE="stand_alone_mode" LOG_LEVEL=TRACE make -C e2e functional-test

Logs

Jenkins Logs and Tests

We always output the log of the elastic-agent container at the end of the scenario, so there you'll find any other information.

Jenkins job finding the index (some of them)
  • 320.log Ubi8 passing but not for the default Agent image:

Screenshot 2020-09-24 at 18 34 59

Screenshot 2020-09-24 at 18 35 14

Jenkins job not finding the index

Screenshot 2020-09-24 at 18 37 05

cc/ @ph @EricDavisX

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@EricDavisX EricDavisX added the failed-test indicates a failed automation test relates label Oct 5, 2020
@michalpristas
Copy link
Contributor

Inspecting the logs i see two reasons of failure.
First one is text file busy issue #21120
Second one is ASC hash mismatch, i will take a look to see what can i do to improve experience here, this was due to missing ASC in the release bits which should be fixed.

The problem there is that it compares asc of released build metricbeat vs self beat metricbeat which is packed (because second iteration of release was not finished).

@ph
Copy link
Contributor

ph commented Oct 14, 2020

@michalpristas still an issue?

@ph ph added the v7.11.0 label Oct 14, 2020
@EricDavisX
Copy link
Contributor

EricDavisX commented Oct 19, 2020

I'm unsure, but it clearly relates - there was found a problem in the logic or flow of restarting, #21835

@EricDavisX
Copy link
Contributor

The fixes are all in for known problems, and we're expecting nightly runs to all pass! When we see it (formally) we can close this out.

@mdelapenya
Copy link
Contributor Author

I'd say that elastic/e2e-testing#376 fixed it! Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug failed-test indicates a failed automation test relates v7.11.0
Projects
None yet
Development

No branches or pull requests

5 participants