Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log monitoring bulk failures #14356

Merged
merged 9 commits into from
Nov 14, 2019

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Oct 31, 2019

Resolves #14303.

As reported in #14303, when the Elasticsearch monitoring reporter in libbeat sends a bulk API request to Elasticsearch, and that request fails, the errors are currently swallowed. This is because the actual response code for the bulk API request is 200 OK; the actual errors are embedded in the request's response body.

This PR teaches the Elasticsearch monitoring reporter to parse the bulk API response and log any errors. For the parsing, the same code as the Elasticsearch output is reused.

Testing this PR

  1. Start up Elasticsearch with security enabled. Make sure you know the password for the elastic superuser.

  2. Create a role that grants necessary privileges for managing and writing to metricbeat-* indices.

    curl -s -u elastic -H 'Content-Type: application/json' 'http://localhost:9200/_security/role/mb_writer' -d '{ "cluster": [ "monitor", "manage_ilm", "manage_index_templates" ], "indices": [ { "names": [ "metricbeat-*" ], "privileges": [ "all" ] } ] }'
    
  3. Create a user with the above role.

    curl -s -u elastic -H 'Content-Type: application/json' 'http://localhost:9200/_security/user/mb_writer' -d '{ "password": "mb_writer", "roles": [ "mb_writer" ] }'
    
  4. Build Metricbeat with this PR.

    cd $GOPATH/src/github.com/elastic/beats/metricbeat
    mage build
    
  5. Start Metricbeat with monitoring enabled and specifying the credentials of the above user for the elasticsearch output.

    ./metricbeat -e -E output.elasticsearch.username=mb_writer -E output.elasticsearch.password=mb_writer -E monitoring.enabled=true
    
  6. Verify that metricbeat-* indices are being created and populated in Elasticsearch but no .monitoring-beats-* indices are being created.

    curl -s -u elastic 'http://localhost:9200/_cat/indices'
    
  7. Verify that there are warnings in the log like so:

    2019-11-01T08:57:43.910-0700    WARN    elasticsearch/client.go:258     monitoring bulk item insert failed (i=0, status=403): {"type":"security_exception","reason":"action [indices:admin/create] is unauthorized for user [mb_writer]"}
    

@@ -548,9 +522,9 @@ func bulkCollectPublishFails(
return failed, stats
}

func itemStatus(reader *jsonReader) (int, []byte, error) {
func ItemStatus(reader *JSONReader) (int, []byte, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function ItemStatus should have comment or be unexported

libbeat/outputs/elasticsearch/json_read.go Show resolved Hide resolved
libbeat/outputs/elasticsearch/json_read.go Show resolved Hide resolved
libbeat/outputs/elasticsearch/json_read.go Show resolved Hide resolved
return
}

for i, _ := range events {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should omit 2nd value from range; this loop is equivalent to for i := range ...

@elasticmachine
Copy link
Collaborator

Pinging @elastic/stack-monitoring (Stack monitoring)

@ycombinator ycombinator force-pushed the lb-mon-log-bulk-failures branch 3 times, most recently from 73c8a72 to 34227f6 Compare November 12, 2019 01:34
@ycombinator
Copy link
Contributor Author

jenkins, test this

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is OK to me, but I think we should have some tests added to cover that behavior and especially if the remote system changes his behavior. I don't link how the 200 vs the 403 response code is handled in this scenario.

Looking at existing code, there is currently no unit tests for the ES/reporter and adding that to the existing python system tests might be complicated but still worth investigating.

Also for BulkReadToItems we can surely add a test for it?

raw []byte
}
// BulkResult contains the result of a bulk API request.
type BulkResult json.RawMessage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 nice change

libbeat/outputs/elasticsearch/client.go Show resolved Hide resolved
Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we need to find a better way with system test, I think its a problem and we need to have a proposal for that. Maybe a way to use a specific docker-compose file for a set of test.

@ycombinator
Copy link
Contributor Author

Travis CI is green. Jenkins CI failures are unrelated. Merging.

@ycombinator ycombinator merged commit a9aff6f into elastic:master Nov 14, 2019
ycombinator added a commit that referenced this pull request Nov 19, 2019
* Log monitoring bulk failures (#14356)

* Log monitoring bulk failures

* Renaming function

* Simplifying type

* Removing extraneous second value

* Adding godoc comments

* Adding CHANGELOG entry

* Clarifying log messages

* WIP: adding unit test stubs

* Fleshing out unit tests

* [DOCS] Deprecate central management (#14104) (#14594)

* State minimum Go version (#14400) (#14598)

* [DOCS] Fix description of rename processor (#14408) (#14600)

* Log monitoring bulk failures (#14356)

* Log monitoring bulk failures

* Renaming function

* Simplifying type

* Removing extraneous second value

* Adding godoc comments

* Adding CHANGELOG entry

* Clarifying log messages

* WIP: adding unit test stubs

* Fleshing out unit tests

* Fixing up CHANGELOG
ycombinator added a commit that referenced this pull request Nov 21, 2019
* Log monitoring bulk failures

* Renaming function

* Simplifying type

* Removing extraneous second value

* Adding godoc comments

* Adding CHANGELOG entry

* Clarifying log messages

* WIP: adding unit test stubs

* Fleshing out unit tests
@ycombinator ycombinator deleted the lb-mon-log-bulk-failures branch December 25, 2019 11:08
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
* Log monitoring bulk failures (elastic#14356)

* Log monitoring bulk failures

* Renaming function

* Simplifying type

* Removing extraneous second value

* Adding godoc comments

* Adding CHANGELOG entry

* Clarifying log messages

* WIP: adding unit test stubs

* Fleshing out unit tests

* [DOCS] Deprecate central management (elastic#14104) (elastic#14594)

* State minimum Go version (elastic#14400) (elastic#14598)

* [DOCS] Fix description of rename processor (elastic#14408) (elastic#14600)

* Log monitoring bulk failures (elastic#14356)

* Log monitoring bulk failures

* Renaming function

* Simplifying type

* Removing extraneous second value

* Adding godoc comments

* Adding CHANGELOG entry

* Clarifying log messages

* WIP: adding unit test stubs

* Fleshing out unit tests

* Fixing up CHANGELOG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Beats will not log monitoring bulk index failure
4 participants