Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APM server monitoring #32515

Merged
merged 14 commits into from
Aug 27, 2018

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Jul 31, 2018

This PR:

  • Adds anapm_system built-in role with privileges to call the bulk Monitoring API, analogous to the beats_system built-in role.
  • Adds an apm_system built-in user with the apm_system role, analogous to the beats_system built-in user.

@ycombinator
Copy link
Contributor Author

/cc @chrisronline

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Member

@pickypg pickypg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts about creating a new index. I'm definitely more against creating a new one than for, but it does follow a pattern.

"index.number_of_replicas": 0,
"index.number_of_shards": 1
},
"version": 7000001,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: When this is backported, it needs to share the same version as the other templates.

}
}
},
"beats_stats": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should roll APM server stuff into the Beats index. We already explicitly ignore apm-server data for all Beats monitoring UI, so it would help to avoid an extra template and set of indices, and the index is going to share a majority of the same fields with Beats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've already run into a lot of confusion with using beats within apm-server - I think users who monitor their indices and don't use beats will be very surprised to see a beats index present.

Are there any security concerns that would push towards separating the indexes?

Copy link
Contributor Author

@ycombinator ycombinator Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @ruflin, @graphaelli and I discussed this briefly off PR. The advantages I see of giving APM server its own set of monitoring indices are:

  • We could, in the future, create a separate built-in security role for APM server monitoring and restrict it's write access to only APM server monitoring indices.
  • The mapping for APM server monitoring indices can grow to fit APM server needs without "polluting" the general beats mapping.
  • (Relatively minor) It removes the filtering in the Monitoring UI code for excluding APM server documents when looking at Beats monitoring and, vice versa, for only including APM server documents when looking at APM server monitoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The very big negative for adding a new index is that it creates a drain on Elasticsearch resources. Chances are, users won't have a lot of apm-server instances reporting to the Monitoring cluster, so creating a separate index for such a low scale use case means we're adding the shards for what will likely be about 24 * 60 * 60 / 10 = 8640 documents per day. At the same time, we'll require Elasticsearch to search each one of those indices for data whenever we try to summarize the stack views.

That's extremely wasteful, especially for small instances/clusters that can generally host monitoring data, but are now getting pushed closer to their limits because of the added load from threading the searches, joining the data in memory, growing the cluster state, and adding the segments.

I think users who monitor their indices and don't use beats will be very surprised to see a beats index present.

We could get around this fear by either documenting it, or updating the index pattern for .monitoring-beats-6-* to be .monitoring-beats-apm-6-*. We could do this for 6.5 and add an alias to .monitoring-beats-6-alias, to the template, so that there is no impact to the UI.

We could, in the future, create a separate built-in security role for APM server monitoring and restrict it's write access to only APM server monitoring indices.

We can still do this. The product user (Beats, Kibana, Logstash) writes indirectly to the monitoring data, so adding a user specific to APM does not require its own index. When Beats gains the ability to monitor APM then we can restrict access by adding a suffix to the index (e.g., .monitoring-beats-apm-6-apm-*) and limit the user to that, if we so choose. It would be ideal, even if we do split the indices, to think of them as a .monitoring-* index pattern rather than the subsets. This should help us, in the future, to avoid this kind of problem. We never should have separated Kibana into its own index. We did that for curation and this, but it's useless on its own, tiny, and never curated separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a chat about this with @ycombinator last week and we expect that apm-server metrics potentially diverge from the Beats ones and we would prefer to keep it separate to have the flexibility in the future.

For the resource usage problem we should rather look into using ILM to have less indices and shards in my opinion instead of mixing multiple products into 1 index.

The other extreme we could go is having the data form all products in 1 index as we currently do with Metricbeat, but we should not start to mix both options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pickypg, I'd like to move this PR forward a bit this week. Could you please take a look at @ruflin's comment above? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin

we expect that apm-server metrics potentially diverge from the Beats ones and we would prefer to keep it separate to have the flexibility in the future.

Isolating data by the prefixed type has always worked for stack monitoring and nothing about that should be impacted by such deviations unless they try to share a type (e.g., beats_stats) while changing the meaning of specific fields. I don't expect that to happen (versus augmenting it, which is fine).

For the resource usage problem we should rather look into using ILM to have less indices and shards in my opinion instead of mixing multiple products into 1 index.

I agree that we should be using ILM/rollover when it is released, regardless of what we do here. However, at least for the short term, to avoid expanding the amount of data that we're storing on existing monitoring clusters, we will want to configure ILM to prune data based on time to match existing behavior (7 days by default). Longer term, particularly once Beats takes over, we can hopefully revisit this solution to be based more on size. Also, adding the shards and extra index mappings is an unnecessary impact that I really think that we should avoid.

The other extreme we could go is having the data form all products in 1 index as we currently do with Metricbeat, but we should not start to mix both options.

I strongly think that we should move in that direction. With Beats, if we / the user decides that they want their data isolated by product, then it's as simple as modifying the index like how I am suggesting to add the -apm suffix. It really does boil down to a huge mistake that we aren't already doing this.

Think about the impact on the cluster from sharing the index relative to Beats-based usage:

  • Searches use the same shards
  • Bulk activity can use the same shards

Versus separate indices:

  • Searches use separate shards
  • Index mappings are separated (growing the cluster state and routing table)
  • Bulk activity cannot be optimized further than by product

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chatted with @ruflin off-PR. For now, for the reasons that @pickypg has articulated in this thread, I'm going to not create a separate index (and template) for APM server monitoring documents. In the longer run, it's likely that we'll be changing indices altogether, perhaps consolidating all monitoring data into metricbeat indices; we can always revisit the indexing strategy then. So for now I'm going to update this PR to remove the index-related changes.

I will, however, keep the changes for a new built-in role and user for APM server monitoring, since asking users/admins to use beats users/roles could be easily confusing (as opposed to users/admins seeing monitoring indices with beats in the name as those are a bit more behind the scenes).

"type": "keyword"
},
"name": {
"type": "keyword"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on marking this and other fields not relevant to apm-server as enabled: false ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could even just remove such fields from the mapping entirely for now, and add them as/when we need them. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that require removing them from the libbeat reporting for apm-server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think so because the mapping uses dynamic: false, so any un-mapped fields will be simply ignored and not indexed (as opposed to if the mapping were to use dynamic: strict, in which case un-mapped fields would cause errors).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The belief is right. Anything unmapped just tags along in the _source but is not used by ES.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @graphaelli, which fields specifically is it okay to remove from the mapping? Let me know and I'll take them out of this PR!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of metrics.libbeat.config.* can go as we don't support modules or reloading

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ycombinator Perhaps we can also make it possible to not show up in the events if not there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin Can you clarify? Do you mean, we can make it so that Beats (APM server) doesn't ship these fields to Monitoring?

Copy link
Member

@ruflin ruflin Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ycombinator Yes, but not only applies to apm-server as other beats also don't have these fields. But it's not important at the moment I would say.

@ruflin
Copy link
Member

ruflin commented Aug 2, 2018

I think we will also have to add a user for apm-server to Elasticsearch.

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 2, 2018

I think we will also have to add a user for apm-server to Elasticsearch.

++ will add built-in role and user, analogous to beats_system.

@tvernum tvernum added the :Security/Security Security issues without another label label Aug 3, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@@ -4,7 +4,7 @@
== elasticsearch-setup-passwords

The `elasticsearch-setup-passwords` command sets the passwords for the built-in
`elastic`, `kibana`, `logstash_system`, and `beats_system` users.
`elastic`, `kibana`, `logstash_system`, `beats_system`, and `apm_server_system` users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to call the user apm-server_system? I mainly worry that users might mistype it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't mixing the dash and the underscore make it more easy for users to mistype the username?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be argued both ways and probably both options are going to cause trouble. Go with the option you feel more comfortable with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@graphaelli Why don't you break the tie here, since this username is intended for APM customers after all 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apm_server_system looks great, thanks

@ruflin
Copy link
Member

ruflin commented Aug 9, 2018

Looks like the failing test is related:

16:08:09 FAILURE 10.5s J3 | LocalExporterIntegTests.testExport <<< FAILURES!
16:08:09    > Throwable #1: java.lang.AssertionError: expected:<[.monitoring-logstash, .monitoring-es, .monitoring-alerts, .monitoring-beats, .monitoring-kibana]> but was:<[.monitoring-logstash, .monitoring-es, .monitoring-apm-server, .monitoring-beats, .monitoring-alerts, .monitoring-kibana]>

@ycombinator
Copy link
Contributor Author

Yeah, I can fix that test, but I'll have to "fix" it again if we decide not to go with a separate index. So I was hoping to get that resolved first.

@ycombinator
Copy link
Contributor Author

@pickypg @ruflin @graphaelli I've updated this PR per #32515 (comment). Ready for your review.

Additionally, could someone from @elastic/es-security review as well, please?

Thanks!

@@ -63,7 +64,7 @@
public class SetupPasswordTool extends LoggingAwareMultiCommand {

private static final char[] CHARS = ("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789").toCharArray();
public static final List<String> USERS = asList(ElasticUser.NAME, KibanaUser.NAME, LogstashSystemUser.NAME, BeatsSystemUser.NAME);
public static final List<String> USERS = asList(ElasticUser.NAME, KibanaUser.NAME, LogstashSystemUser.NAME, BeatsSystemUser.NAME, APMServerSystemUser.NAME);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line probably needs shortening. See CI error:

21:07:59 [ant:checkstyle] [ERROR] /var/lib/jenkins/workspace/elastic+elasticsearch+pull-request/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authc/esnative/tool/SetupPasswordTool.java:67: Line is longer than 140 characters (found 160). [LineLength]

@ycombinator ycombinator force-pushed the x-pack/monitoring/apm-server branch 2 times, most recently from 3ebd1f5 to 88875c1 Compare August 15, 2018 00:21
@ruflin
Copy link
Member

ruflin commented Aug 15, 2018

Now CI found a new place to complain about line length:

03:42:59 [ant:checkstyle] [ERROR] /var/lib/jenkins/workspace/elastic+elasticsearch+pull-request/x-pack/plugin/security/src/test/java/org/elasticsearch/xpack/security/authc/esnative/NativeUsersStoreTests.java:104: Line is longer than 140 characters (found 147). [LineLength]

Copy link
Member

@jaymode jaymode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. My only thought is around whether we need server in the username, but I'm fine with or without it.

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 17, 2018

My only thought is around whether we need server in the username

@graphaelli what's your opinion on this? I put server in the role/user names to try and be future-proof, in case we ever have the need for a role/user for the APM client in the future. But maybe that's not a valid use case and we could drop server from the role/user names in this PR?

@ycombinator
Copy link
Contributor Author

@jaymode CI is green again, mind taking a (hopefully final) peek? Thanks!

Shaunak

Copy link
Member

@jaymode jaymode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM

Copy link
Member

@pickypg pickypg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ycombinator ycombinator merged commit 1779d33 into elastic:master Aug 27, 2018
ycombinator added a commit that referenced this pull request Aug 27, 2018
* Adding new MonitoredSystem for APM server

* Teaching Monitoring template utils about APM server monitoring indices

* Documenting new monitoring index for APM server

* Adding monitoring index template for APM server

* Copy pasta typo

* Removing metrics.libbeat.config section from mapping

* Adding built-in user and role for APM server user

* Actually define the role :)

* Adding missing import

* Removing index template and system ID for apm server

* Shortening line lengths

* Updating expected number of built-in users in integration test

* Removing "system" from role and user names

* Rearranging users to make tests pass
@ycombinator
Copy link
Contributor Author

Backported to:

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 27, 2018
* This was broken by elastic#32515 since the 5.x versions
were removed between PR creation and merge
original-brownbear added a commit that referenced this pull request Aug 27, 2018
* This was broken by #32515 since the 5.x versions
were removed between PR creation and merge
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 27, 2018
* This was broken by elastic#32515 since the 5.x versions
were removed between PR creation and merge
jasontedor added a commit that referenced this pull request Aug 27, 2018
* master:
  Adjust BWC version on mapping version
  Token API supports the client_credentials grant (#33106)
  Build: forked compiler max memory matches jvmArgs (#33138)
  Introduce mapping version to index metadata (#33147)
  SQL: Enable aggregations to create a separate bucket for missing values (#32832)
  Fix grammar in contributing docs
  SECURITY: Fix Compile Error in ReservedRealmTests (#33166)
  APM server monitoring (#32515)
  Support only string `format` in date, root object & date range (#28117)
  [Rollup] Move toBuilders() methods out of rollup config objects (#32585)
  Fix forbiddenapis on java 11  (#33116)
  Apply publishing to genreate pom (#33094)
  Have circuit breaker succeed on unknown mem usage
  Do not lose default mapper on metadata updates (#33153)
  Fix a mappings update test (#33146)
  Reload Secure Settings REST specs & docs (#32990)
  Refactor CachingUsernamePassword realm (#32646)
jasontedor added a commit that referenced this pull request Aug 27, 2018
* 6.x:
  Introduce mapping version to index metadata (#33147)
  Move non duplicated actions back into xpack core (#32952)
  HLRC: Create server agnostic request and response (#32912)
  Build: forked compiler max memory matches jvmArgs (#33138)
  * Added breaking change section for GROUP BY behavior: now it considers null or empty values as a separate group/bucket. Previously, they were ignored. * This is part of backporting of #32832
  SQL: Enable aggregations to create a separate bucket for missing values (#32832)
  [TEST] version guard for reload rest-api-spec
  Fix grammar in contributing docs
  APM server monitoring (#32515)
  Support only string `format` in date, root object & date range (#28117)
  [Rollup] Move toBuilders() methods out of rollup config objects (#32585)
  Accept Gradle build scan agreement (#30645)
  Fix forbiddenapis on java 11  (#33116)
  Run forbidden api checks with runtimeJavaVersion (#32947)
  Apply publishing to genreate pom (#33094)
  Fix a mappings update test (#33146)
  Reload Secure Settings REST specs & docs (#32990)
  Refactor CachingUsernamePassword realm (#32646)
@colings86 colings86 added >feature and removed :Security/Security Security issues without another label labels Oct 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants