Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade elastic.apm.* configuration to "beta" #99328

Closed
4 of 7 tasks
joshdover opened this issue May 5, 2021 · 11 comments
Closed
4 of 7 tasks

Upgrade elastic.apm.* configuration to "beta" #99328

joshdover opened this issue May 5, 2021 · 11 comments
Assignees
Labels
performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@joshdover
Copy link
Contributor

joshdover commented May 5, 2021

Though we've had a working APM agent integration since 7.12, we have not yet documented the configuration keys or allowed customers to configure these on Cloud/ESS.

This issue is to discuss the remaining work needed to get this feature up to a "beta" status in order to allow customers to debug problematic visualizations, dashboards, and other user-created features that may be causing performance problems in their Stack cluster.

To do

  • Change the defaults APM config to reflect the findings of Measure APM agent impact on the platform performance  #78792:
    • captureSpanStackTraces should default to false
    • breakdownMetrics should default to false
    • transactionSampleRate should default to 0.1
  • Add documentation for the supported elastic.apm.* configuration keys, marked as beta
  • Confirm that our configuration telemetry includes elastic.apm.* configs
  • Add the documented configuration keys to Cloud's allowlist for Kibana 7.13+
@joshdover joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc performance labels May 5, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@mshustov mshustov self-assigned this May 5, 2021
@dgieselaar
Copy link
Member

dgieselaar commented May 6, 2021

There is work ongoing to improve the instrumentation for task manager (and thus, alert/detection engine executions) here: #99160.

The most useful part of that change is the more granular grouping for transactions. We could also consider extracting that and backporting it to 7.13. The other changes concern distributed tracing for tasks (rule execution > action execution), some other minor improvements and a possible performance fix, that we'll like split out as well.

@joshdover
Copy link
Contributor Author

Grooming notes:

@dgieselaar
Copy link
Member

@joshdover: @SylvainJuge from the APM Java Agent team has been looking into instrumenting ES. He might have an idea of how the slow query log can be correlated to a trace.

@dgieselaar
Copy link
Member

@joshdover what do we want to do with the slow query logs that we cannot get from the elasticsearch spans? Other than those being sampled. We have an experimental feature for tail-based sampling, but I'm not sure if that is able to select outliers yet, I'll follow up on that.

@joshdover
Copy link
Contributor Author

What's primary driving this as a priority at this time is our inability to correlate slow & expensive queries in Elasticsearch to the underlying feature or visualization in Kibana. I'm concerned that finding slow ES queries with APM data only from Kibana may be hard, at least with the default UI. But I think we would actually have all the raw data we need. To find the slowest ES queries coming from Kibana, we'd just need to drop down into Discover and search for spans with span.destination.service.resource: elasticsearch and sort by span.duration.us

We have an experimental feature for tail-based sampling, but I'm not sure if that is able to select outliers yet, I'll follow up on that.

Anywhere I can read about this feature?

@dgieselaar
Copy link
Member

You can also use span.subtype:elasticsearch (IIRC), span.destination.service.resource is for service maps, and its value might be node-specific in the future.

Anywhere I can read about this feature?

There is none, I think: elastic/apm-server#4586.

I don't think it can select spans to sample based on duration, but I'm still trying to figure out whether that might be possible in the future.

@dgieselaar
Copy link
Member

@joshdover it could be possible in the future to make sampling decisions based on slow Elasticsearch queries (eg keep all transactions that have a slow ES span). Not something we can use for a while though. It might be relevant here because if we have a very low sampling rate (eg 2%), most of the queries logged by ES as slow would not have a corresponding Elasticsearch span.

@lizozom
Copy link
Contributor

lizozom commented Nov 11, 2021

Note https://github.com/elastic/cloud/pull/90839#issuecomment-960942239
As a first stage, we might opt into pre-configuring apm and only allowlisting the active flag.

@trentm
Copy link
Member

trentm commented Nov 15, 2021

@joshdover

captureSpanStackTraces should default to false
breakdownMetrics should default to false
transactionSampleRate should default to 0.1

These are checked above, but is it possible that the CENTRALIZED_*_CONFIG values in kbn-apm-config-loader will get skipped with cloud deployments that might set elastic.apm.serverUrl to a value other than the one hardcoded in the Kibana build?

#117492 suggests:

elastic.apm.serverUrl: <Regional APM cluster>

which may differ from the default serverUrl being updated in #117749
which will trip this code:

if (
!this.baseConfig?.serverUrl ||
this.baseConfig.serverUrl === centralizedConfig.serverUrl
) {
this.baseConfig = merge(this.baseConfig, centralizedConfig);
}
}

skipping the CENTRALIZED_*_CONFIG blocks that are the only ones setting captureSpanStackTraces, breakdownMetrics, captureHeaders, captureBody, etc.

I'm not confident I've read this correctly.

@lizozom
Copy link
Contributor

lizozom commented Apr 18, 2022

@joshdover

Docs on how to configure APM were added in #127892
Cloud now sets all configurations internally as part of https://github.com/elastic/dev/issues/1443

I'm closing the issue for now. Feel free to reopen if you find this necessary 🙏🏻

@lizozom lizozom closed this as completed Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

6 participants