Agents should collect cloud identification metadata #256

alex-fedotyev · 2020-04-16T19:52:33Z

Description of the issue

There are multiple use cases where knowing the cloud provider would help our users and ourselves:

App ops knowing where the applications are hosted
Linking across apm, logs and metrics
Elastic - monitoring where customers are deploying their applications to prioritize installation, integration and coverage work

Agents would need to report following optional fields (based on ECS):

cloud.availability_zone
cloud.account.id
cloud.account.name
cloud.instance.id
cloud.instance.name
cloud.machine.type
cloud.project.id
cloud.project.name
cloud.provider
cloud.region

APM server can currently report this data for co-located applications (it needs to be installed on the same host).
This is code which collects cloud metadata:
https://github.com/elastic/beats/tree/0ef472268ea41881f81accabe2af6cfb72eef682/libbeat/processors/add_cloud_metadata

Agent spec PR: #290

Agent Issues

basepi · 2020-04-21T14:58:48Z

We did a lot of this type of metadata collection in HubbleStack, which I worked on in my previous job. It can probably be improved (I didn't actually write this code) with environment variable checks and the like. Including here just for reference. https://github.com/hubblestack/hubble/blob/develop/hubblestack/extmods/grains/cloud_details.py

basepi · 2020-04-21T18:32:53Z

I'd be happy to lead the POC on this with the Python agent, once roadmap discussions have happened.

axw · 2020-04-22T02:32:09Z

It can probably be improved (I didn't actually write this code) with environment variable checks and the like

It would be great if we didn't have to make a network request for each cloud provider type, by detecting which cloud we're running in by some other means first. e.g. for EC2: https://serverfault.com/questions/462903/how-to-know-if-a-machine-is-an-ec2-instance. If nothing else, configuration to specify the cloud provider, which might be "none" to disable sniffing altogether.

basepi · 2020-04-22T15:10:44Z

Agreed. That's the primary way I would improve the linked code. :) It's just a useful reference for what kind of data is available and how to access each metadata endpoint for each cloud provider.

graphaelli · 2020-04-23T15:34:49Z

Following elastic/ecs#816 all fields are now in ECS, some still to be included in a release.

axw · 2020-05-04T01:45:06Z

I've taken the liberty of changing cloud.availability.zone to cloud.availability_zone. I assume that was a typo.

graphaelli · 2020-05-07T17:47:49Z

Intake for these will be available as of 7.8 and is now available in nightly snapshots.

@basepi a POC and guidelines for all agents to follow when implementing collection of this information would be great! I'll follow up on prioritization.

basepi · 2020-05-08T15:33:48Z

Tracking: elastic/apm-agent-python#822

basepi · 2020-05-12T20:17:50Z

It would be great if we didn't have to make a network request for each cloud provider type, by detecting which cloud we're running in by some other means first. e.g. for EC2: https://serverfault.com/questions/462903/how-to-know-if-a-machine-is-an-ec2-instance. If nothing else, configuration to specify the cloud provider, which might be "none" to disable sniffing altogether.

After some initial investigation, I don't think there's a reliable way for us to detect the cloud provider without making network requests. If you go through the above linked thread for EC2, you'll find that there are endless edge cases (well-documented in this answer) and it seems that the most reliable way is indeed to hit the metadata server.

Unfortunately if the metadata server isn't there, then we have to wait to timeout. On the bright side, we should only need to do this once, on startup. We can also be pretty aggressive with our timeouts since the metadata server should be very low latency. (Cloud metadata timeout will also be configurable, in case we're too aggressive in some cases.)

My plan is to provide configuration to specify the cloud provider, as recommended by @axw above, including the ability to disable cloud metadata generation completely. I'll also implement some of the low-hanging fruit checks to reduce the blind checks as much as possible.

basepi · 2020-06-12T18:06:39Z

My solution is code complete here. I still need to add some tests, and do some manual end to end testing, but it's working on AWS, Azure, and GCP.

Notice that there is no provider guessing. I have found no consistent method to detect provider outside of querying the metadata services.

In fact, even those AWS methods defined in the serverfault thread don't work. My AWS machine doesn't have amazonaws.com in hostname -d.

According to Amazon, the best way is to hit the metadata server. You can query the system's UUID but it's not guaranteed to be accurate because there's nothing stopping non-EC2 servers from starting their UUID with ec2.

For Azure, we could check against Azure's IP blocks, but that's a changing list we don't want to have to maintain.

dmidecode is one place we could probably get the required information for Azure (and maybe for GCP, I haven't checked), but it requires sudo which we won't (or, at least, shouldn't) have access to.

Luckily, all of the metadata services rely on non-routable addresses which fail immediately in my testing. So using trial and error should add effectively no overhead. (Even if it did, it only needs to happen once.)

felixbarny · 2020-06-15T14:39:37Z

Would you gather the cloud metadata in a blocking or async way on startup? If async, what should we do when we want to send events to APM Server before the metadata is available? Some options:

Send with incomplete metadata
Don't capture transactions until we have computed the metadata
Delay intake API requests until with have computed the metadata (queuing events in the mean time)

basepi · 2020-06-16T13:23:16Z

In the python agent, it's blocking when we're setting up the transport thread. In my testing these metadata services are all local to the box and extremely fast, I expect the overhead to be effectively zero.

felixbarny · 2020-06-16T13:35:58Z

Doing it synchronously sounds like a good tradeoff then given the complexity of doing it async. I expect the Node.js agent only having the option to do it async though.

basepi · 2020-06-16T15:06:32Z

Perhaps. I would probably recommend the Delay intake API requests until with have computed the metadata (queuing events in the mean time) in that case.

basepi · 2020-06-24T22:25:02Z

The python implementation is complete and tested. We ended up doing this work in the transport background thread. No issues with race conditions, as the send queue is ready before the thread starts, but the thread won't start processing the queue until after the metadata generation is complete. Effectively the Delay intake API requests until with have computed the metadata (queuing events in the mean time) option.

basepi · 2020-06-30T14:04:44Z

@elastic/apm-agent-devs The python solution for this metadata is complete, and can be used as a reference. Please open issues for this for each of your teams. I think we're targeting having this available (even if the UI isn't there yet) for 7.9.0.

smith · 2020-07-01T15:49:14Z

elastic/kibana#70465 has been opened to collect some of these fields in the APM telemetry tasks.

basepi · 2020-07-07T21:52:23Z

@elastic/apm-agent-devs We found a bug in my implementation of AWS metadata collection. Turns out if you try to PUT against the token endpoint on a docker container in AWS Elastic Beanstalk, it fails with a ReadTimeout. So that token request needs to have exception handling, and short timeouts (with no retries). See my fix here: elastic/apm-agent-python#884

smith · 2020-07-07T23:51:26Z

Telemetry support in APM UI for sending AZ, provider, and region is shipping in 7.9 (elastic/kibana#71008) and will be available in the telemetry cluster mapping once elastic/telemetry#393 is complete.

graphaelli · 2020-07-17T19:34:58Z

7.9 milestone met for first implementation, all agents are expected to follow by 7.10 or sooner.

alex-fedotyev added apm-agents poll labels Apr 16, 2020

graphaelli mentioned this issue Apr 16, 2020

Cloud provider intake elastic/apm-server#3660

Closed

axw mentioned this issue May 4, 2020

Cloud metadata elastic/apm-server#3729

Merged

6 tasks

basepi mentioned this issue May 8, 2020

[Feature] Collect cloud metadata elastic/apm-agent-python#822

Closed

axw mentioned this issue May 20, 2020

exporter/elasticexporter: add Elastic APM exporter open-telemetry/opentelemetry-collector-contrib#240

Merged

This was referenced Jun 30, 2020

Standardize CLOUD_PROVIDER config #289

Closed

Add cloud metadata to agent development docs #290

Merged

eyalkoren mentioned this issue Jul 1, 2020

Collect cloud metadata elastic/apm-agent-java#1264

Closed

axw mentioned this issue Jul 1, 2020

Collect cloud metadata elastic/apm-agent-go#786

Closed

SergeyKleyman mentioned this issue Jul 1, 2020

Collect cloud identification metadata elastic/apm-agent-php#76

Open

graphaelli added this to the 7.9 milestone Jul 6, 2020

graphaelli modified the milestones: 7.9, 7.10 Jul 17, 2020

graphaelli closed this as completed Jul 17, 2020

gregkalapos mentioned this issue Jul 28, 2020

Collect cloud metadata elastic/apm-agent-dotnet#918

Closed

mikker mentioned this issue Aug 4, 2020

Collect cloud identification metadata elastic/apm-agent-ruby#841

Closed

felixbarny added discussion and removed poll labels Aug 25, 2020

felixbarny linked a pull request Sep 3, 2020 that will close this issue

Add cloud metadata to agent development docs #290

Merged

russcam mentioned this issue Nov 3, 2020

Add Cloud Metadata elastic/apm-agent-dotnet#1003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents should collect cloud identification metadata #256

Agents should collect cloud identification metadata #256

alex-fedotyev commented Apr 16, 2020 •

edited by AlexanderWert

Loading

basepi commented Apr 21, 2020

basepi commented Apr 21, 2020

axw commented Apr 22, 2020

basepi commented Apr 22, 2020

graphaelli commented Apr 23, 2020

axw commented May 4, 2020

graphaelli commented May 7, 2020

basepi commented May 8, 2020

basepi commented May 12, 2020

basepi commented Jun 12, 2020

felixbarny commented Jun 15, 2020

basepi commented Jun 16, 2020

felixbarny commented Jun 16, 2020

basepi commented Jun 16, 2020

basepi commented Jun 24, 2020

basepi commented Jun 30, 2020

smith commented Jul 1, 2020

basepi commented Jul 7, 2020

smith commented Jul 7, 2020

graphaelli commented Jul 17, 2020

Agents should collect cloud identification metadata #256

Agents should collect cloud identification metadata #256

Comments

alex-fedotyev commented Apr 16, 2020 • edited by AlexanderWert Loading

Description of the issue

Agent Issues

basepi commented Apr 21, 2020

basepi commented Apr 21, 2020

axw commented Apr 22, 2020

basepi commented Apr 22, 2020

graphaelli commented Apr 23, 2020

axw commented May 4, 2020

graphaelli commented May 7, 2020

basepi commented May 8, 2020

basepi commented May 12, 2020

basepi commented Jun 12, 2020

felixbarny commented Jun 15, 2020

basepi commented Jun 16, 2020

felixbarny commented Jun 16, 2020

basepi commented Jun 16, 2020

basepi commented Jun 24, 2020

basepi commented Jun 30, 2020

smith commented Jul 1, 2020

basepi commented Jul 7, 2020

smith commented Jul 7, 2020

graphaelli commented Jul 17, 2020

alex-fedotyev commented Apr 16, 2020 •

edited by AlexanderWert

Loading