Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents should collect cloud identification metadata #256

Closed
6 of 7 tasks
alex-fedotyev opened this issue Apr 16, 2020 · 20 comments · Fixed by #290
Closed
6 of 7 tasks

Agents should collect cloud identification metadata #256

alex-fedotyev opened this issue Apr 16, 2020 · 20 comments · Fixed by #290

Comments

@alex-fedotyev
Copy link

alex-fedotyev commented Apr 16, 2020

Description of the issue

There are multiple use cases where knowing the cloud provider would help our users and ourselves:

  • App ops knowing where the applications are hosted
  • Linking across apm, logs and metrics
  • Elastic - monitoring where customers are deploying their applications to prioritize installation, integration and coverage work

Agents would need to report following optional fields (based on ECS):

  • cloud.availability_zone
  • cloud.account.id
  • cloud.account.name
  • cloud.instance.id
  • cloud.instance.name
  • cloud.machine.type
  • cloud.project.id
  • cloud.project.name
  • cloud.provider
  • cloud.region

APM server can currently report this data for co-located applications (it needs to be installed on the same host).
This is code which collects cloud metadata:
https://github.com/elastic/beats/tree/0ef472268ea41881f81accabe2af6cfb72eef682/libbeat/processors/add_cloud_metadata

Agent spec PR: #290

Agent Issues

@basepi
Copy link
Contributor

basepi commented Apr 21, 2020

We did a lot of this type of metadata collection in HubbleStack, which I worked on in my previous job. It can probably be improved (I didn't actually write this code) with environment variable checks and the like. Including here just for reference. https://github.com/hubblestack/hubble/blob/develop/hubblestack/extmods/grains/cloud_details.py

@basepi
Copy link
Contributor

basepi commented Apr 21, 2020

I'd be happy to lead the POC on this with the Python agent, once roadmap discussions have happened.

@axw
Copy link
Member

axw commented Apr 22, 2020

It can probably be improved (I didn't actually write this code) with environment variable checks and the like

It would be great if we didn't have to make a network request for each cloud provider type, by detecting which cloud we're running in by some other means first. e.g. for EC2: https://serverfault.com/questions/462903/how-to-know-if-a-machine-is-an-ec2-instance. If nothing else, configuration to specify the cloud provider, which might be "none" to disable sniffing altogether.

@basepi
Copy link
Contributor

basepi commented Apr 22, 2020

Agreed. That's the primary way I would improve the linked code. :) It's just a useful reference for what kind of data is available and how to access each metadata endpoint for each cloud provider.

@graphaelli
Copy link
Member

Following elastic/ecs#816 all fields are now in ECS, some still to be included in a release.

@axw
Copy link
Member

axw commented May 4, 2020

I've taken the liberty of changing cloud.availability.zone to cloud.availability_zone. I assume that was a typo.

@graphaelli
Copy link
Member

Intake for these will be available as of 7.8 and is now available in nightly snapshots.

@basepi a POC and guidelines for all agents to follow when implementing collection of this information would be great! I'll follow up on prioritization.

@basepi
Copy link
Contributor

basepi commented May 8, 2020

Tracking: elastic/apm-agent-python#822

@basepi
Copy link
Contributor

basepi commented May 12, 2020

It would be great if we didn't have to make a network request for each cloud provider type, by detecting which cloud we're running in by some other means first. e.g. for EC2: https://serverfault.com/questions/462903/how-to-know-if-a-machine-is-an-ec2-instance. If nothing else, configuration to specify the cloud provider, which might be "none" to disable sniffing altogether.

After some initial investigation, I don't think there's a reliable way for us to detect the cloud provider without making network requests. If you go through the above linked thread for EC2, you'll find that there are endless edge cases (well-documented in this answer) and it seems that the most reliable way is indeed to hit the metadata server.

Unfortunately if the metadata server isn't there, then we have to wait to timeout. On the bright side, we should only need to do this once, on startup. We can also be pretty aggressive with our timeouts since the metadata server should be very low latency. (Cloud metadata timeout will also be configurable, in case we're too aggressive in some cases.)

My plan is to provide configuration to specify the cloud provider, as recommended by @axw above, including the ability to disable cloud metadata generation completely. I'll also implement some of the low-hanging fruit checks to reduce the blind checks as much as possible.

@basepi
Copy link
Contributor

basepi commented Jun 12, 2020

My solution is code complete here. I still need to add some tests, and do some manual end to end testing, but it's working on AWS, Azure, and GCP.

Notice that there is no provider guessing. I have found no consistent method to detect provider outside of querying the metadata services.

In fact, even those AWS methods defined in the serverfault thread don't work. My AWS machine doesn't have amazonaws.com in hostname -d.

According to Amazon, the best way is to hit the metadata server. You can query the system's UUID but it's not guaranteed to be accurate because there's nothing stopping non-EC2 servers from starting their UUID with ec2.

For Azure, we could check against Azure's IP blocks, but that's a changing list we don't want to have to maintain.

dmidecode is one place we could probably get the required information for Azure (and maybe for GCP, I haven't checked), but it requires sudo which we won't (or, at least, shouldn't) have access to.

Luckily, all of the metadata services rely on non-routable addresses which fail immediately in my testing. So using trial and error should add effectively no overhead. (Even if it did, it only needs to happen once.)

@felixbarny
Copy link
Member

Would you gather the cloud metadata in a blocking or async way on startup? If async, what should we do when we want to send events to APM Server before the metadata is available? Some options:

  • Send with incomplete metadata
  • Don't capture transactions until we have computed the metadata
  • Delay intake API requests until with have computed the metadata (queuing events in the mean time)

@basepi
Copy link
Contributor

basepi commented Jun 16, 2020

In the python agent, it's blocking when we're setting up the transport thread. In my testing these metadata services are all local to the box and extremely fast, I expect the overhead to be effectively zero.

@felixbarny
Copy link
Member

Doing it synchronously sounds like a good tradeoff then given the complexity of doing it async. I expect the Node.js agent only having the option to do it async though.

@basepi
Copy link
Contributor

basepi commented Jun 16, 2020

Perhaps. I would probably recommend the Delay intake API requests until with have computed the metadata (queuing events in the mean time) in that case.

@basepi
Copy link
Contributor

basepi commented Jun 24, 2020

The python implementation is complete and tested. We ended up doing this work in the transport background thread. No issues with race conditions, as the send queue is ready before the thread starts, but the thread won't start processing the queue until after the metadata generation is complete. Effectively the Delay intake API requests until with have computed the metadata (queuing events in the mean time) option.

@basepi
Copy link
Contributor

basepi commented Jun 30, 2020

@elastic/apm-agent-devs The python solution for this metadata is complete, and can be used as a reference. Please open issues for this for each of your teams. I think we're targeting having this available (even if the UI isn't there yet) for 7.9.0.

@smith
Copy link

smith commented Jul 1, 2020

elastic/kibana#70465 has been opened to collect some of these fields in the APM telemetry tasks.

@basepi
Copy link
Contributor

basepi commented Jul 7, 2020

@elastic/apm-agent-devs We found a bug in my implementation of AWS metadata collection. Turns out if you try to PUT against the token endpoint on a docker container in AWS Elastic Beanstalk, it fails with a ReadTimeout. So that token request needs to have exception handling, and short timeouts (with no retries). See my fix here: elastic/apm-agent-python#884

@smith
Copy link

smith commented Jul 7, 2020

Telemetry support in APM UI for sending AZ, provider, and region is shipping in 7.9 (elastic/kibana#71008) and will be available in the telemetry cluster mapping once elastic/telemetry#393 is complete.

@graphaelli graphaelli modified the milestones: 7.9, 7.10 Jul 17, 2020
@graphaelli
Copy link
Member

7.9 milestone met for first implementation, all agents are expected to follow by 7.10 or sooner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants