Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Access Log Service (ALS) #1691

Closed
bgagnon opened this issue Oct 11, 2019 · 29 comments
Closed

Support Access Log Service (ALS) #1691

bgagnon opened this issue Oct 11, 2019 · 29 comments
Assignees
Labels
area/logging doc-impact Indicates that an issue or PR needs attention from a technical writer or a docs update. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@bgagnon
Copy link
Contributor

bgagnon commented Oct 11, 2019

Structured logging for the Envoy access logs (ie. JSON access logs) was requested in #624 and implemented in #1511. Envoy supports a more advanced and flexible access logging option: an Access Log Service (ALS).

With this activated, Envoy uses gRPC streams to pass rich and strongly typed protobufs with all details to a sink. This sink is free to do whatever it pleases with the access logs.

Use cases valuable to us:

  • produce new time series of metrics (gauges, histograms, counters) based on values observed in access logs
  • use a custom JSON format not supported by Envoy (this is a limitation of Add support for JSON logging. #1511)
  • post-process the log messages to enrich/simplify them (ex: user agent parsing shenanigans where every browser claims to be Chrome, Mozilla, Firefox or all of them)
  • enrichment that requires Kubernetes API access (ex: identify the upstream pod and retrieve its metadata such as labels or namespace)

We've implemented this as a proof of concept with the following strategy:

If Contour supported this, fewer hacks would be needed. I think the minimum would be:

  • a new CLI flag that activates this and allows the user to specific the ALS sink address/port/service
  • a small modification to LDS/RDS responses when that mode is activated

I don't think Contour needs to provide anything related to implementing an ALS receiver, though we'd be happy to contribute this somewhere if there is interest.

This may be too complex for the scope of Contour and too niche of a feature, but I thought I'd file an issue regardless following @youngnick's recommendation.

@stevesloka
Copy link
Member

stevesloka commented Oct 11, 2019

Sounds like this could be implemented like our discussions with the RateLimiting service. Contour configures the endpoint and then the receiving end is responsible for implementing its side.

@davecheney
Copy link
Contributor

ping @m2 for backlog prioritisation.

@youngnick
Copy link
Member

Now that we've added the ExtensionService CRD, it should be much easier to have a way to configure this - we could conceivably have a config file item that asks for Envoy logs to be sent to a given ExtensionService, with failback to stdout if not available.

In my mind, this feature will be much less usable without full support for #2495 however, as Contour has no way to tell you that the logging that you've configured won't work.

@bgagnon
Copy link
Contributor Author

bgagnon commented Sep 18, 2020

FWIW, for our use case, we don't need this to be configurable at the HTTPProxy or route level -- we'd be satisfied with a global ALS config in Contour's YAML config, for example.

But if we can leverage ExtensionService for a more dynamic experience, that's cool too.
Our current implementation works like this:

  • HttpConnectionManager is updated on the :80 and :443 listeners through an LDS proxy
  • the ALS cluster is statically injected in the Envoy bootstrap config, and has single cluster member (127.0.0.1, since it's a sidecar) -- no EDS or DNS needed

Now, for status propagation all the way to the HTTPProxy.status objects, I'm not sure exactly how that would work. For one thing, it's Envoy that talks to the gRPC ALS service, not Contour. Any errors sending logs to it would be visible only from Envoy logs and/or metrics.

Are you thinking that Contour would initiate its own gRPC checks against the ALS service before handing it off to Envoy? Does it do that just once or continuously? Would the fallback to STDOUT also be dynamic?

If Contour really wants to go down that path...

  • Contour would ship its own ALS sidecar for Envoy
  • this ALS server would output to STDOUT unless given a working destination, in which case it can forward the gRPC messages verbatim
  • the status can be reflected in the HTTPProxy

IMO, it's a lot of moving parts, and likely out of scope for Contour.

@youngnick
Copy link
Member

I meant more that, if the ALS service is running inside Kubernetes, it will have a Service object, with Endpoints that will come and go based on healthchecks, so the Endpoints (which Contour watches already) will tell us if the configured service is accepting traffic.

That's a simple way that we could get a Ready indicator for any given Service.

If the ALS is a sidecar, that is interesting. Any design that we propose definitely needs to be able to handle that use case as well, thanks.

@jpeach
Copy link
Contributor

jpeach commented Sep 18, 2020

I meant more that, if the ALS service is running inside Kubernetes, it will have a Service object, with Endpoints that will come and go based on healthchecks, so the Endpoints (which Contour watches already) will tell us if the configured service is accepting traffic.

That's a simple way that we could get a Ready indicator for any given Service.

If the ALS is a sidecar, that is interesting. Any design that we propose definitely needs to be able to handle that use case as well, thanks.

Maybe. We can consistently support ALS in the ExtensionService model, which lets the services scale independently (and arguably gives better resource management and resource visibility). The cost is increased deployment complexity, and potentially decreased reliability.

One way to support side-car services could be with a new CRD that would cause Contour to inject and configure the sidecar. The difficulty there is that Contour today doesn't manage the Envoy deployment at all, so we would need to deal with that. I guess another way would be a CRD that expresses the operators' commitment to have already configured such a sidecar :)

I don't know that ALS is special WRT being deployed as a sidecar (you could make a similar for deploying ext_authz and rate limit proxies as sidecars).

@sunjayBhatia
Copy link
Member

sunjayBhatia commented Dec 11, 2020

I don't know that ALS is special WRT being deployed as a sidecar (you could make a similar for deploying ext_authz and rate limit proxies as sidecars).

Seems like the sidecar question is a blocker to this work?

Could you sort of cheat and expose the services running in a sidecar with an ExternalName or headless service? (https://kubernetes.io/docs/concepts/services-networking/service/#externalname or https://kubernetes.io/docs/concepts/services-networking/service/#headless-services) That way we wouldn't have to do anything different from the existing ExtensionService model

(well, it might not work to try to expose 127.0.0.1 this way)

Or instead to add another concept in addition to ExtensionService, something like RawServiceAddress, which could possibly be used to configure Envoy to reach out to services outside k8s or inside a pod with loopback addresses

@stevesloka
Copy link
Member

I don't think we need extensionService in play for this. You create your ALS service which might be contour-als (for example) or localhost (if running as a side-car. Envoy then is configured to use that and you log.

I feel like we're over complicating this problem or possibly I'm over simplifying it. =)

@bgagnon
Copy link
Contributor Author

bgagnon commented Dec 11, 2020

Using a sidecar is not a hard requirement.

ALS would work fine behind a Service and Deployment.

The sidecar pattern was our preference only because it scales linearly with the number of Envoy nodes (a fair balancing), and the convenience of not needing endpoint discovery.

@sunjayBhatia
Copy link
Member

I don't think we need extensionService in play for this. You create your ALS service which might be contour-als (for example) or localhost (if running as a side-car. Envoy then is configured to use that and you log.

I feel like we're over complicating this problem or possibly I'm over simplifying it. =)

Yeah, if the ask is to just add GRPC access logging support to the HTTP connection manager this sounds right

Using ExtensionService implies some more granular listener access logging, but we don't support that today so maybe a moot point

@sunjayBhatia
Copy link
Member

sunjayBhatia commented Dec 11, 2020

If the consensus is to add to the existing global access log configuration, the required pieces potentially look like this (spelling it out since I'm new to validate my understanding):

  • Add the Access Log service gRPC cluster to contour bootstrap configuration (config file or CLI params? config file seems maybe better so we dont have to duplicate between the bootstrap command and serve command)
    • ALS address (can be service name or IP/other DNS name) and port
    • TLS parameters (do we want to reuse the existing Envoy parameters for xDS or have a new set?)
    • Bootstrap writes some good defaults for gPRC cluster (similar to xDS config?)
    • ALS configuration will override file based access logging (or vice versa? maybe only one can be specified?)
  • On Listener filter responses, add the envoy.access_loggers.http_grpc Access Logger that points to the cluster configured in the bootstrap to each HTTP filter chain
    • Will need to generate a distinct name for each access logger
    • If we were to allow an address or explicit k8s service name+namespace, we could check the service status and configure the ALS based on that, falling back to stdout as alluded to above, not sure if that is desirable or too much complexity for a first pass, we could always start with address+port and add an explicit k8s service+status check later
    • Might need to allow transport_api_version to be configurable between xDS v2 and v3, but other than that seems like we can leave all the defaults for now

Lmk if this is something that is involved enough to require a design doc instead

@sunjayBhatia
Copy link
Member

@bgagnon
Copy link
Contributor Author

bgagnon commented Dec 14, 2020

The list of request/response headers to send to ALS is one of the configurations needed. Simple ones like Content-type are not included by default.

@jpeach
Copy link
Contributor

jpeach commented Dec 14, 2020

The reason I brought up ExtensionService WRT sidecar is that for all these types of problems you need (1) to express where the API endpoint is and (2) to express the policy for using the endpoint. Although ExtensionService was built for expressing remote API endpoints, I think that it's a logical extrapolation of the concept to be able to use it to say that the endpoint is running locally as a sidecar or something more abstract than a Kubernetes service. Have not thought through what the YAML would look like, but IMHO it's a consistent approach. What we should try to avoid (as much as we can) is having different ways to express similar concepts.

@youngnick
Copy link
Member

I agree that the point of ExtensionService is to express the location and the policy for a cluster we're going to tell Envoy about. That would fit well with "you're connecting to localhost" at a conceptual level, but we probably need some more general guidance about how we do policy for the cluster (this has come up in the discussion about tracing as well).

For the specific config here, we're going to need something similar to the discussions about Tracing in #399, probably.

@xaleeks
Copy link

xaleeks commented Mar 2, 2021

Nick, are we ready to begin to pick this up within the next couple releases? Regarding granularity, I absolutely agree that we start at global instance level and move down to route if we really need to. Further, to James’ point on watching the envoy deployment, if we go with that approach can we think about adding this logic to the Contour Operator once again. I know we’re stuffing a lot of stuff into the Operator already

Furthermore, can these Access Logs then be streamed to something like Fluent Bit. To be quite transparent here, fluent is very popular solution that has already been well integrated into a few telemetry platforms in downstream DIY k8s on the market; that are capable of being post processed and plotted in time series charts etc

@youngnick
Copy link
Member

youngnick commented Mar 2, 2021

Envoy has its own gRPC logging protocol, that's ALS. Currently, it seems that Fluent Bit doesn't have support for ingesting ALS. So to fully implement this, we may need to implement a basic ALS sink (similar to how we made contour-authserver), or add support for ALS to other projects (like Fluent Bit).

The work of actually adding the config items to enable sending ALS from Envoy is a reasonably straightforward addition to the contour bootstrap command, but being able to verify that the logs are being received requires us to have a sink that supports the ALS gRPC API.

So we can definitely provide the facility to configure this, but we won't be able to validate that it works in CI until we have a sink we can check.

@abhide
Copy link
Member

abhide commented Mar 3, 2021

@youngnick @skriss I am interested in working with someone to add ALS support to Contour.
I have been playing around with Envoy ALS and wrote a simple ALS sink.
Wrote a gist on how this can be configured: https://gist.github.com/abhide/805ea927500a73658ef696f04961a7d9

@youngnick As you said, we will have to add the ALS cluster as part of contour bootstrap configuration and this seems pretty easy. But users can specify whether access logs need to be send as part of listener configuration in HTTPConnectionManager. Take a look at https://github.com/abhide/envoy-getting-started/blob/master/k8s/als.yaml#L17
Trying to wrap my head around Contour's ExtensionService and how RLS achieves it.

Let me know what do you think?

@youngnick
Copy link
Member

This is great work @abhide!

Yes, what I see is that we have a way to tell the Contour bootstrap to configure ALS by looking at a named ExtensionService object. The ExtensionService object then allows us to configure the cluster that will be generated in the boostrap (timeouts and load balancing policy, mainly). In addition, you can point an ExtensionService object to localhost to send traffic to another container running inside the Envoy. Downside to this approach is that the ExtensionService must exist and be parsed by contour bootstrap, which is a fair bit of extra work.

You make a good point that we also need to decide how to enable this. I had originally thought that we would just have a boolean config item for contour serve (in the config file), that would enable sending logs to the configured ALS server, but given that we don't have any way to know if the ALS server is configured in the bootstrap, it's a little complicated. I think we should look to start from "Contour owner mandates everyone use ALS", and see how we go.

Maybe that looks like: We tell the controller the ExtensionService that controls ALS, and that serves both as a boolean that ALS should be enabled, and a check that the service is working before we configure the connection manager to do it for everything? This would be a global config file item for contour serve. Then, if people ask for it, we can add a "disable ALS for this vhost" to HTTPProxy at a later date. @bgagnon, any thoughts on what you'd like?

However, I think that this feature needs a design document, that walks through what to add to contour boostrap and contour serve, lays out a preferred option, and explains some alternatives and why they were not chosen.

@abhide Do you want to make a start and open a WIP PR? I'll help you with polish, so please feel free to submit early and we'll iterate.

@abhide
Copy link
Member

abhide commented Mar 4, 2021

Thanks @youngnick. Will starta WIP design doc PR and we can collaborate on this.

abhide added a commit to abhide/contour that referenced this issue Mar 16, 2021
Updates: projectcontour#1691

Signed-off-by: Amey Bhide <amey15@gmail.com>
@xaleeks xaleeks added kind/feature Categorizes issue or PR as related to a new feature. area/logging doc-impact Indicates that an issue or PR needs attention from a technical writer or a docs update. labels Apr 13, 2021
@sunjayBhatia
Copy link
Member

Seems like this will slip to 1.16, ok to move?

@youngnick
Copy link
Member

Yep, move this one out to 1.16.

@youngnick
Copy link
Member

Still in flight, moving to 1.17.,

@skriss skriss modified the milestones: Backlog, 1.19.0 Jul 29, 2021
@youngnick youngnick modified the milestones: 1.19.0, Backlog, 1.20.0 Sep 21, 2021
@youngnick
Copy link
Member

Okay, @abhide's design PR was auto-closed by the staleness bot, so I thought I would do an update of what I think the current status is here, and hopefully we can get this moving again.

What this comes down to is that adding the ALS config isn't very hard, but finding the correct place to specify the config is. Currently, the example Contour install recommends using contour bootstrap as an initContainer inside the Envoy Pods (whether they are daemonset or deployment), to generate Envoy's bootstrap configuration. This bootstrap config is set up solely by convention and is not configurable at all by the end user.

In order to be able to specify ALS config, Envoy needs the clusters configured in its bootstrap. This doesn't seem changeable without a lot of Envoy work.

Some of the things we discussed in the PR:

  • We could use the ExtensionService object to define an ALS cluster, and pass that ExtensionService object to Envoy's bootstrap config somehow. We discussed getting that config to Envoy in a few ways:
    • via annotations on the ExtensionService objects. This would require the contour bootstrap process to talk to the Kubernetes API, which the Envoy Pod currently doesn't need to do. I, personally, am against granting the Envoy Pods any access to the apiserver without a very good reason, as they are in the request path and highly likely to be attacked.
    • via mounting the Contour config configmap into the Envoy Pod. This has the advantage that it's straightforward, and contour serve can use the existing config packages to bind to it. However, it doesn't intersect well with plans to move Contour's config to a CRD, since we will run into the "granting Envoy Pod apiserver access" problem again.
    • via always defining a logging cluster in the bootstrap (with a standard name), that would only be populated and used in the event that ALS is configured. At the time, I believed that this was hiding the real problem (that we have no way to configure the bootstrap), but I think I was unfair, and this is a reasonable short-term solution.

For the next steps, I think that with the Managed Envoy (#3545) work coming, that's the best way to add this sort of functionality. We'll probably need a backstop as well for people who don't use managed Envoy, which I agree could be the named-by-convention, always-included-in-bootstrap cluster, maybe. @projectcontour/maintainers, thoughts?

@skriss skriss modified the milestones: 1.20.0, 1.21.0 Jan 4, 2022
@skriss skriss modified the milestones: 1.21.0, 1.22.0 May 3, 2022
@skriss skriss modified the milestones: 1.22.0, 1.23.0 Jul 21, 2022
Copy link

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 16, 2024
Copy link

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logging doc-impact Indicates that an issue or PR needs attention from a technical writer or a docs update. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

9 participants