Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka protocol filter #2852

Closed
mattklein123 opened this issue Mar 20, 2018 · 70 comments
Closed

Kafka protocol filter #2852

mattklein123 opened this issue Mar 20, 2018 · 70 comments
Assignees
Labels
design proposal Needs design doc/proposal before implementation enhancement Feature requests. Not bugs or questions. help wanted Needs help!
Milestone

Comments

@mattklein123
Copy link
Member

mattklein123 commented Mar 20, 2018

It's looking like Lyft may be able to fund Kafka protocol support in Envoy sometime this year.

Community, can you please chime in on what you would like to see? I know this will be a very popular feature. Stats (as in the Mongo filter) are a no-brainer. What else? Eventually routing and load balancing for languages in which the Kafka client drivers are not as robust?

@mattklein123 mattklein123 added enhancement Feature requests. Not bugs or questions. design proposal Needs design doc/proposal before implementation labels Mar 20, 2018
@gwenshap
Copy link

gwenshap commented Mar 20, 2018

First, this is totally awesome.
Second, as someone with Kafka experience but rather new to service meshes, I have two types of suggestions: I have some additional features that I'd want to use if I had a Kafka-Proxy and then I also have a very unbaked suggestion for a different way to look at Kafka/Mesh integration that I'd like to discuss.

Let's start with additional features (not all are my idea, folks from Confluent helped!):

  • You can use the proxy to validate events. Because Kafka is "content agnostic", misbehaving clients can write literally anything. A proxy can validate that the message is in Protobufs (or whatever), that it has mandatory headers, etc.
  • Rate limiting is useful.
  • Add headers that allow tracking lineage of events - this was one of the reasons headers were added to Kafka.
  • Message format on the server can't be bumped up until all the clients upgraded, which can delay introduction of new features for a long long time. A proxy can convert the format.
  • Count events for monitoring
  • Really cool if possible: Failover to a DR cluster. This is easy for producers and currently super tricky for consumers (because offsets). Not sure if a service-mesh is enough for that one.

Here's the other point-of-view:
Kafka isn't just a service, "kinda like a database", Kafka is a way of sending messages from one service to another. Kinda like a transport layer, except async and persistent. I wonder if Envoy can integrate with Kafka even deeper and allow services to use Kafka to communicate with other services (instead of REST, gRPC, etc). And then you can "hide" transition from REST to Kafka communication in the same way Lyft used Envoy to move to gRPC.
Not sure if this makes total sense, since async programming is different, but worth mulling over.

@travisjeffery
Copy link

travisjeffery commented Mar 20, 2018

This would be dope. A couple use cases to start are request logs and stats. You can also build a nice audit log by taking your request logs and enriching them with the users's info. This could also help people write their own Kafka filters adding features like upconverting old Kafka clients to newer protocol versions.

@mattklein123
Copy link
Member Author

Thanks @gwenshap those are all great ideas. Would love to discuss more. If Confluent is potentially interested in helping with this (even if just design) can you or someone else reach out to me? My email address is easy to find or you can DM me on Twitter to connect.

@mattklein123
Copy link
Member Author

Other interesting ideas that come to mind:

  • Add Kafka support to the upcoming tap/dump feature so that we can dump to a Kafka stream.
  • Shadow requests to a Kafka stream instead of HTTP/gRPC shadow.

@theduderog
Copy link

theduderog commented Mar 20, 2018

@mattklein123 Do you mind explaining the primary use case you had in mind? Would this be a "Front Envoy" that might be used for ingress into Kubernetes? Or would a side car proxy pretend to be all Kafka brokers to local clients?

By "Front Envoy", I mean something like slide 15 in your deck here.

@wushujames
Copy link

  • tracing/monitoring: who is writing to which topics? Create a graph of data flow from producers to topics to consumers. https://logallthethings.com/2017/05/17/visualizing-the-flow-of-data-as-it-moves-through-a-kafka-cluster/
  • add compression to apps which aren’t already using it
  • add/remove SSL for traffic headed to/from broker
  • fault injection. Trigger consumer rebalances, broker down, producer transaction failures
  • metrics: byte rate per client id. Byte rate per consumer group.
  • like @gwenshap said: validate that requests have certain attributes. Example: CreateTopic requests must have minimum replication factor. Like https://kafka.apache.org/0110/javadoc/org/apache/kafka/server/policy/CreateTopicPolicy.html but for all kafka API types.
  • automatic topic name conversion to/from a cluster. Like, an app would publish to topic foo, and it would actually go to topic application.foo. This would allow multi tenant clusters, but the application would think they have the whole namespace to themselves.
  • consumer lag monitoring for the entire datacenter
  • metrics about which apps are using which versions of the client libraries
  • +1 on failover for consumers to another datacenter. You can do offset->timestamp conversion on one datacenter, and then do timestamp->offset conversion on the failover datacenter.

@alexandrfox
Copy link

alexandrfox commented Mar 20, 2018

Awesome that you guys are looking into Kafka protocol support, that'd be an amazing feature to have!
+1 to @gwenshap and @wushujames ideas, also:

  • dynamic routing (for multicluster setups) of producers and consumers. This, in conjunction with a control plane would be a killer-feature: cluster/topic drain and rebalancing operations made easy;
  • double-producing (e.g. if user wants to produce data to 2 or more clusters/topics at the same time);

@sdotz
Copy link

sdotz commented Mar 20, 2018

Here are some ideas I would find useful (some already mentioned)

  • Monitor consumer lag
  • Failover to another cluster/datacenter while maintaining log position (hard due to offset mismatch)
  • Mirroring topics to another cluster, or teeing publishes "exactly once" to maintain identical clusters
  • Automatic topic switching e.g. specify my_topic_* to consume my_topic_1 and switch to my_topic_2 when it becomes available, transparently to the consumer. This would be useful for data migrations without interrupting consumption. In other terms, the ability to hot swap topics unbeknownst to the consumer.
  • Filter data on the server before sending to the consumer.
  • Producer rate limiting

@mbogoevici
Copy link

Between @mattklein123 @gwenshap and @wushujames this is an awesome list of features.

As a general question, particularly for Matt: would you see any value in capturing some of the more generic features and turning them higher level abstraction for messaging support in the service mesh?

@sdotz
Copy link

sdotz commented Mar 20, 2018

Perhaps also look at some of what kafka-pixy does. I find the wrapping of Kafka's native protocol into with REST/gRPC to be pretty compelling. This better supports usage from FaaS and apps that don't necessarily have the ability to do a long-lived connection.

@rmichela
Copy link

I'd like to see Envoy's Zipkin traces reported to Zipkin using Zipkin's Kafka collector.

@mattklein123
Copy link
Member Author

Thanks everyone for the awesome suggestions that have been added to this issue. From Lyft's perspective, we are primarily interested in:

  • L7 protocol parsing for observability (stats, logging, and trace linking with HTTP RPCs)
  • Ratelimiting at both the connection and L7 message level

So I think this is where we will focus, probably starting in Q3. I will need to go through and do some basic SWAGing in terms of how much existing code in https://github.com/edenhill/librdkafka can be reused for the protocol parsing portion. We will also coordinate with folks at Confluent on this work as well. Please reach out if you are also interested in helping.

@ebroder
Copy link

ebroder commented Mar 29, 2018

Are there any plans at this point for how to practically proxy the Kafka protocol to a pool of brokers? In general, clients connect to a seed node and send it a "metadata" request for the topic/partition they're interested in. The response to that includes a hostname and port, which clients then connect to directly. It means that in practice Kafka clients are (by design) very good at dis-intermediating proxies.

@gwenshap
Copy link

@ebroder One way to do it will be to register the proxy address (probably localhost:port if we are using sidecar) as their advertised-listeners. And then they'll return this address to the clients.
In the latest release, advertised hosts will be a dynamic property, so this may become even easier to manage.

@wushujames
Copy link

@gwenshap: Interesting. That would imply a Kafka cluster that would only work with the sidecars then, right?

@gwenshap
Copy link

I didn't mean to imply that. That's why I said "one way". I know @travisjeffery and @theduderog have ideas about central proxies. Sidecars do seem to be Envoy's main mode of deployment.

@ebroder
Copy link

ebroder commented Mar 29, 2018

That does require that you have to allocate a sidecar port for every kafka broker you're running, right? It seems like the overhead/management costs there could potentially add up quickly

@gwenshap
Copy link

I'm not sure? How expensive are ports? Kafka clusters with over 50 brokers are quite rare.

@mattklein123
Copy link
Member Author

mattklein123 commented Mar 29, 2018

@ebroder @wushujames @gwenshap TBH I really have not gotten into the details yet. If the Kafka protocol does not support a built-in method of proxying (we should discuss), I think there are a few options:

  • Pre-configure all broker addresses in Envoy and have the seed node return Envoy addresses. Pro: Conceptually simple, Con: Annoying to configure.
  • Use some type of IP tables interception to make sure all Kafka connections go through Envoy on the way to the brokers. Pro: Transparent. Con: Requires kernel/external scripts. Needs more investigation and thinking.
  • Have Envoy do active L7 proxying/mutation of the seed communication, and swap broker addresses with local Envoy, and then potentially remember which broker to send which messages to. Pro: No kernel magic, fully transparent. Con: Very complicated, involves handling parts of the client/broker handshake.

But again I haven't done any investigation. I was going to carve out some time to learn more about all of this in Q2 and potentially find some people who would like to help me learn more about it. :)

@wushujames
Copy link

@mattklein123 @gwenshap @ebroder: Yeah, I had the same idea as Matt's option 3. Since the initial request to the brokers has to flow through the sidecar anyway, it can intercept and rewrite the response back to the client, and transform the request/responses as they flow between client/broker. Sounds expensive to me, but I know very little about envoy's performance.

sderosiaux added a commit to sderosiaux/every-single-day-i-tldr that referenced this issue Mar 30, 2018
@ilevine
Copy link

ilevine commented Apr 4, 2018

@mattklein123 @gwenshap @ebroder @wushujames: take a look at https://medium.com/solo-io/introducing-gloo-nats-bring-events-to-your-api-f7ee450f7f79 & https://github.com/solo-io/envoy-nats-streaming - love to get your thoughts .... we created a NATS filter for Envoy ...

@AssafKatz3
Copy link

As @wushujames mentioned:

automatic topic name conversion to/from a cluster. Like, an app would publish to topic foo, and it would actually go to topic application.foo. This would allow multi tenant clusters, but the application would think they have the whole namespace to themselves.

This will be very useful for cannary release or blue/green deployment since will allow to modify the actual topic without any change in application.

@mattklein123 mattklein123 self-assigned this May 12, 2018
@georgeteo
Copy link

@mattklein123: There have been a lot of requests in this thread. Will there a design doc with a list of which requested feature will be supported?

@mattklein123
Copy link
Member Author

@georgeteo yes when I start working on this (unsure when) I will provide a design doc once I do more research.

@alanconway
Copy link

This may have some structural similarities to the AMQP support I'm working on #3415. It will probably also need to use upstream fliters #173. Raising this so we can watch for opportunities to reuse/co-operate on common infrastructure features in Envoy that support both cases.

@stale
Copy link

stale bot commented Jul 7, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@prdoyle
Copy link

prdoyle commented Jun 18, 2019

What's next now that #4950 is merged?

@mattklein123 mattklein123 modified the milestones: 1.11.0, 1.12.0 Jul 3, 2019
@z0r0
Copy link

z0r0 commented Jul 18, 2019

Bumping this as well.

@andreportela
Copy link

any status update on this?

@joewilliams
Copy link

I am interested in the status on this issue as well. Thanks!

@dedalozzo
Copy link

Me too.

@emedina
Copy link

emedina commented Aug 13, 2019

Me as well

@georgicodes
Copy link

Same!

@mhockelberg
Copy link

Also interested in this issue.

@samcarpentier
Copy link

Same! Any updates?

@blaizedsouza
Copy link

I am interested in the status on this issue as well. Thanks!

@prdoyle
Copy link

prdoyle commented Aug 21, 2019

Folks, please stop the "me too" posts. The way to express that is to thumbs-up Matt's comment at the top.

@cyxddgithub
Copy link

how can i join kafka protocol filter feature develop?has any check list so i can pick some task for begin :) @mattklein123 @adamkotwasinski

@hendrikhalkow
Copy link

I'd like to add transparent encryption and decryption to the list of features. In contrast to just doing TLS, this would allow me to have a zero knowledge broker.

@mattklein123
Copy link
Member Author

I'm going to close this as the filter is implemented. Let's please open more specific feature requests for the filter so we can track things in a more granular fashion. Thank you @adamkotwasinski!!!

@adamkotwasinski
Copy link
Contributor

@mattklein123 yeah, I'm planning to revisit this in April (hopefully) when I'll start work on "fat-mesh" filter. Initially it will be very simple: custom cluster (that manages the internal Kafka-discovery (somewhat similar to redis-cluster code)) and the trivial (non-consumer-group) ProduceRequest & FetchRequest handling.

@adamkotwasinski
Copy link
Contributor

All right, I got some initial "stateful" proxy features implemented allowing Envoy to act as a facade for multiple Kafka clusters:

  • producer proxy (handles the Produce requests) - Kafka-mesh filter #11936 - basically records received by Envoy are going to be resubmitted to librdkafka producers that point at the right clusters,
  • consumer proxy (stateful) (handles the Fetch requests) - [contrib] kafka: record-distributing Kafka consumer proxy for multiple upstream clusters (mesh-filter new feature) #24372 - Envoy uses embedded librdkafka consumers to consume records from upstream clusters that match the requests received so far, with some caching - by definition this is stateful (we do not translate a downstream Fetch into an upstream Fetch) and distributes the records amongst all the downstream connections that express an interest in the same topic-partition.

More notes and things that might need to be improved at https://github.com/adamkotwasinski/envoy/blob/ff39845987af5cc5ff8796ad3b683f6a7e8dbe3f/docs/root/configuration/listeners/network_filters/kafka_mesh_filter.rst#notes

@adamkotwasinski
Copy link
Contributor

All right, given that some code has been pushed to allow for response rewriting, we can now use Envoy without needing Kafka to change its configuration : #30669
This allows any kind of user just to set up their fleet of Envoys (actually listeners) to do their own e.g. limiting or termination without needing to bother the Kafka service owners.

@adamkotwasinski
Copy link
Contributor

Updated the protocol code to handle Kafka 3.8 : #36166

@adamkotwasinski
Copy link
Contributor

Broker filter can now filter requests by API Key - #36978

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design proposal Needs design doc/proposal before implementation enhancement Feature requests. Not bugs or questions. help wanted Needs help!
Projects
None yet
Development

No branches or pull requests