-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential new sink trait design #2122
Comments
This sounds good, @LucioFranco :) Do you see any possible complications in implementation? Will this change anything about our buffering or batching strategies? |
It shouldn't change any user facing things, only internal. We should be able to achieve the same buffering/batching as we do now. |
@lukesteensen it'd be good for you to weigh in here. :) |
Overall I think this seems reasonable. We'll very likely want to have more specific APIs for different types of sinks, but this makes sense as the bottom layer. One thing that'd be interesting to think about is if there's a similarly general design that'd make it easier to "drive" sinks from the outside. This would be very useful for testing purposes and could help get rid of some of the sleeping and flakey tests we have now. Maybe something like a |
@lukesteensen I am not following what you mean by |
Basically, the while let Some(event) = rx.next().await {
// do stuff with the event here
} If there's a feasible way to do it, it could be nice for the sink API to just be This could be totally wrong, just brainstorming to see if there's something that could help simplify tests, etc. |
@lukesteensen well for current sinks that use the The loop case would replace the streaming setup. The one thing that isn't solved is the acking. But I think that could be covered with combinator types while still allowing flexibility. |
Update on this, I have spent the last few days endlessly cornering myself with our topology code. There are a couple issues, we can not use async fn in our sink trait since that is not object safe. We also have a proliferation of use of the |
So here is a quick write up of some potential things we could do to improve topology. First, I will (as usual) note some problems with our current topology that should be motivation to update things. Current topology implementationCurrently, our topology implementation is quite complex. I after over a year of working on this code base still very much struggle to totally understand what is going on. This is partially due to how complex the task is but also I believe a confusing usage of types. To give a high level understanding of what our current topology does we can break it down into three main areas. Sources are the simplest component, they are made up of two tasks 1) a source to channel task (this is the one returned from the Transforms are a mix of what sinks provide but they still output. Transforms are built up of three main pieces. 1) A Sinks are similar to transforms except they do not have an output. Their output is obviously some downstream service. Sinks instead of being three tasks like transforms are only made up of two tasks. 1) Is the same as the transform where we have a To sum this part up it looks like this usually:
In general this is a very good approach and has worked quite well for us. That said, there are problems with this setup when it comes to upgrading our types. Not only that, we in fact have much more buffer space than what the user may add to their sink config. We currently allocate a size 1000 channel for sources -> our fanout pump and we also allocate another 100 size channel in between the fanout pump and the input to our transforms. This means with a basic source -> transform -> sink, we by default buffer up to 1100 events before we even hit our buffer type for the sink. This may not be a problem in the end but it is still something to note. So, I mentioned I would talk about fanout. So This problem continues as we end up using the sink type in a lot of places. This problem becomes paramount with New proposalMy new proposal is a bit more intrusive than I originally expected but I think the outcome is much better. What I suggest is that we shift the entire pipeline to use The problem here is that all transforms need an extra 100 size channel to buffer events. This is quite unfortunate and it is a result of our This though can be solved by a new type called For adding a new transform/sink, we can create a new The main issue with this approach is that we can no longer support the replace control action. This is due to the fact that these components need to own the incoming stream type. Because of this, its almost impossible to rewire it (We might be able to implement some broadcast type that could support this). That said, I personally don't think its worth it to support this much more complex replace in place type. We've seen this be a point of confusion in the past and I believe it makes our topology much more complicated and makes testing this type of behavior very difficult. It also introduces plenty of surface for race conditions if not done correctly. So to generally sum up, I think we should simplify our topology implementation, lean on better types like Another major thing I'd also like to bring up with this proposal is also simplifying our traits, we currently have a few traits all over the place. The big one I can think of is TL;DR: Replace our usage of intermediate channels to satisfy (also sorry I said this would be short but ended up writing a lot, hopefully this can be useful for future devs that come across this code) |
Noting, #2625 will likely address and supersede this. |
Closing as this is superseded by (and largely implemented as part of) our new |
I decided to not go with an RFC right out the bat but I'd like some feedback on a new sink trait design.
Current
Currently, we use
futures01::Sink
trait as the main entry point for our sinks. We then break out and implement layers on top of this trait to provide the basic batching, partitioning and streaming implementations that our sinks use.The current sink trait looks like this and we force the the
SinkItem
to be anEvent
. The relevant code is here.futures01::Sink
definition:There are a couple flaws with this design:
It is complicated, it uses a poll based api which is extremely error prone and must be written very carefully. This only becomes worse with
futures03::Sink
which 1) is far from stable (compared toStream
andFuture
, I also don't know of anyone actively working on stabilizing it). The new sink trait, which can be found here is much much more complicated introducing pinning and 2 more required fn.The sink trait is incompatible with our
encode_event
model where we go from one type to another by encoding it into a different type. The reason this is incompatible is due to the fact that ifstart_send
is not able to send (aka applying back pressure) we must return the original event. The issue arises with that we need to encode the event to send it! Therefore if we encode it one way to get it back into its original form to return it we must decode it. This is very costly where this could happen often. The solution to this is to use an api like withtower::Service
where there is aService::poll_ready
that doesn't require an item to be sent. We already use this type of api within vector already but we require to buffer at least one event and this introduces overhead.Most of our sinks return futures intead of submitting items into a "sink" that gets driven. This leads me to think that an api like
poll_complete
is actually more low level than what we actually need. It again introduces a complexity that isn't required and can lead to io starvation issues likeBatchSink
does not properly drive the inner IO resources #299.Proposal
I suggest that we move to a
Stream
based api rather than a pushSink
based API. We have already experimented with this type of api in #1988 and #2096. It has so far been proven to work quite well.The key idea with the new design is to instead of "pushing" items into a
Sink
, we treat a vector sink as aFuture
that is provided someStream<Item = Event>
. The sink can then at its own rate "pull" events out of theStream
. Usually, this stream is either ampsc
in memory buffer or an on diskleveldb
backed buffer. In fact, we already have this in place within topology but we instead use theStream::forward
combinator to stream events into the sink. This code can be found here.What I suggest we do is instead of calling the
Stream::forward
combinator, we instead pass therx
end of the channel into our sink config. This would then allow us to return aFuture
that processes the stream asynchronously.This model also fits well with async/await because we currently have generator support for
Future
but not forSink
. This means we can use the less error prone api for implementing new sinks. The benefit here is a much better experience writting sinks.Proposed trait:
This would then allow us to write our batching and streaming sinks in a much more readable and maintainable way.
The text was updated successfully, but these errors were encountered: