Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add queue inspector facility #29

Merged
merged 6 commits into from
Apr 25, 2018

Conversation

bartelink
Copy link
Member

@bartelink bartelink commented Apr 18, 2018

This PR adds a Health Monitoring hook, as described in the included README.md edit:-

Health Monitoring via the Monitor and Inspector interfaces

The Async wrapper is primarily intended to allow one to achieve minimal logging latency at all times, even when writing to sinks that may momentarily block during the course of their processing (e.g., a File Sink might block for a low number of ms while flushing). The dropping behavior is an important failsafe; it avoids having an unbounded buffering behaviour should logging throughput overwhelm the sink, or the sink ingestion throughput degrade.

In practice, this configuration (assuming one provisions an adequate bufferSize) achieves an efficient and resilient logging configuration that can safely handle load without impacting processing throughput. The risk is of course that events get be dropped if the buffer threshold gets breached. The inspection interface, IAsyncLogEventSinkInspector (obtained by providing an IAsyncLogEventSinkMonitor when configuring the Async Sink), enables a health monitoring mechanism to actively validate that the buffer allocation is not being exceeded in practice.

// Example check: log message to an out of band alarm channel if logging is showing signs of getting overwhelmed
void ExecuteAsyncBufferCheck(IAsyncLogEventSinkInspector inspector)
{
    var usagePct = inspector.Count * 100 / inspector.BoundedCapacity;
    if (usagePct > 50) SelfLog.WriteLine("Log buffer exceeded {0:p0} usage (limit: {1})", usagePct, inspector.BoundedCapacity);
}

class MonitorConfiguration : IAsyncLogEventSinkMonitor
{
    public void StartMonitoring(IAsyncLogEventSinkInspector inspector) =>
        HealthMonitor.AddPeriodicCheck(() => ExecuteAsyncBufferCheck(inspector));

    public void StopMonitoring(IAsyncLogEventSinkInspector inspector) 
    { /* reverse of StartMonitoring */ }
}

// Provide monitor so we can wire the health check to the inspector
var monitor = new MonitorConfiguration();
// Use default config (drop events if >10,000 backlog)
.WriteTo.Async(a => a.File("logs/myapp.log"), monitor: monitor) ...

UPDATE: at the moment, it also addresses #13 as a side-effect

@bartelink
Copy link
Member Author

bartelink commented Apr 18, 2018

I mentioned this in the gitter in passing but am more than aware I'm going outside protocol by barging in with a PR with not even an issue to validate the idea; Please do not hesitate to bat this back if a more appropriate solution should be used.

The doc update is speculative in nature; please feel free to deny it.

Likewise, I'm happy to remove the formatting changes, as they're a touch on the obnoxious side ;)

@bartelink bartelink force-pushed the feature/queue-monitoring branch 5 times, most recently from 7cf874d to b56e8e0 Compare April 18, 2018 18:30
@bartelink
Copy link
Member Author

bartelink commented Apr 18, 2018

This is more or less ready to review from my perspective. The third commit (below) has the meat - the preceding two are debatable, and I'm open to removing them; nobody likes a sticky-beak ;)

If the tests should be moved to another file (or should lean less on excessive sleeping; I could change to use signalling if you wish), or you'd like me to add a short test to the core tests in BackgroundWorkerSinkTests etc, let me know.

@nblumhardt
Copy link
Member

Thanks for the PR!

A couple of initial thoughts -

Is it possible (and not too much trouble) to send the blockWhenFull changes separately? I think this one's an obvious win, and backpressure here seems like a very reasonable thing to support - just could be quicker through the pipeline than the monitoring mechanism, which seems to have more nuances and a bigger design space for us to explore.

For the monitoring mechanism, is the regular reporting a necessity or a nice-to-have for you? E.g. would a simpler callback-based onFull: delegate parameter cover it, without introducing the need to manage a timer?

If the regular report is the main aim, could we abstract away the underlying queue design, and just pass the queue's size and maximum capacity to the callback? Is there any action the callback could take that would alleviate the problem or modify the queue?

Thanks again!

@nblumhardt
Copy link
Member

(The formatting/xUnit updates also look like a win, to me 👍 )

@bartelink
Copy link
Member Author

bartelink commented Apr 19, 2018

Thanks Nick! Will split off the blockWhenFull and formatting things - I assume putting the xUnit changes in there too makes the next discussion easier too.

Re monitoring, having the pushing be managed/owned by sink was more an "easiest way to get something going" assume more people are interested thing.

An onFull is way too late for me. I'm looking to fit this into a platform-driven health check mechanism, so I want to regularly tap into the queue tailback lengths. While it's possible to do this by maintaining a list of items in play to be inspected, my ideal approach is to have it fed on a push basis to a staging area (and/or into a prometheus metrics store) in order to have everything to hand for a health check call; the other benefit of driving it that way around is to be able have the monitor function as a heartbeat for the propagation. If there was a neat way to pass out a ref to the queue (and/or a wrapper over it) in the DSL, managing the timer externally is pretty logical (for starters, I'll likely have > 1 Async in my process).

My stretch goal for this is to be able to use a similar pattern to batch audit transmissions going to kafka with a way to alert on delays (i.e. if the design can be similar, tjat would be a good sign that it's a flexible scheme).

Another thing that springs to mind is that having a monitor (esp on a push basis, all-but-covers the onFull case - in fact one could design it to always report when full)

In my initial thinking I was going to pass 2 args to the action (count and limit), but replaced that with the whole object on the basis that there might be other things people would be interested to grab. However I can see that wrapping it would give important isolation, and it also takes some namespaces out of the picture. One complication is that the caller needs to consider the lifetime / ObjectDisposedException possibilities (on reflection they probably had to anyway, and an encapsulating wrapper can concea; that.)

Is there prior art for having the config DSL to hand out a ref to an inner object? I'll get cracking on calving off the tidying into a separate PR for now...

@bartelink
Copy link
Member Author

@jose3m @cocowalla Do either of you guys happen to have ideas as to what you'd like to see in an API such as this - can this scheme possibly address the desire behind #13 ?

@bartelink bartelink force-pushed the feature/queue-monitoring branch from 01aeb00 to cd58cbd Compare April 19, 2018 22:01
@bartelink bartelink force-pushed the feature/queue-monitoring branch from cd58cbd to 579a0cf Compare April 19, 2018 22:06
@bartelink
Copy link
Member Author

bartelink commented Apr 19, 2018

Having written far better tests in the preceding PR, I'm hating the one I wrote here ;)

I'm thinking that regardless of whether the API remains as an active Timer-driven callback or we provide a way to reference the inner buffer in the course of configuration,
it should definitely wrap the buffer in an abstraction exposing just the limit and the current usage.

I'm happy to work on that when we reach an agreement as to how it should look; right now all I'm thinking is "you don't want to go emitting refs to buffers as an out arg", without having any ideas that are not in line with the API that I have as it is.

(Spiking an out arg that wraps it; looks OK so far)

@bartelink
Copy link
Member Author

bartelink commented Apr 20, 2018

@nblumhardt Should hopefully be more acceptable now - the names can probably benefit from a second set of eyes and the readme perhaps needs to be less preachy, but this provides a more flexible hook without consuming a Timer or adding another target as I had previously done. I've left the commit that implements the first scheme on the branch for posterity but am more than open to eliding it - your call.

Wild stab: Arguably the test is informative enough that it merits moving into the core Test Class ? (though instinct tells me this is equally likely to just be confusing)

@bartelink bartelink force-pushed the feature/queue-monitoring branch from 3534d8e to 7ff190f Compare April 20, 2018 01:05
@bartelink bartelink changed the title Add queue monitor facility Add queue inspector facility Apr 20, 2018
@bartelink bartelink requested a review from nblumhardt April 20, 2018 01:11
@bartelink bartelink force-pushed the feature/queue-monitoring branch 2 times, most recently from e78f22f to efe288c Compare April 20, 2018 01:19
@cocowalla
Copy link
Contributor

@bartelink @nblumhardt

bartelink: @cocowalla Do either of you guys happen to have ideas as to what you'd like to see in an API such as this - can this scheme possibly address the desire behind #13 ?

Main thing I'd want is the possibility to do out-of-band alerting, which your changes certainly cover - I also like how simple this change is, while still providing a mechanism for the dev to plugin whatever monitoring mechanism they choose.

bartelink: Spiking an out arg that wraps it; looks OK so far

I wonder about this - is this done in any other sinks? Mind, I'm not sure Serilog actually exposes any mechanisms for the dev to otherwise get a reference to an individual ISink!

@nblumhardt also asked if there was anything a callback could do that mutates the queue in some way as to alleviate the problem. BlockingCollection is thread-safe, so there shouldn't be an issue with a monitoring thread making changes to it... I suppose in some circumstances you might want to, for example, dump the contents of the queue somewhere, then clear it. Thoughts?

@bartelink
Copy link
Member Author

@cocowalla I missed the point about the mutation, thanks for raising this.

My general view on this sort of thing is that we're best off keeping it simple in this impl (call it CQS if you wish!). Having the role of .Async remain as an easy to explain SRP decorator has its own value - for example, looking at the stuff I put into the readme makes me worry (perhaps the solution is to farm it out to a wiki article).

I can absolutely picture such a facility being useful for any number of tricks (one thing that springs to mind is doing compression out of band iff there is a tailback), I think starting off with the constraint of not having it is a good thing - it might force fanning out into some form of AsyncTee impl to prove the concept (and/or eventually taking the learnings from those and rolling them back in again eventually a la File vs RollingFile).

The other aspect is (if you look at my first commit), things get more hairy the more you expose in terms of portability and general multitargetting busywork; exposing the info only via a constraining wrapper which honours the ISP already takes a potentially complex interaction with a specific implementation out of the picture (and one might in time imagine another sink implementing such an introspection interface to allow one to inspect backlogs). Right now, this is an easy to reason about/review/trust implementation and I'd like it to remain that way (I'm here as a refuge from what happens if your Async trickery gets too complex in a competing system).

Re the specific idea of dumping the contents of the queue somewhere, that's a useful idea though; the way the Seq Sink works in Durable mode has overlaps with this too. In my case, the sort of problem I'm looking to solve for is for an ephemeral instance in a cluster being able to store and forward telemetry type info but being able to deal with a) an ingester being down b) not letting stuff get orphaned when shutting down an instance c) using the RAM thats available under normal conditions and not going to disk to optimize throughput given one would be using Disk more as a fallback when problems happen than as a first class strategy for Durability. As part of this, you want an Async to isolate one from the latency variation, and then a varying strategy behind that. In general doing another .Wrap inside the .Async (which is a Wrap) and encapsulating the semantics as a Sink/Decorator is likely to be a good way of splitting up the problem. Nick, ping me when you have something like that extracted from the Sinks.Seq ;)

As for the out, what can I say ... the biggest problem is that one is handing out a ref to something that one manages the lifetime of (which reminds me, I need to document the ObjectDisposedException/IOE one can trigger in props on the new interface); that's kinda OK if we think of this as being a CloseAndFlush at shutdown will cover it type of problem, but doing it on a push based on specific events is messier still, as it leaves you having to design an API for that..

@bartelink
Copy link
Member Author

bartelink commented Apr 20, 2018

I'm thinking this PR (primarily in service of clarifying the tests - asserting that messages actually got dropped not just be asserting they don't make it through), could even cover #13 by maintaining a forever-accumulating long dropped count which is accessible via the same interface - would that work @jose3m

@bartelink
Copy link
Member Author

bartelink commented Apr 20, 2018

Actually on reflection, having this count is as valuable to me as being able to glean the point in time state with regard to queue backlog. @nblumhardt what are your thoughts on this - is it breaking the PR SRP too much to add this in or do you think we should :kill: :allthebirds: ?

@bartelink bartelink force-pushed the feature/queue-monitoring branch from efe288c to dcd24ac Compare April 20, 2018 13:22
@bartelink
Copy link
Member Author

bartelink commented Apr 20, 2018

Added DroppedMessagesCount - it's a separate commit so can go into a separate PR if desired.

@bartelink bartelink force-pushed the feature/queue-monitoring branch from 06b8a92 to d3b5309 Compare April 20, 2018 14:05
/// <summary>
/// Provides a way to inspect the current state of Async wrapper's ingestion queue.
/// </summary>
public interface IQueueState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit- would be preferable to have this in its own file.

Another nit :-) ... it's closer to being the sink state than the queue state, since the dropped count isn't logically a property of the queue..?

Just looking at namespacing etc., could we tuck this away under Serilog.Sinks.Async.IAsyncLogEventSinkState or something along those lines, so that it's less likely to conflict with other types? I can't imagine there are too many IQueueState interfaces out there, but since it's not something a user will type regularly, a more descriptive name might not hurt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my language of choice it should not be a separate file ;) Agree and will do.
The specific name is way better (the original was a speculative generality land grab) - dropped events are not a thing one wants as a universal 'feature'

@nblumhardt
Copy link
Member

Looks great, Ruben 👍

I've just added some minor comments on naming/namespacing, open to ideas.

Also did wonder whether the out parameter will cause any issues for use with configuration; probably none to worry about too much, but if we were to pass in an IAsyncLogEventSinkMonitor rather than have a value passed out, it would have the benefit of working with XML and JSON config (although only when the monitor type has a parameterless constructor).

Any thoughts?

@bartelink
Copy link
Member Author

bartelink commented Apr 22, 2018

Not quite getting how a IAsyncLogEventSinkMonitor would look... while a DroppingLogEvent would be an advance for people looking to do something specific, all I can see is "BufferAdded" + "BufferRemoved" which might be useful for a general monitor handling multiple sinks in play (but from personal experience, it's best for the API to not encourage one to think of adding lots of Asyncs pipelines and instead maintain one and switch stuff under the control of it's consumption/push loop instead in order to maintain memory cache/context switching mechanical sympathy). Ultimately we are talking (or at least that's what I'm thinking) about surfacing volatile properties of a concurrent data structure more than building out some more general metrics facility (if we were, we might be providing a processed messages count etc)

Can you expand a little on how you can see it being shaped ? (out does make my skin crawl a bit, but it's growing on me as it's got an honesty to it - one is feeding innards outwards)

The other open question is what the out param (if it is to remain one) should be called given a type name of IAsyncLogEventSinkState; state, monitor, guts ... inspector ? I've left it as inspector in the update as my thesaurus is being unduly reticent on this

/// <summary>
/// Accumulated number of messages dropped due to attempted submission having breached <see cref="BufferSize"/> limit.
/// </summary>
long DroppedMessagesCount { get; }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any advance on this name?

@bartelink
Copy link
Member Author

bartelink commented Apr 23, 2018

This is ready to merge from my perspective, but then I don't really understand what this needs to do to play well with the config mechanism; awaiting further guidance ;)
(The out arg is on a separate overload of Async - i.e. the existing API has identical semantics minus passing out the observation reference - thus as long as there is ambiguity in reflection driven configuration or such things in play, it should be safe? ... which reminds me - the existing impl could delegate to the new one for (arguably more) clarity - would that be clearer in your view ?)

@nblumhardt
Copy link
Member

Thanks for the follow-up :-)

I'm not wedded to this idea BTW, but just to clarify my suggestion about turning the mechanism inside out so that it works with config (which I didn't think through or explain very well), here's a bit more detail:

First, just to clarify, the PR as it's currently shaped won't (shouldn't) break configuration - it just won't be possible to use the new monitor feature via configuration.

The internals in a config-friendly version would be the same as they are in the PR, but instead of receiving the sink as an out parameter, we'd provide an object that would have the thing passed into it.

var monitor = new MyAsyncBufferMonitor();

// <snip>

    // Passing the monitor in, instead of getting the state out
    .WriteTo.Async(..., monitor: monitor)

I.e. we pass something in, rather than out.

Inside the WriteTo.Async() method, instead of assigning an out parameter, we'd just call:

monitor.MonitorState(queueState);

The monitor interface itself would get a single method, and the sink would dispose the implementation, if it implements IDisposable, so that the monitoring can stop when the logging pipeline is disposed.

The config tie-in then allows a fully-qualified type name to be specified for the monitor argument, which the config system will use to create an instance of the type and pass it through to the sink.

Rushing this reply out in transit, apologies if this is still vague, please let me know what you think :-) Cheers!

@bartelink
Copy link
Member Author

Ah, thanks for taking the time to explain clearly; thankfully I [hope I] get it ;) I agree this scheme should work well and it does resolve the lifetime management more clearly.
Do you know if there is prior art for something like this in any Serilog package that I should be trying to 100% align with ?

@bartelink
Copy link
Member Author

Added a skeleton implementation. Some observations:

  • In my case, I have 2 pipelines so need to implement a proxy to feed through the disposal notification IDisposable (in practice, the two pipelines both get Disposed at the same time so I won't actually do that)
  • I found it very hard to word the xmldoc on IAsyncLogEventSinkMonitor (vs an out parameter speaking for itself)
  • the Disposal notification can also be achieved by adding an event on the State
  • the overloading gets messy if you have an out-param and monitor interface side by side
  • I'm sceptical that there are many implementations for which the IAsyncLogEventSinkMonitor mechanism affords an ideal integration. While admittedly global health monitoring is likely to be an exception, the lifetime management (singletons etc) surrounding the establishing of a health monitoring timer is probably not best addressed in this way.
  • the other thing that's running through my mind is that the Async wrapper is a very low dependency sink in general, and has concurrency (with the attendant subtlety and complexity it brings) and many patterns to juggle - I don't know how to document it in the README in a succinct manner.

This makes me think we should avoid adding the complexity (surfacing and documenting another interface that there are no known users of) at this point for now.

The other thing for me is that the specific way one configures an Async wrapper is not the sort of thing one actually wants to be coding in a config file given the subtle and complex behaviour and perf implications. For me, such work is best done programmatically in a .UseSerilog and/or a bespoke App health checker implementation.

We could park this last commit in a PR in my repo (or the main repo if you prefer), with an issue marking the fact that a config interface is a TODO. Then, at the point where someone actually wants/needs this, the finessing can be done with much more context as to how someone is going to actually wire it up in practice.

Or am I wrong ? (I do hope I am and hence arrive back here in a few months with the perfect abstraction to straddle multiple hosting environments.)

@bartelink bartelink force-pushed the feature/queue-monitoring branch from f74117f to 6546958 Compare April 24, 2018 14:10
@bartelink
Copy link
Member Author

@nblumhardt Let me know your thoughts - if you still believe this approach is more idiomatic and/or feel I'm worrying too much, I am of course happy to add the tests and/or remove the out variant with a view to getting this merged asap (I'll have it into prod the day after we agree a final layout and I have a NuGet to reference)

@nblumhardt
Copy link
Member

Hey Ruben, thanks for all of the analysis and thoughts 👍

I would personally have not expected async to be used much from configuration, either, but it ended up being the number one feature request we received from when the async wrapper was created through to the addition of configuration support, e.g. here, here, here and here.

Hence, although I agree that it's not the prettiest API in the world 😀, I think it will avoid future work if we can preempt the inevitable requests for configuration support, and add it here.

In my case, I have 2 pipelines so need to implement a proxy to feed through the disposal notification IDisposable (in practice, the two pipelines both get Disposed at the same time so I won't actually do that)

Very good point; I think it was a mistake for me to suggest we dispose the monitor, since we didn't (always) create it; calling StopMonitoring(this) during disposal would be semantically more accurate.

the overloading gets messy if you have an out-param and monitor interface side by side

Agreed; we'd be better off with only the one way of doing things, as you've implemented in the PR.

I found it very hard to word the xmldoc on IAsyncLogEventSinkMonitor (vs an out parameter speaking for itself)

👍 - I think what's there is probably fine for now.

What do you think?

@bartelink
Copy link
Member Author

Agreed; we'd be better off with only the one way of doing things, as you've implemented in the PR.

For avoidance of doubt: the branch as it is has both in situ

It's certainly hard to argue that people don't want it. By the same token, I'm pretty confident it gets misused more than a little too!

So we're saying I should

  • remove the out param version and repurpose the associated tests to address the updated version
  • add a StopMonitoring

@nblumhardt
Copy link
Member

For avoidance of doubt: the branch as it is has both in situ

Ah, thanks - sorry, I'd missed that.

Yes, I think that's the way to go 👍

@bartelink bartelink force-pushed the feature/queue-monitoring branch 2 times, most recently from bcbf490 to 59452ab Compare April 25, 2018 02:26
@bartelink
Copy link
Member Author

bartelink commented Apr 25, 2018

OK, I think this feels pretty final, including updating the README and the test - I think the README example works out OK, or at least better than I was expecting yesterday.

Let me know if there's more to do (NB esp the README and the xmldoc need a thorough nitpicking; I've traversed it way too many times, but we'll obviously pay the price N times if we don't take the time to polish it - i.e., please be as harsh as you can on it so I can (ideally) get it polished and put to bed at some point tomorrow).

@bartelink bartelink force-pushed the feature/queue-monitoring branch from 59452ab to ab156fe Compare April 25, 2018 09:32
@bartelink bartelink force-pushed the feature/queue-monitoring branch from ab156fe to 17e0779 Compare April 25, 2018 09:53
@nblumhardt
Copy link
Member

Epic! Thank you for sticking with this, I think (hope?! :-) ) that we got a great solution out of all the back-and-forth. Dev build incoming! 🎉

@nblumhardt nblumhardt merged commit 6cee32a into serilog:dev Apr 25, 2018
@bartelink bartelink deleted the feature/queue-monitoring branch April 25, 2018 22:26
@nblumhardt nblumhardt mentioned this pull request May 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants