-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(outputs): Implement partial write errors #16146
Conversation
5608625
to
0dbf506
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I tested this thoroughly, especially with the disk buffer and didn't find any issues. This included some manual tests, testing other plugin test cases with the disk buffer, and coverage tests. I added a couple comments as well
0dbf506
to
f2090a4
Compare
type Transaction struct { | ||
// Batch of metrics to write | ||
Batch []telegraf.Metric | ||
|
||
// Accept denotes the indices of metrics that were successfully written | ||
Accept []int | ||
// Reject denotes the indices of metrics that were not written but should | ||
// not be requeued | ||
Reject []int | ||
|
||
// Marks this transaction as valid | ||
valid bool | ||
|
||
// Internal state that can be used by the buffer implementation | ||
state interface{} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "transaction" is the wrong word to use here. In computing, a transaction is a sequence of steps that is executed atomically. Either it runs fully, or not at all. The structure Transaction
here records partially completed writes. It is explicitly providing support for a sequence of actions that is allowed to partially complete. Maybe I would call it PartialBatch
, or something else to indicate it need not all succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm IMO it is a transaction on the buffer. The buffer is not modified until EndTransaction
is called on the buffer. Let's keep this for now.
MetricsAccept []int | ||
MetricsReject []int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using bit arrays here would lead to memory savings and performance improvements. Switching to them would be a significant change to the PR and I think the code is sound as it is, so not recommending that, but maybe as a future performance enhancement.
I would continue to use separate vars for "accept" and "reject", just switch them to bit arrays. They would be fixed in size at len(metrics)/8+1
bytes long, vs len(accepted)*64
, which I think in many cases would be approaching len(metrics)*64
. Not sure what the typical use case is here, would take some measurement of real-world scenarios, but I think chances are good it would be a decent savings.
The other advantage is that it can support simplified operations. The metrics to drop from the WAL after a single write can be found with the bit-wise union of the two bit arrays for accept and reject. Then you union that again with the long-running mask for the WAL. If all the bits are 1
the whole WAL is done. If not, to find out what prefix to remove from the WAL, you find the first non-zero bit in the mask.
Here's an example of a bit-array library you could use as a reference. I would not use a library, but instead write a simplified one that supplies just the operations needed here. When bit-array data structures are not made general they can be surprisingly small in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I do agree that using a bitmask is better it seems a premature optimization at this point. We really need to optimize this further but IMO this also must include optimizing disk I/O as we are currently scratching the disk badly. In this step, the removal-masking should move into a dedicated WAL implementation and be converted to a bit-mask as you suggest. What do you think?
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
Summary
This PR implements specification TSD-008 for partial write errors.
Checklist
Related issues
related to #11942
related to #14802
related to #15908