-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unable to be deserialized error on disk buffer block vector sending data out #18336
Comments
From your description, it looks like there's actually three different things going on here:
Buffer errors at startupIn your case, what is happening is that Vector was in the middle of writing a record (event) to the disk buffer as it was killed by Kubernetes. When Vector restarts and validates where it left off with writing to/reading from the disk buffer, it realizes that this partial write occurred, and then emits an error message to inform you of this. This is expected behavior because Vector cannot ensure atomic writes to disk, and even further, modifications to disk buffer files could occur without Vector even running: manipulating the files manually, filesystem/disk corruption, etc. At best, Vector is able to inform you that the data in the disk buffer is invalid (in some way) and what that means (potentially lost events, etc). We report this at the error level because it's serious -- as you note, you don't expect your data to be lost, and most users consider lost data to be a big issue -- and we want users to be able to understand when this occurs... but there's nothing to change about it, besides trying to adjust deployments to keep Vector from being OOM killed where possible. Having sources stop accepting data when a downstream sink has filled its disk bufferThis is actually the default behavior, but you're overriding it in your configuration by specifying Kafka sink stops sending dataThis looks like a potential legitimate problem you're experiencing. Can you provide some more information about the test you were running? A few things specifically:
|
Hi @tobz , thanks for replying this issue, its been a long time since my report of the issue, I probably need to pick up and some data may already missing. We are running the vector version timberio/vector:0.31.0-distroless-libc, the Kafka drops to zero is NOT happen with the same time on the restart because we have the random suffix name with the replica, we won't able to see the metrics from the same replica when killed of kubernetes. For the buffer, I think you comments totally make sense, I will probably change to the other when_full solution so we will not send the data to the buggy replica. The thing I'm not sure if it is by design is, when the replica runs into this status(block from reading buffer), it is just dead and not recover by it self, we need some manual operation like killed the pod. I totally agree with you this is a design problem, let it in the bug status we will have the chance to notice and debug. |
We saw recently a similar issue. Our vector pods (on 0.32.0 distroless) sending data to Elasticsearch sink started being killed with OOM, and then entering a crash restart loop. We start seeing the same error message in the logs "Last written record was unable to be deserialized. Corruption likely." with reason "invalid data: check failed for struct member payload: pointer out of bounds: base 0x7fb7b0c00008 offset 1929447951 not in range 0x7fb7b0a00000..0x7fb7b0c00000". Data sent to Elasticsearch drops and after a few hours recovers. The memory on the host machines never gets full, we're not sure what triggered the first OOM nor what helped it recover. When vector encounters a corrupted disk buffer, what does it do? Continue while accumulating events in memory? How best to track/obtain information on what triggered this? Setting VECTOR_LOG=debug? |
@awangc Best to open a new issue for this. |
Do we have an update here? I see that this de-serialized error is occurring very often. Is there any workaround to solve it? |
Whether drop_newest or block is used, a corrupted message should either be sent to a "dead letter queue" or dropped. Stopping the entire pipeline due to a corrupted message introduces a complete failure in processing messages upon a restart (due to OOM for example). It defeats the purpose of having disk buffers. |
This is a bug. The only, unfortunate, workaround I'm aware of at the moment is to delete the disk buffer. |
A note for the community
Problem
Recently we did some perf test with the vector act as the vector server collecting metrics data and forward to Kafka. In our cluster we runs into some OOM crash result in some pods restart, after restart, several replica of the vector has the error as:
The replicas with this error will still receiving the metrics data(http_server source) but won't forward to Kafka(sink) anymore, because we didn't enable the log by the time of this error happen and reproduce cost a lot to generate load(will update here when we do it again). This result in losing data, I don't think this is the behavior we expected. Not sure if anymore met the similar problem, would like to learn the root cause.



The Kafka buffer usage down to zero
The http source still receiving data
The Kafka request count down to zero
We could put some alerting around the service, I just feel like if the buffer io runs into this kind of blocking status, we probably want to removing it from the service, but I can't find how can I configure the http server source buffer to block new data when it is full
Configuration
Version
timberio/vector:0.31.0-distroless-libc
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: