unable to be deserialized error on disk buffer block vector sending data out #18336

ee07b415 · 2023-08-21T23:37:57Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Recently we did some perf test with the vector act as the vector server collecting metrics data and forward to Kafka. In our cluster we runs into some OOM crash result in some pods restart, after restart, several replica of the vector has the error as:

2023-08-19T01:52:51.816411Z ERROR sink{component_kind="sink" component_id=*** component_type=kafka component_name=***}: vector_buffers::variants::disk_v2::writer: Last written record was unable to be deserialized. Corruption likely. reason="invalid data: check failed for struct member payload: pointer out of bounds: base 0x7f5c9ce00008 offset 1635021580 not in range 0x7f5c8f3e7000..0x7f5c90000000"

The replicas with this error will still receiving the metrics data(http_server source) but won't forward to Kafka(sink) anymore, because we didn't enable the log by the time of this error happen and reproduce cost a lot to generate load(will update here when we do it again). This result in losing data, I don't think this is the behavior we expected. Not sure if anymore met the similar problem, would like to learn the root cause.
The Kafka buffer usage down to zero

The http source still receiving data

The Kafka request count down to zero

We could put some alerting around the service, I just feel like if the buffer io runs into this kind of blocking status, we probably want to removing it from the service, but I can't find how can I configure the http server source buffer to block new data when it is full

Configuration

customConfig:
  api:
    address: *.*.*.*:**
    enabled: true
    playground: false
  data_dir: /var/lib/vector
  sinks:
    kafka_startree:
      type: kafka
      inputs:
        - http_server
      bootstrap_servers: ""
      topic: ""
      encoding:
        codec: "json"
      compression: "lz4"
      buffer:
        type: disk
        when_full: drop_newest
        max_size: 268435488
      sasl: 
        ****
      tls:
        enabled: true
  sources:
    http_server:
      type: http_server
      address: 0.0.0.0:8080
      encoding: json
      method: POST
      headers:

Version

timberio/vector:0.31.0-distroless-libc

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

tobz · 2023-09-14T17:40:40Z

From your description, it looks like there's actually three different things going on here:

you're seeing disk buffer errors at startup after pods are restarted
you want to be able to stop accepting data at the source when a disk buffer is full
your Kafka sink stops sending data after a period of time

Buffer errors at startup

In your case, what is happening is that Vector was in the middle of writing a record (event) to the disk buffer as it was killed by Kubernetes. When Vector restarts and validates where it left off with writing to/reading from the disk buffer, it realizes that this partial write occurred, and then emits an error message to inform you of this.

This is expected behavior because Vector cannot ensure atomic writes to disk, and even further, modifications to disk buffer files could occur without Vector even running: manipulating the files manually, filesystem/disk corruption, etc. At best, Vector is able to inform you that the data in the disk buffer is invalid (in some way) and what that means (potentially lost events, etc).

We report this at the error level because it's serious -- as you note, you don't expect your data to be lost, and most users consider lost data to be a big issue -- and we want users to be able to understand when this occurs... but there's nothing to change about it, besides trying to adjust deployments to keep Vector from being OOM killed where possible.

Having sources stop accepting data when a downstream sink has filled its disk buffer

This is actually the default behavior, but you're overriding it in your configuration by specifying when_full: drop_newest. Backpressure propagates from sinks to transforms to sources, so if a disk buffer fills at the sink level, that backpressure will eventually propagate back to your source, which in turn will block callers trying to write more data.

Kafka sink stops sending data

This looks like a potential legitimate problem you're experiencing.

Can you provide some more information about the test you were running? A few things specifically:

what version of Vector are you running?
did the Vector restarts occur at the same time that the processing rate to Kafka drops to zero?

ee07b415 · 2023-09-14T18:15:36Z

Hi @tobz , thanks for replying this issue, its been a long time since my report of the issue, I probably need to pick up and some data may already missing.

We are running the vector version timberio/vector:0.31.0-distroless-libc, the Kafka drops to zero is NOT happen with the same time on the restart because we have the random suffix name with the replica, we won't able to see the metrics from the same replica when killed of kubernetes. For the buffer, I think you comments totally make sense, I will probably change to the other when_full solution so we will not send the data to the buggy replica.

The thing I'm not sure if it is by design is, when the replica runs into this status(block from reading buffer), it is just dead and not recover by it self, we need some manual operation like killed the pod. I totally agree with you this is a design problem, let it in the bug status we will have the chance to notice and debug.

awangc · 2023-09-27T21:43:29Z

We saw recently a similar issue. Our vector pods (on 0.32.0 distroless) sending data to Elasticsearch sink started being killed with OOM, and then entering a crash restart loop.

We start seeing the same error message in the logs "Last written record was unable to be deserialized. Corruption likely." with reason "invalid data: check failed for struct member payload: pointer out of bounds: base 0x7fb7b0c00008 offset 1929447951 not in range 0x7fb7b0a00000..0x7fb7b0c00000". Data sent to Elasticsearch drops and after a few hours recovers. The memory on the host machines never gets full, we're not sure what triggered the first OOM nor what helped it recover.

When vector encounters a corrupted disk buffer, what does it do? Continue while accumulating events in memory? How best to track/obtain information on what triggered this? Setting VECTOR_LOG=debug?

tobz · 2023-10-16T14:36:42Z

We saw recently a similar issue. Our vector pods (on 0.32.0 distroless) sending data to Elasticsearch sink started being killed with OOM, and then entering a crash restart loop.

@awangc Best to open a new issue for this.

ShahroZafar · 2024-06-14T10:53:20Z

Do we have an update here? I see that this de-serialized error is occurring very often. Is there any workaround to solve it?

orarnon · 2024-06-24T09:25:23Z

Whether drop_newest or block is used, a corrupted message should either be sent to a "dead letter queue" or dropped. Stopping the entire pipeline due to a corrupted message introduces a complete failure in processing messages upon a restart (due to OOM for example). It defeats the purpose of having disk buffers.
Is that a bug? How can we proceed?

jszwedko · 2024-06-25T20:22:37Z

This is a bug. The only, unfortunate, workaround I'm aware of at the moment is to delete the disk buffer.

ee07b415 added the type: bug A code related bug. label Aug 21, 2023

tobz added the domain: buffers Anything related to Vector's memory/disk buffers label Oct 16, 2023

jszwedko mentioned this issue Jun 12, 2024

Vector Disk Buffer Error: Last written record was unable to be deserialized. Corruption likely #20651

Closed

ShahroZafar mentioned this issue Jun 28, 2024

end to end acknowledgement not supported in kubernetes_log source #20752

Closed

jszwedko mentioned this issue Aug 9, 2024

Unknown buffer disk limitation #19759

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to be deserialized error on disk buffer block vector sending data out #18336

unable to be deserialized error on disk buffer block vector sending data out #18336

ee07b415 commented Aug 21, 2023

tobz commented Sep 14, 2023 •

edited

Loading

ee07b415 commented Sep 14, 2023

awangc commented Sep 27, 2023

tobz commented Oct 16, 2023

ShahroZafar commented Jun 14, 2024

orarnon commented Jun 24, 2024

jszwedko commented Jun 25, 2024

unable to be deserialized error on disk buffer block vector sending data out #18336

unable to be deserialized error on disk buffer block vector sending data out #18336

Comments

ee07b415 commented Aug 21, 2023

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

tobz commented Sep 14, 2023 • edited Loading

Buffer errors at startup

Having sources stop accepting data when a downstream sink has filled its disk buffer

Kafka sink stops sending data

ee07b415 commented Sep 14, 2023

awangc commented Sep 27, 2023

tobz commented Oct 16, 2023

ShahroZafar commented Jun 14, 2024

orarnon commented Jun 24, 2024

jszwedko commented Jun 25, 2024

tobz commented Sep 14, 2023 •

edited

Loading