Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector sink with disk buffer stuck after pod restart #19155

Open
kaarolch opened this issue Nov 15, 2023 · 3 comments
Open

Vector sink with disk buffer stuck after pod restart #19155

kaarolch opened this issue Nov 15, 2023 · 3 comments
Labels
domain: buffers Anything related to Vector's memory/disk buffers type: bug A code related bug.

Comments

@kaarolch
Copy link

kaarolch commented Nov 15, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We are using Vector version 0.33.1 to process and forward Datadog metrics. Due to HPA and Kubernetes node rotation mechanism, we enabled the disk buffer to be sure that all metrics will be forwarded to DD server. Unfortunately, after pod/nodes restart, we observed a situation where the vector pod start, the process is up but data are not being processed or process but not sink. In the correctly mounted persistent volume, we see old buffer-data-*.dat files. Since we use Kubernetes HPA, multiple orphaned dat files exist in the PVC store:

data_dir/buffer/v2/datadog_agent_misc_metrics$ ls -al
total 521996
drwxrwsr-x 2 1000 1000      4096 Nov 15 11:16 .
drwxrwsr-x 6 1000 1000      4096 Oct 20 12:31 ..
-rw-rw-r-- 1 1000 1000 134099968 Nov 14 10:54 buffer-data-20.dat
-rw-rw-r-- 1 1000 1000 133615360 Oct 23 12:53 buffer-data-21.dat
-rw-rw-r-- 1 1000 1000 133886400 Oct 23 12:59 buffer-data-25.dat
-rw-rw-r-- 1 1000 1000 132894968 Oct 23 13:07 buffer-data-32.dat
-rw-rw-r-- 1 1000 1000        24 Nov 15 10:23 buffer.db
-rw-r--r-- 1 1000 1000         0 Nov 15 11:19 buffer.lock
$df -h 
/dev/nvme4n1     30G  547M   29G   2% /data_dir

Also, is there a chance to add detailed warn in the logs when the vector blocks the sink and cannot forward metrics? From an observability perspective, we can currently only catch this based on vector internal metrics and Datadog agent metrics. It would be useful to have this information directly in the logs.

Configuration

## non custom metrics comming from dd-agent - non DogstatsD
type: datadog_metrics
inputs:
  - metrics_route._unmatched
default_api_key: ${US_DATADOG_API_KEY}
endpoint: ${DD_SITE}
buffer:
   type: disk
   max_size: 15000000000 # close to 15GB
   when_full: block
batch:
   max_events: 4000
   timeout_secs: 1
acknowledgements:
   enabled: False
request:
   concurrency: "adaptive"
   rate_limit_duration_secs: 1
   rate_limit_num: 50
   retry_attempts: 15
   retry_max_duration_secs: 1800
   retry_initial_backoff_secs: 1
   timeout_secs: 5
   adaptive_concurrency:
       decrease_ratio: 0.7
       ewma_alpha: 0.4
       initial_concurrency: 2
       rtt_deviation_scale: 2.5

Version

0.33.1

Debug Output

2023-11-15T13:21:20.329877Z DEBUG vector::app: Internal log rate limit configured. internal_log_rate_secs=1
2023-11-15T13:21:20.329916Z  INFO vector::app: Log level is enabled. level="debug"
2023-11-15T13:21:20.329926Z DEBUG vector::app: messaged="Building runtime." worker_threads=30
2023-11-15T13:21:20.332023Z  INFO vector::app: Loading configs. paths=["/vector"]
2023-11-15T13:21:20.335263Z DEBUG vector::config::loading: No secret placeholder found, skipping secret resolution.
2023-11-15T13:21:20.342354Z DEBUG vector::topology::builder: Building new source. component=datadog_agent
2023-11-15T13:21:20.344376Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: vector::sources::datadog_agent: Building HTTP server. address=0.0.0.0:24869
2023-11-15T13:21:20.344414Z DEBUG vector::topology::builder: Building new source. component=dogstatsd
2023-11-15T13:21:20.345010Z DEBUG vector::topology::builder: Building new source. component=vector_internal_logs
2023-11-15T13:21:20.345646Z DEBUG vector::topology::builder: Building new source. component=vector_internal_metrics
2023-11-15T13:21:20.346231Z DEBUG vector::topology::builder: Building new transform. component=high_cardinality_limit
2023-11-15T13:21:20.346315Z DEBUG vector::topology::builder: Building new transform. component=internal_logs_filter
2023-11-15T13:21:20.346434Z DEBUG vector::topology::builder: Building new transform. component=internal_logs_metadata
2023-11-15T13:21:20.346526Z DEBUG vector::topology::builder: Building new transform. component=log_to_metrics_rejections
2023-11-15T13:21:20.346575Z DEBUG vector::topology::builder: Building new transform. component=low_cardinality_limit
2023-11-15T13:21:20.346616Z DEBUG vector::topology::builder: Building new transform. component=metrics_remove_dogstatsd_tag
2023-11-15T13:21:20.346739Z DEBUG vector::topology::builder: Building new transform. component=metrics_route
2023-11-15T13:21:20.346826Z DEBUG vector::topology::builder: Building new transform. component=metrics_route_limit
2023-11-15T13:21:20.347370Z DEBUG vector::topology::builder: Building new transform. component=normal_cardinality_limit
2023-11-15T13:21:20.347427Z DEBUG vector::topology::builder: Building new transform. component=tags_vector_internal_metrics
2023-11-15T13:21:20.350891Z DEBUG vector::topology::builder: Building new transform. component=test_cardinality_limit
2023-11-15T13:21:20.350981Z DEBUG vector::topology::builder: Building new sink. component=consume_not_used_datadog_sources
2023-11-15T13:21:20.351057Z DEBUG vector::topology::builder: Building new sink. component=datadog_agent_dogstatsd_metrics
2023-11-15T13:21:20.352196Z DEBUG sink{component_kind="sink" component_id=datadog_agent_dogstatsd_metrics component_type=datadog_metrics component_name=datadog_agent_dogstatsd_metrics}: vector_buffers::variants::disk_v2::ledger: Found existing data file. data_file="buffer-data-34716.dat" file_size=73956464 total_buffer_size=73956464
2023-11-15T13:21:20.352665Z DEBUG sink{component_kind="sink" component_id=datadog_agent_dogstatsd_metrics component_type=datadog_metrics component_name=datadog_agent_dogstatsd_metrics}:validate_last_write: vector_buffers::variants::disk_v2::writer: Validating last written record in current data file. current_writer_data_file="/data_dir/buffer/v2/datadog_agent_dogstatsd_metrics/buffer-data-34716.dat"
2023-11-15T13:21:20.352716Z DEBUG sink{component_kind="sink" component_id=datadog_agent_dogstatsd_metrics component_type=datadog_metrics component_name=datadog_agent_dogstatsd_metrics}:validate_last_write:ensure_ready_for_write: vector_buffers::variants::disk_v2::writer: Opened data file for writing. data_file_path="/data_dir/buffer/v2/datadog_agent_dogstatsd_metrics/buffer-data-34716.dat" existing_file_size=73956464
2023-11-15T13:21:20.354345Z DEBUG sink{component_kind="sink" component_id=datadog_agent_dogstatsd_metrics component_type=datadog_metrics component_name=datadog_agent_dogstatsd_metrics}:validate_last_write: vector_buffers::variants::disk_v2::writer: Synchronized with ledger. Writer ready. ledger_next=2163196284 last_record_id=2163196226 record_events=58
2023-11-15T13:21:20.354405Z DEBUG sink{component_kind="sink" component_id=datadog_agent_dogstatsd_metrics component_type=datadog_metrics component_name=datadog_agent_dogstatsd_metrics}: vector_buffers::variants::disk_v2::reader: Seeking to last acknowledged record for reader. last_acknowledged_record_id=2163196283
2023-11-15T13:21:20.354432Z DEBUG sink{component_kind="sink" component_id=datadog_agent_dogstatsd_metrics component_type=datadog_metrics component_name=datadog_agent_dogstatsd_metrics}: vector_buffers::variants::disk_v2::reader: Opened data file for reading. data_file_path="/data_dir/buffer/v2/datadog_agent_dogstatsd_metrics/buffer-data-34716.dat"
2023-11-15T13:21:21.160661Z DEBUG sink{component_kind="sink" component_id=datadog_agent_dogstatsd_metrics component_type=datadog_metrics component_name=datadog_agent_dogstatsd_metrics}: vector_buffers::variants::disk_v2::reader: Synchronized with ledger. Reader ready. last_record_id_read=2163196283
2023-11-15T13:21:21.171546Z DEBUG sink{component_kind="sink" component_id=datadog_agent_dogstatsd_metrics component_type=datadog_metrics component_name=datadog_agent_dogstatsd_metrics}: vector_core::tls::settings: Fetching system root certs.
2023-11-15T13:21:21.181361Z DEBUG sink{component_kind="sink" component_id=datadog_agent_dogstatsd_metrics component_type=datadog_metrics component_name=datadog_agent_dogstatsd_metrics}: vector_core::tls::settings: Fetching system root certs.
2023-11-15T13:21:21.190954Z DEBUG vector::topology::builder: Building new sink. component=datadog_agent_misc_metrics
2023-11-15T13:21:21.191226Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}: vector_buffers::variants::disk_v2::ledger: Found existing data file. data_file="buffer-data-20.dat" file_size=134100176 total_buffer_size=134100176
2023-11-15T13:21:21.191248Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}: vector_buffers::variants::disk_v2::ledger: Found existing data file. data_file="buffer-data-32.dat" file_size=132894968 total_buffer_size=266995144
2023-11-15T13:21:21.191266Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}: vector_buffers::variants::disk_v2::ledger: Found existing data file. data_file="buffer-data-21.dat" file_size=133615360 total_buffer_size=400610504
2023-11-15T13:21:21.191285Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}: vector_buffers::variants::disk_v2::ledger: Found existing data file. data_file="buffer-data-25.dat" file_size=133886400 total_buffer_size=534496904
2023-11-15T13:21:21.192066Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}:validate_last_write: vector_buffers::variants::disk_v2::writer: Validating last written record in current data file. current_writer_data_file="/data_dir/buffer/v2/datadog_agent_misc_metrics/buffer-data-20.dat"
2023-11-15T13:21:21.192133Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}:validate_last_write:ensure_ready_for_write: vector_buffers::variants::disk_v2::writer: Opened data file for writing. data_file_path="/data_dir/buffer/v2/datadog_agent_misc_metrics/buffer-data-20.dat" existing_file_size=134100176
2023-11-15T13:21:21.193436Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}:validate_last_write: vector_buffers::variants::disk_v2::writer: Synchronized with ledger. Writer ready. ledger_next=4546695839 last_record_id=4546695838 record_events=1
2023-11-15T13:21:21.193464Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}: vector_buffers::variants::disk_v2::reader: Seeking to last acknowledged record for reader. last_acknowledged_record_id=4546695838
2023-11-15T13:21:21.193496Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}: vector_buffers::variants::disk_v2::reader: Opened data file for reading. data_file_path="/data_dir/buffer/v2/datadog_agent_misc_metrics/buffer-data-20.dat"
2023-11-15T13:21:22.942073Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}: vector_buffers::variants::disk_v2::reader: Synchronized with ledger. Reader ready. last_record_id_read=4546695838
2023-11-15T13:21:22.954493Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}: vector_core::tls::settings: Fetching system root certs.
2023-11-15T13:21:22.966827Z DEBUG sink{component_kind="sink" component_id=datadog_agent_misc_metrics component_type=datadog_metrics component_name=datadog_agent_misc_metrics}: vector_core::tls::settings: Fetching system root certs.

I also found this in debug:

2023-11-15T13:21:23.318191Z ERROR vector::topology::builder: msg="Healthcheck failed." error=Unexpected status: 307 Temporary Redirect component_kind="sink" component_type="datadog_metrics" component_id=datadog_agent_misc_metrics component_name=datadog_agent_misc_metrics

Example Data

pod logs:

2023-11-15T10:23:11.768577Z  INFO vector::app: Log level is enabled. level="info"
2023-11-15T10:23:11.770954Z  INFO vector::app: Loading configs. paths=["/vector"]
2023-11-15T10:23:11.784143Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: vector::sources::datadog_agent: Building HTTP server. address=0.0.0.0:24869
2023-11-15T10:23:13.987559Z  INFO vector::topology::running: Running healthchecks.
2023-11-15T10:23:13.987647Z  INFO vector::topology::builder: Healthcheck passed.
2023-11-15T10:23:13.987786Z  INFO vector: Vector has started. debug="false" version="0.33.1" arch="x86_64" revision="3cc27b9 2023-10-30 16:50:49.747931844"
2023-11-15T10:23:13.988639Z  INFO source{component_kind="source" component_id=dogstatsd component_type=statsd component_name=dogstatsd}: vector::sources::statsd: Listening. addr=0.0.0.0:28125 type="udp"
2023-11-15T10:23:13.991020Z  INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=http://0.0.0.0:8686/playground
2023-11-15T10:23:14.262245Z ERROR vector::topology::builder: msg="Healthcheck failed." error=Unexpected status: 307 Temporary Redirect component_kind="sink" component_type="datadog_metrics" component_id=datadog_vector_internal_metrics component_name=datadog_vector_internal_metrics
2023-11-15T10:23:26.859994Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:23:26.891934Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: finished processing with success status=200
2023-11-15T10:24:15.695383Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:24:15.723062Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: finished processing with success status=200
2023-11-15T10:24:30.693830Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:24:30.724864Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: finished processing with success status=200
2023-11-15T10:26:44.452964Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:26:44.566431Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: finished processing with success status=200
2023-11-15T10:26:44.577527Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:26:44.650057Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: finished processing with success status=200
2023-11-15T10:26:44.688442Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:26:44.777261Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: finished processing with success status=200
2023-11-15T10:26:44.815701Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:26:56.861818Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:29:31.536387Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:29:31.591044Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:29:31.807207Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:32:40.321602Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:33:01.588940Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:34:06.004464Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:34:21.085659Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:34:22.998308Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: processing request
2023-11-15T10:34:26.003856Z  INFO source{component_kind="source" component_id=datadog_agent component_type=datadog_agent component_name=datadog_agent}: warp::filters::trace: finished processing with success status=200

Additional Context

Removing the *.lock file doesn't help.
Flows that do not use this sink, such as internal metrics, are forwarded correctly.
Vector doesn't create additional *.dat files during start.
When we remove all orphan *.dat files from the persistent volume and restart the pod, data processing resumes.
Not all restarts reproduce this issue, there are some pods that were restarted by us multiple times and it's always correctly processing data.

References

No response

@kaarolch kaarolch added the type: bug A code related bug. label Nov 15, 2023
@dsmith3197 dsmith3197 added the domain: buffers Anything related to Vector's memory/disk buffers label Dec 12, 2023
@yalinglee
Copy link
Contributor

yalinglee commented Jul 11, 2024

We run Vector as StatefulSets in EKS, so each vector pod has a dedicated PV provisioned for it too. Recently we observed a similar issue with orphaned .dat files when some pods got OOM killed. After we increased the memory limit, Vector pods restarted successfully and continued to create/process new .dat files for a while. However, because the old .dat files were still around, Vector failed to increment the number in the buffer file due to naming conflict with an old .dat file, and stopped sending logs to sinks with disk buffers. Once we deleted the old .dat files and restarted the affected pod, it started working again. I think this might have been the case for you too, since in your example, it would not have been possible to create a newer .dat file named "buffer-data-21.dat" because an old one already existed with the same filename.

We didn't see any errors related to "Unexpected status: 307 Temporary Redirect" though. The only errors we saw were dropped events with "reason=Source send cancelled," as well as CPU and buffer usage flatlining.

@jvperrin
Copy link

It looks to me like if there's any old .dat files on startup that don't get cleared up, then this will silently get stuck due to the waiting at

// The file is still present and waiting for a reader to finish reading it in order
// to delete it. Wait until the reader signals progress and try again.
debug!("Target data file is still present and not yet processed. Waiting for reader.");
self.ledger.wait_for_reader().await;
which will never free up since the reader hasn't loaded in these "orphaned" buffer files.

There's also an interesting explanation in this comment just a bit before that line about it waiting on the reader:

// Normally, readers will keep up with the writers, and so there will only ever be a
// single data file or two on disk. If there was an issue with a sink reading from this
// buffer, though, we could conceivably have a stalled reader while the writer
// progresses and continues to create new data file.
//
// At some point, the file ID will wrap around and the writer will want to open a "new"
// file for writing that already exists: a previously-written file that has not been
// read yet.
//
// In order to handle this situation, we loop here, trying to create the file. Readers
// are responsible deleting a file once they have read it entirely, so our first loop
// iteration is the happy path, trying to create the new file. If we can't create it,
// this may be because it already exists and we're just picking up where we left off
// from last time, but it could also be a data file that a reader hasn't completed yet.

Does this mean that some kind of old buffer file cleanup needs to be done on startup when loading in the old ledger? How would Vector be able to tell what is still a valid buffer file versus an old/orphaned one? I'm not really familiar with the data structure used here, although I'm interested in learning more about it!

@yalinglee
Copy link
Contributor

That's great info! cc @tobz, who might be able to provide more guidance or suggestions for potential fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: buffers Anything related to Vector's memory/disk buffers type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

4 participants