Vector unable to slow down the source when buffer is full #18633

dsmith3197 · 2023-09-21T17:59:09Z

Discussed in #18578

^{Originally posted by anirbansingharoy September 17, 2023}
In our configuration, we have deployed vector in an AKS cluster as Daemonset. The agent collect logs using kubernetes source connect. There is transformation we do using remap. After that its piped to sink. In Sink, we used the kafka connector which published the logs to Azure Eventhub.

The set up was working perfectly until we got a big surge in amount of logs being generated. We found out during high load vector was shipping all events to Azure Eventhub which was throttling the requests.This was causing the vector pods to get into a state where it was getting OOM Killed.

We then read about buffering model and thought that it could be used to slow down the source when Azure Eventhub is unable to catchup the the pace of logs.

Initially we started with memory based buffering with a maximum events size of 2000. We expected that the behaviour will be when 2000 events are in the buffer, the source will be slowed down. However that didn't happen. It was still hammering azure eventhub with many requests and azure eventhub is rejecting those requests due to throttling.

The configuration is as follows

sources:
  aks_application_logs:
    type: kubernetes_logs
    extra_namespace_label_selector: "collectlogs=yes"
    delay_deletion_ms: 300000
    max_line_bytes: 64000
    max_read_bytes: 1024000
    glob_minimum_cooldown_ms: 30000
  aks_ingress_logs:
    type: kubernetes_logs    
    extra_namespace_label_selector: "capturelogstostorageAcc=yes"

  transforms:
    aks_application_logs_transform:
      type: remap
      inputs:
        - aks_application_logs
      source: |-
        if is_json(string!(.message)) {
            message_json = object!(parse_json(string!(.message)) ?? {})
            del(.message)
            . = merge(.,message_json)
        }
  sinks:
    az_storage_acc:
      type: azure_blob
      inputs :
        - aks_ingress_logs
      container_name: logs
      storage_account: accountname
      blob_prefix: "date/%F/hour/%H/"
      healthcheck:
        enabled: false
      encoding:
        codec: json
    event_hub:
      type: kafka
      acknowledgements:
        enabled: true
      inputs:
        - aks_application_logs_transform
      bootstrap_servers: eventhub_instance:9093
      topic: applogs
      encoding:
        codec: "json"
      healthcheck:
        enabled: true
      batch:
        max_events: 10
      buffer:
        type: "memory"
        max_events: 2000
        when_full: "block"
      librdkafka_options:
        "retries": "2"
      sasl:
        enabled: true
        mechanism: PLAIN
        username: "$$$ConnectionString"
        password: "$${EH_CONNECTION_STRING}"

The pod was continuously OOM Killed as well during this time.

Later we switch to disk based buffering to see if that helps. However it was still hammering the azure eventhub with many requests. The only way we were able to handle the situation at the moment is increasing the azure eventhub capacity.

So we would like to know, if we are doing any mistake in terms of buffer configuration due to which the source is not slowing down when sink is unable to catch up the with pace. Any other suggestion is also appreciated.

The text was updated successfully, but these errors were encountered:

dsmith3197 mentioned this issue Sep 21, 2023

fix(kafka sink): performance improvements and fix memory leak #18634

Merged

dsmith3197 closed this as completed in #18634 Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector unable to slow down the source when buffer is full #18633

Vector unable to slow down the source when buffer is full #18633

dsmith3197 commented Sep 21, 2023

Vector unable to slow down the source when buffer is full #18633

Vector unable to slow down the source when buffer is full #18633

Comments

dsmith3197 commented Sep 21, 2023

Discussed in #18578