Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector unable to slow down the source when buffer is full #18633

Closed
dsmith3197 opened this issue Sep 21, 2023 Discussed in #18578 · 0 comments · Fixed by #18634
Closed

Vector unable to slow down the source when buffer is full #18633

dsmith3197 opened this issue Sep 21, 2023 Discussed in #18578 · 0 comments · Fixed by #18634

Comments

@dsmith3197
Copy link
Contributor

Discussed in #18578

Originally posted by anirbansingharoy September 17, 2023
In our configuration, we have deployed vector in an AKS cluster as Daemonset. The agent collect logs using kubernetes source connect. There is transformation we do using remap. After that its piped to sink. In Sink, we used the kafka connector which published the logs to Azure Eventhub.

The set up was working perfectly until we got a big surge in amount of logs being generated. We found out during high load vector was shipping all events to Azure Eventhub which was throttling the requests.This was causing the vector pods to get into a state where it was getting OOM Killed.

We then read about buffering model and thought that it could be used to slow down the source when Azure Eventhub is unable to catchup the the pace of logs.

Initially we started with memory based buffering with a maximum events size of 2000. We expected that the behaviour will be when 2000 events are in the buffer, the source will be slowed down. However that didn't happen. It was still hammering azure eventhub with many requests and azure eventhub is rejecting those requests due to throttling.

The configuration is as follows

sources:
  aks_application_logs:
    type: kubernetes_logs
    extra_namespace_label_selector: "collectlogs=yes"
    delay_deletion_ms: 300000
    max_line_bytes: 64000
    max_read_bytes: 1024000
    glob_minimum_cooldown_ms: 30000
  aks_ingress_logs:
    type: kubernetes_logs    
    extra_namespace_label_selector: "capturelogstostorageAcc=yes"

  transforms:
    aks_application_logs_transform:
      type: remap
      inputs:
        - aks_application_logs
      source: |-
        if is_json(string!(.message)) {
            message_json = object!(parse_json(string!(.message)) ?? {})
            del(.message)
            . = merge(.,message_json)
        }
  sinks:
    az_storage_acc:
      type: azure_blob
      inputs :
        - aks_ingress_logs
      container_name: logs
      storage_account: accountname
      blob_prefix: "date/%F/hour/%H/"
      healthcheck:
        enabled: false
      encoding:
        codec: json
    event_hub:
      type: kafka
      acknowledgements:
        enabled: true
      inputs:
        - aks_application_logs_transform
      bootstrap_servers: eventhub_instance:9093
      topic: applogs
      encoding:
        codec: "json"
      healthcheck:
        enabled: true
      batch:
        max_events: 10
      buffer:
        type: "memory"
        max_events: 2000
        when_full: "block"
      librdkafka_options:
        "retries": "2"
      sasl:
        enabled: true
        mechanism: PLAIN
        username: "$$$ConnectionString"
        password: "$${EH_CONNECTION_STRING}"      

The pod was continuously OOM Killed as well during this time.

Later we switch to disk based buffering to see if that helps. However it was still hammering the azure eventhub with many requests. The only way we were able to handle the situation at the moment is increasing the azure eventhub capacity.

So we would like to know, if we are doing any mistake in terms of buffer configuration due to which the source is not slowing down when sink is unable to catch up the with pace. Any other suggestion is also appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant