Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes_logs missing labels after upgrading to 0.30.0 #17487

Closed
JohanLindvall opened this issue May 25, 2023 · 8 comments · Fixed by #17746
Closed

kubernetes_logs missing labels after upgrading to 0.30.0 #17487

JohanLindvall opened this issue May 25, 2023 · 8 comments · Fixed by #17746
Labels
meta: regression This issue represents a regression source: file Anything `file` source related source: kubernetes_logs Anything `kubernetes_logs` source related type: bug A code related bug.
Milestone

Comments

@JohanLindvall
Copy link

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

After upgrading from vector 0.29.1 to 0.30.0, vector logs the following errors:

2023-05-24T06:34:55.588006Z ERROR sink{component_kind="sink" component_id=loki_sink component_type=loki component_name=loki_sink}: vector::internal_events::template: Internal log [Failed to render template for "label_value "{{ kubernetes.pod_name }}" with label_key "pod"".] has been suppressed 11064 times.

Switching back to 0.29.1 solves the issue. We are using the official Helm charts to deploy. Our cluster is AKS, running Kubernetes 1.24.9

Configuration

expire_metrics_secs: 21600
sinks:
  loki_sink:
    auth:
      strategy: bearer
      token: ${LOKI_BEARER_TOKEN}
    buffer:
      max_size: 2147483648
      type: disk
    compression: gzip
    encoding:
      codec: text
    endpoint: http://loki-distributor:3100/
    inputs:
    - not_vector_filter
    labels:
      app: '{{ app }}'
      cluster: development
      component: '{{ component }}'
      container: '{{ kubernetes.container_name }}'
      file: '{{ file }}'
      image: '{{ kubernetes.container_image }}'
      instance: '{{ instance }}'
      ip: '{{ kubernetes.pod_ip }}'
      job: '{{ job }}'
      level: '{{ level }}'
      namespace: '{{ kubernetes.pod_namespace }}'
      node_name: '{{ node_name }}'
      pod: '{{ kubernetes.pod_name }}'
      product: '{{ product }}'
      source_context: '{{ source_context }}'
      source_type: '{{ source_type }}'
      stream: '{{ stream }}'
      template_hash: '{{ template_hash }}'
      truncated: '{{ truncated }}'
      unit: '{{ unit }}'
    out_of_order_action: accept
    type: loki
  loki_sink_vector:
    auth:
      strategy: bearer
      token: ${LOKI_BEARER_TOKEN}
    buffer:
      max_size: 268435488
      type: disk
    compression: gzip
    encoding:
      codec: text
    endpoint: http://loki-distributor:3100/
    inputs:
    - vector_filter
    labels:
      app: '{{ app }}'
      cluster: development
      component: '{{ component }}'
      container: '{{ kubernetes.container_name }}'
      file: '{{ file }}'
      image: '{{ kubernetes.container_image }}'
      instance: '{{ instance }}'
      ip: '{{ kubernetes.pod_ip }}'
      job: '{{ job }}'
      level: '{{ level }}'
      namespace: '{{ kubernetes.pod_namespace }}'
      node_name: '{{ node_name }}'
      pod: '{{ kubernetes.pod_name }}'
      product: '{{ product }}'
      source_context: '{{ source_context }}'
      source_type: '{{ source_type }}'
      stream: '{{ stream }}'
      template_hash: '{{ template_hash }}'
      truncated: '{{ truncated }}'
      unit: '{{ unit }}'
    out_of_order_action: accept
    type: loki
  prom_exporter:
    address: 0.0.0.0:9090
    inputs:
    - log_to_metric_kubernetes_logs
    - internal_metrics_filter
    type: prometheus_exporter
sources:
  file_source:
    exclude:
    - /var/log/azure/aks/**/*.log
    include:
    - /var/log/azure/**/*.log
    type: file
  internal_metrics_source:
    type: internal_metrics
  journald_source:
    type: journald
  kubernetes_source:
    type: kubernetes_logs
transforms:
  file_remap:
    inputs:
    - file_source
    source: |2

      .node_name="$VECTOR_SELF_NODE_NAME"
    type: remap
  internal_metrics_filter:
    condition: |
      .name=="component_errors_total"||.name=="component_discarded_events_total"||.name=="component_sent_events_total"||.name=="component_received_events_total"
    inputs:
    - internal_metrics_source
    type: filter
  journald_remap:
    inputs:
    - journald_source
    source: |2

      if !is_nullish(._SYSTEMD_UNIT) {
        .unit=._SYSTEMD_UNIT
      }
      .node_name=.host
      .stream=._TRANSPORT
    type: remap
  kubernetes_filter:
    condition: |
      .kubernetes.container_image != "combination/grpcui-sidecar:latest"
    inputs:
    - kubernetes_source
    type: filter
  kubernetes_remap:
    inputs:
    - kubernetes_filter
    source: |
      .app, err = get(.kubernetes.pod_labels, ["app.kubernetes.io/name"])
      if is_nullish(.app) {
        .app, err = get(.kubernetes.pod_labels, ["app"])
      }
      if is_nullish(.app) {
        .app = .kubernetes.pod_name
      }

      .instance, err = get(.kubernetes.pod_labels, ["app.kubernetes.io/instance"])
      if is_nullish(.instance) {
        .instance, err = get(.kubernetes.pod_labels, ["release"])
      }
      if is_nullish(.instance) {
        del(.instance)
      }

      .component, err = get(.kubernetes.pod_labels, ["app.kubernetes.io/component"])
      if is_nullish(.component) {
        .component, err = get(.kubernetes.pod_labels, ["component"])
      }
      if is_nullish(.component) {
        del(.component)
      }

      .node_name = .kubernetes.pod_node_name

      .job, err = .kubernetes.pod_namespace + "/" + .app
      if is_nullish(.job) {
        del(.job)
      }

      .product, err = get(.kubernetes.pod_labels, ["app.kubernetes.io/part-of"])
      if is_nullish(.product) {
        .product, err = get(.kubernetes.pod_labels, ["part-of"])
      }
      if is_nullish(.product) {
        del(.product)
      }

      if exists(.instance) {
        .instance_default = .instance
      } else {
        .instance_default = ""
      }

      if is_nullish(.kubernetes.container_image) {
      # calico-system/calico-node and others
        .kubernetes.container_image = "unknown"
      }

      del(.file)
    type: remap
  level_remap:
    inputs:
    - kubernetes_remap
    - journald_remap
    - file_remap
    source: |2

      structured, err = parse_json(.message)
      if err == null {
        if !is_nullish(structured.level) {
          .level = structured.level
        } else if !is_nullish(structured.@l) {
          .level = structured.@l
        }
        if !is_nullish(structured.SourceContext) && !is_nullish(structured.@i) {
          .source_context = structured.SourceContext
          .template_hash = structured.@i
        }
      } else {
        structured = parse_regex(.message, r'^\[(?P<level>[A-Z]+)\]') ??
          parse_regex(.message, r'^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} \[(?P<level>[a-z]+)\]') ??
          parse_regex(.message, r'^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z\t(?P<level>[a-z]+)\t') ??
          parse_regex(.message, r'^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z\s+(?P<level>[A-Z]+) ') ??
          parse_regex(.message, r'^\[\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+\]\[\d+\]\[(?P<level>[a-z]+)\]') ??
          parse_regex(.message, r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ \[(?P<level>[a-z]+)\]') ??
          parse_regex(.message, r'^\[\d{2}/\d{2}/\d{4} \d{2}:\d{2}:\d{2} (?P<level>[A-Z]+) [^ ]+ \d+\]') ??
          parse_regex(.message, r'^(?P<level>[IWE])\d+ \d{2}:\d{2}:\d{2}\.\d+\s+\d+\s+[^ :]+:\d+\]') ??
          parse_regex(.message, r'level=(?P<level>[a-z]+)') ??
          {}

        if exists(structured.level) {
          .level = structured.level
        }
      }

      if exists(.level) {
        .level_default = .level
      } else {
        .level_default = ""
      }

      smessage, err=to_string(.message)
      if err == null && strlen(smessage) > 10000000 {
        .message=truncate(smessage, limit: 10000000, ellipsis: true)
      }
    type: remap
  log_to_metric_kubernetes_filter:
    condition: |
      .source_type == "kubernetes_logs"
    inputs:
    - level_remap
    type: filter
  log_to_metric_kubernetes_logs:
    inputs:
    - log_to_metric_kubernetes_filter
    metrics:
    - field: message
      name: log_entries_total
      namespace: processed
      tags:
        app: '{{ app }}'
        container: '{{ kubernetes.container_name }}'
        image: '{{ kubernetes.container_image }}'
        instance: '{{ instance_default }}'
        level: '{{ level_default }}'
        namespace: '{{ kubernetes.pod_namespace }}'
        source_type: '{{ source_type }}'
      type: counter
    type: log_to_metric
  not_vector_filter:
    condition: |
      .kubernetes.container_image != "$CMB_DEP_DOCKER_vector_REF"
    inputs:
    - level_remap
    type: filter
  vector_filter:
    condition: |
      .kubernetes.container_image == "$CMB_DEP_DOCKER_vector_REF"
    inputs:
    - level_remap
    type: filter

Version

docker.io/timberio/vector:0.30.0-debian

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@JohanLindvall JohanLindvall added the type: bug A code related bug. label May 25, 2023
@JohanLindvall
Copy link
Author

JohanLindvall commented May 25, 2023

The issue doesn't seem limited to the kubernetes logs source. We also get similar errors logged for the file source messages

@jszwedko jszwedko added meta: regression This issue represents a regression source: file Anything `file` source related source: kubernetes_logs Anything `kubernetes_logs` source related labels May 25, 2023
@fuchsnj
Copy link
Member

fuchsnj commented May 25, 2023

Can you confirm you are using the exact same config for both vector 0.30.0 and 0.29.1?

It looks like some of the components that are used as inputs into loki_sink aren't adding the kubernetes.* fields to the event, which is why you are getting the template error.

For example, none of the components in the following path adds the kubernetes.pod_name field.
file_source -> file_remap -> level_remap -> not_vector_filter -> loki_sink

@JohanLindvall
Copy link
Author

Can you confirm you are using the exact same config for both vector 0.30.0 and 0.29.1?

Yes, I am using the exact same config in 0.29.1 and 0.30.0.

It looks like some of the components that are used as inputs into loki_sink aren't adding the kubernetes.* fields to the event, which is why you are getting the template error.

For example, none of the components in the following path adds the kubernetes.pod_name field. file_source -> file_remap -> level_remap -> not_vector_filter -> loki_sink

Yes, for some reason the fields are no longer being propagated to the sinks.

@JohanLindvall
Copy link
Author

JohanLindvall commented May 29, 2023

The problem may be that the (vector) label values are actually not set and that the Loki sink rejects messages with labels referencing undefined vector labels. This used to work in 0.29.1.

Could it be #17052 ?

@jszwedko
Copy link
Member

We discussed this briefly. The behavior change in 0.30.0 was to add the log message when a rendered template uses a field that doesn't exist, but the sink functionally still works the same as 0.29.1: it still sends the data without the label. The conclusion we came to is the error should be a warning and, potentially, for users to be able to suppress the warning to replace the old behavior from 0.29.X.

To be consistent with the rest of Vector’s template handling in sinks, rendering errors should cause events to be dropped; however, it does seem somewhat useful for Loki labels to conditionally be added depending on if the template field existed. We need to reflect a bit more on what a good configuration UX is for that.

@JohanLindvall
Copy link
Author

If anyone else is affected by this: I fixed our config to use dynamic labels imported into the loki sink and everything works as expected again.

@onelapahead
Copy link

So want to note for others, upgrading to 0.30.0 is not just a logging change - internal Prometheus metrics are affected too.

With our use case we're sending kubernetes_logs to loki and vector sinks, and we found, despite the chatter in the logs with the new errors, everything continued to work. As @jszwedko mentioned:

it still sends the data without the label

However, we scrape Vector of its own Prometheus metrics and (maybe naively) have the following custom alerting rule as a way to attempt detect any issues sending to our sinks:

increase(vector_component_errors_total[5m]) >= 0

So we saw alerts firing because each Vector was accumulating vector_component_errors_total{error_type="template_faiiled"} counts. For now, we've downgraded to 0.29.1 while we reconsider our alerting rule.

This all sounds expected, but I guess reading the thread I have two questions:

  1. With this log message moving to a warning instead of an error, and the potential feature of being able to disable it entirely, will the same be true for it incrementing the vector_component_errors_total metric? Does it make sense instead for these templating errors to be a separate metric entirely rather than a template_failed of error_type?
  2. If our goal is to detect issues sending to sinks, does our alerting rule need to be limited to certain error_type's such as request_failed? Or is there a different metric you would reccomend using?

Thanks for the input and the awesome project. We've had great success recently adopting Vector within our logging stack and are experimenting with it as Prometheus metric receiver / forwarder as well 📈

@jszwedko jszwedko added this to the Vector 0.30.1 milestone Jun 23, 2023
jszwedko added a commit that referenced this issue Jun 23, 2023
…to warnings

If a render error doesn't result in a dropped event, it seems more like a warning than an error.

For the places that currently emit template errors with `drop_event: false`:

* `loki` sink: skips inserting label if key or value fails to render; falls back to `None` for
  partitioning using `tenant_id`
* `throttle` transform: falls back to `None` for throttle key
* `log_to_metric` transform: skips tag addition
* `papertrail` sink: falls back to `vector` for the `process` field
* `splunk_hec_logs` sink: falls back to `None` for partition keys (source, sourcetype, index)
* `splunk_hec_metrics` sink: falls back to `None` for source, sourcetype, index

Fixes: #17487

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
@jszwedko
Copy link
Member

Thanks for the discussion all! I opened #17746 to drop this back down to a warning, at least. That should be an improvement in that it won't publish the error metric.

github-merge-queue bot pushed a commit that referenced this issue Jun 23, 2023
…to warnings (#17746)

If a render error doesn't result in a dropped event, it seems more like
a warning than an error.

For the places that currently emit template errors with `drop_event:
false`:

* `loki` sink: skips inserting label if key or value fails to render;
falls back to `None` for
  partitioning using `tenant_id`
* `throttle` transform: falls back to `None` for throttle key
* `log_to_metric` transform: skips tag addition
* `papertrail` sink: falls back to `vector` for the `process` field
* `splunk_hec_logs` sink: falls back to `None` for partition keys
(source, sourcetype, index)
* `splunk_hec_metrics` sink: falls back to `None` for source,
sourcetype, index

Fixes: #17487

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta: regression This issue represents a regression source: file Anything `file` source related source: kubernetes_logs Anything `kubernetes_logs` source related type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants