Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki 3.0 Feedback and Issues #12506

Open
slim-bean opened this issue Apr 8, 2024 · 83 comments
Open

Loki 3.0 Feedback and Issues #12506

slim-bean opened this issue Apr 8, 2024 · 83 comments
Labels

Comments

@slim-bean
Copy link
Collaborator

slim-bean commented Apr 8, 2024

If you encounter any troubles upgrading to Loki 3.0 or have feedback for the upgrade process, please leave a comment on this issue!

Also you can ask questions at: https://slack.grafana.com/ in the channel #loki-3

Known Issues:

@Hedius
Copy link

Hedius commented Apr 8, 2024

pls update grafana.com/docs/loki before releasing a major update still shows the 2.9 documentation. :)

@onedr0p
Copy link

onedr0p commented Apr 8, 2024

I tried upgrading the Helm chart ( 5.47.2 → 6.0.0 ) but encountered these errors:

❯ k -n observability logs loki-write-1
failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors:
  line 41: field shared_store not found in type compactor.Config
  line 62: field enforce_metric_name not found in type validation.plain. Use `-config.expand-env=true` flag if you want to expand environment variables in your config file
❯ k -n observability logs loki-read-779bd69757-rrdxt
failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors:
  line 41: field shared_store not found in type compactor.Config
  line 62: field enforce_metric_name not found in type validation.plain. Use `-config.expand-env=true` flag if you want to expand environment variables in your config file
❯ k logs -n observability loki-backend-1
Defaulted container "loki-sc-rules" out of: loki-sc-rules, loki
{"time": "2024-04-08T21:37:34.546399+00:00", "msg": "Starting collector", "level": "INFO"}
{"time": "2024-04-08T21:37:34.546577+00:00", "msg": "No folder annotation was provided, defaulting to k8s-sidecar-target-directory", "level": "WARNING"}
{"time": "2024-04-08T21:37:34.546733+00:00", "msg": "Loading incluster config ...", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547477+00:00", "msg": "Config for cluster api at 'https://10.43.0.1:443' loaded...", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547598+00:00", "msg": "Unique filenames will not be enforced.", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547695+00:00", "msg": "5xx response content will not be enabled.", "level": "INFO"}

Pretty sure I adjusted all the breaking changes described in the release notes but maybe some of the custom config I have is not compatible?

My Helm values are located here, any help?

@Hedius
Copy link

Hedius commented Apr 8, 2024

I tried upgrading the Helm chart ( 5.47.2 → 6.0.0 ) but encountered these errors:

❯ k -n observability logs loki-write-1
failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors:
  line 41: field shared_store not found in type compactor.Config
  line 62: field enforce_metric_name not found in type validation.plain. Use `-config.expand-env=true` flag if you want to expand environment variables in your config file
❯ k -n observability logs loki-read-779bd69757-rrdxt
failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors:
  line 41: field shared_store not found in type compactor.Config
  line 62: field enforce_metric_name not found in type validation.plain. Use `-config.expand-env=true` flag if you want to expand environment variables in your config file
❯ k logs -n observability loki-backend-1
Defaulted container "loki-sc-rules" out of: loki-sc-rules, loki
{"time": "2024-04-08T21:37:34.546399+00:00", "msg": "Starting collector", "level": "INFO"}
{"time": "2024-04-08T21:37:34.546577+00:00", "msg": "No folder annotation was provided, defaulting to k8s-sidecar-target-directory", "level": "WARNING"}
{"time": "2024-04-08T21:37:34.546733+00:00", "msg": "Loading incluster config ...", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547477+00:00", "msg": "Config for cluster api at 'https://10.43.0.1:443' loaded...", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547598+00:00", "msg": "Unique filenames will not be enforced.", "level": "INFO"}
{"time": "2024-04-08T21:37:34.547695+00:00", "msg": "5xx response content will not be enabled.", "level": "INFO"}

Pretty sure I adjusted all the breaking changes described in the release notes but maybe some of the custom config I have is not compatible?

My Helm values are located here, any help?

You are setting shared store in compactor.
It also got dropped there.

See https://github.com/grafana/loki/blob/main/docs/sources/configure/_index.md#compactor

delete_request_store is now required

@onedr0p
Copy link

onedr0p commented Apr 8, 2024

So I should just be able to rename shared_store to delete_request_store and be good?

@alto-rlk
Copy link

alto-rlk commented Apr 8, 2024

helm template grafana/loki --set loki.useTestSchema=true --set-json imagePullSecrets='["blah"]' fails for me with ...executing "loki.memcached.statefulSet" at <$.ctx.Values.image.pullSecrets>: nil pointer evaluating interface {}.pullSecrets

Adding --set-json image.pullSecrets='["blah2"]' to the previous command does work, but image.pullSecrets isn't documented in values.yaml, and would be kind of redundant, so I think maybe this is a typo for imagePullSecrets here?

@rknightion
Copy link

rknightion commented Apr 9, 2024

Since the upgrade everything looks good in our environments although the backend pods seem to be outputting a lot of:
level=info ts=2024-04-09T08:01:08.971329289Z caller=gateway.go:241 component=index-gateway msg="chunk filtering is not enabled" with every loki search. Wasn't happening before 3.0 from what we can tell

I suspect that's because blooms aren't enabled although when I do enable blooms we get a nil pointer:

level=info ts=2024-04-09T08:17:29.692174397Z caller=bloomcompactor.go:458 component=bloom-compactor msg=compacting org_id=plprod table=index_19820 ownership=1f6c0f8500000000-1fa8b221ffffffff
ts=2024-04-09T08:17:31.535678052Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-backend-3-2e51d875' from=10.30.80.69:7946"
level=info ts=2024-04-09T08:17:31.610784021Z caller=scheduler.go:653 msg="this scheduler is in the ReplicationSet, will now accept requests."
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1aec384]

goroutine 1430 [running]:
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).buildGaps.OnceFunc.func4.1()
	/usr/local/go/src/sync/oncefunc.go:24 +0x7c
panic({0x2002700?, 0x42aae10?})
	/usr/local/go/src/runtime/panic.go:914 +0x218
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).buildGaps.func2()
	/src/loki/pkg/bloomcompactor/controller.go:388 +0x24
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).buildGaps.OnceFunc.func4()
	/usr/local/go/src/sync/oncefunc.go:27 +0x64
sync.(*Once).doSlow(0x4006e9f128?, 0x0?)
	/usr/local/go/src/sync/once.go:74 +0x100
sync.(*Once).Do(0x400004e800?, 0x21cc060?)
	/usr/local/go/src/sync/once.go:65 +0x24
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).buildGaps.OnceFunc.func5()
	/usr/local/go/src/sync/oncefunc.go:31 +0x34
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).buildGaps(0x4006e7e720, {0x2c70e48, 0x4006e6e7d0}, {0x4006867892, 0x6}, {{0x1f6c0f8500000000?}, {0x40005a0578?, 0x4d6c?}}, {0x4321220?, 0x0?}, ...)
	/src/loki/pkg/bloomcompactor/controller.go:396 +0x133c
github.com/grafana/loki/v3/pkg/bloomcompactor.(*SimpleBloomController).compactTenant(0x4006e7e720, {0x2c70e48, 0x4006e6e7d0}, {{0x2?}, {0x40005a0578?, 0x101000000226f98?}}, {0x4006867892, 0x6}, {0x2?, 0x0?}, ...)
	/src/loki/pkg/bloomcompactor/controller.go:115 +0x6a0
github.com/grafana/loki/v3/pkg/bloomcompactor.(*Compactor).compactTenantTable(0x40007eee00, {0x2c70e48, 0x4006e6e7d0}, 0x4001a7eab0, 0x0?)
	/src/loki/pkg/bloomcompactor/bloomcompactor.go:460 +0x2e8
github.com/grafana/loki/v3/pkg/bloomcompactor.(*Compactor).runWorkers.func2({0x2c70e48, 0x4006e6e7d0}, 0x0?)
	/src/loki/pkg/bloomcompactor/bloomcompactor.go:422 +0xe0
github.com/grafana/dskit/concurrency.ForEachJob.func1()
	/src/loki/vendor/github.com/grafana/dskit/concurrency/runner.go:105 +0xbc
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/src/loki/vendor/golang.org/x/sync/errgroup/errgroup.go:78 +0x58
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1428
	/src/loki/vendor/golang.org/x/sync/errgroup/errgroup.go:75 +0x98

@nomaster
Copy link

nomaster commented Apr 9, 2024

When upgrading, the pod from the new stateful set 'loki-chunks-cache' couldn't be scheduled, because none of our nodes offer the requested 9830 MiB of memory.

@slim-bean
Copy link
Collaborator Author

pls update grafana.com/docs/loki before releasing a major update still shows the 2.9 documentation. :)

very sorry about this, we are working on a new release processes and also had problems with our documentation updates, I think there are still a few things we are working out but hopefully most of it is correct now.

@slim-bean
Copy link
Collaborator Author

When upgrading, the pod from the new stateful set 'loki-chunks-cache' couldn't be scheduled, because none of our nodes offer the requested 9830 MiB of memory.

You could disable this external memcached entirely by setting enabled: false

or you can make it smaller by reducing allocatedMemory this will also automatically adjust the pod requests in k8s!

chunksCache:
  # -- Specifies whether memcached based chunks-cache should be enabled
  enabled: true
  # -- Amount of memory allocated to chunks-cache for object storage (in MB).
  allocatedMemory: 8192

@sandstrom
Copy link

Awesome with the new bloom filter, for unique IDs etc! 🎉

I'm looking forward to close issue #91 (from 2018) when the experimental bloom filters are stable. 😄

Regarding docs, some feedback:

  • Would be nice to rewrite the 'Simple Scalable' to not assume Kubernetes. For example, move sentences such as "The write target is stateful and is controlled by a Kubernetes StatefulSet." into a separate sub-heading, named kubernetes details. That way, the general description of the simple scalable deployment mode doesn't need to dig into details on how to deploy it under kubernetes.
  • Clarify why the write-target and backend-targets are stateful. I thought any state was on S3 or in configuration files. Is this 'state' the WAL, or cached chunks on disk before being flushed to S3 (or other object storage)? If so, maybe clarify this.
  • Update the architecture section and skip any mention of BoltDB and other legacy stuff, it's just confusing. Include only information related to how it's operated under a regular 3.0 deployment (you can still keep old 1.x docs about BoltDB, just remove it from 3.x docs).
  • More of a feature request, but rename "fake" to "default", it's confusing: https://grafana.com/docs/loki/latest/get-started/architecture/#multi-tenancy
  • Update the docs to reflect 3.0, currently it says "For release 2.9 the components are:…"
  • Update https://grafana.com/docs/loki/latest/operations/storage/retention/ and explain how to use life-cycle rules on S3 (or similar) to handle retention. Remove legacy stuff here too.

Source:
https://grafana.com/docs/loki/latest/get-started/deployment-modes/

@JStickler JStickler added the 3.0 label Apr 9, 2024
@K1kc4
Copy link

K1kc4 commented Apr 10, 2024

Trying to update helm chart 5.43.2 to 6.1.0 but i am getting

UPGRADE FAILED: template: loki/templates/single-binary/statefulset.yaml:44:28: executing "loki/templates/single-binary/statefulset.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.yaml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for "{{- if .Values.enterprise.enabled}}\n{{- tpl .Values.enterprise.config . }}\n{{- else }}\nauth_enabled: {{ .Values.loki.auth_enabled }}\n{{- end }}\n\n{{- with .Values.loki.server }}\nserver:\n  {{- toYaml . | nindent 2}}\n{{- end}}\n\nmemberlist:\n{{- if .Values.loki.memberlistConfig }}\n  {{- toYaml .Values.loki.memberlistConfig | nindent 2 }}\n{{- else }}\n{{- if .Values.loki.extraMemberlistConfig}}\n{{- toYaml .Values.loki.extraMemberlistConfig | nindent 2}}\n{{- end }}\n  join_members:\n    - {{ include \"loki.memberlist\" . }}\n    {{- with .Values.migrate.fromDistributed }}\n    {{- if .enabled }}\n    - {{ .memberlistService }}\n    {{- end }}\n
  {{- end }}\n{{- end }}\n\n{{- with .Values.loki.ingester }}\ningester:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- if .Values.loki.commonConfig}}\ncommon:\n{{- toYaml .Values.loki.commonConfig | nindent 2}}\n  storage:\n  {{- include \"loki.commonStorageConfig\" . | nindent 4}}\n{{- end}}\n\n{{- with .Values.loki.limits_config }}\nlimits_config:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\nruntime_config:\n  file: /etc/loki/runtime-config/runtime-config.yaml\n\n{{- with .Values.chunksCache }}\n{{- if .enabled }}\nchunk_store_config:\n  chunk_cache_config:\n    default_validity: {{ .defaultValidity }}\n    background:\n      writeback_goroutines: {{ .writebackParallelism }}\n      writeback_buffer: {{ .writebackBuffer }}\n      writeback_size_limit: {{ .writebackSizeLimit }}\n    memcached:\n      batch_size: {{ .batchSize }}\n      parallelism: {{ .parallelism }}\n    memcached_client:\n      addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-chunks-cache.{{ $.Release.Namespace }}.svc\n      consistent_hash: true\n      timeout: {{ .timeout }}\n      max_idle_conns: 72\n{{- end }}\n{{- end }}\n\n{{- if .Values.loki.schemaConfig }}\nschema_config:\n{{- toYaml .Values.loki.schemaConfig | nindent 2}}\n{{- end }}\n\n{{- if .Values.loki.useTestSchema }}\nschema_config:\n{{- toYaml .Values.loki.testSchemaConfig | nindent 2}}\n{{- end }}\n\n{{ include \"loki.rulerConfig\" . }}\n\n{{- if or .Values.tableManager.retention_deletes_enabled .Values.tableManager.retention_period }}\ntable_manager:\n  retention_deletes_enabled: {{ .Values.tableManager.retention_deletes_enabled }}\n  retention_period: {{ .Values.tableManager.retention_period }}\n{{- end }}\n\nquery_range:\n  align_queries_with_step: true\n  {{- with .Values.loki.query_range }}\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n  {{- end }}\n  {{- if .Values.resultsCache.enabled }}\n  {{- with .Values.resultsCache }}\n  cache_results: true\n  results_cache:\n    cache:\n      default_validity: {{ .defaultValidity }}\n      background:\n        writeback_goroutines: {{ .writebackParallelism }}\n        writeback_buffer: {{ .writebackBuffer }}\n        writeback_size_limit: {{ .writebackSizeLimit }}\n      memcached_client:\n        consistent_hash: true\n        addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-results-cache.{{ $.Release.Namespace }}.svc\n        timeout: {{ .timeout }}\n        update_interval: 1m\n  {{- end }}\n  {{- end }}\n\n{{- with .Values.loki.storage_config }}\nstorage_config:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.query_scheduler }}\nquery_scheduler:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.compactor }}\ncompactor:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.analytics }}\nanalytics:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.querier }}\nquerier:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.index_gateway }}\nindex_gateway:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend }}\nfrontend:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend_worker }}\nfrontend_worker:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.distributor }}\ndistributor:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\ntracing:\n  enabled: {{ .Values.loki.tracing.enabled }}\n": template: loki/templates/single-binary/statefulset.yaml:37:6: executing "loki/templates/single-binary/statefulset.yaml" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks

@AllexVeldman
Copy link
Contributor

AllexVeldman commented Apr 10, 2024

For the loki helm chart: #12067 changed the port name for the gateway service from http to http-metrics which caused it to be picked up by the loki ServiceMonitor.

The gateway responds with a 404 on the /metrics path causing the prometheus target to fail.

@tete17
Copy link

tete17 commented Apr 10, 2024

For the loki chart we unfortunately had to face some downtime.

This changed 79b876b#diff-89f4fd98934eb0f277b921d45e4c223e168490c44604e454a2192d28dab1c3e2R4 forced the recreation of all the gateway resources: Deployment, Service, PodDisruptionBudget and most critical Ingress.

This is problematic for 2 reasons:

  • The deployment and service will immediately get traffic even though the pods are literally starting and most likely still in the ImagePull phase.
  • Replacing an ingress with the exact same hostname and path combination is problematic if you are running nginx ingress, as it is the case for a really good chunk of the community. This is in part because of their strict validating webhook that doesn't allow duplicate ingresses of that type. The only solution was to delete the ingress and quickly sync it, causing some downtime. Unfortunately promtail wasn't able to recover and send the accumulated log data. This is because it doesn't retry on 404 errors that happen if the ingress is deleted.

@MartinEmrich
Copy link

Two issues so far with my existing Helm values:

loki.schema_config apparently became loki.schemaConfig. After renaming the object, that part was accepted (also by the 5.x helm chart).

Then the loki ConfigMap failed to be generated. The config.yaml value is literally Error: 'error converting YAML to JSON: yaml: line 70: mapping values are not allowed in this context'.

Trying to render the helm chart locally with "helm --debug template" results in

Error: template: loki/templates/write/statefulset-write.yaml:46:28: executing "loki/templates/write/statefulset-write.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.ya
ml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for "
<<<< template removed for brevity >>>
": template: loki/templates/write/statefulset-write.yaml:37:6: executing "loki/templates/write/statefulset-write.yaml" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks

I try to understand the nested template structure in the helm chart to understand what is happening.

A short helm chart values set (which worked fine with 5.x) triggering the phenomenon:

values.yaml
serviceAccount:
  create: false
  name: loki
test:
  enabled: false
monitoring:
  dashboards:
    enable: false
  lokiCanary:
    enabled: false
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false
loki:
  auth_enabled: false
  limits_config:
    max_streams_per_user: 10000
    max_global_streams_per_user: 10000
  storage_config:
    aws:
      s3: s3://eu-central-1
      bucketnames: my-bucket-name
  schemaConfig:
    configs:
      - from: 2024-01-19
        store: tsdb
        object_store: aws
        schema: v11
        index:
          prefix: "some-prefix_"
          period: 24h
  query_range:
    split_queries_by_interval: 0
  query_scheduler:
    max_outstanding_requests_per_tenant: 8192
  analytics:
    reporting_enabled: false
  compactor:
    shared_store: s3
gateway:
  replicas: 3
read:
  replicas: 3
write:
  replicas: 3
compactor:
  enable: true

@slim-bean
Copy link
Collaborator Author

slim-bean commented Apr 10, 2024

hahaha
image

I thought I recognized that github picture!!!

I'm looking forward to close issue #91 (from 2018) when the experimental bloom filters are stable. 😄

2018!!!

Thanks for the great feedback on the docs, very helpful.

One note regarding SSD mode, honestly the original idea of SSD was to make Loki a lot more friendly outside of k8s environments, the problem we found ourselves in though is that we have had no good ability to support customers attempting to run Loki this way and as such we largely require folks to use kubernetes for our commercial offering. This is why the docs are so k8s specific.

It continues to be a struggle to build an open source project which is extremely flexible for folks to run in many ways, but also a product that we have to provide support for.

I'd love to know though how many folks are successfully running SSD mode outside of kubernetes. I'm still a bit bullish on the idea but over time I kind of feel like it hasn't played out as well as we hoped.

@slim-bean
Copy link
Collaborator Author

For the loki helm chart: #12067 changed the port name for the gateway service from http to http-metrics which caused it to be picked up by the loki ServiceMonitor.

The gateway responds with a 404 on the /metrics path causing the prometheus target to fail.

oh interesting, we'll take a look at this, not sure what happened here, thanks!

@slim-bean
Copy link
Collaborator Author

@tete17 I created a new issue for what you found #12554

Thank you for reporting, sorry for the troubles :(

@slim-bean
Copy link
Collaborator Author

@MartinEmrich thank you, I will update the upgrade guide around schemaConfig, sorry about that. And thank you for the sample test values file! very helpful!

@MarcBrendel
Copy link

MarcBrendel commented Apr 10, 2024

Congratulations on the release! 🎉 :) Is there any way to verify that bloom filters are active and working? I cannot seem to find any metrics or log entries that might give a hint. There are also no bloom services listed on the /services endpoint:

curl -s -k https://localhost:3100/services
ruler => Running
compactor => Running
store => Running
ingester-querier => Running
query-scheduler => Running
ingester => Running
query-frontend => Running
distributor => Running
server => Running
ring => Running
query-frontend-tripperware => Running
analytics => Running
query-scheduler-ring => Running
querier => Running
cache-generation-loader => Running
memberlist-kv => Running

I tried deploying it on a single instance in monolithic mode via Docker by adding the following options:

limits_config:
  bloom_gateway_enable_filtering: true
  bloom_compactor_enable_compaction: true

bloom_compactor:
  enabled: true
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

bloom_gateway:
  enabled: true
  client:
    addresses: dns+localhost.localdomain:9095

Edit: My bad, it seems that the bloom components are not available when using -target=all. It needs to be set to -target=all,bloom-compactor,bloom-gateway,bloom-store for a monolithic deployment I guess? See https://grafana.com/docs/loki/latest/get-started/components/#loki-components.

@dakr0013
Copy link

not sure if this is intended but in the _helpers.tpl there is an if check which might be wrong:

{{- if "loki.deployment.isDistributed "}}

similar check is done here which looks like this:

{{- $isDistributed := eq (include "loki.deployment.isDistributed" .) "true" -}}
{{- if $isDistributed -}}

This causes the if check to always be true and thus the frontend.tail_proxy_url to be set in the loki config. But the configured tail_proxy_url does not point to an existing service (I used SSD deplyoment mode). Not sure if this has any impact.

@coro
Copy link
Contributor

coro commented Apr 11, 2024

We encountered a bug in the rendering of the Loki config with the helm chart v6.0.0 that may be similar to what @MartinEmrich encountered above. These simple values will cause the rendering to fail:

loki:
  query_range:
    parallelise_shardable_queries: false
  useTestSchema: true

This causes .Values.loki.config to look like (note the extra indent):

query_range:
  align_queries_with_step: true
    parallelise_shardable_queries: false
  cache_results: true

I believe anything under loki.query_range is being misindented here.

EDIT: I've added a PR to solve the above but in general we've had trouble upgrading to Helm chart v6 as there are now two fields which are seemingly necessary where before they were not, and they're not listed in the upgrade guide:

  • As of 6.0: we must provide a schemaConfig whereas in v5 we could use a suggested default without needing a useTestSchema flag.
  • As of 6.1: we must provide storage defaults otherwise templating fails (see this comment).

In general I would personally prefer that I can always install a Helm chart with no values and get some kind of sensible default, even if only for testing out the chart. Later, when I want to go production-ready, I can tweak those parameters to something more appropriate.

@maksym-iv
Copy link

maksym-iv commented Apr 11, 2024

On the upgrade attempt using Simple Scalable mode scheduler_address is empty in the rendered config, whilst present before upgrade:

    frontend:
      scheduler_address: ""
      tail_proxy_url: http://loki-querier.grafana.svc.gke-main-a.us-east1:3100
    frontend_worker:
      scheduler_address: ""

It looks like schedulerAddress is defined only for the Distributed mode, note, service query-scheduler-discovery is still created

@slim-bean
Copy link
Collaborator Author

We encountered a bug in the rendering of the Loki config with the helm chart v6.0.0 that may be similar to what @MartinEmrich encountered above. These simple values will cause the rendering to fail:

loki:
  query_range:
    parallelise_shardable_queries: false
  useTestSchema: true

This causes .Values.loki.config to look like (note the extra indent):

query_range:
  align_queries_with_step: true
    parallelise_shardable_queries: false
  cache_results: true

I believe anything under loki.query_range is being misindented here.

EDIT: I've added a PR to solve the above but in general we've had trouble upgrading to Helm chart v6 as there are now two fields which are seemingly necessary where before they were not, and they're not listed in the upgrade guide:

* As of 6.0: we must provide a `schemaConfig` whereas in v5 we could use a suggested default without needing a `useTestSchema` flag.

* As of 6.1: we must provide storage defaults otherwise templating fails (see [this comment](https://github.com/grafana/loki/pull/12548#issuecomment-2046492619)).

In general I would personally prefer that I can always install a Helm chart with no values and get some kind of sensible default, even if only for testing out the chart. Later, when I want to go production-ready, I can tweak those parameters to something more appropriate.

Very helpful feedback, thank you!

The schemaConfig name change was an oversight on my part and I need to get it into the upgrade guide, apologies.

The forced requirement for a schemaConfig is an interesting problem, if we default it in the chart then people end up using it which means we can't change it without breaking their clusters because schemas can't be changed, only new ones added. I do supposed we could just add new ones but that feels a bit like forcing an upgrade on someone... I'm not sure, this is a hard problem that I don't have great answers to.

We decided that this time around we'd force people to define a schema, and provide the test schema config value that should be spit out in an error message if you want to just try the chart with data you plan on throwing away. It does seem like we need to update this error or that flag to also provide values for the storage defaults however.

@kunalmehta-eve
Copy link

kunalmehta-eve commented May 13, 2024

Explore-logs-2024-05-13 16_56_45.txt

I am getting performance issues after upgrading Loki to 3.0.0 usinh helm chart 6.0.0, Querying logs taking ages i just upgraded app version for now schema is still v12.

Please suggest
\

`loki:
auth_enabled: false
analytics:
reporting_enabled: false
storage:
type: azure
azure:
accountName: ${azurerm_storage_account.loki.name}
bucketNames:
chunks: ${azurerm_storage_container.loki_chunks.name}
ruler: ${azurerm_storage_container.loki_ruler.name}
admin: ${azurerm_storage_container.loki_admin.name}
ingester:
max_chunk_age: 24h
structuredConfig:
query_range:
# By default, Loki parallelises queries that can be split/sharded. This was a controversial change in v2.4.2
# and causes the number of active connections to rise significantly. We don't really need this feature for our
# current scale, so we therefore disable it. See https://github.com/grafana/loki/pull/5077/files#r781448453
parallelise_shardable_queries: false
server:
# Without increasing the write timeout, long-running queries fail with a 502 Bad Gateway error
# due to a i/o timeout in the read pod.
http_server_write_timeout: 5m

limits_config:
allow_structured_metadata: false
schemaConfig:
configs:
- from: 2022-01-11
store: boltdb-shipper
object_store: azure
schema: v12
index:
prefix: loki_index_
period: 24h

lokiCanary:
resources:
requests:
cpu: "0.01"
memory: 64Mi
limits:
cpu: "0.05"
memory: 128Mi

monitoring:
enabled: true
selfMonitoring:
enabled: true
grafanaAgent:
installOperator: true

write:
replicas: 3
resources:
requests:
cpu: "0.2"
memory: 4Gi
limits:
cpu: "1"
memory: 4Gi

read:
replicas: 3
resources:
requests:
cpu: "0.2"
memory: 3Gi
limits:
# Allow read pods to spike to support larger queries. We assume that such large queries are rare
# and thus don't impact the cluster significantly.
cpu: "3"
memory: 8Gi

backend:
replicas: 3
resources:
requests:
cpu: "0.1"
memory: 512Mi
limits:
cpu: "0.2"
memory: 1Gi`

Please do check logs attached and let me know what needs to be fixed here ? @drew-viles @slim-bean

@slim-bean
Copy link
Collaborator Author

slim-bean commented May 14, 2024

Hey folks sorry for being slow to respond to some of these issues. Appreciate your feedback and help finding and fixing problems!

I've tried to make sure there are at least issues open for things folks are struggling with:

If I've missed anything please let me know!

@slim-bean
Copy link
Collaborator Author

IMPORTANT Can somebody put a bit of light about why monitoring part is set to deprecating stage? When it will be removed how to know if Loki working and working optimally? I saw it swapped to another chart, but that chart provide too much details that not needed if you need only Loki. I not sure if this is good decision 😓

A couple folks have commented on this, there are a few reasons we are removing the monitoring section from the Loki chart:

  • It does not play nicely with other charts for our other databases like mimir/tempo which also installed similar sections causing issues around multiple installations of the agent operator
  • The agent operator itself is deprecated
  • We found there is really not a good one size fits all approach to monitoring, for example this chart used to take the approach of using the prometheus and agent operators to manage custom resources via things like PodLogs and PodMonitors. While some folks already use this method many don't and we can't easily also support helping folks install and operate in this fashion.
  • decoupling all or our helm charts to be installation of just the database simplifies them and makes them easier to maintain
  • providing a separate monitoring chart allows us to provide an approach for monitoring all of our databases (still a WIP)

I apologize as I know for some folks this is disruptive and not making your lives any better, but it's already extremely time consuming to maintain this chart so simplifying it is a huge advantage for us.

The new chart should come with options for just installing Grafana and Dashboards as well as various methods for monitoring although it's not where we'd like it to be yet (unfortunately there isn't a single binary or SSD version of mimir or tempo so their installs are quite large)

I would also recommend folks try out using the monitoring chart with the free tier of grafana cloud as the backend, we can provision the dashboards you need via integrations and this gives you an external mechanism for monitoring your clusters at no charge and hopefully makes everyones lives easier.

@dragoangel
Copy link

dragoangel commented May 14, 2024

IMPORTANT Can somebody put a bit of light about why monitoring part is set to deprecating stage? When it will be removed how to know if Loki working and working optimally? I saw it swapped to another chart, but that chart provide too much details that not needed if you need only Loki. I not sure if this is good decision 😓

A couple folks have commented on this, there are a few reasons we are removing the monitoring section from the Loki chart:

  • It does not play nicely with other charts for our other databases like mimir/tempo which also installed similar sections causing issues around multiple installations of the agent operator
  • The agent operator itself is deprecated
  • We found there is really not a good one size fits all approach to monitoring, for example this chart used to take the approach of using the prometheus and agent operators to manage custom resources via things like PodLogs and PodMonitors. While some folks already use this method many don't and we can't easily also support helping folks install and operate in this fashion.
  • decoupling all or our helm charts to be installation of just the database simplifies them and makes them easier to maintain
  • providing a separate monitoring chart allows us to provide an approach for monitoring all of our databases (still a WIP)

I apologize as I know for some folks this is disruptive and not making your lives any better, but it's already extremely time consuming to maintain this chart so simplifying it is a huge advantage for us.

The new chart should come with options for just installing Grafana and Dashboards as well as various methods for monitoring although it's not where we'd like it to be yet (unfortunately there isn't a single binary or SSD version of mimir or tempo so their installs are quite large)

I would also recommend folks try out using the monitoring chart with the free tier of grafana cloud as the backend, we can provision the dashboards you need via integrations and this gives you an external mechanism for monitoring your clusters at no charge and hopefully makes everyones lives easier.

Hi @slim-bean, first of all thank you for feedback!

I'm using right now monitoring part without any grafana operator, with loki canary that scraped by promtail and send to loki after that. I don't see reason in general dropping monitoring section as only thing it should do is to deploy loki canary, service monitors and grafana dashboards. I don't think such stack will in any way confuse people or create issues in parent helm chart you mentioned. If this not the case, then I would have to just use my own helm chart with all this resources created by myself and loki chart as dependency with is not best option through.

Also as I understand promtail will also get obsolete which is not best best option from what I think. Getting quick look at alloy gives me feeling it's config structure much more complicated compared to promtail, it's luck of web interface to inspect targets and due to that label stuff should be guessed instead of checked. Also having daemonset that would responsible for multiple things which unused and having bunch of metrics that would also be not needed seems like overhead.

@YevhenLodovyi
Copy link

Hi, When shall we expect the 3.X.X release? I am interested in couple of bugfixes and do not want to use not tagged image.

@kunalmehta-eve
Copy link

kunalmehta-eve commented May 15, 2024

@slim-bean

We are getting multiple errors like these
caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF

caller=retry.go:95 org_id=fake msg="error processing request" try=0 query="{app="loki"} | logfmt | level="warn" or level="error"" query_hash=901594686 start=2024-05-14T13:30:00Z end=2024-05-14T13:45:00Z start_delta=17h25m33.153641627s end_delta=17h10m33.153641727s length=15m0s retry_in=329.878123ms err="context canceled"

can you please help ?

@PlayMTL
Copy link

PlayMTL commented May 15, 2024

Hey @slim-bean,

can you please also have a look on my issue with the different s3 buckets and differents access & secret keys. Not completly sure but i think @JBodkin-Amphora has my issue aswell.

Thank you :)

@kunalmehta-eve
Copy link

level=error ts=2024-05-16T09:04:08.131652605Z caller=flush.go:152 component=ingester org_id=fake msg="failed to flush" err="failed to flush chunks: store put chunk: -> github.com/Azure/azure-storage-blob-go/azblob.newStorageError, /src/loki/vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42\n===== RESPONSE ERROR (ServiceCode=InvalidBlockList) =====\nDescription=The specified block list is invalid.\nRequestId:13f410b4-901e-007f-4770-a7b251000000\nTime:2024-05-16T09:04:08.0437568Z, Details: \n Code: InvalidBlockList\n PUT https://testinglokiprd.blob.core.windows.net/chunks/fake/a663ab7e36edbebb/18f807ba885-18f80897cbf-1d839c2?comp=blocklist&timeout=31\n Authorization: REDACTED\n Content-Length: [128]\n Content-Type: [application/xml]\n User-Agent: [Azure-Storage/0.14 (go1.21.9; linux)]\n X-Ms-Blob-Cache-Control: []\n X-Ms-Blob-Content-Disposition: []\n X-Ms-Blob-Content-Encoding: []\n X-Ms-Blob-Content-Language: []\n X-Ms-Blob-Content-Type: []\n X-Ms-Client-Request-Id: [f5420ecf-70fc-4784-75ea-1220f12b3dd0]\n X-Ms-Date: [Thu, 16 May 2024 09:04:08 GMT]\n X-Ms-Version: [2020-04-08]\n --------------------------------------------------------------------------------\n RESPONSE Status: 400 The specified block list is invalid.\n Content-Length: [221]\n Content-Type: [application/xml]\n Date: [Thu, 16 May 2024 09:04:08 GMT]\n Server: [Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0]\n X-Ms-Client-Request-Id: [f5420ecf-70fc-4784-75ea-1220f12b3dd0]\n X-Ms-Error-Code: [InvalidBlockList]\n X-Ms-Request-Id: [13f410b4-901e-007f-4770-a7b251000000]\n X-Ms-Version: [2020-04-08]\n\n\n, num_chunks: 1, labels: {app="parquet-2grvk", container="main", filename="/var/log/pods/argo-workflows_parquet-2grvk-parquet-29307887_8b782254-47c5-4449-b4c8-0de438c02206/main/0.log", job="argo-workflows/parquet-2grvk", namespace="argo-workflows", node_name="aks-defaultgreen-11165910-vmss0000oy", pod="parquet-2grvk-parquet-29307887", stream="stderr"}"

what this error means? started getting after upgradation to loki 3.0.0

@slim-bean @drew-viles

@drew-viles
Copy link

Hi @kunalmehta-eve - I'm probably not the right person to ask about this as I'm a consumer of Loki, not one of the maintainers. All I can recommend is checking the block list that it's flagging as invalid and comparing it to the requirements as defined in the 3.0 docs.

@huozhirui
Copy link

huozhirui commented May 23, 2024

Is it necessary to add a tsdb storage for loki 3. x? Can't use block storage to store indexes like V2.xx?
Is this configuration acceptable?

image

@QuentinBisson
Copy link
Contributor

Is bloom gateway supposed to work in simple scalable mode? Because documentation on how to enable it is non-existent https://grafana.com/docs/loki/latest/get-started/deployment-modes/ and in the helm chart. Also, the current bloom gateway and compactor charts are made to work only with the distributed mode of Loki

"expr": "sum(rate(loki_bloom_gateway_filtered_chunks_sum{job=\"$namespace/bloom-gateway\"}[$__rate_interval]))\n/\nsum(rate(loki_bloom_gateway_requested_chunks_sum{job=\"$namespace/bloom-gateway\"}[$__rate_interval]))",
.

@numa1985
Copy link

Trying to update helm chart 5.43.2 to 6.1.0 but i am getting

UPGRADE FAILED: template: loki/templates/single-binary/statefulset.yaml:44:28: executing "loki/templates/single-binary/statefulset.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.yaml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for "{{- if .Values.enterprise.enabled}}\n{{- tpl .Values.enterprise.config . }}\n{{- else }}\nauth_enabled: {{ .Values.loki.auth_enabled }}\n{{- end }}\n\n{{- with .Values.loki.server }}\nserver:\n  {{- toYaml . | nindent 2}}\n{{- end}}\n\nmemberlist:\n{{- if .Values.loki.memberlistConfig }}\n  {{- toYaml .Values.loki.memberlistConfig | nindent 2 }}\n{{- else }}\n{{- if .Values.loki.extraMemberlistConfig}}\n{{- toYaml .Values.loki.extraMemberlistConfig | nindent 2}}\n{{- end }}\n  join_members:\n    - {{ include \"loki.memberlist\" . }}\n    {{- with .Values.migrate.fromDistributed }}\n    {{- if .enabled }}\n    - {{ .memberlistService }}\n    {{- end }}\n
  {{- end }}\n{{- end }}\n\n{{- with .Values.loki.ingester }}\ningester:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- if .Values.loki.commonConfig}}\ncommon:\n{{- toYaml .Values.loki.commonConfig | nindent 2}}\n  storage:\n  {{- include \"loki.commonStorageConfig\" . | nindent 4}}\n{{- end}}\n\n{{- with .Values.loki.limits_config }}\nlimits_config:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\nruntime_config:\n  file: /etc/loki/runtime-config/runtime-config.yaml\n\n{{- with .Values.chunksCache }}\n{{- if .enabled }}\nchunk_store_config:\n  chunk_cache_config:\n    default_validity: {{ .defaultValidity }}\n    background:\n      writeback_goroutines: {{ .writebackParallelism }}\n      writeback_buffer: {{ .writebackBuffer }}\n      writeback_size_limit: {{ .writebackSizeLimit }}\n    memcached:\n      batch_size: {{ .batchSize }}\n      parallelism: {{ .parallelism }}\n    memcached_client:\n      addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-chunks-cache.{{ $.Release.Namespace }}.svc\n      consistent_hash: true\n      timeout: {{ .timeout }}\n      max_idle_conns: 72\n{{- end }}\n{{- end }}\n\n{{- if .Values.loki.schemaConfig }}\nschema_config:\n{{- toYaml .Values.loki.schemaConfig | nindent 2}}\n{{- end }}\n\n{{- if .Values.loki.useTestSchema }}\nschema_config:\n{{- toYaml .Values.loki.testSchemaConfig | nindent 2}}\n{{- end }}\n\n{{ include \"loki.rulerConfig\" . }}\n\n{{- if or .Values.tableManager.retention_deletes_enabled .Values.tableManager.retention_period }}\ntable_manager:\n  retention_deletes_enabled: {{ .Values.tableManager.retention_deletes_enabled }}\n  retention_period: {{ .Values.tableManager.retention_period }}\n{{- end }}\n\nquery_range:\n  align_queries_with_step: true\n  {{- with .Values.loki.query_range }}\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n  {{- end }}\n  {{- if .Values.resultsCache.enabled }}\n  {{- with .Values.resultsCache }}\n  cache_results: true\n  results_cache:\n    cache:\n      default_validity: {{ .defaultValidity }}\n      background:\n        writeback_goroutines: {{ .writebackParallelism }}\n        writeback_buffer: {{ .writebackBuffer }}\n        writeback_size_limit: {{ .writebackSizeLimit }}\n      memcached_client:\n        consistent_hash: true\n        addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-results-cache.{{ $.Release.Namespace }}.svc\n        timeout: {{ .timeout }}\n        update_interval: 1m\n  {{- end }}\n  {{- end }}\n\n{{- with .Values.loki.storage_config }}\nstorage_config:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.query_scheduler }}\nquery_scheduler:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.compactor }}\ncompactor:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.analytics }}\nanalytics:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.querier }}\nquerier:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.index_gateway }}\nindex_gateway:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend }}\nfrontend:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend_worker }}\nfrontend_worker:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.distributor }}\ndistributor:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\ntracing:\n  enabled: {{ .Values.loki.tracing.enabled }}\n": template: loki/templates/single-binary/statefulset.yaml:37:6: executing "loki/templates/single-binary/statefulset.yaml" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks

@krimeshshah
Copy link

Two issues so far with my existing Helm values:

loki.schema_config apparently became loki.schemaConfig. After renaming the object, that part was accepted (also by the 5.x helm chart).

Then the loki ConfigMap failed to be generated. The config.yaml value is literally Error: 'error converting YAML to JSON: yaml: line 70: mapping values are not allowed in this context'.

Trying to render the helm chart locally with "helm --debug template" results in

Error: template: loki/templates/write/statefulset-write.yaml:46:28: executing "loki/templates/write/statefulset-write.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.ya
ml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for "
<<<< template removed for brevity >>>
": template: loki/templates/write/statefulset-write.yaml:37:6: executing "loki/templates/write/statefulset-write.yaml" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks

I try to understand the nested template structure in the helm chart to understand what is happening.

A short helm chart values set (which worked fine with 5.x) triggering the phenomenon:

values.yaml

serviceAccount:
  create: false
  name: loki
test:
  enabled: false
monitoring:
  dashboards:
    enable: false
  lokiCanary:
    enabled: false
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false
loki:
  auth_enabled: false
  limits_config:
    max_streams_per_user: 10000
    max_global_streams_per_user: 10000
  storage_config:
    aws:
      s3: s3://eu-central-1
      bucketnames: my-bucket-name
  schemaConfig:
    configs:
      - from: 2024-01-19
        store: tsdb
        object_store: aws
        schema: v11
        index:
          prefix: "some-prefix_"
          period: 24h
  query_range:
    split_queries_by_interval: 0
  query_scheduler:
    max_outstanding_requests_per_tenant: 8192
  analytics:
    reporting_enabled: false
  compactor:
    shared_store: s3
gateway:
  replicas: 3
read:
  replicas: 3
write:
  replicas: 3
compactor:
  enable: true

Is this issue fixed. I am trying to migrated loki to helm chart version 6.X.X and i am getting below error

rror: template: logging-scalable/charts/loki/templates/write/statefulset-write.yaml:50:28: executing "logging-scalable/charts/loki/templates/write/statefulset-write.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: logging-scalable/charts/loki/templates/config.yaml:19:7: executing "logging-scalable/charts/loki/templates/config.yaml" at <include "loki.calculatedConfig" .>: error calling include: template: logging-scalable/charts/loki/templates/_helpers.tpl:537:35: executing "loki.calculatedConfig" at <.Values.loki.config>: wrong type for value; expected string; got map[string]interface {

@JohanLindvall
Copy link
Contributor

JohanLindvall commented Jun 4, 2024

We are seeing very high memory usage / memory leaks when ingesting logs with structured metadata. See https://community.grafana.com/t/memory-leaks-in-ingester-with-structured-metadata/123177 and #10994

Reported under #13123 and now fixed. Thanks :)

@zach-flaglerhealth
Copy link

A couple folks have commented on this, there are a few reasons we are removing the monitoring section from the Loki chart:

...

The new chart should come with options for just installing Grafana and Dashboards as well as various methods for monitoring although it's not where we'd like it to be yet (unfortunately there isn't a single binary or SSD version of mimir or tempo so their installs are quite large)

I would also recommend folks try out using the monitoring chart with the free tier of grafana cloud as the backend, we can provision the dashboards you need via integrations and this gives you an external mechanism for monitoring your clusters at no charge and hopefully makes everyones lives easier.

Thanks for the info, just trying to make sure I'm following.

It seems like a lot of your response is around the Grafana Agent Operator, and most of that configuration seems to be through the selfMonitoring: section of the values.yaml file. The serviceMonitor: section seems like fairly standard configuration I've seen in a number of Helm charts.

Looking at the meta-monitoring chart, it definitely seems configured to deploy its own entire stack of applications that would seem to bypass any other metrics gathering that we might be doing on our own clusters ("No one size fits all"), with the goal being that logs and metrics from Loki, Mimir, and Tempo would feed into a Loki and Mimir instance, which has a "Turtles all the way down" feeling to it. It doesn't seem to have a serviceMonitor, other than a section configuring Loki, disabling the serviceMonitor.

So is the intent that it's the entire monitoring: section that's being removed in favor of the meta chart? Or just the self-monitoring agent installation portion?

@dragoangel
Copy link

dragoangel commented Jun 9, 2024

@zach-flaglerhealth agrees with you. If this would be the case, I would end up with writing own helm chart to ship own service monitors and dashboards, not the best option, but for me using clouds for monitoring isn't an option, and migrating to Grafana Mimin instead of kube-prometheus-stack and Thanos just because of couple dashboards and monitors is not an option as well.

I already using own helm chart that ships loki and promtail with needed configuration where they both are set as dependencies. But will someday have to move away from promtail as well :(

@krimeshshah
Copy link

Hi Team,
How do i apply log retention if i want to use loki with simple scalable mode. As per the loki compactor template, it only can be deployed if i run loki in distributed microservice mode. https://github.com/grafana/loki/blob/main/production/helm/loki/templates/compactor/statefulset-compactor.yaml#L1
Also tablemanager is going to be deprecated. Can someone suggest how to configure log retention for loki 3.0 simple scalable mode?

@MartinEmrich
Copy link

MartinEmrich commented Jun 20, 2024

Just doing another upgrade attempt on a less-important environment. I still have issues doing the schema upgrade/schema config.
I tried multiple variants of a schema config entry for the old/previous data, but whatever I try, Loki will not return any data from older data. My current WIP:

  - from: 2024-01-19 ### old logs, where config/prefix was ignored.
    store: tsdb
    object_store: aws
    schema: v11
    index:
      prefix: "loki_index_"
      period: 24h
  - from: 2024-06-20 ### today: transition  during upgrade
    store: tsdb
    object_store: aws
    schema: v11
    index:
      prefix: "myprefix_"
      period: 24h
  - from: 2024-06-21 ### tomorrow: upgrade to v13
    store: tsdb
    object_store: aws
    schema: v13
    index:
      prefix: "myprefix_"
      period: 24h
...

Again the old 2.x version at least ignored the schema index prefix; I found mostly "loki_index_*" folders in the S3 bucket.
So I am content with losing the logs from today, as there's now some mixture between the middle entry (actually using myprefix). New logs are currently received and are retrievable (i.e. middle block works), and from tomorrow on, v13 shall be used.

But the logs from yesterday and beyond should be retrievable, unless something in the first block does not match reality. I see no errors in backend or reader logs.

How could I reconstruct the correct schemaConfigs for yesterday-- from looking at my actual S3 bucket entry?

Update: I notices that the new index folders contain *.tsdb.gz files (Would expect that with "store: tsdb"). The older index folders do only contain a "compactor-XXXXXXXXXX.r.gz" file. What could that hint to?

@MartinEmrich
Copy link

... After trying lots of combinations, it looks like Schema v12, boltdb-shipper and "loki_index_" prefix did the trick.

@ethanliuu
Copy link

@slim-bean

We are getting multiple errors like these caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF

caller=retry.go:95 org_id=fake msg="error processing request" try=0 query="{app="loki"} | logfmt | level="warn" or level="error"" query_hash=901594686 start=2024-05-14T13:30:00Z end=2024-05-14T13:45:00Z start_delta=17h25m33.153641627s end_delta=17h10m33.153641727s length=15m0s retry_in=329.878123ms err="context canceled"

can you please help ?

Hello, I have also encountered this error repeatedly. May I ask if your problem has been resolved

@Kybeer
Copy link

Kybeer commented Jul 11, 2024

So I should just be able to rename shared_store to delete_request_store and be good?

Seems to have worked for me

@blackliner
Copy link

Gotta have to say, the upgrade to helm chart v6 was a bad experience. This whole schemaConfig thing is really turning me down, I don't want to have to mess around with these things as part of an upgrade, and even in a greenfield scenario I would like it to just work. Best of all, the docu is completely empty and thus useless: https://grafana.com/docs/loki/latest/configuration/#schema_config

@MartinEmrich
Copy link

I have to agree. After many pains, lost log periods and some critical glances from colleagues, my/our Loki updates are all done and seem to work, it's time for a conclusion.
Sorry to be direct and harsh, but it was reality for me:

  • The changes of the helm chart Values schema were poorly (if at all) documented. Out of the blue, options stopped working because they were moved to other places or just changed their lower/upper-case composition. Instead of failing with an error, it just decided to use some default value instead.
  • The helm chart itself (personal opinion) is highly overengineered: In some places, it introduces a whole new configuration schema which is then rendered into the actual Loki configuration file, in other places I have to put the Loki configuration as-is. Examples: Why is there both a storage: and a storage_config: object? Why do I now have to give bucketNames.chunks, ruler, admin, even if all are the same and I don't even know the reason?
  • As @blackliner also experienced: The explicit schemaConfig with manually keeping track of dates is a pain, causing actual data loss when the configuration does not match up perfectly. Loki should keep track of this by itself (tracking schema changes in a file/object in the storage backend), use the most current schema for new chunks, and even offer an option to migrate older chunks.

@JBodkin-Amphora
Copy link

I've been looking at migrating to this helm chart from the loki-distributed helm chart, however it is still impossible. The biggest issue seems to be that the affinity and topologySpreadConstraints sections cannot be templated. For example:

ingester:
  topologySpreadConstraints: |
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          {{- include "loki.ingesterSelectorLabels" . | nindent 6 }}
    - maxSkew: 1
      minDomains: 3
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          {{- include "loki.ingesterSelectorLabels" . | nindent 6 }}
  affinity: |
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            topologyKey: topology.kubernetes.io/zone
            labelSelector:
              matchLabels:
                {{- include "loki.ingesterSelectorLabels" . | nindent 12 }}
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              {{- include "loki.ingesterSelectorLabels" . | nindent 10 }}

Some of the other issues that I've encountered are:

  • Required to define loki.storage.bucketNames: {} although I use loki.structuredConfig instead
  • Why do I need to define backend, read and write replicas when I've already specified deploymentMode: Distributed?
  • Why is test.enabled and lokiCanary.enabled defaulted to true? They don't appear in the loki documentation as components and from a glance seem to be about testing so I don't understand why you would need this in production.
  • Why can you not disable the bloom builder from being deployed? I noticed it defaults to 0 replicas, the bloom compactor and gateway seems to be experimental at the moment, should they be opt in? The bloom builder isn't mentioned as a component

@sourcehawk
Copy link

sourcehawk commented Sep 9, 2024

When updating the storageConfig in the v6 helm chart to the following, setting the date of the new tsdb store to one day into the future as stated by the documentation results in errors in loki pods (read, write, backend):

- from: "2022-01-11",
  index:
    period: "24h"
    prefix: "loki_index_"
  object_store: "s3"
  schema: "v12"
  store: "boltdb-shipper"
- from: "2024-09-10",
  index:
    prefix: "index_"
    period: "24h"
  object_store: "s3"
  schema: "v13"
  store: "tsdb"

Error:

schema v13 is required to store Structured Metadata and use native OTLP ingestion, your schema version is v12.

Set allow_structured_metadata: false in the limits_config section or set the command line argument -validation.allow-structured-metadata=false and restart Loki.

Then proceed to update to schema v13 or newer before re-enabling this config, search for 'Storage Schema' in the docs for the schema update procedure

CONFIG ERROR: tsdb index type is required to store Structured Metadata and use native OTLP ingestion, your index type is boltdb-shipper (defined in the store parameter of the schema_config). Set allow_structured_metadata: false in the limits_config section or set the command line argument -validation.allow-structured-metadata=false and restart Loki.
Then proceed to update the schema to use index type tsdb before re-enabling this config, search for 'Storage Schema' in the docs for the schema update procedure"

This error does not occur when I set the from date in the new entry to the current date, but then I am forced to lose logs for that day, and for some reason my loki datasource won't work anymore.

The error is clear by saying that I should disable allow_structured_metadata, but why isn't this just done automatically according to the storage schema I am using? Why do I have to add the storage configuration and then enable/disable this twice, once before and once after the correct date has been reached for my second storage entry? As a user I couldn't care less whether you store structured metadata or not, and frankly I have no idea what it means. All I know is that it breaks the upgrade process.

Also, will the new tsdb store work without setting allow_structured_metadata to true again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests