Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using single store (tsdb, boltdb-shipper), indexes and chunks are shipped to object storage but queries are (partly) empty #10529

Open
Christoph-AK opened this issue Sep 11, 2023 · 48 comments
Labels
type/bug Somehing is not working as expected type/question

Comments

@Christoph-AK
Copy link

For about a week now my loki has weird gaps in the reported data, and right now doesn't return any data.

This is super annoying, especially because of alerts triggering because their conditions are wrongly reported by loki.

The last time I had similar problems with missing data the local storage of my instance was full. But that is not currently the case.

What else can lead to this behaviour? How can I fix this?

compose:

  loki:
    image: grafana/loki:latest
    container_name: loki
    restart: unless-stopped
    logging:
      driver: "json-file"
      options:
        max-size: "512m"
    volumes:
      - ./config/loki.yaml:/etc/config/loki.yaml
      - ./loki:/loki
    entrypoint:
      - /usr/bin/loki
      - -config.file=/etc/config/loki.yaml
    ports:
      - "4100:4100"
    labels:
      - com.centurylinklabs.watchtower.enable=true

loki config:

auth_enabled: false

server:
  http_listen_port: 4100
  http_server_read_timeout: 3m
  http_server_write_timeout: 3m
  grpc_server_max_recv_msg_size: 10485760
  grpc_server_max_send_msg_size: 10485760

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 10s
  chunk_idle_period: 5m
  wal:
    dir: "/loki/wal"
    flush_on_shutdown: true
    replay_memory_ceiling: "512MB"

schema_config:
  configs:
    # - store: boltdb
    #   object_store: filesystem
    #   schema: v11
    #   index:
    #     prefix: index_
    #     period: 24h
    - store: boltdb-shipper
      object_store: aws
      schema: v11
      index:
        prefix: aws_index_
        period: 24h

compactor:
  working_directory: /loki/boltdb-shipper-compactor
  shared_store: aws

storage_config:
  # boltdb:
  #   directory: /loki/index

  # filesystem:
  #   directory: /loki/chunks

  aws:
    bucketnames: ak-loki-log
    region: eu-central-1
    access_key_id: <id>
    secret_access_key: <key>
    sse_encryption: true
    endpoint: s3.eu-central-1.amazonaws.com

  boltdb_shipper:
    active_index_directory: /loki/index
    shared_store: s3
    cache_location: /loki/boltdb-cache

limits_config:
  enforce_metric_name: false
  split_queries_by_interval: 24h

frontend:
  compress_responses: true
  max_outstanding_per_tenant: 4096

querier:
  max_concurrent: 1024
@bumarcell
Copy link

bumarcell commented Sep 11, 2023

We're also experiencing logs gaps on the frontend (Loki v2.9.0) inspite the logs actually being there. A restart of the loki-backend statefulset brings the logs back tho
I wonder if that's the same issue..

@Christoph-AK
Copy link
Author

Hey @bumarcell , can confirm a docker compose restart loki seems to fix the issue. The serivce was restarted when the new version was released, so wasnt running for more then a few days - its a bit worrying if this is recurring.

@bumarcell
Copy link

I believe we've experienced this behavior at least twice so far

@Christoph-AK
Copy link
Author

I'm tempted to just restart loki daily via cron, but that shouldn't be needed in a stable application, so i would rather like to help finding the root cause. Is there any additional debug flags i can run to get some more useful data about the state of loki if this happens again?

@bumarcell
Copy link

My laziness is telling me to wait for the minor updates 😄

@bumarcell
Copy link

which hopefully we triggered already 🤞

@BackKot
Copy link

BackKot commented Sep 12, 2023

Hi all! Similar problem. I decided to roll back to 2.8.4

@isshwar
Copy link

isshwar commented Sep 12, 2023

Hi all,
I have also rolled back to 2.8.4. Hoping a fix is released soon!

@Christoph-AK
Copy link
Author

image Urg, its happening again. Rolling back aswell.

@01cj
Copy link

01cj commented Sep 12, 2023

Same issue, roll backed to 2.8.4

@andrejshapal
Copy link

Hello,
Same on gcp with boltdb-shipper.
Reverted to 2.8.4

@jammiemil
Copy link

Also using gcp with boltdb-shipper,

Upgraded last night to 2.9 and since then my queries only return log lines from BEFORE the upgrade or logs that are still sat on my ingesters.

Reverting to 2.8.4 has resolved the issue for now and i can see all the logs ingested since last nights upgrade.

@chaudum
Copy link
Contributor

chaudum commented Sep 12, 2023

We are investigating this issue, which may have been introduced with #9710 (still needs to be verified).

@chaudum chaudum added the type/bug Somehing is not working as expected label Sep 12, 2023
@chaudum
Copy link
Contributor

chaudum commented Sep 13, 2023

Could someone facing this issue post logs (from indexgateway or backend or all target) here, even if they seem to be irrelevant?
Also to verify, this happens both with boltdb-shipper and tsdb, in single binary mode and simple scalable deployment (SSD)?

@stevenbrookes
Copy link

I'm also seeing this issue. Downgrade to 2.8.4 fixed it. I'm running tsdb in single binary mode. Happy to post logs if you can tell me how to find them.

@BackKot
Copy link

BackKot commented Sep 13, 2023

Could someone facing this issue post logs (from indexgateway or backend or all target) here, even if they seem to be irrelevant? Also to verify, this happens both with boltdb-shipper and tsdb, in single binary mode and simple scalable deployment (SSD)?

tsdb + SSD.
Unfortunately there are no more logs :(

@jmichalek132
Copy link
Contributor

Could someone facing this issue post logs (from indexgateway or backend or all target) here, even if they seem to be irrelevant? Also to verify, this happens both with boltdb-shipper and tsdb, in single binary mode and simple scalable deployment (SSD)?

We are hitting this too, running in microservices mode with tsdb.
The issue for us is quite visible in Logs volume, sometimes there are gaps, and sometimes it returns the full-time range.
I can share traces from a query that returned all data and query that returned only partial data.
I'll provide logs in a bit but I have a suspicion based of traces it might be caused by this experimental feature In-memory (FIFO) cache - chunksembedded-cache.

@pingping95
Copy link

@chaudum

Our Environment is TSDB + AWS S3

I finally rolled back our Loki from 2.9.0 to 2.8.4

@chaudum
Copy link
Contributor

chaudum commented Sep 14, 2023

As a workaround, you can set the per-tenant setting query_ready_index_num_days (docs) to a value greater than 0 (default value), e.g. to 1.
Note that this will pre-fetch index data to your local cache directory.

@michaelkebe
Copy link

michaelkebe commented Sep 14, 2023

Also to verify, this happens both with boltdb-shipper and tsdb, in single binary mode and simple scalable deployment (SSD)?

Can confirm gaps in logs after upgrading to 2.9.0 using boltdb-shipper and single binary mode. Querying logs of the last about 2h is possible, but older logs not available.

Restarting loki to filled the gaps again. Downgraded to 2.8.4 until a proper patch is provided.

I think the issue should be renamed, because it does not occur on S3 only.

@chaudum
Copy link
Contributor

chaudum commented Sep 14, 2023

I think the issue should be renamed, because it does not occur on S3 only.

Agree. Will rename the issue.

@chaudum chaudum changed the title Logs with S3 connection are stored fine, but queries are empty When using single store (tsdb, boltdb-shipper), indexes and chunks are shipped to object storage but queries are (partly) empty Sep 14, 2023
@chaudum
Copy link
Contributor

chaudum commented Sep 14, 2023

Good news, Loki 2.9.1 has been released.

@Christoph-AK
Copy link
Author

Will try it out immediately, thank you so much for looking into it this quickly!

@isshwar
Copy link

isshwar commented Sep 15, 2023

doesn't seem to fix the issue. I have quickly rolled out to 2.9.1 and again started seeing logs only for the last 30mins. is this working for anyone after the upgrade?

@michaelkebe
Copy link

doesn't seem to fix the issue. I have quickly rolled out to 2.9.1 and again started seeing logs only for the last 30mins. is this working for anyone after the upgrade?

How long was the 2.9.1 loki instance running?

@isshwar
Copy link

isshwar commented Sep 15, 2023

doesn't seem to fix the issue. I have quickly rolled out to 2.9.1 and again started seeing logs only for the last 30mins. is this working for anyone after the upgrade?

How long was the 2.9.1 loki instance running?

I let it to run for an hour.

@michaelkebe
Copy link

Running with 2.9.1 for 2 hours now. Until now all fine.

@jmichalek132
Copy link
Contributor

Running with 2.9.1 (in our dev env) for also roughly 2 hours and it seems to be fine for now.

@isshwar
Copy link

isshwar commented Sep 15, 2023

with version 2.9.1 - could see only logs for the last 30-45 mins
image
with version 2.8.4 - could see all the logs
image

loki config is same across both the installations. it is only that I have changed the version generated from helm chart.

log line from read pod for 2.9.1
level=info ts=2023-09-15T14:35:26.301702656Z caller=metrics.go:159 component=frontend org_id=fake traceID=018970543e2ecedc latency=fast query="{instance=~\".+\"}" query_hash=251209868 query_type=limited range_type=range length=6h0m0s start_delta=6h0m0.481689492s end_delta=481.689775ms step=10s duration=277.991764ms status=200 limit=1000 returned_lines=0 throughput=2.5MB total_bytes=706kB total_bytes_structured_metadata=0B lines_per_second=5564 total_lines=1547 post_filter_lines=1547 total_entries=801 store_chunks_download_time=159.234202ms queue_time=14.271999ms splits=25 shards=0 cache_chunk_req=7 cache_chunk_hit=6 cache_chunk_bytes_stored=9239 cache_chunk_bytes_fetched=47812 cache_chunk_download_time=121.847µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s

log line from read pod for 2.8.4
level=info ts=2023-09-15T14:44:46.623703883Z caller=metrics.go:152 component=frontend org_id=fake latency=fast query="{instance=~\".+\"}" query_hash=251209868 query_type=limited range_type=range length=6h0m0s start_delta=6h0m0.276677975s end_delta=276.678729ms step=10s duration=89.106795ms status=200 limit=1000 returned_lines=0 throughput=10MB total_bytes=902kB lines_per_second=22130 total_lines=1972 total_entries=1000 store_chunks_download_time=1.835368ms queue_time=2.97ms splits=4 shards=4 cache_chunk_req=10 cache_chunk_hit=10 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=77702 cache_chunk_download_time=75.44µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s

can provide config/logs, if that helps in analyzing

@bumarcell
Copy link

total_bytes_structured_metadata=0B this looks off 🤔

@jmichalek132
Copy link
Contributor

jmichalek132 commented Sep 19, 2023

Running with 2.9.1 (in our dev env) for also roughly 2 hours and it seems to be fine for now.

Running okay since, so we will upgrade production too.

@Alexsandr-Random
Copy link

Alexsandr-Random commented Sep 21, 2023

@isshwar we have the same problem, but it is not issue of total_bytes_structured_metadata=0B
when downgrade to 2.8.4 that option disappeared even we do not change anything in config files
please, share your config because when we downgrade from 2.9.1 to 2.8.4 we also not seeing any historical log data.

update - when downgraded - you can not see any historical data for few hours (we have high among of logs stored in s3 so it could took even a day). I am very surprised in morning when i finally could retrive data from s3.
So, for now (27.10.23) stable loki version is 2.8.4

@ehlomarcus
Copy link

I'm seeing a similar issue and we are running 2.9.0, 2.9.1 and now 2.9.2. Still the same issue.

I can't see anything in the logs that indicate some kind of issue.

If I port-forward to the loki-write-1/2/3 pods and issue POST /ingester/shutdown with curl. Then results come back after each of the pods have flushed blocks and restarted.

We have two loki clusters and they have different time periods where search result is "missing". One has a gap of about 2-3 hours and the other have about ~12 hours. We are using the "grafana/loki" chart.

@isshwar
Copy link

isshwar commented Oct 27, 2023

@Alexsandr-Random
here is my config file. I have now upgraded to 2.9.2 from 2.8.4 and the logs not showing again 😢 @ehlomarcus able to find any reason for this?

auth_enabled: false
    common:
      compactor_address: 'loki-backend'
      path_prefix: /var/loki
      replication_factor: ${LOKI_REPLICATION_FACTOR}
      storage:
        s3:
          access_key_id: ${S3_ACCESS_KEY_ID}
          bucketnames: ${S3_BUCKET_NAME}-chunks
          endpoint: ${S3_ENDPOINT}
          http_config:
            insecure_skip_verify: true
          insecure: false
          region: ${S3_REGION}
          s3forcepathstyle: true
          secret_access_key: $${q}{S3_SECRET_ACCESS_KEY}
    compactor:
      compaction_interval: 30m
      delete_request_cancel_period: 30m
      retention_enabled: true
      shared_store: s3
      working_directory: /var/loki/retention
    frontend:
      max_outstanding_per_tenant: 4096
      scheduler_address: query-scheduler-discovery.${K8S_NAMESPACE}.svc.${K8S_CLUSTER_NAME}.local.:9095
    frontend_worker:
      scheduler_address: query-scheduler-discovery.${K8S_NAMESPACE}.svc.${K8S_CLUSTER_NAME}.local.:9095
    index_gateway:
      mode: ring
    limits_config:
      enforce_metric_name: false
      ingestion_burst_size_mb: 30
      ingestion_rate_mb: 20
      max_cache_freshness_per_query: 10m
      max_chunks_per_query: 6000000
      max_entries_limit_per_query: 10000
      max_query_parallelism: 256
      max_query_series: 2000
      max_streams_matchers_per_query: 10000
      per_stream_rate_limit: 20MB
      query_timeout: 300s
      reject_old_samples: false
      reject_old_samples_max_age: 168h
      retention_period: 744h
      split_queries_by_interval: 15m
    memberlist:
      join_members:
      - loki-memberlist
    querier:
      engine:
        timeout: 300s
      max_concurrent: 2048
    query_range:
      align_queries_with_step: true
    query_scheduler:
      max_outstanding_requests_per_tenant: 32768
    ruler:
      alertmanager_client:
        basic_auth_password: $${q}{ALERTMANAGER_PASSWORD}
        basic_auth_username: ${ALERTMANAGER_USERNAME}
      alertmanager_url: ${ALERTMANAGER_URL}
      enable_alertmanager_v2: true
      enable_sharding: true
      evaluation_interval: ${EVALUATION_INTERVAL}
      remote_write:
        clients:
          prometheus-0:
            basic_auth:
              password: $${q}{PROMETHEUS_PASSWORD}
              username: ${PROMETHEUS_USERNAME}
            name: prom-0
            url: https://prometheus.pageplace.de/api/v1/write
        enabled: true
      storage:
        local:
          directory: /var/ruler
        type: local
      wal:
        dir: /var/loki/ruler-wal
    runtime_config:
      file: /etc/loki/runtime-config/runtime-config.yaml
    schema_config:
      configs:
      - from: "2022-01-11"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v12
        store: boltdb-shipper
      - from: "2023-08-09"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v12
        store: tsdb
    server:
      grpc_listen_port: 9095
      grpc_server_max_recv_msg_size: 10485760
      grpc_server_max_send_msg_size: 10485760
      http_listen_port: 3100
      http_server_idle_timeout: 310s
      http_server_read_timeout: 310s
      http_server_write_timeout: 310s
    storage_config:
      boltdb_shipper:
        active_index_directory: /var/loki/boltdb/index
        cache_location: /var/loki/boltdb-cache
        shared_store: s3
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
      tsdb_shipper:
        active_index_directory: /var/loki/index
        cache_location: /var/loki/tsdb-cache
        shared_store: s3

@waney316
Copy link

waney316 commented Dec 3, 2023

oh, @Alexsandr-Random I get the same issue, When I specify a time range of 12 hours, I can only see logs from the last hour or two. I am using version 2.9.2, tsdb+s3 as the backend storage

image

@ehlomarcus
Copy link

My issue is resolved. My issue was due to a configuration error that I made.

I had made these changes to ingester

ingester:
    chunk_idle_period: 24h
    max_chunk_age: 48h

What I had not realise was that the querier setting querier.query_ingesters_within: 3h (default) limited the ability to search logs that had not yet been written to S3.
I could have increased querier.query_ingesters_within, but it made more sense to decrease the ingester settings instead from an performance perspective.

@MitchIonascu
Copy link

This issue is still present, several months in.

2.8.4 works fine, anything higher than this cuts history down to 2-3 hours maximum.

@peter-miroshnikov
Copy link

2.9.0 experience the same issue.
Docker Compose
Local Installation.

Downgrading to 2.8.4 fixed the problem.

@sudoexec
Copy link

sudoexec commented Aug 9, 2024

I'm using Loki 3.1.0 with the same issue.
The size of chunks and index in s3 is increasing, but I can only see logs in 2 hours.

@Adam-Hsieh
Copy link

I used Loki 2.9.8 or 2.9.1 with the same issue.
The size of chunks and indexes in the filesystem is increasing, but I can only see logs in 2~3 hours.
Is there any update?

@shengjiangfeng
Copy link

I am using 2.9.0, the same issue for me. There seems no update.....

@wfjsw
Copy link

wfjsw commented Oct 29, 2024

UPDATE: Solved. See https://community.grafana.com/t/logs-are-gone-after-flushing-off-ingester/135458/7

I think I am observing similar behavior on Loki 3.2.1 where logs are gone from dashboard after they are flushed off the ingester. If I set up

querier:
  query_store_only: true

I will get almost no result.

My config:

target: all,write
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 
  path_prefix: /var/lib/loki
  instance_interface_names:
    - eth0
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory
    instance_enable_ipv6: true

ingester: 
  chunk_encoding: zstd
  max_chunk_age: 6h
  chunk_idle_period: 3h
  chunk_target_size: 16777216

ingester_rf1:
  enabled: false

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

pattern_ingester:
  enabled: true
  metric_aggregation:
    enabled: true
    loki_address: localhost:3100

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf

compactor:
  working_directory: /var/lib/loki/retention
  compaction_interval: 1h
  retention_enabled: false
  retention_delete_delay: 24h
  retention_delete_worker_count: 150  
  delete_request_store: filesystem

table_manager:
  retention_period: 365d

limits_config:
  retention_period: 365d
  retention_stream: []
  max_query_parallelism: 16
  discover_log_levels: false

Image

Image

It can be clearly seen that old logs are gradually fading out.

@tirelibirefe
Copy link

11.2024 v2.9.3
same problem

@ticup
Copy link

ticup commented Dec 10, 2024

12.2024 3.3.0.
Same problem , didn't have it with 2.9.1

@skl256
Copy link

skl256 commented Dec 10, 2024

@tirelibirefe @ticup can you try 3.1.2? For my in 3.1.2 no issue.

@ticup
Copy link

ticup commented Dec 10, 2024

@skl256 Thanks for the reply. I'm running it through helm, latest version there seems to be 3.3.0.
Any way I can upgrade to 3.3.2? I've tried to upgrade it with helm upgrade --version 3.3.2, but that gives me:

Error: UPGRADE FAILED: resource mapping not found for name: "loki" namespace: "monitoring" from "": no matches for kind "PodLogs" in version "monitoring.grafana.com/v1alpha1"
ensure CRDs are installed first

Something other people have experienced as well #13409

@skl256
Copy link

skl256 commented Dec 10, 2024

I'm sorry, I read your text incorrectly, I thought you had version 2.9.1.
For me, the defect only appeared on 2.9.x versions, and did not appear on 2.9.x and 3.1.x.

@ticup
Copy link

ticup commented Dec 10, 2024

Wow I actually found the issue. The problem was that the chunks-cache was not being scheduled anymore.
Since 3X (or some other moment after 2.9) it creates a chunk cache with 16Gb memory. That's a bit much for my setup, so it couldn't be scheduled.
When I disabled the chunksCache with (helm chart):

chunksCache:
  enabled: false

the disappearing of logs is actually fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Somehing is not working as expected type/question
Projects
None yet
Development

No branches or pull requests