Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka/zookeeper fatal error when disk runs out #3133

Closed
sposs opened this issue Jun 17, 2024 · 7 comments
Closed

Kafka/zookeeper fatal error when disk runs out #3133

sposs opened this issue Jun 17, 2024 · 7 comments
Labels

Comments

@sposs
Copy link

sposs commented Jun 17, 2024

Self-Hosted Version

24.500

CPU Architecture

x86_64

Docker Version

26.1.4

Docker Compose Version

2.27.1

Steps to Reproduce

Install self hosted. Run out of disk. Kafka/zookeeper will fail. Impossible to recover (see logs), my installation is doomed.

Expected Result

The service should not break to a point it cannot be recovered. Maybe check the disk and kill itself. I'd rather loose a bunch of transactions than losing everything.

Actual Result

===> Launching kafka ... 
[2024-06-17 04:55:46,504] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
[2024-06-17 04:55:47,568] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2024-06-17 04:55:47,915] INFO Updated connection-accept-rate max connection creation rate to 2147483647 (kafka.network.ConnectionQuotas)
[2024-06-17 04:55:47,936] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1001] Created data-plane acceptor and processors for endpoint : ListenerName(PLAINTEXT) (kafka.network.SocketServer)
[2024-06-17 04:55:48,020] INFO Creating /brokers/ids/1001 (is it secure? false) (kafka.zk.KafkaZkClient)
[2024-06-17 04:55:48,033] INFO Stat of the created znode at /brokers/ids/1001 is: 1478,1478,1718600148028,1718600148028,1,0,0,72130214439944228,194,0,1478
 (kafka.zk.KafkaZkClient)
[2024-06-17 04:55:48,034] INFO Registered broker 1001 at path /brokers/ids/1001 with addresses: PLAINTEXT://kafka:9092, czxid (broker epoch): 1478 (kafka.zk.KafkaZkClient)
[2024-06-17 04:55:48,242] INFO [/config/changes-event-process-thread]: Starting (kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread)
[2024-06-17 04:55:48,259] WARN [Controller id=1001, targetBrokerId=1001] Connection to node 1001 (kafka/172.19.0.13:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2024-06-17 04:55:48,260] WARN [RequestSendThread controllerId=1001] Controller 1001's connection to broker kafka:9092 (id: 1001 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
java.io.IOException: Connection to kafka:9092 (id: 1001 rack: null) failed.
	at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:70)
	at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:298)
	at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:251)
	at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:130)
[2024-06-17 04:55:48,341] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1001] Enabling request processing. (kafka.network.SocketServer)
[2024-06-17 04:55:48,344] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.DataPlaneAcceptor)
[2024-06-17 04:56:20,444] ERROR Error while appending records to ingest-transactions-0 in dir /var/lib/kafka/data (org.apache.kafka.storage.internals.log.LogDirFailureChannel)
java.io.IOException: No space left on device
	at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at java.base/sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62)
	at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
	at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:79)
	at java.base/sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:280)
	at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:90)
	at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
	at kafka.log.LogSegment.append(LogSegment.scala:160)
	at kafka.log.LocalLog.append(LocalLog.scala:439)
	at kafka.log.UnifiedLog.append(UnifiedLog.scala:911)
	at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719)
	at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313)
	at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301)
	at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1277)
	at scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
	at scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
	at scala.collection.mutable.HashMap.map(HashMap.scala:35)
	at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1265)
	at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:868)
	at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:686)
	at kafka.server.KafkaApis.handle(KafkaApis.scala:180)
	at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:153)
	at java.base/java.lang.Thread.run(Thread.java:829)
[2024-06-17 04:56:20,445] WARN [ReplicaManager broker=1001] Stopping serving replicas in dir /var/lib/kafka/data (kafka.server.ReplicaManager)
[2024-06-17 04:56:20,464] WARN [ReplicaManager broker=1001] Broker 1001 stopped fetcher for partitions snuba-queries-0,outcomes-0,scheduled-subscriptions-transactions-0,events-0,cdc-0,profiles-call-tree-0,snuba-generic-metrics-sets-commit-log-0,__consumer_offsets-0,scheduled-subscriptions-events-0,outcomes-billing-0,ingest-performance-metrics-0,events-subscription-results-0,snuba-dead-letter-generic-events-0,transactions-0,snuba-dead-letter-replays-0,processed-profiles-0,snuba-dead-letter-metrics-0,snuba-attribution-0,scheduled-subscriptions-generic-metrics-distributions-0,snuba-generic-metrics-counters-commit-log-0,ingest-events-0,metrics-subscription-results-0,snuba-generic-metrics-gauges-commit-log-0,profiles-0,scheduled-subscriptions-generic-metrics-counters-0,scheduled-subscriptions-generic-metrics-sets-0,scheduled-subscriptions-generic-metrics-gauges-0,generic-metrics-subscription-results-0,snuba-transactions-commit-log-0,snuba-spans-0,ingest-replay-events-0,ingest-sessions-0,ingest-transactions-0,ingest-attachments-0,snuba-metrics-0,monitors-clock-tick-0,snuba-metrics-summaries-0,snuba-dead-letter-group-attributes-0,shared-resources-usage-0,ingest-monitors-0,ingest-occurrences-0,transactions-subscription-results-0,generic-events-0,snuba-dead-letter-generic-metrics-0,snuba-metrics-commit-log-0,ingest-metrics-0,group-attributes-0,snuba-generic-metrics-0,event-replacements-0,snuba-dead-letter-querylog-0,snuba-commit-log-0,snuba-generic-metrics-distributions-commit-log-0,ingest-replay-recordings-0,snuba-generic-events-commit-log-0,scheduled-subscriptions-metrics-0 and stopped moving logs for partitions  because they are in the failed log directory /var/lib/kafka/data. (kafka.server.ReplicaManager)
[2024-06-17 04:56:20,464] WARN Stopping serving logs in dir /var/lib/kafka/data (kafka.log.LogManager)
[2024-06-17 04:56:20,466] ERROR Shutdown broker because all log dirs in /var/lib/kafka/data have failed (kafka.log.LogManager)

And zookeepers' logs

Using log4j config /etc/kafka/log4j.properties
===> User
uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)
===> Configuring ...
Running in Zookeeper mode...
===> Running preflight checks ... 
===> Check if /var/lib/kafka/data is writable ...
===> Check if Zookeeper is healthy ...
[2024-06-17 06:00:49,813] ERROR Unable to resolve address: zookeeper:2181 (org.apache.zookeeper.client.StaticHostProvider)
java.net.UnknownHostException: zookeeper: Name or service not known
	at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:930)
	at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1543)
	at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
	at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1386)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1307)
	at org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:88)
	at org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:141)
	at org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:368)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1204)
[2024-06-17 06:00:49,818] WARN Session 0x0 for server zookeeper:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. (org.apache.zookeeper.ClientCnxn)

Shutting down and restarting fails with dependency failed to start: container sentry-self-hosted-zookeeper-1 is unhealthy
Reinstalling fails with

dependency failed to start: container sentry-self-hosted-zookeeper-1 is unhealthy
Error in install/bootstrap-snuba.sh:3.
'$dcr snuba-api bootstrap --no-migrate --force' exited with status 1
-> ./install.sh:main:36
--> install/bootstrap-snuba.sh:source:3

Tried to follow the troubleshooting guide

sentry@workhorse:~/self-hosted$ docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list
[+] Creating 1/0
 ✔ Container sentry-self-hosted-zookeeper-1  Created                                                                                                                                                                                               0.0s 
[+] Running 1/1
 ✔ Container sentry-self-hosted-zookeeper-1  Started                                                                                                                                                                                               0.4s 
dependency failed to start: container sentry-self-hosted-zookeeper-1 is unhealthy

Tried the nuclear option

sentry@workhorse:~/self-hosted$ docker compose down --volumes
[+] Running 13/13
 ✔ Container sentry-self-hosted-kafka-1             Removed                                                                                                                                                                                        0.0s 
 ✔ Container sentry-self-hosted-clickhouse-1        Removed                                                                                                                                                                                        0.0s 
 ✔ Container sentry-self-hosted-redis-1             Removed                                                                                                                                                                                        0.0s 
 ✔ Container sentry-self-hosted-zookeeper-1         Removed                                                                                                                                                                                        0.1s 
 ✔ Volume sentry-self-hosted_sentry-clickhouse-log  Removed                                                                                                                                                                                        0.0s 
 ✔ Volume sentry-self-hosted_sentry-vroom           Removed                                                                                                                                                                                        0.4s 
 ✔ Volume sentry-self-hosted_sentry-secrets         Removed                                                                                                                                                                                        0.0s 
 ✔ Volume sentry-self-hosted_sentry-kafka-log       Removed                                                                                                                                                                                        0.4s 
 ✔ Volume sentry-self-hosted_sentry-smtp            Removed                                                                                                                                                                                        0.4s 
 ✔ Volume sentry-self-hosted_sentry-smtp-log        Removed                                                                                                                                                                                        0.4s 
 ✔ Volume sentry-self-hosted_sentry-nginx-cache     Removed                                                                                                                                                                                        0.4s 
 ✔ Volume sentry-self-hosted_sentry-zookeeper-log   Removed                                                                                                                                                                                        0.4s 
 ✔ Network sentry-self-hosted_default               Removed                                                                                                                                                                                        0.1s 
sentry@workhorse:~/self-hosted$ docker volume rm sentry-kafka
sentry-kafka
sentry@workhorse:~/self-hosted$ docker volume rm sentry-zookeeper
sentry-zookeeper

But then reinstall fails

 Volume "sentry-self-hosted_sentry-nginx-cache"  Created
external volume "sentry-zookeeper" not found
Error in install/upgrade-clickhouse.sh:15.
'$dc up -d clickhouse' exited with status 1
-> ./install.sh:main:25
--> install/upgrade-clickhouse.sh:source:15

Event ID

No response

@sposs
Copy link
Author

sposs commented Jun 17, 2024

Worst thing: I've removed everything (docker system prune -a), but now install always fails due to the missing volume.

@sposs
Copy link
Author

sposs commented Jun 17, 2024

Apparently, docker system prune -a does not clean the volumes if the location is not standard and that's the reason for the problem reinstalling.

@djoeycl
Copy link

djoeycl commented Jun 17, 2024

to get it working again you need to do.

docker volume create sentry-zookeeper
docker volume create sentry-kafka

@getsantry
Copy link

getsantry bot commented Jul 10, 2024

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added the Stale label Jul 10, 2024
@sposs
Copy link
Author

sposs commented Jul 10, 2024

Could the documentation be updated with mention to this issue and how to properly resolve it? I've had the problem a couple of times already, and fight to prevent disk space issues, but fail. It would be really nice if kafka would not fail so critically.

@stayallive
Copy link
Contributor

stayallive commented Jul 11, 2024

It's already documented: https://develop.sentry.dev/self-hosted/troubleshooting/#nuclear-option (see "Nuclear option" under the "Kafka" section).

Kafka can be deleted pretty safely since it only contains unprocessed data. So yes you will lose some data but only data that was not processed into the database yet so if you have to loose anything Kafka is not that bad fortunately.

Unfortunately I think there isn't much we can do to prevent Kafka from corrupting itself once your disk is full (this is probably more of a Kafka problem than a Sentry problem). You should add monitoring for disk space reaching critical levels and act before that happens or provision with a bit more room or accept you once in a while have to reset Kafka.

Anyway, hope the link will help if you need to resolve this in the future.

@getsantry
Copy link

getsantry bot commented Aug 5, 2024

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added the Stale label Aug 5, 2024
@getsantry getsantry bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 13, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Archived in project
Archived in project
Development

No branches or pull requests

5 participants