Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Cassandra cluster can not recover if a C* pod and its data are deleted #350

Open
wallrj opened this issue May 8, 2018 · 7 comments
Open

Comments

@wallrj
Copy link
Member

wallrj commented May 8, 2018

If a C* Pod and its data are deleted, a replacement Pod will be started by the StatefulSet controller, but it will be started with an empty /var/lib/cassandra data directory.

This will cause a new Cassandra node UUId to be generated, and when the C* node attempts to join the cluster, it will be considered to be a new, rather than a replacement node.

This is much more likely if the cluster is configured to use persistent local storage rather than re-attachable networked PVs.

You can work around this by storing the C* node UUID as a field on e.g. a Pilot resource or centrally on the CassandraCluster resource, so that when a pilot is restarted, it can discover the original C* node UUID and start the Cassandra process with -Dcassandra.replace-address=<original_uuid>.

See is cassandra-kubernetes-hostid written as part of Improve Cassandra Example , which suggests that you can supply the "hostid" to -Dcassandra.replace_address when starting the C* node.

/kind bug

@munnerz
Copy link
Contributor

munnerz commented May 8, 2018

I'm going to reclassify this as a feature, as storing the UUID on the Pilot resource is a feature we did not previously support, and so when it is lost through some external means, it is expected that it is not recoverable 😄

/kind feature

@yanniszark
Copy link

yanniszark commented May 30, 2018

Just for clarification 😄
When a Pod is deleted, the PV attached to that pod should not be deleted. That means there are two types of failure:

  1. The Pod is deleted/restarted but the node is healthy. If using local storage (I assume we are talking about Local Persistent Volumes (Beta) because hostpath is not supposed to be used for multi-node clusters), from what I know, this should be handled by C*, because you still have the PV with the data (source).
  2. The node is unhealthy and all its resources are unavailable: This is the case where a node replacement should occur, so I assume this issue is talking about this. When using Network-Attached Storage (like on a Cloud Provider - NOT reccomended by DataStax) the volume can be re-attached and this becomes a type 1 failure.

If I got something wrong please correct me

@wallrj
Copy link
Member Author

wallrj commented May 30, 2018

Thanks @yanniszark. That's an accurate summary 👍

@yanniszark
Copy link

yanniszark commented May 30, 2018

I got too busy with the details and forgot to ask the actual question 😄
So in this issue, you are talking about a type 2 failure using local storage?

@wallrj
Copy link
Member Author

wallrj commented May 31, 2018

So in this issue, you are talking about a type 2 failure using local storage?

Yep. :-)
Are you using / testing Navigator? Or interested in contributing? We'd be very interested to get your feedback on the project.

@yanniszark
Copy link

yanniszark commented May 31, 2018

I am interested in both 😄
For my thesis, I am looking at developing a cloud-native solution for C* to run on K8s.
This project is very interesting and I have learned a lot by browsing the issues you have encountered.
I am also looking at the Priam project by Netflix. Their model is very similar to what you have: a sidecar running alongside C* and a centralized storage (SimpleDB in their case, etcd in K8s).
Their system is tested in production for many years so I was thinking it could provide some good guidelines.

@yanniszark
Copy link

yanniszark commented Jul 2, 2018

A little follow-up on this:
From the source code, it seems such a thing is not feasible.
I will permalink the files I traced to arrive to this conclusion:

First, in StorageService.java where the join function for a new node is located:
https://github.com/apache/cassandra/blob/5cc68a87359dd02412bdb70a52dfcd718d44a5ba/src/java/org/apache/cassandra/service/StorageService.java#L475-L490

As we can see, it calls DatabaseDescriptor.getReplaceAddress() to get the replace address. This is of InetAddressAndPort type, so incompatible from the start, but we'll keep digging in case it the UUID is resolved before reaching us.

The DatabaseDescriptor.getReplaceAddress() function:

https://github.com/apache/cassandra/blob/5cc68a87359dd02412bdb70a52dfcd718d44a5ba/src/java/org/apache/cassandra/config/DatabaseDescriptor.java#L1351-L1365

It calls InetAddressAndPort.getByName to retrieve the replace_address. The relevant function is this :

https://github.com/apache/cassandra/blob/5cc68a87359dd02412bdb70a52dfcd718d44a5ba/src/java/org/apache/cassandra/locator/InetAddressAndPort.java#L137-L157

Which calls upon HostAndPort.fromString to get the address, which is part of the Google Core Libraries. Description can be found here.

Consequently, it would seem that it is not possible to provide host_id to the replace_address option. In the older releases of Cassandra, there was a replace_node option which accepted UUID, but it was deprecated in favor of replace_address.

In this case, it seems the way to go is to store the ip address of the node.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants