Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration PR: HTTP Management Proxy Implementation #1356

Merged
merged 32 commits into from
Oct 16, 2023

Conversation

Miles-Garnsey
Copy link
Contributor

This PR is for integration of the Reaper side of the HTTP management proxy.

Fixes issue:
Epic

@github-actions
Copy link

No linked issues found. Please add the corresponding issues in the pull request description.
Use GitHub automation to close the issue when a PR is merged

@Miles-Garnsey Miles-Garnsey force-pushed the integration/http-managementproxy branch 2 times, most recently from 7f8b52f to 79844b3 Compare August 17, 2023 02:58
@Miles-Garnsey Miles-Garnsey changed the title Integration/http managementproxy Integration PR: HTTP Management Proxy Implementation Aug 17, 2023
@Miles-Garnsey
Copy link
Contributor Author

Closed via #1408 .

Miles-Garnsey and others added 24 commits October 3, 2023 11:29
Co-authored-by: Miles-Garnsey <miles.garnsey@datastax.com>
…ts into ICassandraManagementProxy and implement in both HTTP and JMX impls. (#1358)
Implement methods:
- getClusterName
- getLiveNodes
- clearSnapshot
- listSnapshots
- takeSnapshot
* Add stubbed polling of job details from the mgmt-api. This will not work without the actual client implementation

Implement using apiClient the triggerRepair, getJobDetails, scheduler as well as add a simple test to ensure the state is managed correctly

* Merge test files after the rebase

* Add a test to verify the behavior of the notifications polling

* Address comments
)

* Implement schema methods in the HttpManagementProxy (fixes #1344)
* Move HTTP repair implementation over to new V2 endpoint.

* Strip out Cassandra <3 repair methods and fix method signatures to always use newer ringrange.

* Fix tests.

* Remove tests for Cassandra <3.

* New cancelAllRepairs http method. More tests.

* Bump version of managment API client in pom.xml to bring in repair methods with correct Long integer type.
…#1376)

* Implement HttpCassandraManagementProxy.getEndpointToHostId(), HttpCassandraManagementProxy.getLocalEndpoint(), HttpCassandraManagementProxy.getTokens()
…1373)

* Implement getTokenEndpointMap.
* Fix getTokens
* Implement getEndpointToHostId.

---------

Co-authored-by: Miles-Garnsey <miles.garnsey@datastax.com>
* Remove references to RunState NOT_EXISTING, since they cause spurious errors in tests and this state no longer exists.
* Remove references too JMX from ClusterFacade and make it more generic.
Miles-Garnsey and others added 2 commits October 3, 2023 11:29
Comment out test steps which rely on getPendingCompactions.
Comment out percent-repaired related test.
@Miles-Garnsey Miles-Garnsey force-pushed the integration/http-managementproxy branch from d72a78c to e2d1998 Compare October 3, 2023 00:29
adejanovski and others added 3 commits October 4, 2023 11:36
* Implement getPendingCompactions in the HttpManagementProxy

Since metrics are now exposed on a different port than the mgmt-api itself, this required to go through the HttpMetricsProxy which got partially implemented for that need.
It is now able to pull metrics from the metrics endpoint and parse it into GenericMetrics.
* Put a hook in the docker container's cassandra-reaper.yml so that the HTTP management proxy can be enabled via environent variable, instead of only through the config file.
* Update management API client and remove references to notifications in v2 repair requests.
@Miles-Garnsey
Copy link
Contributor Author

Miles-Garnsey commented Oct 13, 2023

Reporting the results of my tests:

  • Using a test Cassandra/Management API image built from here, I can get Reaper to start and connect via the HTTP API.
  • I am using a Reaper version built from the latest integration branch in this repo.
  • Starting a Repair gives the below results when a single node cluster is deployed. The repair appears to start correctly, and (because we've added no data) it correctly detects that nothing can be repaired.
INFO   [2023-10-13 05:16:04,066] [test:2d7afb80-6987-11ee-b146-dfa5cc7633db] i.c.s.RepairRunner - Next segment to run : 2d802ba0-6987-11ee-b146-dfa5cc7633db 
INFO   [2023-10-13 05:16:04,126] [test:2d7afb80-6987-11ee-b146-dfa5cc7633db:2d802ba0-6987-11ee-b146-dfa5cc7633db] i.c.s.SegmentRunner - Nothing to repair for segment 2d802ba0-6987-11ee-b146-dfa5cc7633db in keyspace reaper_db 
INFO   [2023-10-13 05:16:34,097] [test:2d7afb80-6987-11ee-b146-dfa5cc7633db] i.c.s.RepairRunner - Attempting to run new segment... 
...

The repair appears to be progressing normally up to this point.

I then go and expand the cluster to 3 nodes, and use cqlsh to add some data to the keyspace under repair in the original node (in a newly created table, while the new nodes are coming up), I see the following:

INFO   [2023-10-13 05:28:59,812] [test:2d7afb80-6987-11ee-b146-dfa5cc7633db] i.c.s.RepairRunner - Attempting to run new segment... 
ERROR  [2023-10-13 05:28:59,814] [test:2d7afb80-6987-11ee-b146-dfa5cc7633db] i.c.s.RepairRunner - RepairRun FAILURE, scheduling retry 
java.lang.NumberFormatException: For input string: "null"
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.base/java.lang.Integer.parseInt(Integer.java:652)
	at java.base/java.math.BigInteger.<init>(BigInteger.java:536)
	at java.base/java.math.BigInteger.<init>(BigInteger.java:674)
	at io.cassandrareaper.storage.repairsegment.CassandraRepairSegmentDao.createRepairSegmentFromRow(CassandraRepairSegmentDao.java:97)
	at io.cassandrareaper.storage.repairsegment.CassandraRepairSegmentDao.getRepairSegmentsForRun(CassandraRepairSegmentDao.java:270)
	at io.cassandrareaper.storage.repairsegment.CassandraRepairSegmentDao.getNextFreeSegments(CassandraRepairSegmentDao.java:278)
	at io.cassandrareaper.service.RepairRunner.startNextSegment(RepairRunner.java:489)
	at io.cassandrareaper.service.RepairRunner.run(RepairRunner.java:250)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
	at java.base/java.lang.Thread.run(Thread.java:829)
ERROR  [2023-10-13 05:29:12,470] [clustername-reconnection-0] c.d.d.c.AbstractReconnectionHandler - Authentication error on host test-dc1-service/10.96.0.5:9042: Provided username test-reaper and/or password are incorrect 
ERROR  [2023-10-13 05:29:12,472] [clustername-reconnection-0] c.d.d.c.Cluster - Authentication error during reconnection to test-dc1-service/10.96.0.5:9042, scheduling retry in 256000 milliseconds 
com.datastax.driver.core.exceptions.AuthenticationException: Authentication error on host test-dc1-service/10.96.0.5:9042: Provided username test-reaper and/or password are incorrect
	at com.datastax.driver.core.Connection$9.apply(Connection.java:553)
	at com.datastax.driver.core.Connection$9.apply(Connection.java:515)
	at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:211)
	at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:200)
	at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:111)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1027)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:691)
	at com.datastax.driver.core.Connection$Future.onSet(Connection.java:1540)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1290)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1208)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.in
vokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

Prior to that issue, I think Reaper also was throwing some errors around the connectAny method, in which it said it was unable to connect to any of the hosts (and the host it was trying to connect to was the seed service from recollection).

The repair has stalled at this point, Reaper appears to be in a crash loop, the UI is not responding (and I assume the probes aren't either, hence why it is being restarted).

I also see:

ERROR  [2023-10-13 05:33:41,044] [ReaperApplication-scheduler] i.c.ReaperApplication - Couldn't resume running repair runs 
io.cassandrareaper.ReaperException: com.datastax.driver.core.exceptions.UnauthorizedException: User test-reaper has no SELECT permission on <table reaper_db.cluster> or any of its parents
	at io.cassandrareaper.service.RepairManager.resumeRunningRepairRuns(RepairManager.java:183)
	at io.cassandrareaper.ReaperApplication.lambda$scheduleRepairManager$2(ReaperApplication.java:369)
	at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
	at java.base/java.lang.Thread.run(Thread.java:829)

I am not convinced this is an issue with the new HTTP management logic, since authentication to the DB doesn't seem to be related to our changes, I will continue experimenting.

One of the logs from Cassandra itself is below:
server-system-logger.log

@Miles-Garnsey
Copy link
Contributor Author

Miles-Garnsey commented Oct 13, 2023

I do note that when I delete the k8ssandra cluster, secrets remain in the namespace and are not cleaned up:

kubectl get secrets -n k8ssandra-operator 
NAME                                     TYPE                                  DATA   AGE
k8ssandra-operator-token                 kubernetes.io/service-account-token   3      45m
k8ssandra-operator-webhook-server-cert   kubernetes.io/tls                     3      45m
test-reaper                              Opaque                                2      45m
test-reaper-ui                           Opaque                                2      45m
test-superuser                           Opaque                                2      45m
webhook-server-cert                      kubernetes.io/tls                     3      45m

So maybe there is an issue where the superuser secret is changed when the cluster is scaled, but the new secret is not remounted (or hot reloaded) by Reaper.

@Miles-Garnsey
Copy link
Contributor Author

When I start a repair on a 3 node cluster which has never been restarted (but contains no data), I get the following outcome:

INFO   [2023-10-13 06:09:28,656] [test:fc570d20-698e-11ee-b292-8b6204c14f24] i.c.s.RepairRunner - Next segment to run : fc6209a1-698e-11ee-b292-8b6204c14f24 
INFO   [2023-10-13 06:09:28,945] [test:fc570d20-698e-11ee-b292-8b6204c14f24:fc6209a1-698e-11ee-b292-8b6204c14f24] i.c.s.RepairRunner - Triggered repair of segment fc6209a1-698e-11ee-b292-8b6204c14f24 via host 10.96.2.6 
INFO   [2023-10-13 06:09:28,946] [test:fc570d20-698e-11ee-b292-8b6204c14f24:fc6209a1-698e-11ee-b292-8b6204c14f24] i.c.s.SegmentRunner - Repair for segment fc6209a1-698e-11ee-b292-8b6204c14f24 started, status wait will timeout in 1800000 millis 
INFO   [2023-10-13 06:09:33,134] [test:fc570d20-698e-11ee-b292-8b6204c14f24:fc6209a1-698e-11ee-b292-8b6204c14f24] i.c.s.SegmentRunner - Repair command 1 on segment fc6209a1-698e-11ee-b292-8b6204c14f24 returned with state DONE
INFO   [2023-10-13 06:09:58,748] [test:fc570d20-698e-11ee-b292-8b6204c14f24] i.c.s.RepairRunner - Attempting to run new segment...

The repair progresses. While I have not let it finish, it seems healthy.

If I add data using tlp-stress, delete some SSTables at random from /var/lib/cassandra/data run nodetool flush then nodetool refresh (all of which should create entropy) and then run a repair, I get the following log output:

INFO   [2023-10-13 06:33:01,937] [test:59a9dc70-6992-11ee-b292-8b6204c14f24] i.c.s.RepairRunner - Next segment to run : 59ac7484-6992-11ee-b292-8b6204c14f24 
INFO   [2023-10-13 06:33:02,021] [test:59a9dc70-6992-11ee-b292-8b6204c14f24:59ac7484-6992-11ee-b292-8b6204c14f24] i.c.s.RepairRunner - Triggered repair of segment 59ac7484-6992-11ee-b292-8b6204c14f24 via host 10.96.1.34 
INFO   [2023-10-13 06:33:02,021] [test:59a9dc70-6992-11ee-b292-8b6204c14f24:59ac7484-6992-11ee-b292-8b6204c14f24] i.c.s.SegmentRunner - Repair for segment 59ac7484-6992-11ee-b292-8b6204c14f24 started, status wait will timeout in 1800000 millis
INFO   [2023-10-13 06:33:31,977] [test:59a9dc70-6992-11ee-b292-8b6204c14f24] i.c.s.RepairRunner - Attempting to run new segment... 
INFO   [2023-10-13 06:33:31,991] [test:59a9dc70-6992-11ee-b292-8b6204c14f24] i.c.s.RepairRunner - All nodes are busy or have too many pending compactions for the remaining candidate segments. 
INFO   [2023-10-13 06:33:31,999] [test:59a9dc70-6992-11ee-b292-8b6204c14f24] i.c.s.RepairRunner - All nodes are busy or have too many pending compactions for the remaining candidate segments. 
INFO   [2023-10-13 06:33:32,008] [test:59a9dc70-6992-11ee-b292-8b6204c14f24] i.c.s.RepairRunner - Repair amount done 0.0
INFO   [2023-10-13 06:34:02,015] [test:59a9dc70-6992-11ee-b292-8b6204c14f24] i.c.s.RepairRunner - Attempting to run new segment... 

Which actually seems like a reasonable output. However, the repair never progresses, and the cassandra nodes report errors too -

On the node which the SStables were deleted from:

INFO  [Repair#5:1] 2023-10-13 07:03:05,732 RepairRunnable.java:216 - [repair #8dc123c0-6996-11ee-92b9-bfede922d060]Repair command #5 finished with error
INFO  [RepairSnapshotExecutor:1] 2023-10-13 07:03:05,732 ActiveRepairService.java:714 - [repair #8dc123c0-6996-11ee-92b9-bfede922d060] Clearing snapshots for tlp_stress.sensor_data
INFO  [RepairSnapshotExecutor:1] 2023-10-13 07:03:05,733 ActiveRepairService.java:724 - [repair #8dc123c0-6996-11ee-92b9-bfede922d060] Cleared snapshots in 0ms
ERROR [Stream-Deserializer-/10.96.0.7:7000-26824f63] 2023-10-13 07:03:06,104 StreamingInboundHandler.java:205 - [Stream channel: 26824f63] stream operation from /10.96.0.7:7000 failed
java.lang.IllegalStateException: unknown stream session: 8e1425c1-6996-11ee-92b9-bfede922d060 - 0
	at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:45)
	at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:38)
	at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:53)
	at org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:172)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

On a node from which SSTables were not deleted

INFO  [Stream-Deserializer-/10.96.1.34:7000-6cd9040e] 2023-10-13 07:03:06,100 StreamResultFuture.java:193 - [Stream #8e1425c0-6996-11ee-92b9-bfede922d060] Session with /10.96.1.34:7000 is complete
WARN  [Stream-Deserializer-/10.96.1.34:7000-6cd9040e] 2023-10-13 07:03:06,100 StreamResultFuture.java:220 - [Stream #8e1425c0-6996-11ee-92b9-bfede922d060] Stream failed
ERROR [Stream-Deserializer-/10.96.1.34:7000-e99d7b27] 2023-10-13 07:03:06,103 StreamingInboundHandler.java:205 - [Stream channel: e99d7b27] stream operation from /10.96.1.34:7000 failed
java.lang.IllegalStateException: unknown stream session: 8e1425c0-6996-11ee-92b9-bfede922d060 - 0
	at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:45)
	at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:38)
	at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:53)
	at org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:172)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

@Miles-Garnsey Miles-Garnsey merged commit 8a26f9d into master Oct 16, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants