Fix ConcurrentContainer lifecycle issues #3406

LokeshAlamuri · 2024-08-03T16:51:02Z

This commit would fix the issue.

'isChildRunning' API would return true only after all the containers are actually stopped.
Add 'stopAbnormally' in a Lock.
Call childStarted in ConcurrentContainer from KafkaMessageListenerContainer right before publishing ConsumerStartedEvent.

artembilan

ConcurrentContainer start would be permitted only after all the containers running status is false.

Would you mind to revise the logic in a way that they are idempotent?
So, if start() has been called before that does not mean that we cannot call it again.
Same for stop().

Probably if you think about these lifecycle hooks as idempotent operations, then the logic would e much simple. Or the problem will go away at all.

I also curious how this fix is correlated with your fenced one before.
At a glance they contradict each other.

Thanks

artembilan · 2024-08-06T19:24:16Z

...kafka/src/main/java/org/springframework/kafka/listener/AbstractMessageListenerContainer.java

@@ -277,6 +277,10 @@ public boolean isRunning() {
 		return this.running;
 	}

+	protected boolean canStop() {
+		return this.running;


Why isRunning() not enough?
It is totally OK to have an extra logic in the overridden method in that ConcurrentMessageListenerContainer.

Assume concurrency == 2.
Cmain -- concurrent container
C0, C1 -- child containers

start the concurrent container.

Cmain -- running
C0 -- running
C1 -- running

stop container C0 manually.

Cmain -- not running
C0 -- not running
C1 -- running

stop 'Cmain' container.

As per the earlier condition concurrent container running status is false. So, it will not be stopped.

New condition has to be added to verify if really child containers are cleared. This is equal to stop is called prior or not.

stop container C0 manually.

Cmain -- not running
C0 -- not running
C1 -- running

But this situation is not correct.
The Cmain has to be running until any of its children is running.

But as per the API definition,

From interface:
org.springframework.context.Lifecycle Check whether this component is currently running.
In the case of a container, this will return true only if all components that apply are currently running

The Cmain has to be running until any of its children is running.

Cmain ConcurrentContainer is not stopped. Only it's running status will be set to false to indicate that one or all containers are stopped as per the API definition.

The problem with false just for one child container that the rest are still running and we don't know about that from parent container since, as you said, it has to return false.
If that is what you want to implement here, then it is not OK to allow to stop any child container individually.
If we allow (and I don't see why not), then false for parent container would lead to the resource leak in the next start() call.
Pay attention that we don't stop any running containers over there, but just create new instances.
Plus pay attention that this.startedContainers.set(0); is wrong here, because we don't take into account those running containers at the moment.
And all of that just because we decided to return false for the situation when at least one child container is stopped.

The problem with false just for one child container that the rest are still running and we don't know about that from parent container since, as you said, it has to return false.

If it returns false, indicates one, more or all containers are stopped. We can know, what containers are running by iterating over the containers.

If that is what you want to implement here, then it is not OK to allow to stop any child container individually.
If we allow (and I don't see why not), then false for parent container would lead to the resource leak in the next start() call.

As per the existing logic running containers are stopped, if stop is called on ConcurrentContainer. There is no resource leak here. Please suggest here.

Plus pay attention that this.startedContainers.set(0); is wrong here, because we don't take into account those running containers at the moment.

this.startedContainers.set(0) -- Indicates initially no child container has started. Once started, it will be incremented and gets decremented once stopped.

The problem with false just for one child container that the rest are still running and we don't know about that from parent container since, as you said, it has to return false.

As mentioned earlier, I am good to implement as per your comments.

set running status to false only if all the containers. otherwise, set to true

I will verify one more time and confirm this. I thought of putting my views. Looking for your final review of my earlier comments before I implement as you mentioned.

artembilan · 2024-08-06T19:29:03Z

...fka/src/main/java/org/springframework/kafka/listener/ConcurrentMessageListenerContainer.java

+					return true;
+				}
+			}
+			if ((!isRunning() && this.startedContainers.get() > 0)) {


I don't understand this extra logic.
The purpose of the isChildRunning() is to report if any child container is active at the moment.
Why do we need to check !this.isRunning() and then number of started containers?
Didn't we discuss with your in other PR (#3377 ) that it is abnormal to have child container running and stopped parent one?

The condition in the Line:216 is the critical logic. running status of the ConcurrentContainer will be set to false for the following conditions.

ConcurrentContainer itself is stopped manually.

One of the child container is stopped for any reason.

startedContainers count will be decremented only when the actual container exits. So, i have verified if the ConcurrentContainer is stopped and it is having any child containers processing still messages. In this case, it should return true.

artembilan · 2024-08-06T19:30:32Z

...fka/src/main/java/org/springframework/kafka/listener/ConcurrentMessageListenerContainer.java

@@ -235,7 +243,7 @@ public boolean isChildRunning() {
 	 */
 	@Override
 	protected void doStart() {
-		if (!isRunning()) {
+		if (!isRunning() && this.containers.stream().allMatch(container -> !container.isRunning())) {


This again sounds like a contradiction to what we have with fenced child container.
Why would one be running if parent is stopped?

This condition is to prevent run call on ConcurrentContainer before stopping the child containers. This is to verify if really all the containers are stopped.

This is similar to earlier logic. If container is in running status , subsequent run calls will be ignored.

Earlier, running status of the concurrent container is not set to false, even if one of the container is stopped. I have made the changes to set concurrent container running status to false, even if one of the container is stopped.

In this case, condition to verify if concurrent is allowed to start needs to be changed. If all the containers running status is set to 'false' then only run is allowed.

In this case, condition to verify if concurrent is allowed to start needs to be changed. If all the containers running status is set to 'false' then only run is allowed.

That's also not what I think about this concurrent container logic.
It is in running state when any of its child containers running.
When we start concurrent container, it suppose to start all of its children.
If some of them are running already, then idempotent.
As we state before: we just don't allow orphaned child containers to be restarted.
Probably much robust solution is to stop all the children when we call start of their parent.
This way any new start would give us a fresh state. Kinda total renew.

When we start concurrent container, it suppose to start all of its children.
If some of them are running already, then idempotent.
As we state before: we just don't allow orphaned child containers to be restarted.
Probably much robust solution is to stop all the children when we call start of their parent.
This way any new start would give us a fresh state. Kinda total renew.

This is already happening. I have not modified any thing here.

That's also not what I think about this concurrent container logic.
It is in running state when any of its child containers running.

I have set the running status to false as per the API definition

I still believe that we have to return true if any of the children are running.

I am good with this.
But one query, what is your suggestion to know from the API, if any of the container is stopped. Do we need to get all the containers and verify if any of it is stopped.

Or leave it as is, because we just don't change concurrent running state even if we stop all its children manually.

I think here we must set the running status to false if all the containers are stopped. Let us not leave as it is.

Well, that's start() of the concurrent container.
I think the best solution would be to stop all the currently running and go ahead with the rest of the logic where we really re-create child container.

I might agree that we can move concurrent container to not running state if all of its children are stopped.
If we have an API to notify via thisOrParentContainer that child is stopped.

Do we need to get all the containers and verify if any of it is stopped.

Why do we need to check if they are stopped?
I think iteration over this.containers and checking isRunning() is enough.
Is isChildRunning() OK?

Well, that's start() of the concurrent container.
I think the best solution would be to stop all the currently running and go ahead with the rest of the logic where we really re-create child container.

As per the current code,

Start API would start new child containers.
Stop API would stop any containers running.

I feel it is correct. Let us not change any thing regarding this. I have added new conditions verify if really ConcurrentContainer is really stopped or not.

As per the earlier code, second run is ignored. APP dev can directly stop and run the ConcurrentContainer if he plans to run again. But, good practise to wait untill isChildRunningAPI returns false. This indicates all the resources are released.

Do we need to get all the containers and verify if any of it is stopped.

Why do we need to check if they are stopped? I think iteration over this.containers and checking isRunning() is enough. Is isChildRunning() OK?

isChildRunningAPI returns false only when all the message processing is actually stopped. Otherwise, it will always returns true. This expected and requested behavior is really useful, since it prevents doubling the capacity requirements on the Kafka cluster as well on JVM.

App developers need to iterate over the containers to verify if any of it is stopped. As per your suggestion, ConcurrentContainer running status is set to false only when all the containers are stopped.

artembilan · 2024-08-06T19:32:15Z

spring-kafka/src/main/java/org/springframework/kafka/listener/ContainerGroupSequencer.java

@@ -152,24 +152,22 @@ public synchronized void onApplicationEvent(ListenerContainerIdleEvent event) {
 	}

 	private synchronized void stopParentAndCheckGroup(MessageListenerContainer parent) {
-		if (parent.isRunning()) {


Why do you remove this condition?

I have removed the condition to verify running is true. I have changed this, since the status of the ConcurrentContainer would be set to false if one of the container is stopped. Earlier, it worked since running status of the ConcurrentContainer is not set to false if any of the container is stopped and it is not set to true after it is started.

artembilan · 2024-08-06T19:33:21Z

...rc/test/java/org/springframework/kafka/listener/ConcurrentMessageListenerContainerTests.java

+
+			@Override
+			protected Consumer<Integer, String> createKafkaConsumer(String groupId, String clientIdPrefix,
+																	String clientIdSuffixArg, Properties properties) {


This does not look like OK code formatting?
Can you revise it, please, so it looks nice from review perspective?

artembilan · 2024-08-09T15:30:02Z

@LokeshAlamuri ,

let's step back and try to understand what is the problem at all.

Would you mind explaining the reasoning behind your work?
I thought you have implemented fenced before for the situation when child container can be started when its parent is stopped.
So, that mean that we are able to stop any particular child container and I don't see a reason why don't allow to that.
But I feel like this is fully different story and that is not what you are trying to achieve here.
However I still cannot fully understand what you are doing and why.

Thanks

LokeshAlamuri · 2024-08-09T16:50:29Z

I thought you have implemented fenced before for the situation when child container can be started when its parent is stopped.
So, that mean that we are able to stop any particular child container and I don't see a reason why don't allow to that.

Fenced child container issue is very clear and straightforward. Assume ConcurrentContainer is stopped. But, if we are having reference to child container, it is still possible to start the child container. It is not correct. Because, it is having reference to the ConcurrentContainer. It is nothing but corrupting the ConcurrentContainer. I have provided Junit clearly how this scenario could be replicated.

LokeshAlamuri · 2024-08-09T16:53:55Z

But I feel like this is fully different story and that is not what you are trying to achieve here.
However I still cannot fully understand what you are doing and why.

I am here trying to fix the ConcurrentContainer lifecycle issues and especially isChildRunning API. As discussed earlier, isChildRunning API should return false only when all the containers are stopped. This is the most critical thing for any application. It indicates, when exactly the spring-kafka component is completely stopped.

artembilan · 2024-08-09T17:06:42Z

It is nothing but corrupting the ConcurrentContainer.

How does it corrupt it with its start if we don't stop ConcurrentContainer?
You can chose some child container to be stopped for any reason and then start it back when ever you need.

It indicates, when exactly the spring-kafka component is completely stopped.

Right. And that's what I mean with having ConcurrentContainer running until at least one of its child container is running.
Or I even feel more natural to have ConcurrentContainer always running until its stop() explicitly called.
This way we still maintain the same set of this.containers and its up to target application to decide what to do with every child container.
Why do you find this logic as wrong?

LokeshAlamuri · 2024-08-09T17:14:22Z

It indicates, when exactly the spring-kafka component is completely stopped.

Right. And that's what I mean with having ConcurrentContainer running until at least one of its child container is running. Or I even feel more natural to have ConcurrentContainer always running until its stop() explicitly called. This way we still maintain the same set of this.containers and its up to target application to decide what to do with every child container. Why do you find this logic as wrong?

Looks good to me. Only one issue, if we keep the running status of the ConcurrentContainer to true, even after one or all the containers are stopped, from the API perspective imagine once, who wants to track the system, they cannot use the running status any more and have to always iterate over the containers to see every thing is functioning properly or not.

But, If we have set the running status of the ConcurrentContainer to false, if any of the container fails, from the API perspective it tells ok something is wrong and then they can have the logic to handle the situation. running status to true indicates all is good. In this model, running flag is really useful.

LokeshAlamuri · 2024-08-09T17:17:36Z

It is nothing but corrupting the ConcurrentContainer.

How does it corrupt it with its start if we don't stop ConcurrentContainer? You can chose some child container to be stopped for any reason and then start it back when ever you need.

As mentioned in my previous comment, I am not trying to stop ConcurrentContainer. I am only trying to stop the fenced containers from starting once again since they are holding the reference to the ConcurrentContainer. That is the fix we gave for the issue #3371.

artembilan · 2024-08-09T17:29:07Z

set the running status of the ConcurrentContainer to false, if any of the container fails

Right, but then it is chicken-egg problem.
I see the situation when other child containers are still running, however the whole ConcurrentContainer says that nothing is running, so it is OK to call its run and we are in a leakage situation where ConcurrentContainer spawns new containers without stopping others.
Even if we fix this via refresh the whole state in the start, I still feel like this is an overhead and pointless number of operations.
With running controlled just only by the ConcurrentContainer, then it is OK to deal with every its children individually.
And we really refresh the state when we restart the whole ConcurrentContainer explicitly.
The combination of isRunning() and isChildRunning() should cover all the possible situations and simplify target logic.

LokeshAlamuri · 2024-08-09T17:36:21Z

set the running status of the ConcurrentContainer to false, if any of the container fails

Right, but then it is chicken-egg problem. I see the situation when other child containers are still running, however the whole ConcurrentContainer says that nothing is running, so it is OK to call its run and we are in a leakage situation where ConcurrentContainer spawns new containers without stopping others. Even if we fix this via refresh the whole state in the start, I still feel like this is an overhead and pointless number of operations. With running controlled just only by the ConcurrentContainer, then it is OK to deal with every its children individually. And we really refresh the state when we restart the whole ConcurrentContainer explicitly. The combination of isRunning() and isChildRunning() should cover all the possible situations and simplify target logic.

Ok then i will keep the running status logic as before and will revert the necessary changes in my PR. Once done, we can have further discussion. Thank you.

LokeshAlamuri · 2024-08-14T17:13:06Z

Gentle reminder. I have updated the PR as per our discussion.

...fka/src/main/java/org/springframework/kafka/listener/ConcurrentMessageListenerContainer.java

artembilan · 2024-08-14T17:34:23Z

...fka/src/main/java/org/springframework/kafka/listener/ConcurrentMessageListenerContainer.java

+		this.lifecycleLock.lock();
+		try {
+			if (this.containers.contains(child) || this.stoppedContainers.contains(child)) {
+				this.startedContainers.incrementAndGet();


This logic is somehow does not compile in my head.
Why do we need to track those stoppedContainers at all?
Does not look like that brings any benefit over setFenced() introduced before.

In my feeling the this.containers collection is totally enough to track all the lifecycle of children if their start/stop is done manually.
Well, I still see a performance benefit via this.startedContainers counter, but this stoppedContainers smells like an overhead.

When the container stop is called, all the this.containers are cleared. Assume not all containers really called childstopped. Now, earlier child containers would invoke the childstopped API. How can we know that these child containers really belong to ConcurrentContainer for the active run or just stopped run. So, even though stop is called on ConcurrentContainer, we need to maintain the previous containers untill a second run starts or all the child containers are really stopped.

this.stoppedContainers would be useful here.

Right. So, we must not do that this.containers.clear(); in the end of ConcurrentMessageListenerContainer.doStop(final Runnable callback, boolean normal).
And reset it when we do a new ConcurrentMessageListenerContainer.start().
This way we would always have a track of children until we go fresh start 😄

This commit would fix the following issues. 1) 'isChildRunning' API would return true only after all the containers are actually stopped. 2) ConcurrentContainer `start` would be permitted only after all the containers running status is false. 3) ConcurrentContainer `stop` would be permitted if the container is in running status or if previously `stop` API is not called. 4) Move the logic to verify whether to permit the `stop` call to KafkaMessageListenerContainer and ConcurrentMessageListenerContainer. 5) Add 'stopAbnormally' in a Lock. 6) Set the ConcurrentContainer running status to true after `childStarted` 7) Set the ConcurrentContainer running status to false after `childStopped` 8) Call `childStarted` in ConcurrentContainer from KafkaMessageListenerContainer right before publishing ConsumerStartedEvent.

This commit would fix the issue when exactly the ConcurrentContainer has to be stopped. As per the earlier logic, running status would not be set to false if any of the container is stopped. This is not correct and modified the logic to set running status to false even if one of the container is stopped. So, it is sufficient to call directly stop API on parent container that would internally check if all the containers are stopped and would execute the callback accordingly.

As per the review comments, this commit reverts the changes related to the ConcurrentContainer `running` status. Summary of all changes in this PR. 1) 'isChildRunning' API would return true only after all the containers are actually stopped. 2) Add 'stopAbnormally' in a Lock. 3) Call `childStarted` in ConcurrentContainer from KafkaMessageListenerContainer right before publishing ConsumerStartedEvent.

artembilan · 2024-08-15T18:06:42Z

...fka/src/main/java/org/springframework/kafka/listener/ConcurrentMessageListenerContainer.java

+		this.lifecycleLock.lock();
+		try {
+			if (this.containers.contains(child) || this.stoppedContainers.contains(child)) {
+				this.startedContainers.incrementAndGet();


Right. So, we must not do that this.containers.clear(); in the end of ConcurrentMessageListenerContainer.doStop(final Runnable callback, boolean normal).
And reset it when we do a new ConcurrentMessageListenerContainer.start().
This way we would always have a track of children until we go fresh start 😄

This commit would change the time at which the childContainers are cleared. Earlier, childContainers are cleared during `stop` call. But, after this change childContainers would be cleared only during the next `start` call.

This commit would include the following changes. 1) Clear all the containers after all the child containers stopped. Previous commit clears only during the fresh start. 2) Publish `ConcurrentContainerStoppedEvent` when the ConcurrentContainer and all the child child containers are stopped. But, previously `ConcurrentContainerStoppedEvent` is emitted when all the containers are stopped.

LokeshAlamuri · 2024-08-19T14:08:04Z

I have updated the PR as per our discussion. Please review and give your comments.

artembilan · 2024-08-19T19:08:08Z

Thanks for the great contribution @LokeshAlamuri; looking forward for more!

We will try to back-port the fix to those supported versions...

artembilan · 2024-08-19T19:17:10Z

OK. We cannot back-port it to previous versions.
Apparently there is still something missing there.
So, leaving it only in the main.

Feel free to contribute PRs for those supported versions.
However they will make only for the next release: we do current today.

Hope the problem you are trying to fix is not that critical if we deal with container lifecycle properly.

LokeshAlamuri mentioned this pull request Aug 5, 2024

ConcurrentMessageListenerContainer isChildRunning API is returning false even though active MessageListenerContainer instances are processing messages. #3338

Closed

artembilan requested changes Aug 6, 2024

View reviewed changes

artembilan requested changes Aug 14, 2024

View reviewed changes

LokeshAlamuri added 7 commits August 15, 2024 22:45

Fix failed Junit

9e265bc

Reformat Junit

0677fd4

Reformat Junit

692b354

Revert changes in ContainerGroupSequencer

f6d8c4b

LokeshAlamuri force-pushed the GH-3338 branch from 0f1dc0c to f6d8c4b Compare August 15, 2024 17:24

artembilan requested changes Aug 15, 2024

View reviewed changes

LokeshAlamuri added 2 commits August 16, 2024 16:17

Clear previous childContainers during Start call

0ad1da8

This commit would change the time at which the childContainers are cleared. Earlier, childContainers are cleared during `stop` call. But, after this change childContainers would be cleared only during the next `start` call.

artembilan merged commit ee7779c into spring-projects:main Aug 19, 2024
3 checks passed

Fix ConcurrentContainer lifecycle issues #3406

Fix ConcurrentContainer lifecycle issues #3406

Conversation

LokeshAlamuri commented Aug 3, 2024 • edited Loading

artembilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LokeshAlamuri Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LokeshAlamuri Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LokeshAlamuri Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artembilan commented Aug 9, 2024

LokeshAlamuri commented Aug 9, 2024

LokeshAlamuri commented Aug 9, 2024

artembilan commented Aug 9, 2024

LokeshAlamuri commented Aug 9, 2024 • edited Loading

LokeshAlamuri commented Aug 9, 2024

artembilan commented Aug 9, 2024

LokeshAlamuri commented Aug 9, 2024

LokeshAlamuri commented Aug 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LokeshAlamuri commented Aug 19, 2024

artembilan commented Aug 19, 2024

artembilan commented Aug 19, 2024

LokeshAlamuri commented Aug 3, 2024 •

edited

Loading

LokeshAlamuri Aug 9, 2024 •

edited

Loading

LokeshAlamuri Aug 7, 2024 •

edited

Loading

LokeshAlamuri Aug 7, 2024 •

edited

Loading

LokeshAlamuri commented Aug 9, 2024 •

edited

Loading