Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ConcurrentContainer lifecycle issues #3406

Merged
merged 9 commits into from
Aug 19, 2024

Conversation

LokeshAlamuri
Copy link
Contributor

@LokeshAlamuri LokeshAlamuri commented Aug 3, 2024

This commit would fix the issue.

  1. 'isChildRunning' API would return true only after all the containers are actually stopped.
  2. Add 'stopAbnormally' in a Lock.
  3. Call childStarted in ConcurrentContainer from KafkaMessageListenerContainer right before publishing ConsumerStartedEvent.

Copy link
Member

@artembilan artembilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConcurrentContainer start would be permitted only after all the containers running status is false.

Would you mind to revise the logic in a way that they are idempotent?
So, if start() has been called before that does not mean that we cannot call it again.
Same for stop().

Probably if you think about these lifecycle hooks as idempotent operations, then the logic would e much simple. Or the problem will go away at all.

I also curious how this fix is correlated with your fenced one before.
At a glance they contradict each other.

Thanks

@@ -277,6 +277,10 @@ public boolean isRunning() {
return this.running;
}

protected boolean canStop() {
return this.running;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isRunning() not enough?
It is totally OK to have an extra logic in the overridden method in that ConcurrentMessageListenerContainer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume concurrency == 2.
Cmain -- concurrent container
C0, C1 -- child containers

  1. start the concurrent container.

    Cmain -- running
    C0 -- running
    C1 -- running

  2. stop container C0 manually.

    Cmain -- not running
    C0 -- not running
    C1 -- running

  3. stop 'Cmain' container.

    As per the earlier condition concurrent container running status is false. So, it will not be stopped.

    New condition has to be added to verify if really child containers are cleared. This is equal to stop is called prior or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop container C0 manually.

Cmain -- not running
C0 -- not running
C1 -- running

But this situation is not correct.
The Cmain has to be running until any of its children is running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as per the API definition,

From interface:
org.springframework.context.Lifecycle Check whether this component is currently running.
In the case of a container, this will return true only if all components that apply are currently running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cmain has to be running until any of its children is running.

Cmain ConcurrentContainer is not stopped. Only it's running status will be set to false to indicate that one or all containers are stopped as per the API definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with false just for one child container that the rest are still running and we don't know about that from parent container since, as you said, it has to return false.
If that is what you want to implement here, then it is not OK to allow to stop any child container individually.
If we allow (and I don't see why not), then false for parent container would lead to the resource leak in the next start() call.
Pay attention that we don't stop any running containers over there, but just create new instances.
Plus pay attention that this.startedContainers.set(0); is wrong here, because we don't take into account those running containers at the moment.
And all of that just because we decided to return false for the situation when at least one child container is stopped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with false just for one child container that the rest are still running and we don't know about that from parent container since, as you said, it has to return false.

If it returns false, indicates one, more or all containers are stopped. We can know, what containers are running by iterating over the containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is what you want to implement here, then it is not OK to allow to stop any child container individually.
If we allow (and I don't see why not), then false for parent container would lead to the resource leak in the next start() call.

As per the existing logic running containers are stopped, if stop is called on ConcurrentContainer. There is no resource leak here. Please suggest here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus pay attention that this.startedContainers.set(0); is wrong here, because we don't take into account those running containers at the moment.

this.startedContainers.set(0) -- Indicates initially no child container has started. Once started, it will be incremented and gets decremented once stopped.

Copy link
Contributor Author

@LokeshAlamuri LokeshAlamuri Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with false just for one child container that the rest are still running and we don't know about that from parent container since, as you said, it has to return false.

As mentioned earlier, I am good to implement as per your comments.

set running status to false only if all the containers. otherwise, set to true

I will verify one more time and confirm this. I thought of putting my views. Looking for your final review of my earlier comments before I implement as you mentioned.

return true;
}
}
if ((!isRunning() && this.startedContainers.get() > 0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this extra logic.
The purpose of the isChildRunning() is to report if any child container is active at the moment.
Why do we need to check !this.isRunning() and then number of started containers?
Didn't we discuss with your in other PR (#3377 ) that it is abnormal to have child container running and stopped parent one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition in the Line:216 is the critical logic. running status of the ConcurrentContainer will be set to false for the following conditions.

  1. ConcurrentContainer itself is stopped manually.
  2. One of the child container is stopped for any reason.

startedContainers count will be decremented only when the actual container exits. So, i have verified if the ConcurrentContainer is stopped and it is having any child containers processing still messages. In this case, it should return true.

@@ -235,7 +243,7 @@ public boolean isChildRunning() {
*/
@Override
protected void doStart() {
if (!isRunning()) {
if (!isRunning() && this.containers.stream().allMatch(container -> !container.isRunning())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This again sounds like a contradiction to what we have with fenced child container.
Why would one be running if parent is stopped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is to prevent run call on ConcurrentContainer before stopping the child containers. This is to verify if really all the containers are stopped.

This is similar to earlier logic. If container is in running status , subsequent run calls will be ignored.

Earlier, running status of the concurrent container is not set to false, even if one of the container is stopped. I have made the changes to set concurrent container running status to false, even if one of the container is stopped.

In this case, condition to verify if concurrent is allowed to start needs to be changed. If all the containers running status is set to 'false' then only run is allowed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, condition to verify if concurrent is allowed to start needs to be changed. If all the containers running status is set to 'false' then only run is allowed.

That's also not what I think about this concurrent container logic.
It is in running state when any of its child containers running.
When we start concurrent container, it suppose to start all of its children.
If some of them are running already, then idempotent.
As we state before: we just don't allow orphaned child containers to be restarted.
Probably much robust solution is to stop all the children when we call start of their parent.
This way any new start would give us a fresh state. Kinda total renew.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we start concurrent container, it suppose to start all of its children.
If some of them are running already, then idempotent.
As we state before: we just don't allow orphaned child containers to be restarted.
Probably much robust solution is to stop all the children when we call start of their parent.
This way any new start would give us a fresh state. Kinda total renew.

This is already happening. I have not modified any thing here.

Copy link
Contributor Author

@LokeshAlamuri LokeshAlamuri Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also not what I think about this concurrent container logic.
It is in running state when any of its child containers running.

I have set the running status to false as per the API definition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe that we have to return true if any of the children are running.

I am good with this.
But one query, what is your suggestion to know from the API, if any of the container is stopped. Do we need to get all the containers and verify if any of it is stopped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or leave it as is, because we just don't change concurrent running state even if we stop all its children manually.

I think here we must set the running status to false if all the containers are stopped. Let us not leave as it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's start() of the concurrent container.
I think the best solution would be to stop all the currently running and go ahead with the rest of the logic where we really re-create child container.

I might agree that we can move concurrent container to not running state if all of its children are stopped.
If we have an API to notify via thisOrParentContainer that child is stopped.

Do we need to get all the containers and verify if any of it is stopped.

Why do we need to check if they are stopped?
I think iteration over this.containers and checking isRunning() is enough.
Is isChildRunning() OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's start() of the concurrent container.
I think the best solution would be to stop all the currently running and go ahead with the rest of the logic where we really re-create child container.

As per the current code,

Start API would start new child containers.
Stop API would stop any containers running.

I feel it is correct. Let us not change any thing regarding this. I have added new conditions verify if really ConcurrentContainer is really stopped or not.

As per the earlier code, second run is ignored. APP dev can directly stop and run the ConcurrentContainer if he plans to run again. But, good practise to wait untill isChildRunningAPI returns false. This indicates all the resources are released.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to get all the containers and verify if any of it is stopped.

Why do we need to check if they are stopped? I think iteration over this.containers and checking isRunning() is enough. Is isChildRunning() OK?

isChildRunningAPI returns false only when all the message processing is actually stopped. Otherwise, it will always returns true. This expected and requested behavior is really useful, since it prevents doubling the capacity requirements on the Kafka cluster as well on JVM.

App developers need to iterate over the containers to verify if any of it is stopped. As per your suggestion, ConcurrentContainer running status is set to false only when all the containers are stopped.

@@ -152,24 +152,22 @@ public synchronized void onApplicationEvent(ListenerContainerIdleEvent event) {
}

private synchronized void stopParentAndCheckGroup(MessageListenerContainer parent) {
if (parent.isRunning()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you remove this condition?

Copy link
Contributor Author

@LokeshAlamuri LokeshAlamuri Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the condition to verify running is true. I have changed this, since the status of the ConcurrentContainer would be set to false if one of the container is stopped. Earlier, it worked since running status of the ConcurrentContainer is not set to false if any of the container is stopped and it is not set to true after it is started.


@Override
protected Consumer<Integer, String> createKafkaConsumer(String groupId, String clientIdPrefix,
String clientIdSuffixArg, Properties properties) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look like OK code formatting?
Can you revise it, please, so it looks nice from review perspective?

@artembilan
Copy link
Member

@LokeshAlamuri ,

let's step back and try to understand what is the problem at all.

Would you mind explaining the reasoning behind your work?
I thought you have implemented fenced before for the situation when child container can be started when its parent is stopped.
So, that mean that we are able to stop any particular child container and I don't see a reason why don't allow to that.
But I feel like this is fully different story and that is not what you are trying to achieve here.
However I still cannot fully understand what you are doing and why.

Thanks

@LokeshAlamuri
Copy link
Contributor Author

I thought you have implemented fenced before for the situation when child container can be started when its parent is stopped.
So, that mean that we are able to stop any particular child container and I don't see a reason why don't allow to that.

Fenced child container issue is very clear and straightforward. Assume ConcurrentContainer is stopped. But, if we are having reference to child container, it is still possible to start the child container. It is not correct. Because, it is having reference to the ConcurrentContainer. It is nothing but corrupting the ConcurrentContainer. I have provided Junit clearly how this scenario could be replicated.

@LokeshAlamuri
Copy link
Contributor Author

But I feel like this is fully different story and that is not what you are trying to achieve here.
However I still cannot fully understand what you are doing and why.

I am here trying to fix the ConcurrentContainer lifecycle issues and especially isChildRunning API. As discussed earlier, isChildRunning API should return false only when all the containers are stopped. This is the most critical thing for any application. It indicates, when exactly the spring-kafka component is completely stopped.

@artembilan
Copy link
Member

It is nothing but corrupting the ConcurrentContainer.

How does it corrupt it with its start if we don't stop ConcurrentContainer?
You can chose some child container to be stopped for any reason and then start it back when ever you need.

It indicates, when exactly the spring-kafka component is completely stopped.

Right. And that's what I mean with having ConcurrentContainer running until at least one of its child container is running.
Or I even feel more natural to have ConcurrentContainer always running until its stop() explicitly called.
This way we still maintain the same set of this.containers and its up to target application to decide what to do with every child container.
Why do you find this logic as wrong?

@LokeshAlamuri
Copy link
Contributor Author

LokeshAlamuri commented Aug 9, 2024

It indicates, when exactly the spring-kafka component is completely stopped.

Right. And that's what I mean with having ConcurrentContainer running until at least one of its child container is running. Or I even feel more natural to have ConcurrentContainer always running until its stop() explicitly called. This way we still maintain the same set of this.containers and its up to target application to decide what to do with every child container. Why do you find this logic as wrong?

Looks good to me. Only one issue, if we keep the running status of the ConcurrentContainer to true, even after one or all the containers are stopped, from the API perspective imagine once, who wants to track the system, they cannot use the running status any more and have to always iterate over the containers to see every thing is functioning properly or not.

But, If we have set the running status of the ConcurrentContainer to false, if any of the container fails, from the API perspective it tells ok something is wrong and then they can have the logic to handle the situation. running status to true indicates all is good. In this model, running flag is really useful.

@LokeshAlamuri
Copy link
Contributor Author

It is nothing but corrupting the ConcurrentContainer.

How does it corrupt it with its start if we don't stop ConcurrentContainer? You can chose some child container to be stopped for any reason and then start it back when ever you need.

As mentioned in my previous comment, I am not trying to stop ConcurrentContainer. I am only trying to stop the fenced containers from starting once again since they are holding the reference to the ConcurrentContainer. That is the fix we gave for the issue #3371.

@artembilan
Copy link
Member

set the running status of the ConcurrentContainer to false, if any of the container fails

Right, but then it is chicken-egg problem.
I see the situation when other child containers are still running, however the whole ConcurrentContainer says that nothing is running, so it is OK to call its run and we are in a leakage situation where ConcurrentContainer spawns new containers without stopping others.
Even if we fix this via refresh the whole state in the start, I still feel like this is an overhead and pointless number of operations.
With running controlled just only by the ConcurrentContainer, then it is OK to deal with every its children individually.
And we really refresh the state when we restart the whole ConcurrentContainer explicitly.
The combination of isRunning() and isChildRunning() should cover all the possible situations and simplify target logic.

@LokeshAlamuri
Copy link
Contributor Author

set the running status of the ConcurrentContainer to false, if any of the container fails

Right, but then it is chicken-egg problem. I see the situation when other child containers are still running, however the whole ConcurrentContainer says that nothing is running, so it is OK to call its run and we are in a leakage situation where ConcurrentContainer spawns new containers without stopping others. Even if we fix this via refresh the whole state in the start, I still feel like this is an overhead and pointless number of operations. With running controlled just only by the ConcurrentContainer, then it is OK to deal with every its children individually. And we really refresh the state when we restart the whole ConcurrentContainer explicitly. The combination of isRunning() and isChildRunning() should cover all the possible situations and simplify target logic.

Ok then i will keep the running status logic as before and will revert the necessary changes in my PR. Once done, we can have further discussion. Thank you.

@LokeshAlamuri
Copy link
Contributor Author

Gentle reminder. I have updated the PR as per our discussion.

this.lifecycleLock.lock();
try {
if (this.containers.contains(child) || this.stoppedContainers.contains(child)) {
this.startedContainers.incrementAndGet();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is somehow does not compile in my head.
Why do we need to track those stoppedContainers at all?
Does not look like that brings any benefit over setFenced() introduced before.

In my feeling the this.containers collection is totally enough to track all the lifecycle of children if their start/stop is done manually.
Well, I still see a performance benefit via this.startedContainers counter, but this stoppedContainers smells like an overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the container stop is called, all the this.containers are cleared. Assume not all containers really called childstopped. Now, earlier child containers would invoke the childstopped API. How can we know that these child containers really belong to ConcurrentContainer for the active run or just stopped run. So, even though stop is called on ConcurrentContainer, we need to maintain the previous containers untill a second run starts or all the child containers are really stopped.

this.stoppedContainers would be useful here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. So, we must not do that this.containers.clear(); in the end of ConcurrentMessageListenerContainer.doStop(final Runnable callback, boolean normal).
And reset it when we do a new ConcurrentMessageListenerContainer.start().
This way we would always have a track of children until we go fresh start 😄

This commit would fix the following issues.

1) 'isChildRunning' API would return true only after all the containers are actually stopped.
2) ConcurrentContainer `start` would be permitted only after all the containers running status is false.
3) ConcurrentContainer `stop` would be permitted if the container is in running status or if previously `stop` API is not called.
4) Move the logic to verify whether to permit the `stop` call to KafkaMessageListenerContainer and ConcurrentMessageListenerContainer.
5) Add 'stopAbnormally' in a Lock.
6) Set the ConcurrentContainer running status to true after `childStarted`
7) Set the ConcurrentContainer running status to false after `childStopped`
8) Call `childStarted` in ConcurrentContainer from KafkaMessageListenerContainer right before publishing ConsumerStartedEvent.
This commit would fix the issue when exactly the ConcurrentContainer has to be stopped. As per the earlier logic, running status would not be set to false if any of the container is stopped. This is not correct and modified the logic to set running status to false even if one of the container is stopped. So, it is sufficient to call directly stop API on parent container that would internally check if all the containers are stopped and would execute the callback accordingly.
As per the review comments, this commit reverts the changes related to the ConcurrentContainer `running` status.
Summary of all changes in this PR.

1) 'isChildRunning' API would return true only after all the containers are actually stopped.
2) Add 'stopAbnormally' in a Lock.
3) Call `childStarted` in ConcurrentContainer from KafkaMessageListenerContainer right before publishing ConsumerStartedEvent.
this.lifecycleLock.lock();
try {
if (this.containers.contains(child) || this.stoppedContainers.contains(child)) {
this.startedContainers.incrementAndGet();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. So, we must not do that this.containers.clear(); in the end of ConcurrentMessageListenerContainer.doStop(final Runnable callback, boolean normal).
And reset it when we do a new ConcurrentMessageListenerContainer.start().
This way we would always have a track of children until we go fresh start 😄

This commit would change the time at which the childContainers are cleared. Earlier,
childContainers are cleared during `stop` call. But, after this change childContainers
would be cleared only during the next `start` call.
This commit would include the following changes.

1) Clear all the containers after all the child containers stopped. Previous commit clears only during the fresh start.
2) Publish `ConcurrentContainerStoppedEvent` when the ConcurrentContainer and all the child child containers are stopped. But, previously `ConcurrentContainerStoppedEvent` is emitted when all the containers are stopped.
@LokeshAlamuri
Copy link
Contributor Author

I have updated the PR as per our discussion. Please review and give your comments.

@artembilan artembilan merged commit ee7779c into spring-projects:main Aug 19, 2024
3 checks passed
@artembilan
Copy link
Member

Thanks for the great contribution @LokeshAlamuri; looking forward for more!

We will try to back-port the fix to those supported versions...

@artembilan
Copy link
Member

OK. We cannot back-port it to previous versions.
Apparently there is still something missing there.
So, leaving it only in the main.

Feel free to contribute PRs for those supported versions.
However they will make only for the next release: we do current today.

Hope the problem you are trying to fix is not that critical if we deal with container lifecycle properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants