Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in agent Traceflow controller #5954

Merged
merged 1 commit into from
Feb 2, 2024

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Feb 1, 2024

It may happen that a Traceflow is assigned with a tag that was just released from an old Traceflow but the controller hasn't processed the deletion event of the old Traceflow yet. Previously the controller skipped starting new Traceflow if the tag was already being used, which caused the Traceflow to timeout.

The commit adds a check when determining whether it should start a Traceflow. If the tag is associated with another Traceflow, it will clean it up then start a new trace for the current one.

It also fixes a bug in cleanupTraceflow, which might uninstall flows for another Traceflow if the tag is reassigned.

Fixes #5760
Fixes #5609

@tnqn tnqn added area/ops/traceflow Issues or PRs related to the Traceflow feature action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Feb 1, 2024
hangyan
hangyan previously approved these changes Feb 1, 2024
antoninbas
antoninbas previously approved these changes Feb 1, 2024
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

c.runningTraceflowsMutex.Unlock()
// This may happen if a Traceflow is assigned with a tag that was just released from an old Traceflow but
// the agent hasn't processed the deletion event of the old Traceflow yet.
if ok && tfState.name != traceflowName {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we see a different TF with the same name before we process the deletion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible in theory. If we create a TF, delete it, then recreate a new one with the same name. It's possible that the same tag is assigned to it. If agent processes the creation of the 2nd TF first, it will encounter this.
But this shouldn't happen with antctl which appends random string to the name.
To totally prevent it, we should use UID to identify TF.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might misunderstand your question. Now I feel you are asking in which case we will see the current bug happen?
It's due to the concurrent workers: we will receive the deletion of old TF before the creation of new TF. But after they are dispatched to workers, one worker may run faster than another. Then we end of with processing creation first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you understood the question correctly.
Yes I was thinking we could store the UID in the Traceflow "state", and use something like tfState.UID != tf.UID here. We could probably keep indexing the Traceflow state map on the name. Would more changes be required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be done, fixed

gran-vmv
gran-vmv previously approved these changes Feb 2, 2024
luolanzone
luolanzone previously approved these changes Feb 2, 2024
Copy link
Contributor

@luolanzone luolanzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit.

pkg/agent/controller/traceflow/traceflow_controller.go Outdated Show resolved Hide resolved
It may happen that a Traceflow is assigned with a tag that was just
released from an old Traceflow but the controller hasn't processed the
deletion event of the old Traceflow yet. Previously the controller
skipped starting new Traceflow if the tag was already being used, which
caused the Traceflow to timeout.

The commit adds a check when determining whether it should start a
Traceflow. If the tag is associated with another Traceflow, it will
clean it up then start a new trace for the current one.

It also fixes a bug in cleanupTraceflow, which might uninstall flows for
another Traceflow if the tag is reassigned.

Signed-off-by: Quan Tian <qtian@vmware.com>
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member Author

tnqn commented Feb 2, 2024

/skip-all

1 similar comment
@tnqn
Copy link
Member Author

tnqn commented Feb 2, 2024

/skip-all

@tnqn tnqn merged commit ebb61fb into antrea-io:main Feb 2, 2024
47 of 48 checks passed
@tnqn tnqn deleted the fix-traceflow-race branch February 2, 2024 04:42
tnqn added a commit to tnqn/antrea that referenced this pull request Mar 6, 2024
It may happen that a Traceflow is assigned with a tag that was just
released from an old Traceflow but the controller hasn't processed the
deletion event of the old Traceflow yet. Previously the controller
skipped starting new Traceflow if the tag was already being used, which
caused the Traceflow to timeout.

The commit adds a check when determining whether it should start a
Traceflow. If the tag is associated with another Traceflow, it will
clean it up then start a new trace for the current one.

It also fixes a bug in cleanupTraceflow, which might uninstall flows for
another Traceflow if the tag is reassigned.

Signed-off-by: Quan Tian <qtian@vmware.com>
tnqn added a commit to tnqn/antrea that referenced this pull request Mar 6, 2024
It may happen that a Traceflow is assigned with a tag that was just
released from an old Traceflow but the controller hasn't processed the
deletion event of the old Traceflow yet. Previously the controller
skipped starting new Traceflow if the tag was already being used, which
caused the Traceflow to timeout.

The commit adds a check when determining whether it should start a
Traceflow. If the tag is associated with another Traceflow, it will
clean it up then start a new trace for the current one.

It also fixes a bug in cleanupTraceflow, which might uninstall flows for
another Traceflow if the tag is reassigned.

Signed-off-by: Quan Tian <qtian@vmware.com>
tnqn added a commit to tnqn/antrea that referenced this pull request Mar 25, 2024
It may happen that a Traceflow is assigned with a tag that was just
released from an old Traceflow but the controller hasn't processed the
deletion event of the old Traceflow yet. Previously the controller
skipped starting new Traceflow if the tag was already being used, which
caused the Traceflow to timeout.

The commit adds a check when determining whether it should start a
Traceflow. If the tag is associated with another Traceflow, it will
clean it up then start a new trace for the current one.

It also fixes a bug in cleanupTraceflow, which might uninstall flows for
another Traceflow if the tag is reassigned.

Signed-off-by: Quan Tian <qtian@vmware.com>
tnqn added a commit that referenced this pull request Mar 26, 2024
It may happen that a Traceflow is assigned with a tag that was just
released from an old Traceflow but the controller hasn't processed the
deletion event of the old Traceflow yet. Previously the controller
skipped starting new Traceflow if the tag was already being used, which
caused the Traceflow to timeout.

The commit adds a check when determining whether it should start a
Traceflow. If the tag is associated with another Traceflow, it will
clean it up then start a new trace for the current one.

It also fixes a bug in cleanupTraceflow, which might uninstall flows for
another Traceflow if the tag is reassigned.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/ops/traceflow Issues or PRs related to the Traceflow feature kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
5 participants