Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate redundant dial mutex causing unbounded connection queue contention #3088

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

LINKIWI
Copy link
Contributor

@LINKIWI LINKIWI commented Aug 10, 2024

The immediate symptom this PR attempts to address: during periods of transient server connectivity errors, go-redis commands time out after upwards of 60s (or more), even though the socket read/write timeouts are 3s and the context timeout on the commands is 1s. We have root caused this bug to lock contention in the connection pool's lazy dialer.

There is a mutual exclusion lock in DialHook, which allows only one server dial to occur at the same time. In the event of server connectivity errors, this causes unbounded connection queueing under highly concurrent workloads.

Consider, for example, a concurrent workload with the default dial timeout of 5s, and an unresponsive server endpoint. During this period, all dials are timing out.

  1. The connection pool is empty, or all connections are currently occupied by in-flight I/O.
  2. N commands are executed concurrently, all of which need to acquire new connections.
  3. All N commands attempt to add a connection to the pool, which lazily calls DialHook.
  4. The first connection attempt acquires the mutex, and times out after 5s.
  5. The second connection attempt is blocked on mutex acquisition, and after acquiring the lock, itself also times out after 5s. The total wall clock time that the second command was blocked is now 10s.
  6. ...
  7. This results in a cascading failure mode where, under this scenario, individual commands can occupy multiple minutes of wall clock time due to lock contention.

DialHook itself does not mutate any state in the hooksMixin. I believe the mutex is redundant, and can be eliminated. The original commit that introduced the lock attempts to fix a race condition encountered in AddHook. The mutex that guards chain should be sufficient for this purpose.

We have validated that this change fixes the unbounded queueing, and prevents the system from entering a prolonged state of not serving any useful throughput, during these periods.

I have also added a unit test to capture the regression. Without this patch, the unit test correctly fails; Ping takes successively longer on each invocation (1s, 2s, 3s, etc.).

@LINKIWI
Copy link
Contributor Author

LINKIWI commented Aug 24, 2024

@monkey92t Are you the right person to review this? Thanks.

@monkey92t
Copy link
Collaborator

@monkey92t Are you the right person to review this? Thanks.

Sorry, I don't have the relevant permission. ping @ofekshenawa ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants