-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add retry for dict:incr in case of post event failed #48
Comments
Current status is that we're letting go of this library, in favour of event communications over unix sockets. |
@Tieske Thank you for your reply. I looked at the implementation of the new library, and also did not provide the retry of event insertion (post_event && queue.put) to ensure successful insertion as much as possible. Do you want to leave the retry of failed event insertion to the method caller? In this case, should the relevant implementation of Kong also be adjusted? |
I haven't been involved with the new lib. Maybe @chronolaw has ideas? |
https://github.com/Kong/lua-resty-worker-events/blob/master/lib/resty/worker/events.lua#L291 Is this error in the log file? |
In |
If no out-of-memory error is reported, the retry code does not resolve it, it may be caused by another issue。 We have also encountered inconsistency in memory data, which occurs occasionally. Is this a bug? |
Retrying makes sense to me, but I think the code is flawed. The goal is to retry both the
So can we add an exponential back-off to the |
Hi, I have the following problem when using Kong:
Specifically, I have encountered a number of workers' inconsistency in the balance: targets: upstream_id cache. That is, some workers' L1 still caches old data, and at least one worker can read new data (combined with the implementation of lua-resty-mlcache: delete, this indicates that delete_shm is successful, but the problem may be caused by other workers' inability to remove the old data of L1 layer).
The following is the function call logic:
[kong] cache:invalidate(service_key,cluster) ---> [kong] self:invalidate_ local(key) ---> [lua-resty-mlcache] self.mlcache:delete(key) ---> [lua-resty-worker-event] post(channel_name, channel, data) ---> [lua-resty-worker-event] post_ event(source, event, data, unique) ---> [lua-resty-worker-event] dict:incr && dict:add
I think there may be problems in the following parts:
In lua-resty-worker-event's post_event function, dict: add has a retry mechanism(shm_retries), but dict: incr lacks a retry. If incr fails, it returns directly and interrupts post_event function.
In my problem, this will cause the invalidate event to not be inserted into the shm, so other workers cannot consume it, and other workers cannot remove the L1 layer data.
Refer to the implementation of dict: incr. When this function is called, a lock will be obtained. However, when there is high concurrency, other requests are also occupying the lock, resulting the dict:incr failure, thus interrupting post_event function and causes the invalidate event cannot be inserted into the shm, so that other workers cannot consume it.
implementation of dict: incr in lua-nginx-module
Since this function is designed as post event in shm, should we try to ensure the success rate of post? So I think we should also give dict:incr a few retries to avoid this to some extent like this:
https://github.com/Kong/lua-resty-worker-events/blob/master/lib/resty/worker/events.lua#L157
before:
after:
The text was updated successfully, but these errors were encountered: