Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add retry for dict:incr in case of post event failed #48

Open
zjlooojoe opened this issue Jan 31, 2023 · 7 comments
Open

add retry for dict:incr in case of post event failed #48

zjlooojoe opened this issue Jan 31, 2023 · 7 comments

Comments

@zjlooojoe
Copy link

zjlooojoe commented Jan 31, 2023

Hi, I have the following problem when using Kong:

According to the function logic of updating the target of a service, in a node, the specific balance: targets: upstream_id will be expelled in on_upstream_event, which will execute invalidate_local (" balance: targets: upstream_id ") for all workers, and this function will call mlcache: delete (key) which deletes L2 and L1 caches. And the latter on_upstream_event -> get_upstream_by_id -> cache.get show that If there is a value in L1, it will be returned directly. Otherwise, look for L2. If there is no value in L2, L3 will be called and execute set shm_ set_ lru_

Specifically, I have encountered a number of workers' inconsistency in the balance: targets: upstream_id cache. That is, some workers' L1 still caches old data, and at least one worker can read new data (combined with the implementation of lua-resty-mlcache: delete, this indicates that delete_shm is successful, but the problem may be caused by other workers' inability to remove the old data of L1 layer).

The following is the function call logic:

[kong] cache:invalidate(service_key,cluster) ---> [kong] self:invalidate_ local(key) ---> [lua-resty-mlcache] self.mlcache:delete(key) ---> [lua-resty-worker-event] post(channel_name, channel, data) ---> [lua-resty-worker-event] post_ event(source, event, data, unique) ---> [lua-resty-worker-event] dict:incr && dict:add

I think there may be problems in the following parts:

In lua-resty-worker-event's post_event function, dict: add has a retry mechanism(shm_retries), but dict: incr lacks a retry. If incr fails, it returns directly and interrupts post_event function.
In my problem, this will cause the invalidate event to not be inserted into the shm, so other workers cannot consume it, and other workers cannot remove the L1 layer data.

Refer to the implementation of dict: incr. When this function is called, a lock will be obtained. However, when there is high concurrency, other requests are also occupying the lock, resulting the dict:incr failure, thus interrupting post_event function and causes the invalidate event cannot be inserted into the shm, so that other workers cannot consume it.

implementation of dict: incr in lua-nginx-module
image

Since this function is designed as post event in shm, should we try to ensure the success rate of post? So I think we should also give dict:incr a few retries to avoid this to some extent like this:
https://github.com/Kong/lua-resty-worker-events/blob/master/lib/resty/worker/events.lua#L157
before:

_dict:add(KEY_LAST_ID, 0)
event_id, err = _dict:incr(KEY_LAST_ID, 1)
if err then return event_id, err end

after:

  _dict:add(KEY_LAST_ID, 0)
  retries = 0
  while not event_id and retries < 5 do
    event_id, err = _dict:incr(KEY_LAST_ID, 1)
    if err then
      return event_id, err
    elseif retries >= 5 then
      log(WARN, "worker-events: could not incr events-last-id to shm after ", retries,
              " tries. Payload size: ", #json, " bytes, ",
              "source: '", tostring(source), "', event: '", tostring(event))
    end
    retries = retries + 1
  end
@Tieske
Copy link
Member

Tieske commented Jan 31, 2023

Current status is that we're letting go of this library, in favour of event communications over unix sockets.

@zjlooojoe
Copy link
Author

zjlooojoe commented Jan 31, 2023

Current status is that we're letting go of this library, in favour of event communications over unix sockets.

@Tieske Thank you for your reply. I looked at the implementation of the new library, and also did not provide the retry of event insertion (post_event && queue.put) to ensure successful insertion as much as possible. Do you want to leave the retry of failed event insertion to the method caller? In this case, should the relevant implementation of Kong also be adjusted?

@Tieske
Copy link
Member

Tieske commented Jan 31, 2023

I haven't been involved with the new lib. Maybe @chronolaw has ideas?

@fairyqb
Copy link

fairyqb commented Jul 11, 2023

@chronolaw
Copy link
Contributor

In lua-resty-events, we will report error if event publishing fails, then you can try to post it again.

@fairyqb
Copy link

fairyqb commented Jul 12, 2023

If no out-of-memory error is reported, the retry code does not resolve it, it may be caused by another issue。

We have also encountered inconsistency in memory data, which occurs occasionally. Is this a bug?

@Tieske @chronolaw

@Tieske
Copy link
Member

Tieske commented Jul 12, 2023

Retrying makes sense to me, but I think the code is flawed. The goal is to retry both the dict:add and dict:incr operations if they fail. Both are done in a tight loop.

  • dict:add will on each retry expel a number of entries (30 iirc) from the LRU, until there is enough room to fit it in. So during the loop changes are made to improve the probability of success.
  • dict:incr will not change anything on a try. So I think the tight loop is bound to keep failing.

So can we add an exponential back-off to the dict:incr operation? that should improve the retry-success. (probably need to take care of yielding and non-yielding phases)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants