add retry for dict:incr in case of post event failed #48

zjlooojoe · 2023-01-31T04:00:09Z

Hi, I have the following problem when using Kong:

According to the function logic of updating the target of a service, in a node, the specific balance: targets: upstream_id will be expelled in on_upstream_event, which will execute invalidate_local (" balance: targets: upstream_id ") for all workers, and this function will call mlcache: delete (key) which deletes L2 and L1 caches. And the latter on_upstream_event -> get_upstream_by_id -> cache.get show that If there is a value in L1, it will be returned directly. Otherwise, look for L2. If there is no value in L2, L3 will be called and execute set shm_ set_ lru_。

Specifically, I have encountered a number of workers' inconsistency in the balance: targets: upstream_id cache. That is, some workers' L1 still caches old data, and at least one worker can read new data (combined with the implementation of lua-resty-mlcache: delete, this indicates that delete_shm is successful, but the problem may be caused by other workers' inability to remove the old data of L1 layer).

The following is the function call logic:

[kong] cache:invalidate(service_key,cluster) ---> [kong] self:invalidate_ local(key) ---> [lua-resty-mlcache] self.mlcache:delete(key) ---> [lua-resty-worker-event] post(channel_name, channel, data) ---> [lua-resty-worker-event] post_ event(source, event, data, unique) ---> [lua-resty-worker-event] dict:incr && dict:add

I think there may be problems in the following parts:

In lua-resty-worker-event's post_event function, dict: add has a retry mechanism(shm_retries), but dict: incr lacks a retry. If incr fails, it returns directly and interrupts post_event function.
In my problem, this will cause the invalidate event to not be inserted into the shm, so other workers cannot consume it, and other workers cannot remove the L1 layer data.

Refer to the implementation of dict: incr. When this function is called, a lock will be obtained. However, when there is high concurrency, other requests are also occupying the lock, resulting the dict:incr failure, thus interrupting post_event function and causes the invalidate event cannot be inserted into the shm, so that other workers cannot consume it.

implementation of dict: incr in lua-nginx-module

Since this function is designed as post event in shm, should we try to ensure the success rate of post? So I think we should also give dict:incr a few retries to avoid this to some extent like this:
https://github.com/Kong/lua-resty-worker-events/blob/master/lib/resty/worker/events.lua#L157
before:

_dict:add(KEY_LAST_ID, 0)
event_id, err = _dict:incr(KEY_LAST_ID, 1)
if err then return event_id, err end

after:

  _dict:add(KEY_LAST_ID, 0)
  retries = 0
  while not event_id and retries < 5 do
    event_id, err = _dict:incr(KEY_LAST_ID, 1)
    if err then
      return event_id, err
    elseif retries >= 5 then
      log(WARN, "worker-events: could not incr events-last-id to shm after ", retries,
              " tries. Payload size: ", #json, " bytes, ",
              "source: '", tostring(source), "', event: '", tostring(event))
    end
    retries = retries + 1
  end

The text was updated successfully, but these errors were encountered:

Tieske · 2023-01-31T06:55:56Z

Current status is that we're letting go of this library, in favour of event communications over unix sockets.

zjlooojoe · 2023-01-31T09:26:58Z

Current status is that we're letting go of this library, in favour of event communications over unix sockets.

@Tieske Thank you for your reply. I looked at the implementation of the new library, and also did not provide the retry of event insertion (post_event && queue.put) to ensure successful insertion as much as possible. Do you want to leave the retry of failed event insertion to the method caller? In this case, should the relevant implementation of Kong also be adjusted?

Tieske · 2023-01-31T09:46:26Z

I haven't been involved with the new lib. Maybe @chronolaw has ideas?

fairyqb · 2023-07-11T23:16:33Z

https://github.com/Kong/lua-resty-worker-events/blob/master/lib/resty/worker/events.lua#L291

Is this error in the log file?

@zjlooojoe
@Tieske

chronolaw · 2023-07-12T00:42:26Z

In lua-resty-events, we will report error if event publishing fails, then you can try to post it again.

fairyqb · 2023-07-12T03:00:29Z

If no out-of-memory error is reported, the retry code does not resolve it, it may be caused by another issue。

We have also encountered inconsistency in memory data, which occurs occasionally. Is this a bug?

@Tieske @chronolaw

Tieske · 2023-07-12T09:29:43Z

Retrying makes sense to me, but I think the code is flawed. The goal is to retry both the dict:add and dict:incr operations if they fail. Both are done in a tight loop.

dict:add will on each retry expel a number of entries (30 iirc) from the LRU, until there is enough room to fit it in. So during the loop changes are made to improve the probability of success.
dict:incr will not change anything on a try. So I think the tight loop is bound to keep failing.

So can we add an exponential back-off to the dict:incr operation? that should improve the retry-success. (probably need to take care of yielding and non-yielding phases)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add retry for dict:incr in case of post event failed #48

add retry for dict:incr in case of post event failed #48

zjlooojoe commented Jan 31, 2023 •

edited

Loading

Tieske commented Jan 31, 2023

zjlooojoe commented Jan 31, 2023 •

edited

Loading

Tieske commented Jan 31, 2023

fairyqb commented Jul 11, 2023

chronolaw commented Jul 12, 2023

fairyqb commented Jul 12, 2023 •

edited

Loading

Tieske commented Jul 12, 2023

add retry for dict:incr in case of post event failed #48

add retry for dict:incr in case of post event failed #48

Comments

zjlooojoe commented Jan 31, 2023 • edited Loading

Tieske commented Jan 31, 2023

zjlooojoe commented Jan 31, 2023 • edited Loading

Tieske commented Jan 31, 2023

fairyqb commented Jul 11, 2023

chronolaw commented Jul 12, 2023

fairyqb commented Jul 12, 2023 • edited Loading

Tieske commented Jul 12, 2023

zjlooojoe commented Jan 31, 2023 •

edited

Loading

zjlooojoe commented Jan 31, 2023 •

edited

Loading

fairyqb commented Jul 12, 2023 •

edited

Loading