Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

* 1st argument: the table identifier does not refer to an existing ETS table #181

Closed
maxguzenski opened this issue Jan 6, 2023 · 20 comments

Comments

@maxguzenski
Copy link

Hi,
I have an app with a intensive use of nebulex and more or less 100 times per day I receive this error on elixir logs,
can anyone help me?

(ArgumentError errors were found at the given arguments:

  * 1st argument: the table identifier does not refer to an existing ETS table
)

The flow is: it happens many times at once, run many hours without issue, it happens again... (repeat)

my nebulex config

config :app, App.Cache,
  gc_cleanup_min_timeout: :timer.seconds(10),
  gc_cleanup_max_timeout: :timer.minutes(10),
  allocated_memory: 8_000_000_000

defmodule App.Cache do
  use Nebulex.Cache,
    otp_app: :app,
    adapter: Nebulex.Adapters.Local
end

Error on sentry:

Captura de tela 06 01 2023 às 01 10 52 AM

@cabol
Copy link
Owner

cabol commented Jan 6, 2023

Hi 👋 !! Real quickly, are you using the latest Nebulex version? or what version are you using?

@maxguzenski
Copy link
Author

Everything is updated.

Dependency              Current  Latest  Status
absinthe                1.6.6    1.7.0   Update possible
absinthe_phoenix        2.0.2    2.0.2   Up-to-date
absinthe_plug           1.5.8    1.5.8   Up-to-date
bamboo                  2.2.0    2.2.0   Up-to-date
bcrypt_elixir           2.3.0    3.0.1   Update not possible
broadway                1.0.5    1.0.5   Up-to-date
broadway_sqs            0.7.2    0.7.2   Up-to-date
comeonin                5.3.3    5.3.3   Up-to-date
cors_plug               2.0.3    3.0.3   Update not possible
cowboy                  2.9.0    2.9.0   Up-to-date
credo                   1.6.7    1.6.7   Up-to-date
dataloader              1.0.10   1.0.10  Up-to-date
decorator               1.4.0    1.4.0   Up-to-date
dialyxir                1.2.0    1.2.0   Up-to-date
ecto_psql_extras        0.7.10   0.7.10  Up-to-date
ecto_sql                3.9.2    3.9.2   Up-to-date
eqrcode                 0.1.10   0.1.10  Up-to-date
ex_aws                  2.4.1    2.4.1   Up-to-date
ex_aws_lambda           2.1.0    2.1.0   Up-to-date
ex_aws_s3               2.3.3    2.3.3   Up-to-date
ex_aws_sqs              3.3.1    3.4.0   Update possible
feeder_ex               1.1.0    1.1.0   Up-to-date
gettext                 0.20.0   0.21.0  Update possible
guardian                2.3.1    2.3.1   Up-to-date
hackney                 1.18.1   1.18.1  Up-to-date
httpoison               1.8.2    1.8.2   Up-to-date
jason                   1.4.0    1.4.0   Up-to-date
jose                    1.11.5   1.11.5  Up-to-date
mock                    0.3.7    0.3.7   Up-to-date
nebulex                 2.4.2    2.4.2   Up-to-date
phoenix                 1.6.15   1.6.15  Up-to-date
phoenix_ecto            4.4.0    4.4.0   Up-to-date
phoenix_html            3.2.0    3.2.0   Up-to-date
phoenix_live_dashboard  0.7.2    0.7.2   Up-to-date
phoenix_pubsub          2.1.1    2.1.1   Up-to-date
plug_cowboy             2.6.0    2.6.0   Up-to-date
poolboy                 1.5.2    1.5.2   Up-to-date
postgrex                0.16.5   0.16.5  Up-to-date
remote_ip               1.0.0    1.1.0   Update not possible
sentry                  8.0.6    8.0.6   Up-to-date
shards                  1.1.0    1.1.0   Up-to-date
sobelow                 0.11.1   0.11.1  Up-to-date
telemetry               1.1.0    1.2.0   Update not possible
telemetry_metrics       0.6.1    0.6.1   Up-to-date
telemetry_poller        1.0.0    1.0.0   Up-to-date
timex                   3.7.9    3.7.9   Up-to-date
ua_inspector            2.2.0    3.0.1   Update not possible
uuid                    1.1.8    1.1.8   Up-to-date
xxhash                  0.3.1    0.3.1   Up-to-date

@maxguzenski
Copy link
Author

I move backed to :shard to have a try, it created 48 shards (I run this app on an aws c5.12xlarge, into a ubuntu docker).
Until now, with shards, it happens 2 times

Captura de tela 06 01 2023 às 13 11 55 PM

@maxguzenski
Copy link
Author

Now, with shards, I have others errors

Captura de tela 06 01 2023 às 13 18 34 PM

ErlangError :shards_meta.get/1

Captura de tela 06 01 2023 às 13 20 21 PM

ArgumentError :erlang.send_after/3

Captura de tela 06 01 2023 às 13 19 36 PM

@maxguzenski
Copy link
Author

Maybe when the old ets table is removed there are still accesses to it?

@cabol
Copy link
Owner

cabol commented Jan 7, 2023

Right, let me check, it is weird because when a new generation is created, the older one is just marked for deletion, but it is not deleted immediately, that happens the next time a new generation is created, giving time precisely to processes still hitting it, which must be a very short period of time, because the references to the new and old generation have changed. Anyway, let me review this and try to reproduce it somehow.

@cabol
Copy link
Owner

cabol commented Jan 7, 2023

I was investigating and the case this may happen is if for example two (or more) generations are created one right after the other, almost at the same time not giving time to existing processing to update the new generation list, but it is very weird. My first suspicion is that maybe the GC timer and the cleanup timer are happening one right after the other causing a situation like this, it is very unusual, but it may happen, especially if you mention you have a high volume of data and load on the cache. So, I made a fix to avoid it, I pushed the changes to the master branch already, please try it out and let me know if the issue persists, let's first discard this.

@maxguzenski
Copy link
Author

Ok, I'll give it a try at production... on monday morning

@maxguzenski
Copy link
Author

maxguzenski commented Jan 7, 2023

Those are my request count/min, every request access nebulex more than once,
anyway the cache takes many hours to reach 8gb

Captura de tela 07 01 2023 às 08 05 31 AM

So, by the number of requests over errors on sentry, it represents much less than 0.01%

@cabol
Copy link
Owner

cabol commented Jan 9, 2023

I'm still investigating because it is not yet clear to me why this is happening, I've pushed some fixes/improvements in the meantime, perhaps it won't fix this issue but let me know once you try it out.

@maxguzenski
Copy link
Author

I'm not push your new code on production yet (I will soon), but I changed config 3 days ago to:

  # from 
  gc_cleanup_min_timeout: :timer.minutes(10),
  gc_cleanup_max_timeout: :timer.minutes(60)

 #to 
  gc_interval: :timer.hours(24),
  gc_cleanup_min_timeout: :timer.minutes(10),
  gc_cleanup_max_timeout: :timer.minutes(60)

And everything is running without issues, in this time, the cache cleanup itself 4 times with no problems. I believe gc interval (with default value) is related with the bug.

@maxguzenski
Copy link
Author

Ok, there is a bug... I will make a video, something is very wrong.

@maxguzenski
Copy link
Author

My mistake... I need more time investigating it. what I thought was a bug might is not.

@cabol
Copy link
Owner

cabol commented Jan 10, 2023

I'm not push your new code on production yet (I will soon), but I changed config 3 days ago to:

  # from 
  gc_cleanup_min_timeout: :timer.minutes(10),
  gc_cleanup_max_timeout: :timer.minutes(60)

 #to 
  gc_interval: :timer.hours(24),
  gc_cleanup_min_timeout: :timer.minutes(10),
  gc_cleanup_max_timeout: :timer.minutes(60)

And everything is running without issues, in this time, the cache cleanup itself 4 times with no problems. I believe gc interval (with default value) is related with the bug.

Interesting, this gives me a better idea of where to look into it, will check and keep you posted, thanks!!

@maxguzenski
Copy link
Author

maxguzenski commented Jan 10, 2023

How exactly is the allocated_memory parameter along with a high gc_interval value supposed to work when memory reaches the maximum allowed?

A) Everything is deleted and new ets are created
B) New ets is created and old cache is kept for some time.

I have the impression that in my app option "A" is being made. Becasuse used memory drops from 16gb (allocated_memory) to near zero.

I'm adding some telemetry data to help see how the cache behaves over time and I'll be putting it all into production soon.

@cabol
Copy link
Owner

cabol commented Jan 14, 2023

Hey 👋 !! There was a fix very much related to this one (#183), perhaps you want to try it out, it is on the master branch. Let me know how it goes, stay tuned!

@maxguzenski
Copy link
Author

Probably yes, because I tested it with several configurations of gc_interval and gc_cleanup_min_timeout, and the bug only happens when gc_interval is not setted.

I'll test master branch.

@cabol
Copy link
Owner

cabol commented Mar 26, 2023

Closing this issue but feel free to reopen it if even with fix #183 the issue persists.

@cabol cabol closed this as completed Mar 26, 2023
@pierre-pretorius
Copy link

@cabol we see the same issue with nebulex 2.5.1. Will try it soon with the latest 2.5.2. The stack trace reports :ets.lookup/2:

ArgumentError
:ets.lookup/2

(ArgumentError errors were found at the given arguments:

  * 1st argument: the table identifier does not refer to an existing ETS table
)
Stack Trace

Newest

ArgumentError: errors were found at the given arguments:

  * 1st argument: the table identifier does not refer to an existing ETS table

  Module "ets", in :ets.lookup/2
  File "lib/nebulex/adapters/local.ex", line 753, in Nebulex.Adapters.Local.get_entry/4
  File "lib/nebulex/adapters/local.ex", line 465, in Nebulex.Adapters.Local.gen_fetch/4
  File "lib/nebulex/adapters/local.ex", line 456, in Nebulex.Adapters.Local.do_get/3
  File "lib/nebulex/adapters/local.ex", line 447, in Nebulex.Adapters.Local.get_/2
  File "lib/nebulex/adapters/local.ex", line 444, in anonymous fn/3 in Nebulex.Adapters.Local.get_/2
  File "/opt/sa_cloud/deps/telemetry/src/telemetry.erl", line 272, in :telemetry.span/3
  File "lib/nebulex/adapters/local.ex", line 440, in Nebulex.Adapters.Local.get/3

Sometimes the stack trace says :ets.take/2:

ArgumentError
:ets.take/2

(ArgumentError errors were found at the given arguments:
  * 1st argument: the table identifier does not refer to an existing ETS table
)

ArgumentError: errors were found at the given arguments:

  * 1st argument: the table identifier does not refer to an existing ETS table

  Module "ets", in :ets.take/2
  File "lib/nebulex/adapters/local.ex", line 761, in Nebulex.Adapters.Local.pop_entry/4
  File "lib/nebulex/adapters/local.ex", line 465, in Nebulex.Adapters.Local.gen_fetch/4
  File "lib/nebulex/adapters/local.ex", line 457, in Nebulex.Adapters.Local.do_get/3
  File "lib/nebulex/adapters/local.ex", line 447, in Nebulex.Adapters.Local.get_/2
  File "lib/nebulex/adapters/local.ex", line 444, in anonymous fn/3 in Nebulex.Adapters.Local.get_/2
  File "/opt/sa_cloud/deps/telemetry/src/telemetry.erl", line 272, in :telemetry.span/3
  File "lib/nebulex/adapters/local.ex", line 440, in Nebulex.Adapters.Local.get/3

Our Nebulex usage is relatively simple:

defmodule MyApp.Cache do
  use Nebulex.Cache,
      otp_app: :sa_cloud,
      adapter: Nebulex.Adapters.Replicated
end

The only methods we call is cache put and get:

MyApp.Cache.get(string)
MyApp.Cache.put(string, data)

We do however run two Elixir nodes on the same machine so that we can restart Elixir without downtime.

rel/env.sh.eex:

#!/bin/sh

# Sets and enables heart (recommended only in daemon mode)
# case $RELEASE_COMMAND in
#   daemon*)
#     HEART_COMMAND="$RELEASE_ROOT/bin/$RELEASE_NAME $RELEASE_COMMAND"
#     export HEART_COMMAND
#     export ELIXIR_ERL_OPTIONS="-heart"
#     ;;
#   *)
#     ;;
# esac

# Set the release to work across nodes. If using the long name format like
# the one below (my_app@127.0.0.1), you need to also uncomment the
# RELEASE_DISTRIBUTION variable below. Must be "sname", "name" or "none".
export RELEASE_DISTRIBUTION=name
export RELEASE_MODE=embedded
export RELEASE_NODE=<%= @release.name %>_$PORT@127.0.0.1

config/releases.exs:

config :kernel,
       sync_nodes_optional: [:"my_app_3000@127.0.0.1", :"my_app_3001@127.0.0.1"],
       sync_nodes_timeout: 10000

@cabol
Copy link
Owner

cabol commented Jul 20, 2023

Hey 👋 ! May you share the cache config? Also, if you can elaborate a bit more on when this happens, like if there is a particular situation or something, or just happens randomly? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants