Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application does not recover following loss of connectivity with DB #106

Closed
kimburgess opened this issue Sep 2, 2021 · 4 comments · Fixed by place-labs/clear#26 · May be fixed by place-labs/crystal-pg#1
Closed

Application does not recover following loss of connectivity with DB #106

kimburgess opened this issue Sep 2, 2021 · 4 comments · Fixed by place-labs/clear#26 · May be fixed by place-labs/crystal-pg#1
Assignees
Labels
type: bug something isn't working

Comments

@kimburgess
Copy link
Contributor

kimburgess commented Sep 2, 2021

Describe the bug
Following an interruption to database availability, the application can enter a state where requests fail and recovery becomes dependent on infrastructure automation.

Your environment
placeos-1.2108.4

Expected behavior
Reconnection should be attempted.

Additional context
Example log output under failure conditions:

tenant lookup failed on <REDACTED> with Error caught, last query was:
SELECT * FROM "tenants" WHERE ("domain" = '<REDACTED>') ORDER BY "tenants"."id" ASC LIMIT 1 (Clear::SQL::Error)
  from app/lib/clear/src/clear/sql/logger.cr:69:5 in 'first'
  from app/src/controllers/utilities/multi_tenant.cr:31:7 in 'determine_tenant_from_domain'
  from usr/share/crystal/src/slice.cr:70:18 in '???'
  from usr/share/crystal/src/primitives.cr:266:3 in 'call'
  from usr/share/crystal/src/http/server/handler.cr:28:7 in 'call'
  from usr/share/crystal/src/http/server/handler.cr:28:7 in 'call'
  from usr/share/crystal/src/http/server/request_processor.cr:51:11 in 'process'
  from usr/share/crystal/src/http/server.cr:500:5 in '->'
  from usr/share/crystal/src/primitives.cr:266:3 in 'run'
  from ???
Caused by:  (DB::ConnectionLost)
  from app/lib/db/src/db/statement.cr:92:7 in 'query'
  from app/lib/clear/src/clear/sql/query/fetch.cr:137:40 in 'first'
  from app/src/controllers/utilities/multi_tenant.cr:31:7 in 'determine_tenant_from_domain'
  from usr/share/crystal/src/slice.cr:70:18 in '???'
  from usr/share/crystal/src/primitives.cr:266:3 in 'call'
  from usr/share/crystal/src/http/server/handler.cr:28:7 in 'call'
  from usr/share/crystal/src/http/server/handler.cr:28:7 in 'call'
  from usr/share/crystal/src/http/server/request_processor.cr:51:11 in 'process'
  from usr/share/crystal/src/http/server.cr:500:5 in '->'
  from usr/share/crystal/src/primitives.cr:266:3 in 'run'
  from ???

The ORM in use appears to support this, but is disabled by default. If this is compatible, it would be good to enable this behaviour by default. Alternatively, if the upstream behaviour is not suitable, an alternative application level recover should be added here.

@kimburgess kimburgess added the type: bug something isn't working label Sep 2, 2021
@caspiano
Copy link
Contributor

caspiano commented Sep 3, 2021

crystal-db support reconnection natively.
Is this issue perhaps better reframed in terms of retries?
Retries are configurable via the connection URL, see here

#107 also benefits from the information above

@dukenguyenxyz
Copy link
Contributor

dukenguyenxyz commented Sep 3, 2021

So as long as this statement_with_retry is used https://github.com/crystal-lang/crystal-db/blob/bf5ca75d1ace7e15b00ca03ad21728b8b00cf007/src/db/pool_statement.cr#L41-L45 like here https://github.com/crystal-lang/crystal-db/blob/bf5ca75d1ace7e15b00ca03ad21728b8b00cf007/src/db/pool_statement.cr#L23-L25 then ConnectionLost won't be raised

def retry in statement_with_retry is defined here https://github.com/crystal-lang/crystal-db/blob/bf5ca75d1ace7e15b00ca03ad21728b8b00cf007/src/db/pool.cr#L179

Basically it rescues PoolResourceLost(T) errors (ConnectionLost < PoolResourceLost(Connection)), the only error statement_with_retry would not rescue is PoolRetryAttemptsExceeded which happens if retry_attempts + current connections is too low.

So what this means is somewhere in crystal-pg or clear, the query method that is implemented does not use statement_with_retry and execute the sql code directly.

If we look at https://github.com/will/crystal-pg/blob/cafe572e12966f7449477891fa65558954dbca5c/src/pg/statement.cr#L1-L44, which is used throughout clear (defined in db: https://github.com/crystal-lang/crystal-db/blob/bf5ca75d1ace7e15b00ca03ad21728b8b00cf007/src/db/statement.cr#L63-L110)

We can clearly see that there is no retry logic here. We can implement the retry logic like https://github.com/crystal-lang/crystal-db/blob/bf5ca75d1ace7e15b00ca03ad21728b8b00cf007/src/db/pool.cr#L165-L191 but we need access to @retry_attempts, @retry_delay, and @idle, however those are only available in DB::Pool(T), which is not in the same scope as PG::Statement https://github.com/will/crystal-pg/blob/master/src/pg/statement.cr#L1-L44, we have to pass those in somehow, then this would be resolved.

the recovery testing steps:

  • launch an app that uses clear and connects to a database
  • test the api is working
  • restart the database container
  • test the api is working

@dukenguyenxyz
Copy link
Contributor

Relevant: crystal-lang/crystal-db#45 (comment)

@bcardiff
Copy link

bcardiff commented Sep 4, 2021

Hi! I'm trying to understand the context of the issue reported in crystal-db a little bit.

👀 I think that place-labs/clear#11 will not recover from connection loss.

Once your are dealing with a connection object, there is no retry logic possible. A connection is almost a socket. If the socket is broken/closed you are done.

But if you run statements using the pool then the pool can recover. Because you don't care about the which connection is actually used. (This also implies like transactions can't recover. At least there is no API to retry the whole transaction if the connection goes down).

So, if Clear::SQL::ConnectionPool is expected to have a retry logic DB::Pool#retry or DB::Database#retry might need to be exposed I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug something isn't working
Projects
None yet
4 participants