Application does not recover following loss of connectivity with DB #106

kimburgess · 2021-09-02T11:37:42Z

Describe the bug
Following an interruption to database availability, the application can enter a state where requests fail and recovery becomes dependent on infrastructure automation.

Your environment
placeos-1.2108.4

Expected behavior
Reconnection should be attempted.

Additional context
Example log output under failure conditions:

tenant lookup failed on <REDACTED> with Error caught, last query was:
SELECT * FROM "tenants" WHERE ("domain" = '<REDACTED>') ORDER BY "tenants"."id" ASC LIMIT 1 (Clear::SQL::Error)
  from app/lib/clear/src/clear/sql/logger.cr:69:5 in 'first'
  from app/src/controllers/utilities/multi_tenant.cr:31:7 in 'determine_tenant_from_domain'
  from usr/share/crystal/src/slice.cr:70:18 in '???'
  from usr/share/crystal/src/primitives.cr:266:3 in 'call'
  from usr/share/crystal/src/http/server/handler.cr:28:7 in 'call'
  from usr/share/crystal/src/http/server/handler.cr:28:7 in 'call'
  from usr/share/crystal/src/http/server/request_processor.cr:51:11 in 'process'
  from usr/share/crystal/src/http/server.cr:500:5 in '->'
  from usr/share/crystal/src/primitives.cr:266:3 in 'run'
  from ???
Caused by:  (DB::ConnectionLost)
  from app/lib/db/src/db/statement.cr:92:7 in 'query'
  from app/lib/clear/src/clear/sql/query/fetch.cr:137:40 in 'first'
  from app/src/controllers/utilities/multi_tenant.cr:31:7 in 'determine_tenant_from_domain'
  from usr/share/crystal/src/slice.cr:70:18 in '???'
  from usr/share/crystal/src/primitives.cr:266:3 in 'call'
  from usr/share/crystal/src/http/server/handler.cr:28:7 in 'call'
  from usr/share/crystal/src/http/server/handler.cr:28:7 in 'call'
  from usr/share/crystal/src/http/server/request_processor.cr:51:11 in 'process'
  from usr/share/crystal/src/http/server.cr:500:5 in '->'
  from usr/share/crystal/src/primitives.cr:266:3 in 'run'
  from ???

The ORM in use appears to support this, but is disabled by default. If this is compatible, it would be good to enable this behaviour by default. Alternatively, if the upstream behaviour is not suitable, an alternative application level recover should be added here.

The text was updated successfully, but these errors were encountered:

caspiano · 2021-09-03T03:11:32Z

crystal-db support reconnection natively.
Is this issue perhaps better reframed in terms of retries?
Retries are configurable via the connection URL, see here

#107 also benefits from the information above

dukenguyenxyz · 2021-09-03T04:32:40Z

So as long as this statement_with_retry is used https://github.com/crystal-lang/crystal-db/blob/bf5ca75d1ace7e15b00ca03ad21728b8b00cf007/src/db/pool_statement.cr#L41-L45 like here https://github.com/crystal-lang/crystal-db/blob/bf5ca75d1ace7e15b00ca03ad21728b8b00cf007/src/db/pool_statement.cr#L23-L25 then ConnectionLost won't be raised

def retry in statement_with_retry is defined here https://github.com/crystal-lang/crystal-db/blob/bf5ca75d1ace7e15b00ca03ad21728b8b00cf007/src/db/pool.cr#L179

Basically it rescues PoolResourceLost(T) errors (ConnectionLost < PoolResourceLost(Connection)), the only error statement_with_retry would not rescue is PoolRetryAttemptsExceeded which happens if retry_attempts + current connections is too low.

So what this means is somewhere in crystal-pg or clear, the query method that is implemented does not use statement_with_retry and execute the sql code directly.

If we look at https://github.com/will/crystal-pg/blob/cafe572e12966f7449477891fa65558954dbca5c/src/pg/statement.cr#L1-L44, which is used throughout clear (defined in db: https://github.com/crystal-lang/crystal-db/blob/bf5ca75d1ace7e15b00ca03ad21728b8b00cf007/src/db/statement.cr#L63-L110)

We can clearly see that there is no retry logic here. We can implement the retry logic like https://github.com/crystal-lang/crystal-db/blob/bf5ca75d1ace7e15b00ca03ad21728b8b00cf007/src/db/pool.cr#L165-L191 but we need access to @retry_attempts, @retry_delay, and @idle, however those are only available in DB::Pool(T), which is not in the same scope as PG::Statement https://github.com/will/crystal-pg/blob/master/src/pg/statement.cr#L1-L44, we have to pass those in somehow, then this would be resolved.

the recovery testing steps:

launch an app that uses clear and connects to a database
test the api is working
restart the database container
test the api is working

dukenguyenxyz · 2021-09-03T05:23:28Z

Relevant: crystal-lang/crystal-db#45 (comment)

bcardiff · 2021-09-04T14:57:32Z

Hi! I'm trying to understand the context of the issue reported in crystal-db a little bit.

👀 I think that place-labs/clear#11 will not recover from connection loss.

Once your are dealing with a connection object, there is no retry logic possible. A connection is almost a socket. If the socket is broken/closed you are done.

But if you run statements using the pool then the pool can recover. Because you don't care about the which connection is actually used. (This also implies like transactions can't recover. At least there is no API to retry the whole transaction if the connection goes down).

So, if Clear::SQL::ConnectionPool is expected to have a retry logic DB::Pool#retry or DB::Database#retry might need to be exposed I think.

kimburgess added the type: bug something isn't working label Sep 2, 2021

jeremyw24 assigned dukenguyenxyz Sep 3, 2021

dukenguyenxyz mentioned this issue Sep 3, 2021

feat(statement.cr): add retry logic to perform_query and perform_exec place-labs/crystal-pg#1

Open

dukenguyenxyz mentioned this issue Sep 7, 2021

perf(connection_pool.cr): retry on using_connection place-labs/clear#26

Merged

dukenguyenxyz closed this as completed in place-labs/clear#26 Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application does not recover following loss of connectivity with DB #106

Application does not recover following loss of connectivity with DB #106

kimburgess commented Sep 2, 2021 •

edited

Loading

caspiano commented Sep 3, 2021

dukenguyenxyz commented Sep 3, 2021 •

edited

Loading

dukenguyenxyz commented Sep 3, 2021

bcardiff commented Sep 4, 2021

Application does not recover following loss of connectivity with DB #106

Application does not recover following loss of connectivity with DB #106

Comments

kimburgess commented Sep 2, 2021 • edited Loading

caspiano commented Sep 3, 2021

dukenguyenxyz commented Sep 3, 2021 • edited Loading

dukenguyenxyz commented Sep 3, 2021

bcardiff commented Sep 4, 2021

kimburgess commented Sep 2, 2021 •

edited

Loading

dukenguyenxyz commented Sep 3, 2021 •

edited

Loading