Establish a new connection instead of reconnect! #18010

jrafanie · 2018-09-21T19:29:48Z

https://bugzilla.redhat.com/show_bug.cgi?id=1626005

It was found that reconnect! against a postgresql SSL connection could
occur while postgresql was stopping or starting, even before the
"database system is ready to accept connections." If this is timed
correctly, this connection could fail SSL handshaking, and the
client could "give" up trying SSL connections and only ever try non-SSL.

The workaround until this commit, was to restart the client process such
as the worker or evm server.

This was evident in the postgresql log where you'd see a client,
172.168.1.99, initiate a SSL connection:

2018-08-30 11:23:52
EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG:
connection authorized: user=root database=vmdb_production SSL enabled
(protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)

Then, sometime later postgresql restarts and this connection timed it
"right" and all future connections were attempted with SSL off and fail:

2018-08-30 11:23:52
EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL:
no pg_hba.conf entry for host "172.168.1.99", user "root", database
"vmdb_production", SSL off

Somewhere deep in pg/libpq/openssl, the client code failed to initialize
the SSL handshaking code and will continually only attempt non-SSL
connections until you restart the process.

Until we can dig deep into the pg/libpq/openssl code, this code change
forces the re-initialization of the SSL handshaking code in the client
by discarding the existing connection and establishing a new one.

To test this, I first recreated this solely in an ActiveRecord
environment, by restarting postgresql in one ssh session on an appliance
and running this in rails console in another:

500.times do
  begin
    puts conn.select_value("select count(*) from users;")
  rescue PG::Error, ActiveRecord::StatementInvalid
    begin
      conn.reconnect!
    rescue
      retry
    end
  end
  sleep 0.001
end

Then, I verified this code change in a full appliance server/worker
environment by first recreating this (after many iterations) and then
confirming it did not recreate with this change applied:

for x in `seq 1 500`
do
  # restart postgresql is too fast, so I needed to first stop it, sleep
  # a little, then start it, sleep a bit longer, then keep doing that in
  # a loop, while the workers/servers would try to reconnect!.
  small_rand=`ruby -e 'puts rand(4)'`
  rand=`ruby -e 'puts rand(10) + 10'`
  echo $small_rand
  echo $rand
  systemctl stop rh-postgresql95-postgresql
  sleep $small_rand
  systemctl start rh-postgresql95-postgresql
  sleep $rand
done

jrafanie · 2018-09-21T19:30:51Z

Note, it looks like we do this in the messaging gem too:

/Users/joerafaniello/.gem/ruby/2.4.4/bundler/gems/manageiq-messaging-1031f1a3f4f8/lib/manageiq/messaging/stomp/background_job.rb:                  ActiveRecord::Base.connection.reconnect!

carbonin · 2018-09-21T19:49:04Z

app/models/miq_queue_worker_base/runner.rb

-          ActiveRecord::Base.connection.reconnect!
+
+          # Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections
+          ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection("primary"))


Do we need to explicitly specify the primary connection name or will this work without a parameter?

yeah, it needs to disconnect the pool and drop the key for the specification name from the owner_to_pool hash

But I don't think that's the remove_connection we call.

irb(main):001:0> ActiveRecord::Base.method(:remove_connection).source_location => ["/home/ncarboni/.gem/ruby/2.4.3/gems/activerecord-5.0.7/lib/active_record/connection_handling.rb", 136]

Or is @connection_specification_name not "primary" here

Yeah, that should work too. Good point.

Note, I chose the hardcoded "primary" because these places were calling explicitly reconnect! on AR::Base and not a model/subclass. I guess calling connection_specification_name on AR::Base is nicer than the hardcoded "primary". I'd rather not rely on connection_handler.retrieve_connection_pool(name) returning "primary" when the name is nil though.

What do you think?

spec_name = ActiveRecord::Base.connection_specification_name ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))

Yeah, that looks good.

It does seem a bit strange that they choose to go straight to the instance variable in remove_connection rather than using the logic in the connection_specification_name method. I'm sure there are Reasons for the way it's done, but your version looks more like the behavior we want here.

The version you wrote up in the comment that is.... Just realized that was a bit ambiguous as both versions are "yours"

😆

Note, I was going to one line to avoid adding more lines but 3 AR::Base on one line is one too many.

carbonin · 2018-09-21T19:49:38Z

Great work reproducing this @jrafanie!

https://bugzilla.redhat.com/show_bug.cgi?id=1626005 It was found that reconnect! against a postgresql SSL connection could occur while postgresql was stopping or starting, even before the "database system is ready to accept connections." If this is timed correctly, this connection could fail SSL handshaking, and the client could "give" up trying SSL connections and only ever try non-SSL. The workaround until this commit, was to restart the client process such as the worker or evm server. This was evident in the postgresql log where you'd see a client, 172.168.1.99, initiate a SSL connection: 2018-08-30 11:23:52 EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG: connection authorized: user=root database=vmdb_production SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off) Then, sometime later postgresql restarts and this connection timed it "right" and all future connections were attempted with SSL off and fail: 2018-08-30 11:23:52 EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL: no pg_hba.conf entry for host "172.168.1.99", user "root", database "vmdb_production", SSL off Somewhere deep in pg/libpq/openssl, the client code failed to initialize the SSL handshaking code and will continually only attempt non-SSL connections until you restart the process. Until we can dig deep into the pg/libpq/openssl code, this code change forces the re-initialization of the SSL handshaking code in the client by discarding the existing connection and establishing a new one. To test this, I first recreated this solely in an ActiveRecord environment, by restarting postgresql in one ssh session on an appliance and running this in rails console in another: ``` 500.times do begin puts conn.select_value("select count(*) from users;") rescue PG::Error, ActiveRecord::StatementInvalid begin conn.reconnect! rescue retry end end sleep 0.001 end ``` Then, I verified this code change in a full appliance server/worker environment by first recreating this (after many iterations) and then confirming it did not recreate with this change applied: ``` for x in `seq 1 500` do # restart postgresql is too fast, so I needed to first stop it, sleep # a little, then start it, sleep a bit longer, then keep doing that in # a loop, while the workers/servers would try to reconnect!. small_rand=`ruby -e 'puts rand(4)'` rand=`ruby -e 'puts rand(10) + 10'` echo $small_rand echo $rand systemctl stop rh-postgresql95-postgresql sleep $small_rand systemctl start rh-postgresql95-postgresql sleep $rand done ```

jrafanie · 2018-09-21T21:01:32Z

For reference, this is an example where postgresql accepts a connection before it's ready to accept them... leading to this problem:

2018-09-21 20:43:29 GMT::5ba557e4.7c7:@:[1991]:LOG:  shutting down
                    2018-09-21 20:43:29 GMT::5ba557e4.7c7:@:[1991]:LOG:  database system is shut down
                    2018-09-21 20:43:32 GMT::5ba557f4.821:@:[2081]:LOG:  database system was shut down at 2018-09-21 20:43:29 GMT
hmmm ------------>  2018-09-21 20:43:32 GMT:::1(57872):5ba557f4.822:[unknown]@[unknown]:[2082]:LOG:  connection received: host=::1 port=57872
                    2018-09-21 20:43:32 GMT::5ba557f4.821:@:[2081]:LOG:  MultiXact member wraparound protections are now enabled
TOO LATE BUDDY -->  2018-09-21 20:43:32 GMT::5ba557f4.81f:@:[2079]:LOG:  database system is ready to accept connections
                    2018-09-21 20:43:32 GMT:::1(57872):5ba557f4.822:root@vmdb_production:[2082]:FATAL:  the database system is starting up
                    2018-09-21 20:43:32 GMT::5ba557f4.826:@:[2086]:LOG:  autovacuum launcher started
                    2018-09-21 20:43:32 GMT::5ba557f4.828:@:[2088]:LOG:  starting pglogical supervisor
                    2018-09-21 20:43:32 GMT:::1(57874):5ba557f4.829:[unknown]@[unknown]:[2089]:LOG:  connection received: host=::1 port=57874
NON-SSL now ----->  2018-09-21 20:43:32 GMT:::1(57874):5ba557f4.829:root@vmdb_production:[2089]:FATAL:  no pg_hba.conf entry for host "::1", user "root", database "vmdb_production", SSL off

jrafanie · 2018-09-21T21:07:22Z

ok, updated with @carbonin's suggestions, thanks!

miq-bot · 2018-09-21T21:07:26Z

Checked commit jrafanie@2f236f6 with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0
4 files checked, 0 offenses detected
Everything looks fine. 🏆

jrafanie · 2018-09-21T21:09:04Z

Note, using the above LOVELY for x in bash script, I can get this to reliably recreate without this code change within 30 minutes... with the code change, it went a few hours before I just killed it.

kbrock

Looking good billy rae

jrafanie added bug core labels Sep 21, 2018

jrafanie requested review from gtanzillo and carbonin September 21, 2018 19:29

carbonin reviewed Sep 21, 2018

View reviewed changes

jrafanie force-pushed the workaround_reconnect_failing_for_ssl_connections branch from ca3d193 to 2f236f6 Compare September 21, 2018 21:03

kbrock approved these changes Sep 21, 2018

View reviewed changes

carbonin self-assigned this Sep 21, 2018

carbonin added the gaprindashvili/no label Sep 21, 2018

carbonin merged commit 7467510 into ManageIQ:master Sep 21, 2018

carbonin added this to the Sprint 95 Ending Sep 24, 2018 milestone Sep 21, 2018

jrafanie deleted the workaround_reconnect_failing_for_ssl_connections branch September 24, 2018 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establish a new connection instead of reconnect! #18010

Establish a new connection instead of reconnect! #18010

jrafanie commented Sep 21, 2018 •

edited

Loading

jrafanie commented Sep 21, 2018

carbonin Sep 21, 2018

jrafanie Sep 21, 2018 •

edited

Loading

carbonin Sep 21, 2018

jrafanie Sep 21, 2018 •

edited

Loading

carbonin Sep 21, 2018

carbonin Sep 21, 2018

jrafanie Sep 21, 2018

carbonin commented Sep 21, 2018

jrafanie commented Sep 21, 2018

jrafanie commented Sep 21, 2018

miq-bot commented Sep 21, 2018

jrafanie commented Sep 21, 2018

kbrock left a comment

Establish a new connection instead of reconnect! #18010

Establish a new connection instead of reconnect! #18010

Conversation

jrafanie commented Sep 21, 2018 • edited Loading

jrafanie commented Sep 21, 2018

carbonin Sep 21, 2018

Choose a reason for hiding this comment

jrafanie Sep 21, 2018 • edited Loading

Choose a reason for hiding this comment

carbonin Sep 21, 2018

Choose a reason for hiding this comment

jrafanie Sep 21, 2018 • edited Loading

Choose a reason for hiding this comment

carbonin Sep 21, 2018

Choose a reason for hiding this comment

carbonin Sep 21, 2018

Choose a reason for hiding this comment

jrafanie Sep 21, 2018

Choose a reason for hiding this comment

carbonin commented Sep 21, 2018

jrafanie commented Sep 21, 2018

jrafanie commented Sep 21, 2018

miq-bot commented Sep 21, 2018

jrafanie commented Sep 21, 2018

kbrock left a comment

Choose a reason for hiding this comment

jrafanie commented Sep 21, 2018 •

edited

Loading

jrafanie Sep 21, 2018 •

edited

Loading

jrafanie Sep 21, 2018 •

edited

Loading