Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish a new connection instead of reconnect! #18010

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Sep 21, 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1626005

It was found that reconnect! against a postgresql SSL connection could
occur while postgresql was stopping or starting, even before the
"database system is ready to accept connections." If this is timed
correctly, this connection could fail SSL handshaking, and the
client could "give" up trying SSL connections and only ever try non-SSL.

The workaround until this commit, was to restart the client process such
as the worker or evm server.

This was evident in the postgresql log where you'd see a client,
172.168.1.99, initiate a SSL connection:

2018-08-30 11:23:52
EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG:
connection authorized: user=root database=vmdb_production SSL enabled
(protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)

Then, sometime later postgresql restarts and this connection timed it
"right" and all future connections were attempted with SSL off and fail:

2018-08-30 11:23:52
EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL:
no pg_hba.conf entry for host "172.168.1.99", user "root", database
"vmdb_production", SSL off

Somewhere deep in pg/libpq/openssl, the client code failed to initialize
the SSL handshaking code and will continually only attempt non-SSL
connections until you restart the process.

Until we can dig deep into the pg/libpq/openssl code, this code change
forces the re-initialization of the SSL handshaking code in the client
by discarding the existing connection and establishing a new one.

To test this, I first recreated this solely in an ActiveRecord
environment, by restarting postgresql in one ssh session on an appliance
and running this in rails console in another:

500.times do
  begin
    puts conn.select_value("select count(*) from users;")
  rescue PG::Error, ActiveRecord::StatementInvalid
    begin
      conn.reconnect!
    rescue
      retry
    end
  end
  sleep 0.001
end

Then, I verified this code change in a full appliance server/worker
environment by first recreating this (after many iterations) and then
confirming it did not recreate with this change applied:

for x in `seq 1 500`
do
  # restart postgresql is too fast, so I needed to first stop it, sleep
  # a little, then start it, sleep a bit longer, then keep doing that in
  # a loop, while the workers/servers would try to reconnect!.
  small_rand=`ruby -e 'puts rand(4)'`
  rand=`ruby -e 'puts rand(10) + 10'`
  echo $small_rand
  echo $rand
  systemctl stop rh-postgresql95-postgresql
  sleep $small_rand
  systemctl start rh-postgresql95-postgresql
  sleep $rand
done

@jrafanie
Copy link
Member Author

Note, it looks like we do this in the messaging gem too:

/Users/joerafaniello/.gem/ruby/2.4.4/bundler/gems/manageiq-messaging-1031f1a3f4f8/lib/manageiq/messaging/stomp/background_job.rb:                  ActiveRecord::Base.connection.reconnect!

ActiveRecord::Base.connection.reconnect!

# Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections
ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection("primary"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly specify the primary connection name or will this work without a parameter?

Copy link
Member Author

@jrafanie jrafanie Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it needs to disconnect the pool and drop the key for the specification name from the owner_to_pool hash

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think that's the remove_connection we call.

irb(main):001:0> ActiveRecord::Base.method(:remove_connection).source_location
=> ["/home/ncarboni/.gem/ruby/2.4.3/gems/activerecord-5.0.7/lib/active_record/connection_handling.rb", 136]

Or is @connection_specification_name not "primary" here

Copy link
Member Author

@jrafanie jrafanie Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that should work too. Good point.

Note, I chose the hardcoded "primary" because these places were calling explicitly reconnect! on AR::Base and not a model/subclass. I guess calling connection_specification_name on AR::Base is nicer than the hardcoded "primary". I'd rather not rely on connection_handler.retrieve_connection_pool(name) returning "primary" when the name is nil though.

What do you think?

spec_name = ActiveRecord::Base.connection_specification_name
ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that looks good.

It does seem a bit strange that they choose to go straight to the instance variable in remove_connection rather than using the logic in the connection_specification_name method. I'm sure there are Reasons for the way it's done, but your version looks more like the behavior we want here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version you wrote up in the comment that is.... Just realized that was a bit ambiguous as both versions are "yours"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Note, I was going to one line to avoid adding more lines but 3 AR::Base on one line is one too many.

@carbonin
Copy link
Member

Great work reproducing this @jrafanie!

https://bugzilla.redhat.com/show_bug.cgi?id=1626005

It was found that reconnect! against a postgresql SSL connection could
occur while postgresql was stopping or starting, even before the
"database system is ready to accept connections."  If this is timed
correctly, this connection could fail SSL handshaking, and the
client could "give" up trying SSL connections and only ever try non-SSL.

The workaround until this commit, was to restart the client process such
as the worker or evm server.

This was evident in the postgresql log where you'd see a client,
172.168.1.99, initiate a SSL connection:

2018-08-30 11:23:52
EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG:
connection authorized: user=root database=vmdb_production SSL enabled
(protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)

Then, sometime later postgresql restarts and this connection timed it
"right" and all future connections were attempted with SSL off and fail:

2018-08-30 11:23:52
EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL:
no pg_hba.conf entry for host "172.168.1.99", user "root", database
"vmdb_production", SSL off

Somewhere deep in pg/libpq/openssl, the client code failed to initialize
the SSL handshaking code and will continually only attempt non-SSL
connections until you restart the process.

Until we can dig deep into the pg/libpq/openssl code, this code change
forces the re-initialization of the SSL handshaking code in the client
by discarding the existing connection and establishing a new one.

To test this, I first recreated this solely in an ActiveRecord
environment, by restarting postgresql in one ssh session on an appliance
and running this in rails console in another:

```
500.times do
  begin
    puts conn.select_value("select count(*) from users;")
  rescue PG::Error, ActiveRecord::StatementInvalid
    begin
      conn.reconnect!
    rescue
      retry
    end
  end
  sleep 0.001
end
```

Then, I verified this code change in a full appliance server/worker
environment by first recreating this (after many iterations) and then
confirming it did not recreate with this change applied:

```
for x in `seq 1 500`
do
  # restart postgresql is too fast, so I needed to first stop it, sleep
  # a little, then start it, sleep a bit longer, then keep doing that in
  # a loop, while the workers/servers would try to reconnect!.
  small_rand=`ruby -e 'puts rand(4)'`
  rand=`ruby -e 'puts rand(10) + 10'`
  echo $small_rand
  echo $rand
  systemctl stop rh-postgresql95-postgresql
  sleep $small_rand
  systemctl start rh-postgresql95-postgresql
  sleep $rand
done
```
@jrafanie
Copy link
Member Author

For reference, this is an example where postgresql accepts a connection before it's ready to accept them... leading to this problem:

2018-09-21 20:43:29 GMT::5ba557e4.7c7:@:[1991]:LOG:  shutting down
                    2018-09-21 20:43:29 GMT::5ba557e4.7c7:@:[1991]:LOG:  database system is shut down
                    2018-09-21 20:43:32 GMT::5ba557f4.821:@:[2081]:LOG:  database system was shut down at 2018-09-21 20:43:29 GMT
hmmm ------------>  2018-09-21 20:43:32 GMT:::1(57872):5ba557f4.822:[unknown]@[unknown]:[2082]:LOG:  connection received: host=::1 port=57872
                    2018-09-21 20:43:32 GMT::5ba557f4.821:@:[2081]:LOG:  MultiXact member wraparound protections are now enabled
TOO LATE BUDDY -->  2018-09-21 20:43:32 GMT::5ba557f4.81f:@:[2079]:LOG:  database system is ready to accept connections
                    2018-09-21 20:43:32 GMT:::1(57872):5ba557f4.822:root@vmdb_production:[2082]:FATAL:  the database system is starting up
                    2018-09-21 20:43:32 GMT::5ba557f4.826:@:[2086]:LOG:  autovacuum launcher started
                    2018-09-21 20:43:32 GMT::5ba557f4.828:@:[2088]:LOG:  starting pglogical supervisor
                    2018-09-21 20:43:32 GMT:::1(57874):5ba557f4.829:[unknown]@[unknown]:[2089]:LOG:  connection received: host=::1 port=57874
NON-SSL now ----->  2018-09-21 20:43:32 GMT:::1(57874):5ba557f4.829:root@vmdb_production:[2089]:FATAL:  no pg_hba.conf entry for host "::1", user "root", database "vmdb_production", SSL off

@jrafanie jrafanie force-pushed the workaround_reconnect_failing_for_ssl_connections branch from ca3d193 to 2f236f6 Compare September 21, 2018 21:03
@jrafanie
Copy link
Member Author

ok, updated with @carbonin's suggestions, thanks!

@miq-bot
Copy link
Member

miq-bot commented Sep 21, 2018

Checked commit jrafanie@2f236f6 with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0
4 files checked, 0 offenses detected
Everything looks fine. 🏆

@jrafanie
Copy link
Member Author

Note, using the above LOVELY for x in bash script, I can get this to reliably recreate without this code change within 30 minutes... with the code change, it went a few hours before I just killed it.

Copy link
Member

@kbrock kbrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good billy rae

@carbonin carbonin self-assigned this Sep 21, 2018
@carbonin carbonin merged commit 7467510 into ManageIQ:master Sep 21, 2018
@carbonin carbonin added this to the Sprint 95 Ending Sep 24, 2018 milestone Sep 21, 2018
@jrafanie jrafanie deleted the workaround_reconnect_failing_for_ssl_connections branch September 24, 2018 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants