Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session requests failing after upgrade #8868

Closed
jkuester opened this issue Feb 13, 2024 · 11 comments
Closed

Session requests failing after upgrade #8868

jkuester opened this issue Feb 13, 2024 · 11 comments
Assignees
Labels
Priority: 1 - High Blocks the next release. Type: Bug Fix something that isn't working as intended
Milestone

Comments

@jkuester
Copy link
Contributor

Describe the bug
Upgrading to 4.6.0-beta.2 somehow seems to break the session/authentication handling in the api server! You can no longer login to the webapp. Also, even direct REST calls to the api with basic auth fail due to a 400 error. (Though calls to open endpoints like api/v2/monitoring are successful.)

To Reproduce

  1. Deploy a CHT instance (e.g. via EKS or docker-helper).
    1. Of course I cannot repro this locally when just building from the 4.6.x branch....
  2. Upgrade it to 4.6.0-beta.2 via the admin app.
    1. If you Stage first and then Install, the upgrade will complete without any errors being displayed on the admin page. If you just directly Install without staging, then an error will display on the upgrade page, but the upgrade will actually complete successfully.
  3. Logout
  4. Try to log back in and see error
  5. Try directly calling an authenticated endpoint (e.g. api/v2.upgrade) and see error
    1. The response I get back is:
      400 - "<html><body><h1>400 Bad request</h1>\nYour browser sent an invalid request.\n</body></html>\n"
      

Expected behavior
Upgrading should not break the deployment. If it does, you are not going to space today....

Logs
Here are the logs directly after making 5 back-to-back calls to the api/v2/upgrade endpoint (and getting the above 400 error every time)

api

2024-02-13T20:15:29.563 REQ: 1f694f26-1dbf-4b33-b631-776a8a98bb67 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1
2024-02-13T20:15:29.572 RES: 1f694f26-1dbf-4b33-b631-776a8a98bb67 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1 400 - 8.494 ms
2024-02-13T20:15:30.355 REQ: d98c78b4-784f-450e-83c3-252e802f0db6 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1
2024-02-13T20:15:30.362 RES: d98c78b4-784f-450e-83c3-252e802f0db6 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1 400 - 5.724 ms
2024-02-13T20:15:31.315 REQ: 084cd065-f8f9-48cb-9686-852d51cb9c6d 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1
2024-02-13T20:15:31.321 RES: 084cd065-f8f9-48cb-9686-852d51cb9c6d 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1 400 - 5.863 ms
2024-02-13T20:15:32.083 REQ: 316e3468-c569-4da1-8013-97b90180ead9 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1
2024-02-13T20:15:32.089 RES: 316e3468-c569-4da1-8013-97b90180ead9 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1 400 - 5.747 ms
2024-02-13T20:15:33.084 REQ: 0d4ffc80-786f-496a-bb00-e456b4bc97f9 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1
2024-02-13T20:15:33.090 RES: 0d4ffc80-786f-496a-bb00-e456b4bc97f9 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1 400 - 5.917 ms

haproxy

<150>Feb 13 20:15:29 haproxy[12]: 100.64.207.147,couchdb.joshk-dev.svc.cluster.local,200,3,0,0,GET,/_session,-,medic,'-',323,1,148,'insomnia/2023.5.8'
<150>Feb 13 20:15:29 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'
<150>Feb 13 20:15:30 haproxy[12]: 100.64.207.147,couchdb.joshk-dev.svc.cluster.local,200,2,0,0,GET,/_session,-,medic,'-',323,1,148,'insomnia/2023.5.8'
<150>Feb 13 20:15:30 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'
<150>Feb 13 20:15:31 haproxy[12]: 100.64.207.147,couchdb.joshk-dev.svc.cluster.local,200,2,0,0,GET,/_session,-,medic,'-',323,1,148,'insomnia/2023.5.8'
<150>Feb 13 20:15:31 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'
<150>Feb 13 20:15:32 haproxy[12]: 100.64.207.147,couchdb.joshk-dev.svc.cluster.local,200,2,0,0,GET,/_session,-,medic,'-',323,1,148,'insomnia/2023.5.8'
<150>Feb 13 20:15:32 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'
<150>Feb 13 20:15:33 haproxy[12]: 100.64.207.147,couchdb.joshk-dev.svc.cluster.local,200,2,0,0,GET,/_session,-,medic,'-',323,1,148,'insomnia/2023.5.8'
<150>Feb 13 20:15:33 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'

couch

[notice] 2024-02-13T20:15:29.569501Z couchdb@127.0.0.1 <0.3300.1> 6c245bafd8 josh45.dev.medicmobile.org 100.64.207.147 medic GET /_session 200 ok 0
[notice] 2024-02-13T20:15:30.360297Z couchdb@127.0.0.1 <0.3306.1> c06dacc4bd josh45.dev.medicmobile.org 100.64.207.147 medic GET /_session 200 ok 0
[notice] 2024-02-13T20:15:31.319901Z couchdb@127.0.0.1 <0.3314.1> 72522dca7e josh45.dev.medicmobile.org 100.64.207.147 medic GET /_session 200 ok 0
[notice] 2024-02-13T20:15:31.747582Z couchdb@127.0.0.1 <0.3317.1> 3587fac42b couchdb.joshk-dev.svc.cluster.local:5984 100.64.241.227 medic GET /_membership 200 ok 0
[notice] 2024-02-13T20:15:32.087835Z couchdb@127.0.0.1 <0.3318.1> a2f2462741 josh45.dev.medicmobile.org 100.64.207.147 medic GET /_session 200 ok 0
[notice] 2024-02-13T20:15:33.088422Z couchdb@127.0.0.1 <0.3354.1> 29679fb9ed josh45.dev.medicmobile.org 100.64.207.147 medic GET /_session 200 ok 0

Environment

  • Version: 4.6.0-beta.2

Additional context

We first noticed this issue because the Upgrade job is failing on the CI for the beta tag.

@jkuester jkuester added the Type: Bug Fix something that isn't working as intended label Feb 13, 2024
@1yuv
Copy link
Member

1yuv commented Feb 13, 2024

If you Stage first and then Install, the upgrade will complete without any errors being displayed on the admin page. If you just directly Install without staging, then an error will display on the upgrade page, but the upgrade will actually complete successfully.

If it's helpful, I'd seen exact this behavior while upgrading instances from 4.5 to 4.5.2.

@dianabarsan
Copy link
Member

dianabarsan commented Feb 14, 2024

This Haproxy log is telling.

<150>Feb 13 20:15:31 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'

NOSRV means haproxy doesn't know where to send this request.

It seems to have no connection to your CouchDb. How are you deploying this?
We test authentication quite extensively in e2e tests, running the same images that we publish. I'm suspecting something else is amiss here.

@jkuester
Copy link
Contributor Author

@dianabarsan This was done via an EKS deployment. I started with 4.5.0 and had no issues upgrading to 4.5.1 and/or 4.5.2. But I was able to consistently recreate this behavior on repeated fresh EKS deployments.... I can try testing this again later on docker-helper to see if I get the same behavior!

@jkuester
Copy link
Contributor Author

Okay, so while testing this locally via the docker-helper, I was able to reproduce the step #2 behavior where the upgrade completes, but the admin app shows an error. BUT, I did not run into the same session/auth errors. Originally I was thinking these two were related, but now I am not so sure.... 🤔 (Also Yuvraj's experience seems to support this.)

@jkuester
Copy link
Contributor Author

I think the CI test is failing due to the error being displayed in the admin app. So, really the session/auth stuff might just be weirdness on EKS somehow? 🤔

@dianabarsan
Copy link
Member

dianabarsan commented Feb 15, 2024

I think the upgrade in admin app is a UI display issue.
I think your EKS upgrade issue is not related, and is in fact related to https://github.com/medic/config-moh-togo/issues/178#issuecomment-1938030137 , which @nydr and I are investigating (one of us, at least). I am strongly suspecting an k8s upgrade service bug here, which is highly concerning.
I am less concerned about an AdminApp UI bug, but I'll look into that one as well.

Thanks a lot for following up!

@jkuester jkuester added this to the 4.6.0 milestone Feb 15, 2024
@dianabarsan
Copy link
Member

Ok, I no longer believe this issue is the same as the moh-togo upgrade. I think this is isolated to 4.6.x and how API sends requests to haproxy.
This is now definitely a blocking issue for releasing 4.6

@dianabarsan dianabarsan added the Priority: 1 - High Blocks the next release. label Feb 22, 2024
@dianabarsan
Copy link
Member

dianabarsan commented Feb 22, 2024

This is caused bumping Node to v20, and possibly some interaction between request-promise-native library and Node 20 http libraries.

This is the POST _session request that API sends to haproxy using rpn, which in turn should return the sessoin cookie:

[22/Feb/2024:12:55:21.698] frontend http-in (#2): invalid request
  backend <NONE> (#-1), server <NONE> (#-1), event #0, src 10.42.0.33:49362
  buffer starts at 0 (including 0 out), 32597 free,
  len 171, wraps at 32672, error at position 26
  H1 connection flags 0x00000100, H1 stream flags 0x00000890
  H1 msg state MSG_RQMETH(2), H1 msg flags 0x00001400
  H1 chunk len 0 bytes, H1 body len 0 bytes :
  
  00000  roxy.dev.svc.cluster.local:5984\r\n
  00033  accept: application/json\r\n
  00059  content-type: application/json\r\n
  00091  content-length: 34\r\n
  00111  Connection: keep-alive\r\n
  00135  \r\n
  00137  {"name":"admin","password":"pass"}

It seems that haproxy somehow receives a truncated request ( roxy.dev.svc.clust...) header. Only wild guesses as to why this happens.

This request shows up in haproxy logs as:

<150>Feb 13 20:15:30 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'

@dianabarsan
Copy link
Member

Downgrading to Node 18 fixes the issue. It seems that Node 19 made some changes to the http library:
https://nodejs.org/en/blog/announcements/v19-release-announce#https11-keepalive-by-default
https://nodejs.org/en/blog/announcements/v19-release-announce#llhttp

I've tried removing all lua password obfuscation scripts from haproxy, but it has not resolved the problem.

At this moment, I'm strongly suspecting that changes made to the http library are interacting in some unfortunate way with request library.

@dianabarsan
Copy link
Member

Node 19 adds keep-alive header to requests by default. This makes the content-length header very significant and important because it enables the server to know when one request has been fully received and another one is sent through the same connection.
When authenticating a user, we take all headers from the original user's request and pass them to a GET /_session request, presumably so we don't have to filter out all the ways through which the user can authenticate (cookie, authorization etc). This meant that if the original request was a POST, and it had a content-length header, this header was also forwarded. This meant that when the connection would later be reused, the server (haproxy in this case) would send truncate the previous's request content-length number of characters from the next request, believing the previous request was not finished. Obviously, this new truncated request was invalid and generated a 400 status code.

This only seems to happen when accessing API directly or through the AWS load balancer, and never when running requests through nginx.

We shouldn't really be forwarding all headers from a random request to another.

dianabarsan added a commit that referenced this issue Mar 14, 2024
Node 19 adds keep-alive header to requests by default. This makes the content-length header very significant and important because it enables the server to know when one request has been fully received and another one is sent through the same connection.
When authenticating a user, we take all headers from the original user's request and pass them to a GET /_session request, presumably so we don't have to filter out all the ways through which the user can authenticate (cookie, authorization etc). This meant that if the original request was a POST, and it had a content-length header, this header was also forwarded. This meant that when the connection would later be reused, the server (haproxy in this case) would truncate the previous's request content-length number of characters from the next request, believing the previous request was not finished. Obviously, this new truncated request was invalid and generated a 400 status code.

This only seems to happen when accessing API directly or through the AWS load balancer, and never when running requests through nginx.

To avoid next request truncation, I am no longer forwarding content-length headers in the session request.

#8868
dianabarsan added a commit that referenced this issue Mar 14, 2024
Node 19 adds keep-alive header to requests by default. This makes the content-length header very significant and important because it enables the server to know when one request has been fully received and another one is sent through the same connection.
When authenticating a user, we take all headers from the original user's request and pass them to a GET /_session request, presumably so we don't have to filter out all the ways through which the user can authenticate (cookie, authorization etc). This meant that if the original request was a POST, and it had a content-length header, this header was also forwarded. This meant that when the connection would later be reused, the server (haproxy in this case) would truncate the previous's request content-length number of characters from the next request, believing the previous request was not finished. Obviously, this new truncated request was invalid and generated a 400 status code.

This only seems to happen when accessing API directly or through the AWS load balancer, and never when running requests through nginx.

To avoid next request truncation, I am no longer forwarding content-length headers in the session request.

#8868

(cherry picked from commit 831bd6e)
@jkuester jkuester changed the title Session/auth issues after upgrading to 4.6.0-beta.2 Session requests failing after upgrade Mar 14, 2024
dianabarsan added a commit that referenced this issue Mar 15, 2024
Node 19 adds keep-alive header to requests by default. This makes the content-length header very significant and important because it enables the server to know when one request has been fully received and another one is sent through the same connection.
When authenticating a user, we take all headers from the original user's request and pass them to a GET /_session request, presumably so we don't have to filter out all the ways through which the user can authenticate (cookie, authorization etc). This meant that if the original request was a POST, and it had a content-length header, this header was also forwarded. This meant that when the connection would later be reused, the server (haproxy in this case) would truncate the previous's request content-length number of characters from the next request, believing the previous request was not finished. Obviously, this new truncated request was invalid and generated a 400 status code.

This only seems to happen when accessing API directly or through the AWS load balancer, and never when running requests through nginx.

To avoid next request truncation, I am no longer forwarding content-length headers in the session request.

#8868

(cherry picked from commit 831bd6e)
@dianabarsan
Copy link
Member

Merged to master and backported to 4.6.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 1 - High Blocks the next release. Type: Bug Fix something that isn't working as intended
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants