Session requests failing after upgrade #8868

jkuester · 2024-02-13T20:21:53Z

Describe the bug
Upgrading to 4.6.0-beta.2 somehow seems to break the session/authentication handling in the api server! You can no longer login to the webapp. Also, even direct REST calls to the api with basic auth fail due to a 400 error. (Though calls to open endpoints like api/v2/monitoring are successful.)

To Reproduce

Deploy a CHT instance (e.g. via EKS or docker-helper).
1. Of course I cannot repro this locally when just building from the 4.6.x branch....
Upgrade it to 4.6.0-beta.2 via the admin app.
1. If you Stage first and then Install, the upgrade will complete without any errors being displayed on the admin page. If you just directly Install without staging, then an error will display on the upgrade page, but the upgrade will actually complete successfully.
Logout
Try to log back in and see error
Try directly calling an authenticated endpoint (e.g. api/v2.upgrade) and see error
1. The response I get back is:
```
400 - "<html><body><h1>400 Bad request</h1>\nYour browser sent an invalid request.\n</body></html>\n"
```

Expected behavior
Upgrading should not break the deployment. If it does, you are not going to space today....

Logs
Here are the logs directly after making 5 back-to-back calls to the api/v2/upgrade endpoint (and getting the above 400 error every time)

api

2024-02-13T20:15:29.563 REQ: 1f694f26-1dbf-4b33-b631-776a8a98bb67 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1
2024-02-13T20:15:29.572 RES: 1f694f26-1dbf-4b33-b631-776a8a98bb67 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1 400 - 8.494 ms
2024-02-13T20:15:30.355 REQ: d98c78b4-784f-450e-83c3-252e802f0db6 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1
2024-02-13T20:15:30.362 RES: d98c78b4-784f-450e-83c3-252e802f0db6 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1 400 - 5.724 ms
2024-02-13T20:15:31.315 REQ: 084cd065-f8f9-48cb-9686-852d51cb9c6d 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1
2024-02-13T20:15:31.321 RES: 084cd065-f8f9-48cb-9686-852d51cb9c6d 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1 400 - 5.863 ms
2024-02-13T20:15:32.083 REQ: 316e3468-c569-4da1-8013-97b90180ead9 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1
2024-02-13T20:15:32.089 RES: 316e3468-c569-4da1-8013-97b90180ead9 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1 400 - 5.747 ms
2024-02-13T20:15:33.084 REQ: 0d4ffc80-786f-496a-bb00-e456b4bc97f9 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1
2024-02-13T20:15:33.090 RES: 0d4ffc80-786f-496a-bb00-e456b4bc97f9 216.147.121.215 medic GET /api/v2/upgrade HTTP/1.1 400 - 5.917 ms

haproxy

<150>Feb 13 20:15:29 haproxy[12]: 100.64.207.147,couchdb.joshk-dev.svc.cluster.local,200,3,0,0,GET,/_session,-,medic,'-',323,1,148,'insomnia/2023.5.8'
<150>Feb 13 20:15:29 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'
<150>Feb 13 20:15:30 haproxy[12]: 100.64.207.147,couchdb.joshk-dev.svc.cluster.local,200,2,0,0,GET,/_session,-,medic,'-',323,1,148,'insomnia/2023.5.8'
<150>Feb 13 20:15:30 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'
<150>Feb 13 20:15:31 haproxy[12]: 100.64.207.147,couchdb.joshk-dev.svc.cluster.local,200,2,0,0,GET,/_session,-,medic,'-',323,1,148,'insomnia/2023.5.8'
<150>Feb 13 20:15:31 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'
<150>Feb 13 20:15:32 haproxy[12]: 100.64.207.147,couchdb.joshk-dev.svc.cluster.local,200,2,0,0,GET,/_session,-,medic,'-',323,1,148,'insomnia/2023.5.8'
<150>Feb 13 20:15:32 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'
<150>Feb 13 20:15:33 haproxy[12]: 100.64.207.147,couchdb.joshk-dev.svc.cluster.local,200,2,0,0,GET,/_session,-,medic,'-',323,1,148,'insomnia/2023.5.8'
<150>Feb 13 20:15:33 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'

couch

[notice] 2024-02-13T20:15:29.569501Z couchdb@127.0.0.1 <0.3300.1> 6c245bafd8 josh45.dev.medicmobile.org 100.64.207.147 medic GET /_session 200 ok 0
[notice] 2024-02-13T20:15:30.360297Z couchdb@127.0.0.1 <0.3306.1> c06dacc4bd josh45.dev.medicmobile.org 100.64.207.147 medic GET /_session 200 ok 0
[notice] 2024-02-13T20:15:31.319901Z couchdb@127.0.0.1 <0.3314.1> 72522dca7e josh45.dev.medicmobile.org 100.64.207.147 medic GET /_session 200 ok 0
[notice] 2024-02-13T20:15:31.747582Z couchdb@127.0.0.1 <0.3317.1> 3587fac42b couchdb.joshk-dev.svc.cluster.local:5984 100.64.241.227 medic GET /_membership 200 ok 0
[notice] 2024-02-13T20:15:32.087835Z couchdb@127.0.0.1 <0.3318.1> a2f2462741 josh45.dev.medicmobile.org 100.64.207.147 medic GET /_session 200 ok 0
[notice] 2024-02-13T20:15:33.088422Z couchdb@127.0.0.1 <0.3354.1> 29679fb9ed josh45.dev.medicmobile.org 100.64.207.147 medic GET /_session 200 ok 0

Environment

Version: 4.6.0-beta.2

Additional context

We first noticed this issue because the Upgrade job is failing on the CI for the beta tag.

The text was updated successfully, but these errors were encountered:

1yuv · 2024-02-13T20:41:10Z

If you Stage first and then Install, the upgrade will complete without any errors being displayed on the admin page. If you just directly Install without staging, then an error will display on the upgrade page, but the upgrade will actually complete successfully.

If it's helpful, I'd seen exact this behavior while upgrading instances from 4.5 to 4.5.2.

dianabarsan · 2024-02-14T03:31:33Z

This Haproxy log is telling.

<150>Feb 13 20:15:31 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'

NOSRV means haproxy doesn't know where to send this request.

It seems to have no connection to your CouchDb. How are you deploying this?
We test authentication quite extensively in e2e tests, running the same images that we publish. I'm suspecting something else is amiss here.

jkuester · 2024-02-14T14:44:30Z

@dianabarsan This was done via an EKS deployment. I started with 4.5.0 and had no issues upgrading to 4.5.1 and/or 4.5.2. But I was able to consistently recreate this behavior on repeated fresh EKS deployments.... I can try testing this again later on docker-helper to see if I get the same behavior!

jkuester · 2024-02-14T16:28:14Z

Okay, so while testing this locally via the docker-helper, I was able to reproduce the step #2 behavior where the upgrade completes, but the admin app shows an error. BUT, I did not run into the same session/auth errors. Originally I was thinking these two were related, but now I am not so sure.... 🤔 (Also Yuvraj's experience seems to support this.)

jkuester · 2024-02-14T16:29:19Z

I think the CI test is failing due to the error being displayed in the admin app. So, really the session/auth stuff might just be weirdness on EKS somehow? 🤔

dianabarsan · 2024-02-15T02:02:17Z

I think the upgrade in admin app is a UI display issue.
I think your EKS upgrade issue is not related, and is in fact related to https://github.com/medic/config-moh-togo/issues/178#issuecomment-1938030137 , which @nydr and I are investigating (one of us, at least). I am strongly suspecting an k8s upgrade service bug here, which is highly concerning.
I am less concerned about an AdminApp UI bug, but I'll look into that one as well.

Thanks a lot for following up!

dianabarsan · 2024-02-20T15:03:31Z

Ok, I no longer believe this issue is the same as the moh-togo upgrade. I think this is isolated to 4.6.x and how API sends requests to haproxy.
This is now definitely a blocking issue for releasing 4.6

dianabarsan · 2024-02-22T14:18:23Z

This is caused bumping Node to v20, and possibly some interaction between request-promise-native library and Node 20 http libraries.

This is the POST _session request that API sends to haproxy using rpn, which in turn should return the sessoin cookie:

[22/Feb/2024:12:55:21.698] frontend http-in (#2): invalid request
  backend <NONE> (#-1), server <NONE> (#-1), event #0, src 10.42.0.33:49362
  buffer starts at 0 (including 0 out), 32597 free,
  len 171, wraps at 32672, error at position 26
  H1 connection flags 0x00000100, H1 stream flags 0x00000890
  H1 msg state MSG_RQMETH(2), H1 msg flags 0x00001400
  H1 chunk len 0 bytes, H1 body len 0 bytes :
  
  00000  roxy.dev.svc.cluster.local:5984\r\n
  00033  accept: application/json\r\n
  00059  content-type: application/json\r\n
  00091  content-length: 34\r\n
  00111  Connection: keep-alive\r\n
  00135  \r\n
  00137  {"name":"admin","password":"pass"}

It seems that haproxy somehow receives a truncated request ( roxy.dev.svc.clust...) header. Only wild guesses as to why this happens.

This request shows up in haproxy logs as:

<150>Feb 13 20:15:30 haproxy[12]: 100.64.207.147,<NOSRV>,400,0,0,-1,-,-,-,-,'-',0,-1,-,'-'

dianabarsan · 2024-03-07T03:57:48Z

Downgrading to Node 18 fixes the issue. It seems that Node 19 made some changes to the http library:
https://nodejs.org/en/blog/announcements/v19-release-announce#https11-keepalive-by-default
https://nodejs.org/en/blog/announcements/v19-release-announce#llhttp

I've tried removing all lua password obfuscation scripts from haproxy, but it has not resolved the problem.

At this moment, I'm strongly suspecting that changes made to the http library are interacting in some unfortunate way with request library.

dianabarsan · 2024-03-12T01:03:36Z

Node 19 adds keep-alive header to requests by default. This makes the content-length header very significant and important because it enables the server to know when one request has been fully received and another one is sent through the same connection.
When authenticating a user, we take all headers from the original user's request and pass them to a GET /_session request, presumably so we don't have to filter out all the ways through which the user can authenticate (cookie, authorization etc). This meant that if the original request was a POST, and it had a content-length header, this header was also forwarded. This meant that when the connection would later be reused, the server (haproxy in this case) would send truncate the previous's request content-length number of characters from the next request, believing the previous request was not finished. Obviously, this new truncated request was invalid and generated a 400 status code.

This only seems to happen when accessing API directly or through the AWS load balancer, and never when running requests through nginx.

We shouldn't really be forwarding all headers from a random request to another.

Node 19 adds keep-alive header to requests by default. This makes the content-length header very significant and important because it enables the server to know when one request has been fully received and another one is sent through the same connection. When authenticating a user, we take all headers from the original user's request and pass them to a GET /_session request, presumably so we don't have to filter out all the ways through which the user can authenticate (cookie, authorization etc). This meant that if the original request was a POST, and it had a content-length header, this header was also forwarded. This meant that when the connection would later be reused, the server (haproxy in this case) would truncate the previous's request content-length number of characters from the next request, believing the previous request was not finished. Obviously, this new truncated request was invalid and generated a 400 status code. This only seems to happen when accessing API directly or through the AWS load balancer, and never when running requests through nginx. To avoid next request truncation, I am no longer forwarding content-length headers in the session request. #8868

Node 19 adds keep-alive header to requests by default. This makes the content-length header very significant and important because it enables the server to know when one request has been fully received and another one is sent through the same connection. When authenticating a user, we take all headers from the original user's request and pass them to a GET /_session request, presumably so we don't have to filter out all the ways through which the user can authenticate (cookie, authorization etc). This meant that if the original request was a POST, and it had a content-length header, this header was also forwarded. This meant that when the connection would later be reused, the server (haproxy in this case) would truncate the previous's request content-length number of characters from the next request, believing the previous request was not finished. Obviously, this new truncated request was invalid and generated a 400 status code. This only seems to happen when accessing API directly or through the AWS load balancer, and never when running requests through nginx. To avoid next request truncation, I am no longer forwarding content-length headers in the session request. #8868 (cherry picked from commit 831bd6e)

dianabarsan · 2024-03-15T00:36:43Z

Merged to master and backported to 4.6.x

jkuester added the Type: Bug Fix something that isn't working as intended label Feb 13, 2024

dianabarsan self-assigned this Feb 14, 2024

garethbowen added this to Product Team Activities Feb 14, 2024

github-project-automation bot moved this to Todo in Product Team Activities Feb 14, 2024

jkuester added this to the 4.6.0 milestone Feb 15, 2024

dianabarsan added the Priority: 1 - High Blocks the next release. label Feb 22, 2024

This was referenced Feb 22, 2024

Run e2e tests over k8s architecture instead of Docker Compose #8909

Open

Replace deprecated "request-promise-native" dependency #6250

Closed

dianabarsan mentioned this issue Mar 12, 2024

fix(#8868): don't forward all headers for get requests #8924

Merged

5 tasks

dianabarsan mentioned this issue Mar 14, 2024

fix(#8868): don't forward all headers for get requests (#8924) #8933

Merged

5 tasks

jkuester changed the title ~~Session/auth issues after upgrading to 4.6.0-beta.2~~ Session requests failing after upgrade Mar 14, 2024

dianabarsan closed this as completed Mar 15, 2024

github-project-automation bot moved this from Todo to Done in Product Team Activities Mar 15, 2024

jkuester mentioned this issue Mar 15, 2024

Admin app shows error after successful upgrade #8940

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session requests failing after upgrade #8868

Session requests failing after upgrade #8868

jkuester commented Feb 13, 2024

1yuv commented Feb 13, 2024

dianabarsan commented Feb 14, 2024 •

edited

Loading

jkuester commented Feb 14, 2024

jkuester commented Feb 14, 2024

jkuester commented Feb 14, 2024

dianabarsan commented Feb 15, 2024 •

edited

Loading

dianabarsan commented Feb 20, 2024

dianabarsan commented Feb 22, 2024 •

edited

Loading

dianabarsan commented Mar 7, 2024

dianabarsan commented Mar 12, 2024

dianabarsan commented Mar 15, 2024

Session requests failing after upgrade #8868

Session requests failing after upgrade #8868

Comments

jkuester commented Feb 13, 2024

api

haproxy

couch

1yuv commented Feb 13, 2024

dianabarsan commented Feb 14, 2024 • edited Loading

jkuester commented Feb 14, 2024

jkuester commented Feb 14, 2024

jkuester commented Feb 14, 2024

dianabarsan commented Feb 15, 2024 • edited Loading

dianabarsan commented Feb 20, 2024

dianabarsan commented Feb 22, 2024 • edited Loading

dianabarsan commented Mar 7, 2024

dianabarsan commented Mar 12, 2024

dianabarsan commented Mar 15, 2024

dianabarsan commented Feb 14, 2024 •

edited

Loading

dianabarsan commented Feb 15, 2024 •

edited

Loading

dianabarsan commented Feb 22, 2024 •

edited

Loading