Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Complement TestWriteMDirectAccountData is flakey #13334

Closed
DMRobertson opened this issue Jul 20, 2022 · 45 comments
Closed

Complement TestWriteMDirectAccountData is flakey #13334

DMRobertson opened this issue Jul 20, 2022 · 45 comments
Labels
A-Testing Issues related to testing in complement, synapse, etc T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact Z-Flake Tests that give intermittent failures

Comments

@DMRobertson DMRobertson added T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Flake Tests that give intermittent failures Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact labels Jul 20, 2022
@DMRobertson DMRobertson self-assigned this Jul 20, 2022
@DMRobertson
Copy link
Contributor Author

The errors here all look like something is killing Synapse before the /sync and the test complete. Maybe a test timeout?

  2022/07/04 17:38:02 Sharing [SERVER_NAME=hs1 SYNAPSE_COMPLEMENT_USE_WORKERS=true SYNAPSE_COMPLEMENT_DATABASE=postgres] host environment variables with container
  2022/07/04 17:38:23 Sharing [SERVER_NAME=hs1 SYNAPSE_COMPLEMENT_USE_WORKERS=true SYNAPSE_COMPLEMENT_DATABASE=postgres] host environment variables with container
      direct_messaging_test.go:21: Deploy times: 21.283694616s blueprints, 12.775381085s containers
      client.go:550: POST hs1/_matrix/client/r0/createRoom => 200 OK (856.713676ms)
      client.go:550: PUT hs1/_matrix/client/r0/user/@alice:hs1/account_data/m.direct => 200 OK (37.898421ms)
      client.go:550: GET hs1/_matrix/client/r0/sync => 200 OK (288.946171ms)
      client.go:550: POST hs1/_matrix/client/r0/createRoom => 200 OK (765.193117ms)
      client.go:550: PUT hs1/_matrix/client/r0/user/@alice:hs1/account_data/m.direct => 200 OK (53.574637ms)
      client.go:548: GET hs1/_matrix/client/r0/sync => error: read tcp [::1]:56598->[::1]:49162: read: connection reset by peer (1.009787379s)
      direct_messaging_test.go:58: CSAPI.DoFunc response returned error: Get "http://localhost:49162/_matrix/client/r0/sync?since=s15_3_0_1_2_1_1_3_0&timeout=1000": read tcp [::1]:56598->[::1]:49162: read: connection reset by peer
  2022/07/04 17:38:40 ============================================
  2022/07/11 09:45:53 Sharing [SERVER_NAME=hs1 SYNAPSE_COMPLEMENT_USE_WORKERS=true SYNAPSE_COMPLEMENT_DATABASE=postgres] host environment variables with container
      direct_messaging_test.go:21: Deploy times: 20.872806357s blueprints, 12.269527058s containers
      client.go:550: POST hs1/_matrix/client/r0/createRoom => 200 OK (897.854882ms)
      client.go:550: PUT hs1/_matrix/client/r0/user/@alice:hs1/account_data/m.direct => 200 OK (36.485377ms)
      client.go:550: GET hs1/_matrix/client/r0/sync => 200 OK (244.466278ms)
      client.go:550: POST hs1/_matrix/client/r0/createRoom => 200 OK (661.168877ms)
      client.go:550: PUT hs1/_matrix/client/r0/user/@alice:hs1/account_data/m.direct => 200 OK (64.203384ms)
      client.go:548: GET hs1/_matrix/client/r0/sync => error: read tcp [::1]:41930->[::1]:49162: read: connection reset by peer (601.598µs)
      direct_messaging_test.go:58: CSAPI.DoFunc response returned error: Get "http://localhost:49162/_matrix/client/r0/sync?since=s15_3_0_1_2_1_1_3_0&timeout=1000": read tcp [::1]:41930->[::1]:49162: read: connection reset by peer
  2022/07/11 09:46:08 ============================================
 2022/07/19 18:22:14 Sharing [SERVER_NAME=hs1 SYNAPSE_COMPLEMENT_USE_WORKERS=true SYNAPSE_COMPLEMENT_DATABASE=postgres] host environment variables with container
  2022/07/19 18:22:37 Sharing [SERVER_NAME=hs1 SYNAPSE_COMPLEMENT_USE_WORKERS=true SYNAPSE_COMPLEMENT_DATABASE=postgres] host environment variables with container
      direct_messaging_test.go:21: Deploy times: 22.695300434s blueprints, 12.262235796s containers
      client.go:550: POST hs1/_matrix/client/v3/createRoom => 200 OK (990.690686ms)
      client.go:550: PUT hs1/_matrix/client/v3/user/@alice:hs1/account_data/m.direct => 200 OK (68.471235ms)
      client.go:550: GET hs1/_matrix/client/v3/sync => 200 OK (279.595511ms)
      client.go:550: POST hs1/_matrix/client/v3/createRoom => 200 OK (873.907959ms)
      client.go:550: PUT hs1/_matrix/client/v3/user/@alice:hs1/account_data/m.direct => 200 OK (70.438659ms)
      client.go:548: GET hs1/_matrix/client/v3/sync => error: EOF (806.109µs)
      direct_messaging_test.go:58: CSAPI.DoFunc response returned error: Get "http://localhost:49162/_matrix/client/v3/sync?since=s15_3_0_1_2_1_1_3_0&timeout=1000": EOF
  2022/07/19 18:22:53 ============================================

@DMRobertson
Copy link
Contributor Author

Concentrating on https://github.com/matrix-org/synapse/runs/7184887268?check_suite_focus=true for a moment. The go testing.T logs report a connection reset

       client.go:548: GET hs1/_matrix/client/r0/sync => error: read tcp [::1]:56598->[::1]:49162: read: connection reset by peer (1.009787379s)
      direct_messaging_test.go:58: CSAPI.DoFunc response returned error: Get "http://localhost:49162/_matrix/client/r0/sync?since=s15_3_0_1_2_1_1_3_0&timeout=1000": read tcp [::1]:56598->[::1]:49162: read: connection reset by peer

And the synapse entrypoint logs report a SIGTERM request

  2022-07-04 17:38:39,986 WARN received SIGTERM indicating exit request

It's not clear to me if the SIGTERM was sent before or after the /sync connection was reset. That is: did complement send a SIGTERM to shut down the container after the test failed, or did the test fail because of the SIGTERM?

Timestamps on the testing.T logs would help here.

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@reivilibre
Copy link
Contributor

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

I will buy anyone who fixes this a nice meal or beverage.

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@richvdh
Copy link
Member

richvdh commented Sep 21, 2022

The errors here all look like something is killing Synapse before the /sync and the test complete. Maybe a test timeout?

The errors referred to here are:

      client.go:602: [CSAPI] GET hs1/_matrix/client/v3/sync => error: read tcp [::1]:56470->[::1]:49162: read: connection reset by peer (828.386µs)
      direct_messaging_test.go:49: CSAPI.DoFunc response returned error: Get "http://localhost:49162/_matrix/client/v3/sync?timeout=1000": read tcp [::1]:56470->[::1]:49162: read: connection reset by peer

This is, indeed, very mysterious. I don't understand what would cause the connection to drop other than Synapse being killed, which shouldn't happen while complement is still calling /sync. I'll add some debug.

richvdh added a commit to matrix-org/complement that referenced this issue Sep 21, 2022
richvdh added a commit to matrix-org/complement that referenced this issue Sep 21, 2022
@babolivier
Copy link
Contributor

@richvdh
Copy link
Member

richvdh commented Sep 26, 2022

I am still mystified. It definitely looks like that /sync request is being dropped by Synapse before complement shuts down the container.

I hope #13914 will shed more light.

@DMRobertson
Copy link
Contributor Author

richvdh added a commit that referenced this issue Oct 5, 2022
Have nginx send its logs to stderr/out, so that we can debug
#13334.
richvdh added a commit that referenced this issue Oct 5, 2022
Have nginx send its logs to stderr/out, so that we can debug
#13334.
@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

DMRobertson commented Oct 6, 2022

@richvdh not sure if #14063 worked. Don't think I can see nginx logs in that run's output(?)

@richvdh
Copy link
Member

richvdh commented Oct 6, 2022

@richvdh not sure if #14063 worked. Don't think I can see nginx logs in that run's output(?)

looks like that branch didn't include #14063. Interesting to see the same problem happen on a monolith build though.

@richvdh
Copy link
Member

richvdh commented Oct 6, 2022

It feels significant that this is the first test that gets run by complement.

richvdh added a commit to matrix-org/complement that referenced this issue Oct 6, 2022
@DMRobertson
Copy link
Contributor Author

...it seems suspicious that we haven't seen this since turning on the debug mode.

@richvdh
Copy link
Member

richvdh commented Oct 25, 2022

it really does. It really feels like something weird is happening in the Go HTTP client stack.

@DMRobertson
Copy link
Contributor Author

AFAICS turning debug on only hits this and this. Neither of which look particularly menacing.

Maybe we just consider this fixed by debug mode? 🤷

@richvdh
Copy link
Member

richvdh commented Oct 25, 2022

https://github.com/matrix-org/complement/blob/21646b51e62c25196ed978b6eb76d7f1c6ff95ff/internal/client/client.go#L542 gets side-eye from me. The call to DumpResponse has a side-effect of reading the response into a buffer. That might mean that the response gets read quicker, or it might mean that we can now read the response twice.

Maybe we just consider this fixed by debug mode?

Well, we could. It's going to drive me crazy not knowing though.

@richvdh
Copy link
Member

richvdh commented Oct 25, 2022

matrix-org/complement@21646b5/internal/client/client.go#L542 gets side-eye from me. The call to DumpResponse has a side-effect of reading the response into a buffer. That might mean that the response gets read quicker, or it might mean that we can now read the response twice.

Oh wait, that happens after the error. bah.

@richvdh
Copy link
Member

richvdh commented Oct 25, 2022

I think I want to turn off Debug mode again and see if it still happens, and if so inspect the nginx logs for clues.

@richvdh
Copy link
Member

richvdh commented Oct 26, 2022

I think I want to turn off Debug mode again

done in #538

@DMRobertson
Copy link
Contributor Author

@squahtx squahtx self-assigned this Nov 30, 2022
@squahtx
Copy link
Contributor

squahtx commented Dec 1, 2022

...it seems suspicious that we haven't seen this since turning on the debug mode.

...DumpResponse...

DumpResponse does have the side effect of closing resp.Body, which we neglect to do in SetGlobalAccountData. As a result, we leak an open HTTP connection, which can't be reused for future requests. The /sync then probably has to start a new connection.

@squahtx
Copy link
Contributor

squahtx commented Dec 1, 2022

We can leak additional connections when starting homeservers. As part of startup, we poll /_matrix/client/versions every 50 ms until it succeeds, also neglecting to close the response body.
https://github.com/matrix-org/complement/blob/91a35bc17769c2367fcf181c763e6cc31405eea4/internal/docker/deployer.go#L454

But when I try leaking ~1012 connections on my machine, it produces a different error once the per-process file descriptor limit is reached (ulimit -n = 1024).

    client.go:602: [CSAPI] PUT hs1/_matrix/client/v3/user/@alice:hs1/account_data/m.direct => 200 OK (8.709116ms)
    ...
    client.go:602: [CSAPI] PUT hs1/_matrix/client/v3/user/@alice:hs1/account_data/m.direct => 200 OK (4.100741ms)
    client.go:602: [CSAPI] PUT hs1/_matrix/client/v3/user/@alice:hs1/account_data/m.direct => 200 OK (4.545437ms)
    client.go:600: [CSAPI] GET hs1/_matrix/client/v3/sync => error: dial tcp: lookup localhost: device or resource busy (82.68<C2><B5>s)
    direct_messaging_test.go:84: CSAPI.DoFunc response returned error: Get "http://localhost:54026/_matrix/client/v3/sync?since=s15_3_0_1_2_1_1_3_0&timeout=1000": dial tcp: lookup localhost: device or resource busy

so that explanation doesn't work.

@squahtx
Copy link
Contributor

squahtx commented Dec 1, 2022

It's strange that the connection reset by peer only ever happens during /sync. If there were some racy interaction with the CSAPI tests running in parallel, we might expect it to affect a random request.

@richvdh
Copy link
Member

richvdh commented Dec 1, 2022

https://github.com/matrix-org/synapse/actions/runs/3561656856/jobs/5982792177 !!

The nginx logs have only this:

  nginx | 172.20.0.1 - - [28/Nov/2022:04:26:37 +0000] "GET /_matrix/client/versions HTTP/1.1" 200 817 "-" "Go-http-client/1.1"
  nginx | 172.20.0.1 - - [28/Nov/2022:04:26:38 +0000] "POST /_matrix/client/v3/createRoom HTTP/1.1" 200 68 "-" "Go-http-client/1.1"
  nginx | 172.20.0.1 - - [28/Nov/2022:04:26:38 +0000] "PUT /_matrix/client/v3/user/@alice:hs1/account_data/m.direct HTTP/1.1" 200 33 "-" "Go-http-client/1.1"
  nginx | 172.20.0.1 - - [28/Nov/2022:04:26:38 +0000] "GET /_matrix/client/v3/sync?timeout=1000 HTTP/1.1" 200 8569 "-" "Go-http-client/1.1"
  nginx | 172.20.0.1 - - [28/Nov/2022:04:26:39 +0000] "POST /_matrix/client/v3/createRoom HTTP/1.1" 200 68 "-" "Go-http-client/1.1"
  nginx | 172.20.0.1 - - [28/Nov/2022:04:26:39 +0000] "PUT /_matrix/client/v3/user/@alice:hs1/account_data/m.direct HTTP/1.1" 200 33 "-" "Go-http-client/1.1"
  2022-11-28 04:26:39,981 INFO stopped: nginx (exit status 0)

... which makes it look like nginx never saw the request in question, putting this pretty squarely in the domain of being a complement bug rather than a synapse one.

Is it possible that Complement is attempting to re-use an existing HTTP connection, but that is racing with nginx deciding to close the connection?

@squahtx
Copy link
Contributor

squahtx commented Dec 1, 2022

Is it possible that Complement is attempting to re-use an existing HTTP connection, but that is racing with nginx deciding to close the connection?

I'm having trouble figuring out how we'd see a connection reset by peer in that case.

From stackoverflow, connection reset by peer indicates that the remote end closed the TCP connection or exited while there was unread data.

In this case, the remote end would be the docker-proxy process listening on localhost:49152 and proxying to nginx at container-ip:8008. There's one obvious case where we would see a reset, and it's when docker-proxy fails to connect to the destination and closes the connection without reading any data.

If nginx were to close the connection early, docker-proxy would close the connection to complement too and we would usually see an end-of-stream*. I believe when this happens, there is retry logic in the go http client that transparently attempts another connection, which can then reset if docker-proxy fails to connect to nginx (this can happen if the container is shut down prematurely during testing).

*I'm unable to get docker-proxy to send a reset once the connection to the backend has been established.

Test setup

docker-proxy

/usr/bin/docker-proxy -proto tcp -host-ip 127.0.0.1 -host-port 49152 -container-ip 127.0.0.1 -container-port 49153 3</dev/zero

nb: docker-proxy is hardcoded to expect fd 3 to exist, otherwise it fails to start.

backend which instantly closes the connection

#!/usr/bin/env python3

import socket
import struct

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
s.bind(("127.0.0.1", 49153))
s.listen(1)

while True:
    c, ip = s.accept()
    c.close()

s.close()

curl sending endless requests

head -c 1024 /dev/urandom > 1m
while true; do curl "http://localhost:49152" -X POST --data @1m -H "Expect:" 2>&1 | grep reset; done

When running the curl loop for 10 minutes locally, no resets are printed.

@squahtx
Copy link
Contributor

squahtx commented Dec 1, 2022

... which makes it look like nginx never saw the request in question, putting this pretty squarely in the domain of being a complement bug rather than a synapse one.

Would we expect a log line when nginx is shut down before an in-flight request completes? Or would nginx wait for the request to complete before stopping?

@richvdh
Copy link
Member

richvdh commented Dec 1, 2022

Would we expect a log line when nginx is shut down before an in-flight request completes? Or would nginx wait for the request to complete before stopping?

I'm not sure, but what we do know is that complement has received its "reset by peer" before it shuts down the containers. So yes it's possible that ngnix received the request and still considers it in-progress, but in that case, where has the "reset by peer" come from?

@squahtx squahtx removed their assignment Mar 29, 2023
@MadLittleMods MadLittleMods added the A-Testing Issues related to testing in complement, synapse, etc label Apr 25, 2023
@erikjohnston
Copy link
Member

Not seen for a while, closing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Testing Issues related to testing in complement, synapse, etc T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact Z-Flake Tests that give intermittent failures
Projects
None yet
Development

No branches or pull requests

7 participants