Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LFS: Cloning objects / batch not found #8273

Closed
2 of 7 tasks
gabyx opened this issue Sep 24, 2019 · 49 comments · Fixed by #8454
Closed
2 of 7 tasks

LFS: Cloning objects / batch not found #8273

gabyx opened this issue Sep 24, 2019 · 49 comments · Fixed by #8454
Labels
Milestone

Comments

@gabyx
Copy link

gabyx commented Sep 24, 2019

  • Gitea version: Bug contained in 1.8.3 - 1.9.3
  • Git version: 2.23.0 (local)
  • Operating system: Gitea (Linux, docker), Pushing from repo from: Windows
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant

Description

When I upload a repo with LFS objects, the upload mostly works.
While cloning, after some time, the lfs smudge filter (here 58%)
stalls always after some time, saying

image

After a night of debugging (updating sucessively through all versions with docker),
we come to the conclusions that

  • this issue arises in all versions from 1.8.3 till 1.9.3.
  • Version 1.7.4 - 1.8.2 all work correctly.
  • Setting the repository to private or public did not help (version 1.8.3)

Could it be that the following Submissions into 1.8.3 are problematic:

The hints/workarounds in the discussion below, did not solve this issue:
https://discourse.gitea.io/t/solved-git-lfs-upload-repeats-infinitely/635/2

Hopefully this gets some attention, since its a nasty LFS Bug which made us almost to apple crumble. 🍎

@lunny lunny added the type/bug label Sep 26, 2019
@gabyx gabyx changed the title LFS Upload objects/batch not found LFS: Cloning objects / batch not found Sep 26, 2019
@m-a-v
Copy link

m-a-v commented Sep 27, 2019

I've made some more tests. After compiling the version of commit dbd0a2e Fix LFS Locks over SSH (#6999) (#7223) the error appears. The LFS data is large (approximately 10 GB). One commit before (7697a28) everthing works perfectly.

I've tried to disable the SSH server. But this doesn't change anything.

@zeripath Let me know if you need more information.

@m-a-v
Copy link

m-a-v commented Sep 27, 2019

Here you can see the debug log output when the error occurs: PANIC:: runtime error: invalid memory address or nil pointer dereference,

2019/09/27 20:44:19 [D] Could not find repository: company/repository - dial tcp 172.18.0.6:3306: connect: cannot assign requested address, 2019/09/27 20:44:19 [D] LFS request - Method: GET, URL: /company/repository.git/info/lfs/objects/063e23a8631392cc939b6b609df91e02d064f3fe279522c3eefeb1c5f1d738a3, Status 404, 2019/09/27 20:44:19 [...les/context/panic.go:36 1()] [E] PANIC:: runtime error: invalid memory address or nil pointer dereference, /usr/local/go/src/runtime/panic.go:82 (0x44abc0), /usr/local/go/src/runtime/signal_unix.go:390 (0x44a9ef), /go/src/code.gitea.io/gitea/models/repo_permission.go:120 (0x108a0ed), /go/src/code.gitea.io/gitea/models/repo_permission.go:120 (0x108a0ed), /go/src/code.gitea.io/gitea/models/repo_permission.go:95 (0x1183338), /go/src/code.gitea.io/gitea/modules/lfs/server.go:501 (0x118330a), /go/src/code.gitea.io/gitea/modules/lfs/server.go:128 (0x117f2dd), /go/src/code.gitea.io/gitea/modules/lfs/server.go:146 (0x117f468), /go/src/code.gitea.io/gitea/modules/lfs/server.go:105 (0x117ef90), /usr/local/go/src/reflect/value.go:447 (0x4cb930), /usr/local/go/src/reflect/value.go:308 (0x4cb3b3), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:177 (0x9a1466), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:137 (0x9a0d5b), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:121 (0x9cff19), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:112 (0x11667e8), /go/src/code.gitea.io/gitea/modules/context/panic.go:40 (0x11667db), /usr/local/go/src/reflect/value.go:447 (0x4cb930), /usr/local/go/src/reflect/value.go:308 (0x4cb3b3), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:177 (0x9a1466), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:137 (0x9a0d5b), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:121 (0x9cff19), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:112 (0x9efe76), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/session/session.go:192 (0x9efe61), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:79 (0x9cfdc0), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:157 (0x9a1120), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:135 (0x9a0e4a), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:121 (0x9cff19), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:112 (0x9e197f), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/recovery.go:161 (0x9e196d), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/logger.go:40 (0x9d3bb3), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:157 (0x9a1120), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:135 (0x9a0e4a), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:121 (0x9cff19), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:112 (0x9e0ca0), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/logger.go:52 (0x9e0c8b), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/logger.go:40 (0x9d3bb3), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:157 (0x9a1120), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:135 (0x9a0e4a), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:121 (0x9cff19), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/router.go:187 (0x9e2bc6), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/router.go:303 (0x9dc635), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/macaron.go:220 (0x9d4f8c), /go/src/code.gitea.io/gitea/vendor/github.com/gorilla/context/context.go:141 (0xce374a), /usr/local/go/src/net/http/server.go:1995 (0x6f63a3), /usr/local/go/src/net/http/server.go:2774 (0x6f9677), /usr/local/go/src/net/http/server.go:1878 (0x6f5360), /usr/local/go/src/runtime/asm_amd64.s:1337 (0x464c20), , 2019/09/27 20:44:19 [D] Template: status/500, 2019/09/27 20:44:19 [...les/context/panic.go:36 1()] [E] PANIC:: runtime error: invalid memory address or nil pointer dereference, /usr/local/go/src/runtime/panic.go:82 (0x44abc0), /usr/local/go/src/runtime/signal_unix.go:390 (0x44a9ef), /go/src/code.gitea.io/gitea/models/repo_permission.go:120 (0x108a0ed), /go/src/code.gitea.io/gitea/models/repo_permission.go:120 (0x108a0ed), /go/src/code.gitea.io/gitea/models/repo_permission.go:95 (0x1183338), /go/src/code.gitea.io/gitea/modules/lfs/server.go:501 (0x118330a), /go/src/code.gitea.io/gitea/modules/lfs/server.go:128 (0x117f2dd), /go/src/code.gitea.io/gitea/modules/lfs/server.go:146 (0x117f468), /go/src/code.gitea.io/gitea/modules/lfs/server.go:105 (0x117ef90), /usr/local/go/src/reflect/value.go:447 (0x4cb930), /usr/local/go/src/reflect/value.go:308 (0x4cb3b3), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:177 (0x9a1466), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:137 (0x9a0d5b), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:121 (0x9cff19), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:112 (0x11667e8), /go/src/code.gitea.io/gitea/modules/context/panic.go:40 (0x11667db), /usr/local/go/src/reflect/value.go:447 (0x4cb930), /usr/local/go/src/reflect/value.go:308 (0x4cb3b3), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:177 (0x9a1466), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:137 (0x9a0d5b), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:121 (0x9cff19), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:112 (0x9efe76), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/session/session.go:192 (0x9efe61), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:79 (0x9cfdc0), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:157 (0x9a1120), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:135 (0x9a0e4a), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:121 (0x9cff19), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:112 (0x9e197f), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/recovery.go:161 (0x9e196d), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/logger.go:40 (0x9d3bb3), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:157 (0x9a1120), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:135 (0x9a0e4a), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:121 (0x9cff19), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:112 (0x9e0ca0), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/logger.go:52 (0x9e0c8b), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/logger.go:40 (0x9d3bb3), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:157 (0x9a1120), /go/src/code.gitea.io/gitea/vendor/github.com/go-macaron/inject/inject.go:135 (0x9a0e4a), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/context.go:121 (0x9cff19), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/router.go:187 (0x9e2bc6), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/router.go:303 (0x9dc635), /go/src/code.gitea.io/gitea/vendor/gopkg.in/macaron.v1/macaron.go:220 (0x9d4f8c), /go/src/code.gitea.io/gitea/vendor/github.com/gorilla/context/context.go:141 (0xce374a), /usr/local/go/src/net/http/server.go:1995 (0x6f63a3), /usr/local/go/src/net/http/server.go:2774 (0x6f9677), /usr/local/go/src/net/http/server.go:1878 (0x6f5360), /usr/local/go/src/runtime/asm_amd64.s:1337 (0x464c20),

@m-a-v
Copy link

m-a-v commented Sep 28, 2019

I suppose that Gitea is exceeding the number of local socket connections permitted by the OS.

Failure: cannot assign requested address

See also explanation and possible solution here:
golang/go#16012 (comment)

Where could I change the setting MaxIdleConnsPerHost and other LFS server settings to make further tests?

@m-a-v
Copy link

m-a-v commented Sep 28, 2019

BTW: The error PANIC:: runtime error: invalid memory address or nil pointer dereference does not always appear in the log output. Sometimes the server and client just hang.

@m-a-v
Copy link

m-a-v commented Sep 28, 2019

@lunny Who could help to isolate this bug? Is there any Gitea programmer who could support us? I am willing to make more tests but I need some hints.

@gabyx
Copy link
Author

gabyx commented Sep 29, 2019

@m-a-v: There is also a setting:

git -c lfs.concurrenttransfers=5 clone

which will affect the transfer probably, nevertheless it should not crash the server...

@gabyx
Copy link
Author

gabyx commented Sep 29, 2019

Another interesting read: https://www.fromdual.com/huge-amount-of-time-wait-connections

  • Check ulimit, maxfiles, and somaxconn. Possibly system runs out of limits resources. Link

@lunny lunny added this to the 1.9.4 milestone Sep 30, 2019
@lunny
Copy link
Member

lunny commented Sep 30, 2019

@m-a-v I think @zeripath maybe. But if not, I can take a look at this.

@m-a-v
Copy link

m-a-v commented Sep 30, 2019

The problem seems to be the huge amount of connections for the Get request (more than 10k connections for one single client!). See also here:

https://medium.com/@valyala/net-http-client-has-the-following-additional-limitations-318ac870ce9d.
https://medium.com/@nate510/don-t-use-go-s-default-http-client-4804cb19f779

@lunny lunny modified the milestones: 1.9.4, 1.9.5 Oct 8, 2019
@zeripath
Copy link
Contributor

zeripath commented Oct 10, 2019

@m-a-v I've been very busy doing other things for a while so have been away from Gitea. I'll take a look at this.

I think you're on the right trail with the number of connections thing. IIRC there's another person who had a similar issue.

@zeripath
Copy link
Contributor

@m-a-v I can't understand why dbd0a2e should break things, but I'll double check.

Maybe it's possible the request body isn't being closed or something stupid like that. That would cause a leak if so and could explain the issue.

The other possiblity is that dbd0a2e has nothing to do with things and it's a Heisenbug relating to the number of connections thing.

@guillep2k
Copy link
Member

A netstat -an could be usefull to see in what state are the connections when this happens. It doesn't need to make Gitea fail, but it will be useful as long as there is a large number of connections listed. It's not the same if the connections are in CONNECTED state, or CLOSE_WAIT, FIN_WAIT1, etc.

@zeripath
Copy link
Contributor

OK, so all these calls to ReadCloser() don't Close():

if err := contentStore.Put(meta, ctx.Req.Body().ReadCloser()); err != nil {

dec := json.NewDecoder(r.Body().ReadCloser())

dec := json.NewDecoder(r.Body().ReadCloser())

Whether that's the cause of your bug is another question - however, it would fit with dbd0a2e causing more issues because suddenly you get a lot more calls to unpack.

These should be closed so I guess that's at least a starting point for attempting to fix this. (If I find anything else I will update this.)

@zeripath
Copy link
Contributor

@m-a-v would you be able to rebuild from my PR #8454 and see if that solves your issue?

@m-a-v
Copy link

m-a-v commented Oct 11, 2019

@zeripath Thanks a lot. It may take some time until I can test it, but I certainly will.

@zeripath
Copy link
Contributor

It's actually been merged in to 1.10 and 1.9 branches already.

@m-a-v
Copy link

m-a-v commented Oct 15, 2019

I've tested it again with 1.10 and it seems that the described LFS bug has been solved or at least it made the error appear for this specific scenario. Before @zeropath fix we had more than 10k connections in a TIME_WAIT state. Now there are still approximately 3.5k connections in the TIME_WAIT state. I assume if multiple clients will access the LFS server the same problem could still occur.

Any idea how to improve this? Are there other possible leaks? I assume that a connection which closes will not remain in a TIME_WAIT state. Can anyone confirm this?

@zeripath
Copy link
Contributor

Hi @m-a-v, I guess this means that I must have missed some others. Is there anyway of checking that they're all LFS connections?

@m-a-v
Copy link

m-a-v commented Oct 15, 2019

Indirectly, yes. I had only one active client. Before LFS checkout I had two connections on the MariaDB database server instance. During LFS checkout about 3.5k connections and then some minutes later again 2 connections.

This article could be interesting:
http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html

@zeripath
Copy link
Contributor

LFS checkout causes 3.5K connections?! How many LFS objects do you have?

@m-a-v
Copy link

m-a-v commented Oct 15, 2019

12k LFS objects.

@guillep2k
Copy link
Member

@zeripath Any connections that Gitea leaves open should remain in either ESTABLISHED or CLOSE_WAIT states.

@zeripath
Copy link
Contributor

Could it be that git lfs on the client is also leading connections?

@guillep2k
Copy link
Member

Could it be that git lfs on the client is also leading connections?

That would be either FIN_WAIT_1 or FIN_WAIT_2.

TIME_WAIT is a state maintained by the OS to keep the port from being reused (by port I mean the client+server address & port pair).

@guillep2k
Copy link
Member

This picture should help (but it's not easy to read, so I guess it doesn't):

image

@m-a-v
Copy link

m-a-v commented Oct 15, 2019

I think the problem is more the following:

"Your problem is that you are not reusing your MySQL connections within your app but instead you are creating a new connection every time you want to run an SQL query. This involves not only setting up a TCP connection, but then also passing authentication credentials across it. And this is happening for every query (or at least every front-end web request) and it's wasteful and time consuming."

I think this would also speed up Gitea's LFS server a lot.

source: https://serverfault.com/questions/478691/avoid-time-wait-connections

@zeripath
Copy link
Contributor

AHA! Excellent! Well done for finding that!

@zeripath
Copy link
Contributor

zeripath commented Oct 15, 2019

OK We do recycle connections. We use the underlying go sql connection pool.

For MySQL there are the following in the [database] part of the app.ini:

  • MAX_IDLE_CONNS 0: Max idle database connections on connnection pool, default is 0
  • CONN_MAX_LIFETIME 3s: Database connection max lifetime

https://docs.gitea.io/en-us/config-cheat-sheet/#database-database

I think MAX_IDLE_CONNECTIONS was set to 0 because MySQL doesn't like long lasting connections.

I will however make a PR, exposing SetConnMaxLifetime. Edit: I'm an idiot it's already exposed for MySQL.

@zeripath
Copy link
Contributor

I think what you need to do is tune those variables better. I think our defaults are highly likely to be incorrect - however, I think they were set to this because of other users complaining of problems.

I suspect that MAX_IDLE_CONNECTIONS being set to 0 happened before we adjusted CONN_MAX_LIFETIME and it could be that we could be more generous with both of these. I.e. something like MAX_IDLE_CONNECTIONS 10 and CONN_MAX_LIFETIME 15m would work.

@m-a-v
Copy link

m-a-v commented Oct 21, 2019

I could test it again with the repo. Which branch should I take? Which parameters (I've seen that discussions continued)?

@m-a-v
Copy link

m-a-v commented Oct 21, 2019

So I've spotted another unclosed thing, which is unlikely to be causing your issue, however, I am suspicious that we're not closing the response body in modules/lfs/server.go.

Did you also fix this?

@m-a-v
Copy link

m-a-v commented Oct 31, 2019

I have made several experiments with the currently running gitea server(v1.7.4 and with the new version v.1.9.5). The netstat snapshots were created at the peak of the number of open connections.

Version 1.7.4

root@917128b828cb:/# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
      1 Foreign
      1 established)
      2 ESTABLISHED
      2 LISTEN
    162 TIME_WAIT

Version 1.9.5 (and same default settings as with 1.7.4

bash-5.0# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
      1 ESTABLISHED
      1 Foreign
      1 established)
      5 LISTEN
  30064 TIME_WAIT

Version 1.9.5 (CONN_MAX_LIFETIME = 45s, MAX_IDLE_CONNS = 10, MAX_OPEN_CONNS = 10)

bash-5.0# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
      1 ESTABLISHED
      1 Foreign
      1 established)
      5 LISTEN
  31095 TIME_WAIT

With both configurations the LFS servers has much too many open connections. So I think we still have serious problems with large LFS repos.

$ git clone https://domain.org/repo.git test
Cloning into 'test'...
remote: Enumerating objects: 157392, done.
remote: Counting objects: 100% (157392/157392), done.
remote: Compressing objects: 100% (97424/97424), done.
remote: Total 157392 (delta 63574), reused 151365 (delta 57755)
Receiving objects: 100% (157392/157392), 6.99 GiB | 57.68 MiB/s, done.
Resolving deltas: 100% (63574/63574), done.
Updating files: 100% (99264/99264), done.
Filtering content:  53% (6594/12372), 4.13 GiB | 2.38 MiB/s

The clone process just freezes at a certain percentage (as soon as there are too many connections).

I think this bug should be reopened.

@zeripath
Copy link
Contributor

#8528 was only backported to 1.10 as #8618 . It was not backported to 1.9.5.

Setting MAX_OPEN_CONNS won't have any effect on 1.9.5.

Please try on 1.10-rc2 or master.

@m-a-v
Copy link

m-a-v commented Oct 31, 2019

master (CONN_MAX_LIFETIME = 45s, MAX_IDLE_CONNS = 10, MAX_OPEN_CONNS = 10)

bash-5.0# netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
      1 FIN_WAIT1
      1 Foreign
      1 established)
      5 ESTABLISHED
      5 LISTEN
   8041 TIME_WAIT

The checkout succeeds but still many used connections remain in TIME_WAIT status. If multiple clients would access the LFS server it could not handle it.

@zeripath
Copy link
Contributor

zeripath commented Oct 31, 2019

Your max lifetime is probably too low, 45s seems aggressive.

Are you sure all of those connections are db connections? Lots of http connections will be made when dealing with lots of lfs objects. (There's probably some more efficiencies we can do.)

If they're all db then multiple users won't change it - you're likely at your max as it should be mathematically determinable:

Total Connections = open +idle + timewait

If max open=max idle:
Max C = O + W

dC/dt = dO/dt + dW/dt

max dO/dt = 0 (as it's fixed)

max dW/dT = max_o/max_l - W/max_tw

dC/dt is positive around C=0 therefore dC/dt=0 should represent max for positive C and thence maximize W.

max_W = max_tw * max_o / max_l

If they're all db then you have a very long max tw or I've messed up in my maths somewhere.

You can set your time_wait at a server network stack level.

@m-a-v
Copy link

m-a-v commented Oct 31, 2019

I've chosen the 45 seconds from the discussion between you and @guillep2k in #8528.

How are the connections reused? Where is this made in the code? I assume after a connection is closed it will go in the TIME_WAIT state.

I don't know if all are db connections. Why did it work with 1.7.4 almost perfectly (see above)?

@m-a-v
Copy link

m-a-v commented Oct 31, 2019

This could be interesting:
https://stackoverflow.com/questions/1931043/avoiding-time-wait

"Probably the best option, if it's doable: refactor your protocol so that connections that are finished aren't closed, but go into an "idle" state so they can be re-used later, instead of opening up a new connection (like HTTP keep-alive)."

"Setting SO_REUSEADDR on the client side doesn't help the server side unless it also sets SO_REUSEADDR"

@guillep2k
Copy link
Member

guillep2k commented Oct 31, 2019

@zeripath @m-a-v It must be noticed that not all TIME_WAIT connections are from the database. Internal requests (e.g. internal router) and many others will create quick http connections that may or may not be reused.

@m-a-v it would be cool if you'd break your statistics down by listening port number.

@guillep2k
Copy link
Member

"Probably the best option, if it's doable: refactor your protocol so that connections that are finished aren't closed, but go into an "idle" state so they can be re-used later, instead of opening up a new connection (like HTTP keep-alive)."

"Setting SO_REUSEADDR on the client side doesn't help the server side unless it also sets SO_REUSEADDR"

I don't think SO_REUSEADDR applies here. If you're down into this level of optimization, I'd suggest tuning the tcp_fin_timeout parameter in the kernel. Too short a value will have ill side effects, though; I'd wouldn't set it below 30 seconds.

But TIME_WAIT is actually the symptom, not the problem.

@m-a-v
Copy link

m-a-v commented Nov 1, 2019

@guillep2k What do you exactly mean with "it would be cool if you'd break your statistics down by listening port number"?

tcp_fin_timeout is set to 60 seconds on my system. Ubuntu 18.04 LTS standard configuration.

The question still remains. Why did it work perfectly with 1.7.4 (and earlier) and now anymore?

@guillep2k
Copy link
Member

@m-a-v

# netstat -ant | grep TIME_WAIT | awk '{print $5 " " $6}' | cut -d: -f2 | sort | uniq -c | sort -n

@guillep2k
Copy link
Member

The question still remains. Why did it work perfectly with 1.7.4 (and earlier) and now anymore?

I don't know, I'd need to check the code. The important thing is that it's taken care of now. 😁

@m-a-v
Copy link

m-a-v commented Nov 1, 2019

The question still remains. Why did it work perfectly with 1.7.4 (and earlier) and now anymore?

I don't know, I'd need to check the code. The important thing is that it's taken care of now. 😁

I meant "and now not anymore".

@guillep2k
Copy link
Member

I meant "and now not anymore".

I meant it's now solved by properly handling CONN_MAX_LIFETIME, MAX_IDLE_CONNS and MAX_OPEN_CONNS.

@m-a-v If you want to investigate what's the specific change between 1.7.4 and 1.9.5 that caused this, I'd be interested in learning about your results.

@gabyx
Copy link
Author

gabyx commented Dec 20, 2019

on 1.7.4 (9f33aa6) I had lots of connections when cloning too on the peak, when Filtering... -> LFS Smudge:

$ netstat -ant | awk '{print $6}' | sort | uniq -c | sort -n
      1 Foreign
      1 established)
      5 LISTEN
     10 ESTABLISHED
   8599 TIME_WAIT

When git lfs push --all origin at the peak

$ netstat -ant | grep TIME_WAIT | awk '{print $5 " " $6}' | cut -d: -f2 | sort | uniq -c
66

suddenly the client hangs on 97%. GIT_TRACE=true does not show anything... it just hangs... possible not related to gitea.

@gabyx
Copy link
Author

gabyx commented Jan 7, 2020

on 1.11.0+dev-563-gbcac7cb93:
netstat -ant | grep TIME_WAIT | awk '{print $5 " " $6}' | cut -d: -f2 | sort | uniq -c
Peak is 280 connections in TIME_WAIT.

@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants