Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http connections are not properly closed without a http.Client Timeout property #1040

Closed
BogdanEAM opened this issue Apr 1, 2023 · 7 comments
Labels
bug Error or flaw in the code with unintended result

Comments

@BogdanEAM
Copy link

Describe the bug
We are starting a Java applications from a gcsfuse mount folder.
The app has 139 lib jars and many more prop files and classes.

With the gcsfuse version v0.41.12 it starts fine. However with versions v0.42.2 and v0.42.3 it does not start.
The difference that is causing the issue between these version is the removal of "Timeout" property from the http.Client created in "internal\storage\storage_handle.go" method NewStorageHandle.

Runtime flags
used for all versions:
-o noexec --dir-mode 700 --stackdriver-export-interval 60s --type-cache-ttl 20h -o ro -o nosuid -o nodev --file-mode 400 --enable-storage-client-library=true --implicit-dirs --enable-nonexistent-type-cache --stat-cache-ttl 20h --stat-cache-capacity 40960

Behaviour

with versions v0.42.2 and v0.42.3

The startup of the app always hits a dead stop when gcsfuse is reading a certain (always the same) file from the lib folder. With the _gcs flag we can see that it is always the same file.

By doing a netstat in parallel with the application startup we can see that the number of "ESTABLISHED" connections for the respective gcsfuse process grows until it reaches the number of 101 connections which is the default value for "max-conns-per-host".

Workaround 1
If we add the flag "--max-conns-per-host 300" the application start fine and the number of connections reported by netstat caps at around 115 connections.

Workaround 2
I have rebuild the code for v0.42.3 by reading the "Timeout" flag in "internal\storage\storage_handle.go" in the NewStorageHandle with the previous default value of 800 milliseconds.
The neststat in this case shows that connections are not pilling up, their number increases and decreases while application starts and never reaches a number higher than 15 connections.

If I set the Timeout to 10 seconds the max number of connections observed was 35 and again the app starts fine.

        httpClient := &http.Client{
                Transport: &oauth2.Transport{
                        Base:   transport,
                        Source: clientConfig.TokenSrc,
                },
                Timeout: 800 * time.Millisecond,
        }

with versions v0.41.12

The application start fine.
Running netstat in parallel as described above shows that connections are not pilling up, their number increases and decreases while application starts and never reaches a number higher than 20 connections.

To Collect more Debug logs
By adding some "log.Println" in the Close() method from "vendor\cloud.google.com\go\storage\http_client.go" we can see that the "r.body" is of different type depending if the Timeout property is set or not:

  • without Timeout set the r.body is of type: *http.bodyEOFSignal.Close()
  • with Timeout set the r.body is of type: *http.cancelTimerBody.Close()
func (r *httpReader) Close() error {
	log.Printf("closing body from %T.Close() using %T.Close()\n", r, r.body)
	return r.body.Close()
}

Another thing that we observed is that, after starting the applications with "--max-conns-per-host 300" or with Timeout property, after we stop the application we see some --debug_gcs specific logs like this:

gcs: 2023/03/31 12:10:31.975526 Req            0x389: -> Read("14.10.00.00.608516-SNAPSHOT/lib/file1.jar", [16384, 269906)) (49m36.102090196s): OK
gcs: 2023/03/31 12:10:31.975587 Req             0x23: -> Read("14.10.00.00.608516-SNAPSHOT/lib/file2.jar", [16384, 62983)) (49m53.576558928s): OK
gcs: 2023/03/31 12:10:31.975617 Req             0x26: -> Read("14.10.00.00.608516-SNAPSHOT/lib/file3.jar", [16384, 52764)) (49m53.509058704s): OK

Notice the duration times are in minutes and they are approximately how long the app was running.
This happens regardless if the Timeout is present or not but as I said, with this set the application starts.

System (please complete the following information):
NAME="CentOS Linux"
VERSION="7 (Core)"
Description: CentOS Linux release 7.9.2009 (Core)
Release: 7.9.2009

@BogdanEAM BogdanEAM added the bug Error or flaw in the code with unintended result label Apr 1, 2023
@raj-prince
Copy link
Collaborator

raj-prince commented Apr 4, 2023

Hi @BogdanEAM,

Thanks a lot for providing the thorough details and observations!

Based on the details you provided, below is my speculation:

  • It is clear, that app holds the connection while reading some jar/lib files. As I tried to simulate similar read operation for more than 200 files using fio, it works and all the Read(...) call is taking around 500ms not like in your case (49 minutes).
  • When you added time-out in that case: gcsfuse closes the connection explicitly after the time-out which avoids the case when app would hold the idle connection for reading a particular file.
  • When you provided more connection in this case: gcsfuse satisfies the connection requirement for the app, which would be around 115, and app works fine.

To debug further, we need to know or confirm about the reading behaviour of the application - as it holds the connection or not. It would be good, if you could provide gcsfuse logs with --debug_fs and --debug_gcs flags for all the three cases specially the first one: (1) application dead stop (2) in case of explicit time-out (3) when we provide more connections.

Regards and thanks,
Prince Kumar.

@BogdanEAM
Copy link
Author

Hi @raj-prince,
Thanks for looking into this issue.

I'm gathering those log files for you.
Can you explain what was the reason for removing the "Timeout" property from the http.Client?

Thanks,
Bogdan

@raj-prince
Copy link
Collaborator

raj-prince commented Apr 10, 2023

Hi @BogdanEAM,

Sorry for the late response!

The concept of time-out was introduce in gcsfuse via the go-storage-client library (existing default client). Before, with jacobsa/gcloud, we were not using the "time-out" property in http client.

During the testing, we found an issue (pre-condition error) due to newly introduced timeout property as mentioned in this PR. Since, we were not using this property in earlier version hence removed while using go-storage-client library also.

We will discuss within the team, should we keep time-out property or not and accordingly update this thread.

Regards,
Prince Kumar.

@BogdanEAM
Copy link
Author

Thanks for the info.
I think the best would be to keep the possibility to have a timeout (with a flag) but run without it as default.

I checked and it looks like the JVM is holding the locks for all the jar files that are loaded. This is how the JVM works, not something specific to our application.
This means that if the connection is closed (only) when the file handle is released none of the jar files connections will actually be closed until the app is stopped.
This is not good in our case because we have a lot of jars and their number can vary which also means that using the "max-conns-per-host" flag is not ideal because a value that works now might not work in month or so.

Another solution that I can think of is to update the gcsfuse code to close the connection when a file has been fully retrieved from GCS and not when the file handle is released.

Thanks you,
Bogdan

@sethiay
Copy link
Collaborator

sethiay commented Apr 12, 2023

Hey @BogdanEAM , Thanks for your reply. We are investigating the possible options here and we will get back to you once we take a decision.

@sethiay
Copy link
Collaborator

sethiay commented Apr 24, 2023

@BogdanEAM , Thanks for your patience. We have supported http client timeout in GCSFuse v0.42.4. The timeout can be controlled with --http-client-timeout flag. Closing this issue. Please re-open if face any further issues.

@sethiay sethiay closed this as completed Apr 24, 2023
@BogdanEAM
Copy link
Author

Thank you very much.
I confirm that using --http-client-timeout the application does not freeze during startup anymore and it starts and runs correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Error or flaw in the code with unintended result
Projects
None yet
Development

No branches or pull requests

3 participants