Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum.golang.org: set a useful User-Agent string #44468

Closed
ddevault opened this issue Feb 20, 2021 · 16 comments
Closed

sum.golang.org: set a useful User-Agent string #44468

ddevault opened this issue Feb 20, 2021 · 16 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@ddevault
Copy link

ddevault commented Feb 20, 2021

I was looking through some suspicious traffic on my git hosting service and it took me a while to understand that it was coming from sum.golang.org.

The code here:

https://github.com/golang/mod/blob/master/gosumcheck/main.go#L187

Should be updated to set a meaningful User-Agent so that admins like me are less confused when reading our access logs.

Aside: the page at sum.golang.org should include a link to the source code, it was not easy to find.

@hyangah
Copy link
Contributor

hyangah commented Feb 20, 2021

That is an example program to check the go.sum file against go.sum database like sum.golang.org, not the git host.
And sum.golang.org isn't using the code. sum.golang.org & proxy.golang.org access source hosting services using the go commands (they invoke go mod download and go list). For source code of go, please see https://go.googlesource.com/go/+/refs/tags/go1.16/src/cmd/go/

#35699 has prior discussion about adding User-Agent string to go commands.

@ddevault
Copy link
Author

I ultimately blocked these IPs because they appeared to be wantonly crawling the server. This is poor behavior for a crawler. It caused service issues for Go users as a result. sum.golang.org is being a poor citizen of the web.

Another aside: it really frustrates me that all discussions are locked after they're "decided", this isn't the first time that I've had a new perspective or new information to contribute and couldn't because the decision was already made.

@ddevault
Copy link
Author

A solution which would respect the privacy of the user is adding a GOUSERAGENT environment variable or some such similar thing and then just setting it appropriately on the proxy servers, and to a generic value for end-users.

@hyangah
Copy link
Contributor

hyangah commented Feb 20, 2021

If go command allows setting the User-Agent, I think proxy.golang.org and sum.golang.org can consider setting the User-Agent field - I think we internally discussed this at some point in the past to help small git hosting services but can't find it in the issue tracker (maybe we never filed an issue for discussion?)

By the way, blocking all the traffic from sum.golang.org or proxy.golang.org will prevent all Go users of your package from getting the sum data. I suspect the traffic to refresh the latest and list information is too aggressive for some git hosting services. @ddevault can you tell us more about the traffic pattern you've seen? (logs are appreciated if possible).

cc @katiehockman @heschik @bcmills @jayconrod @matloob @rsc

This issue isn't locked or closed. We appreciate your input and new information.

@ddevault
Copy link
Author

ddevault commented Feb 20, 2021

By the way, blocking all the traffic from sum.golang.org or proxy.golang.org will prevent all Go users of your package from getting the sum data

Right. And we would not have blocked it if we had any idea what the clients were, i.e. if they set their User-Agent properly. As far as we could tell, it was just some skiddie on GCP running a scraper written in Go. The aggressiveness is fine (though you should obey robots.txt!), given the utility - supporting the Go ecosystem - but without knowing what it's being used for, we have no context and have to consider it a violation of our terms of service, which prohibits scraping outside of a few specific purposes.

can you tell us more about the traffic pattern you've seen? (logs are appreciated if possible).

I'll update this if I see it again later, but I didn't save the logs and we get a lot of traffic - and because it's not easily distinguished from any other kind of traffic, it's hard to find the activity again.

@jayconrod
Copy link
Contributor

Setting a user agent string for the go command seems reasonable, as long as it doesn't contain the version: we want to avoid different content being served to different versions of the go command.

Note however that the fetch service backing proxy.golang.org fetches modules using go mod download with GOPROXY=direct, so that's mostly downloading repos with Git and VCS tools. Go's user agent wouldn't apply there.

I'm not at all sure where the scraping is coming from though. The go command doesn't scrape. If you share logs we can look into whether there's a bug here.

@hyangah
Copy link
Contributor

hyangah commented Feb 22, 2021

@jayconrod I am guessing that's part of import paths resolving traffic https://golang.org/cmd/go/#hdr-Remote_import_paths

@jayconrod
Copy link
Contributor

Could be... those requests will all have the query string ?go-get=1. Should be easy to identify in logs.

@ddevault
Copy link
Author

Here's an example:

74.125.182.164 - - [23/Feb/2021:00:41:34 +0000] "GET /~yoink00/zaplog/info/refs?service=git-upload-pack HTTP/2.0" 200 553 "-" "git/2.30.0" "-"
74.125.182.161 - - [23/Feb/2021:00:41:34 +0000] "GET /~yoink00/zaplog/info/refs?service=git-upload-pack HTTP/2.0" 200 553 "-" "git/2.30.0" "-"
74.125.182.161 - - [23/Feb/2021:00:41:34 +0000] "POST /~yoink00/zaplog/git-upload-pack HTTP/2.0" 200 56 "-" "git/2.30.0" "-"
74.125.182.161 - - [23/Feb/2021:00:41:34 +0000] "POST /~yoink00/zaplog/git-upload-pack HTTP/2.0" 200 9434 "-" "git/2.30.0" "-"

These IPs appear to come from Google. At the moment I'm getting several requests per second in this shape from Google IP blocks.

@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 23, 2021
@cagedmantis cagedmantis added this to the Backlog milestone Feb 23, 2021
@ddevault
Copy link
Author

This crawling is actually starting to get out of hand. Is there someone on the infrastructure team I can escalate to?

@hyangah
Copy link
Contributor

hyangah commented Feb 23, 2021

@ddevault Thanks for sharing the example. They look like requests triggered by git remote calls from a go list or go mod download command. To help users who want to investigate and analyze the git traffic from our service, we are going to specify GIT_HTTP_USER_AGENT. But some requests still remain User-Agent string not set due to the limitation of underlying tools.

If packages and modules hosted in your site are actively used by Go users or the site hosts many packages, the aggregated volume of traffic originated from us may be significant. If it is causing an issue in your service, can you please file a separate issue with specific details of requests you are seeing and what problem the traffic has caused? The problem doesn't seem like an issue about missing User-Agent string any more.

@ddevault
Copy link
Author

I wonder if the latest Go release, with its changes to modules, is causing a larger burden on hosting services. In any case, I may follow-up in a second ticket later on, but for infrastructure issues I would prefer to be contacted directly by the sysadmins responsible: sir@cmpwn.com

@hyangah
Copy link
Contributor

hyangah commented Feb 23, 2021

@ddevault Can you please file a separate issue to discuss the issue? Other code hosting service owners may be interested in the topic and we'd like to keep our conversation in public.

@ddevault
Copy link
Author

See #44577

@hyangah
Copy link
Contributor

hyangah commented Feb 26, 2021

The change that sets GIT_HTTP_USER_AGENT="GoModuleMirror/1.0 (+https://proxy.golang.org)" is now deployed.

@ddevault can you please verify git http requests coming from proxy.golang.org now have the User-Agent string? Thank you!

@ddevault
Copy link
Author

I can verify that I'm receiving requests with that User-Agent now. Thanks!

@dmitshur dmitshur modified the milestones: Backlog, Unreleased Aug 19, 2021
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 19, 2021
@golang golang locked and limited conversation to collaborators Aug 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants