Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serve robots.txt to prevent crawlers from accessing UI #97

Merged
merged 1 commit into from
Jul 21, 2024

Conversation

brandur
Copy link
Collaborator

@brandur brandur commented Jul 19, 2024

Here, add a robots.txt that denies all crawler access to UI
installations. If a River UI is exposed publicly without authentication,
it's bad and a security problem, but we don't have to make it worse by
allowing crawlers to find it.

We'll add a separate robots.txt in the demo that will allow basic
access to top-level pages, although deny access to crawl through the
potentially tens of thousands of jobs in /jobs/.

There were a number of ways to go about this that were plausible. I
ended up putting in an embedded file system that can pull in static
files easily into the Go program, and which could be reused for future
static files in case they're needed. It may be a little more than we
need right now, but it wasn't much harder to do than serving a one off
static file.

Copy link
Contributor

@bgentry bgentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this makes sense in the generalized riverui handler, or should it be specific to the demo app? Open to either, just not sure if this is something regular users will want (or if it makes sense path prefixed within the handler).

@brandur brandur force-pushed the brandur-robots-txt branch from b796423 to e55c19a Compare July 19, 2024 04:34
@brandur brandur changed the title Serve robots.txt to prevent Google indexing a zillion sample jobs Serve robots.txt to prevent crawlers from accessing UI Jul 21, 2024
@brandur brandur force-pushed the brandur-robots-txt branch from e55c19a to 3bc12e0 Compare July 21, 2024 15:48
@brandur brandur force-pushed the brandur-robots-txt branch from 3bc12e0 to 9d9ad90 Compare July 21, 2024 17:43
@brandur
Copy link
Collaborator Author

brandur commented Jul 21, 2024

Thx.

Was going to mention too that I feel like what I ended up with here is a bit of overkill in that it allows generic files to be added along with robots.txt, but in the end it wasn't that much less code than just having an endpoint for a single file only. I figure we can leave it like this and then refactor at some point in the future if it turns out that most of this code has gone relatively unused.

Here, add a `robots.txt` that denies all crawler access to UI
installations. If a River UI is exposed publicly without authentication,
it's bad and a security problem, but we don't have to make it worse by
allowing crawlers to find it.

We'll add a separate `robots.txt` in the demo that will allow basic
access to top-level pages, although deny access to crawl through the
potentially tens of thousands of jobs in `/jobs/`.

There were a number of ways to go about this that were plausible. I
ended up putting in an embedded file system that can pull in static
files easily into the Go program, and which could be reused for future
static files in case they're needed. It may be a little more than we
need right now, but it wasn't much harder to do than serving a one off
static file.
@brandur brandur force-pushed the brandur-robots-txt branch from 9d9ad90 to 3ce193b Compare July 21, 2024 17:50
@brandur brandur merged commit 924b677 into master Jul 21, 2024
10 checks passed
@brandur brandur deleted the brandur-robots-txt branch July 21, 2024 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants