Serve `robots.txt` to prevent crawlers from accessing UI #97

brandur · 2024-07-19T04:26:11Z

Here, add a robots.txt that denies all crawler access to UI
installations. If a River UI is exposed publicly without authentication,
it's bad and a security problem, but we don't have to make it worse by
allowing crawlers to find it.

We'll add a separate robots.txt in the demo that will allow basic
access to top-level pages, although deny access to crawl through the
potentially tens of thousands of jobs in /jobs/.

There were a number of ways to go about this that were plausible. I
ended up putting in an embedded file system that can pull in static
files easily into the Go program, and which could be reused for future
static files in case they're needed. It may be a little more than we
need right now, but it wasn't much harder to do than serving a one off
static file.

bgentry

Do you think this makes sense in the generalized riverui handler, or should it be specific to the demo app? Open to either, just not sure if this is something regular users will want (or if it makes sense path prefixed within the handler).

brandur · 2024-07-21T17:43:48Z

Thx.

Was going to mention too that I feel like what I ended up with here is a bit of overkill in that it allows generic files to be added along with robots.txt, but in the end it wasn't that much less code than just having an endpoint for a single file only. I figure we can leave it like this and then refactor at some point in the future if it turns out that most of this code has gone relatively unused.

Here, add a `robots.txt` that denies all crawler access to UI installations. If a River UI is exposed publicly without authentication, it's bad and a security problem, but we don't have to make it worse by allowing crawlers to find it. We'll add a separate `robots.txt` in the demo that will allow basic access to top-level pages, although deny access to crawl through the potentially tens of thousands of jobs in `/jobs/`. There were a number of ways to go about this that were plausible. I ended up putting in an embedded file system that can pull in static files easily into the Go program, and which could be reused for future static files in case they're needed. It may be a little more than we need right now, but it wasn't much harder to do than serving a one off static file.

bgentry reviewed Jul 19, 2024

View reviewed changes

brandur force-pushed the brandur-robots-txt branch from b796423 to e55c19a Compare July 19, 2024 04:34

brandur changed the title ~~Serve robots.txt to prevent Google indexing a zillion sample jobs~~ Serve robots.txt to prevent crawlers from accessing UI Jul 21, 2024

brandur force-pushed the brandur-robots-txt branch from e55c19a to 3bc12e0 Compare July 21, 2024 15:48

bgentry approved these changes Jul 21, 2024

View reviewed changes

brandur force-pushed the brandur-robots-txt branch from 3bc12e0 to 9d9ad90 Compare July 21, 2024 17:43

brandur force-pushed the brandur-robots-txt branch from 9d9ad90 to 3ce193b Compare July 21, 2024 17:50

brandur merged commit 924b677 into master Jul 21, 2024
10 checks passed

brandur deleted the brandur-robots-txt branch July 21, 2024 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serve `robots.txt` to prevent crawlers from accessing UI #97

Serve `robots.txt` to prevent crawlers from accessing UI #97

brandur commented Jul 19, 2024 •

edited

Loading

bgentry left a comment

brandur commented Jul 21, 2024

Serve robots.txt to prevent crawlers from accessing UI #97

Serve robots.txt to prevent crawlers from accessing UI #97

Conversation

brandur commented Jul 19, 2024 • edited Loading

bgentry left a comment

Choose a reason for hiding this comment

brandur commented Jul 21, 2024

Serve `robots.txt` to prevent crawlers from accessing UI #97

Serve `robots.txt` to prevent crawlers from accessing UI #97

brandur commented Jul 19, 2024 •

edited

Loading