Put hashes in asset filenames, not querystrings #118

quis · 2021-06-18T11:57:00Z

Using querystring hashes is a DoS vulnerability because it means that a request with a new querystring will miss the edge and pass through to the origin.

rjbaker

Looks good. I updated the file_fingerprint test to match the change.

benthorner

I appreciate this has gone in already, but I've got some concerns about the approach we've taken here, which I'm finding quite subtle and pretty hard to follow. Was this an urgent change, or could it have waited a day or two? I appreciate this isn't my field of expertise, so I'd be interested to hear @tombye's thoughts on the approach as well.

gulpfile.js

Currently our specified image is not being shown on twitter cards, for example in https://twitter.com/msuxg/status/1407254592333697024 In our html we currently have ``` <meta property="og:image" content="https://www.gov.uk/alerts/assets/images/opengraph-image.png?f86f1d0dd106bfbcd8ad1ee5ea68e75e"> ... <meta property="og:image" content="/assets/images/govuk-opengraph-image.png"> ``` where the first is the image we want to show and the second is html added by the GOV.UK Frontend. https://github.com/alphagov/govuk-frontend/blob/v3.12.0/src/govuk/template.njk#L26 Note the second meta image currently 404s for us because it is a relative url rather than an absolute url. Twitter should be following the spec of open graph, where the first image is given preference and it ignores the second - https://ogp.me/#array. However, I think it is not following this correctly and that is why our image is not showing on twitter cards. This hypothesis backed by a manual test by me to remove the second meta tag and I see that twitter renders our card with our first image correctly. Therefore it leaves us with a few options 1. Get twitter to fix a potential bug with their twitter card implementation 2. Change the second meta tag to use an absolute URL 3. Get the design system to remove their fallover tag 4. Add in specific meta tags for twitter images Option 1 and 3 will take time and effort. Option 2 may not work because we won't have a file hosted in our assets directory at that location because of #118 Therefore I've gone for option 4 because I've tested it and it appears to work and it is also quick and non controversial. You can see the spec documented at https://developer.twitter.com/en/docs/twitter-for-websites/cards/overview/markup

quis · 2021-06-29T08:18:45Z

This is 90% working now. Just need to figure out why filenameFormat is being ignored in this one place:
https://github.com/alphagov/notifications-govuk-alerts/pull/118/files#diff-25789e3ba4c2adf4a68996260eb693a441b4a834c38b76167a120f0b51b969f7R52

templates/_layout.html

lib/utils.py

benthorner · 2021-06-29T11:44:51Z

Just in case it helps: would it be easier to get the Python code to do the renaming? It already knows where the compiled assets are, and how to fingerprint them, so it's not a big step to rename them as well.

rjbaker · 2021-06-30T08:39:05Z

@benthorner Totally forgot that Ollie had a go at doing it in Python: OllieJC@6126d5e

tombye · 2021-07-01T08:34:51Z

The only problem with doing it all in python is that the JavaScript build generates a sourcemap to give the browser a view of the uncompressed code. This is linked to the compressed code by a comment at it's end, containing the name of the sourcemap. In other words, if we rename the compressed file, we need to rename this too. If rollup (the library that builds the JS) is allowed to insert the hash itself, it will update this reference too.

All that being the case, we've been looking at how we can keep the hash generation in the frontend build but make it work with the python. Caveat: we could just do a search-and-replace on the sourcemap comment but messing with the contents of files doesn't seem a great idea if we can avoid it.

benthorner · 2021-07-01T10:00:56Z

Thanks for your thoughts @tombye.

I've wondered about source maps in the past: why do we have them? If they're just a nice-to-have, then we could just get rid of them instead of spending more effort to make them work.

tombye · 2021-07-01T10:33:26Z

I've wondered about source maps in the past: why do we have them? If they're just a nice-to-have, then we could just get rid of them instead of spending more effort to make them work.

It can be very hard to debug JS in the browser without them as we obfuscate our code as well as minify it. They aren't necessarily needed on production but I can't seem much value in doing extra work to remove them when:

they aren't loaded until devtools is open
we might need to debug JS on production at some point

tombye · 2021-07-02T11:08:32Z

Me and @quis have been working on some changes which move this to the approach I mentioned in #118 (comment).

This new approach:

leaves the process of adding SHAs to the static asset filenames to the Gulp tasks
changes the file_fingerprint filter from finding the SHA for filenames from their contents to finding the SHA'ed filename through a regex, built from the original filename

That means as long as assets have 8 character SHAs in their name, we trust Gulp to generate the right SHA for their contents.

This also includes:

some tidying up of asset URLs across the pages
passing all assets we can through file_fingerprint
extra logic to ignore a small set of assets we can't add SHAs to because we can't change their URLs

benthorner · 2021-07-02T15:53:34Z

Thanks for the update @tombye, that does sound like a simpler approach: all the knowledge about the hashing is in one place, and the Python code just needs to care about knowing what the file names are (a simpler goal).

I noticed there are quite a few "WIP" commits. Can we squash those at all? That might make the PR easier to review.

Using querystring hashes is a DoS vulnerability because it means that a request with a new querystring will miss the edge and pass through to the origin.

Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>

This requires a bit of bodging because `gulp-hash-filename` sees the filename and extension for `file.js.map` as `file.js` and `.map`. We should be able to do this with rollup by adding `entryFileNames: '[name]-[hash].js'` to its output config but that seems to use a different hashing algorithm so is incompatible with `gulp-hash-filename` and our backend code 🤡

Changes the plugin used to add a SHA to the filenames of all file types except JavaScript. For JavaScript files, use Rollup's own hashing method. Note: this method means the output file matches that input so references to it in templates need to change. This commit also creates a copy of each file it adds a SHA to, so the python filter code can find it without needing to know the SHA at any point.

It turns out that rollup's 'input' option allows you to define the 'name' property (used in the format string that defines the final name of output files). This can be done by passing in an object containing name:path pairs. For example: 'input': { 'foo': '/path/to/file.js' }, 'output': { 'entryFileNames': '[name].js' } ...will produce 'foo.js' These changes use this feature to return the JavaScript files to the names they had before and updates the HTML to reflect this. See the docs for the input option for more detail: https://www.rollupjs.org/guide/en/#input

The JavaScript files produced by rollup have SHAs in their names that are not just generated from their final contents*. Because of this, it is tricky to determine their filename. This proposes a different approach to the file_fingerprint filter, changing it so it gets the SHA'ed filename by regexp rather than using the same hashing as the gulp task. *This issue contains some useful information on how rollup generates its hashes: rollup/rollup#2839

Includes: - fix for how options are sent to gulp plugin that does the asset SHA'ing - give the HTML5 shiv a SHA in its name - add quotes to hrefs specified on certain tags without them* - stop the static asset build producing duplicate files for those with hashes in the name, these aren't used to verify the hash any more - remove unused gulp plugins and moves the gulp-sha256-filename plugin into dev dependencies - add hash to name of SVG fallback images *HTML5 allows the current form (which is why it all works) but it's a good idea to quote them for consistency.

Some image URLs are baked into the GOVUK Frontend components so can't be passed through our file_fingerprint filter, to get a SHA. Because of this, we need to keep them without the SHA.

1. updates the existing one so it covers the new behaviour 2. adds a test for when a file can't be found to match the path sent

tombye · 2021-07-02T20:16:36Z

I noticed there are quite a few "WIP" commits. Can we squash those at all? That might make the PR easier to review.

I went through those commits, squashed them into one and added a summary of the combined changes. Hopefully this should help.

benthorner

Thanks for trying @tombye ⭐.

I ended up looking at the overall diff: from a reviewers perspective, the commits are hard to follow because they tell a meandering story of dead ends and small fixes (stuff I don't need to know). Fortunately the overall diff wasn't too hard to look at in one go. Definitely something I'll be pushing for in future PRs I review, though. As a quick suggestion, I reckon you could squash all but the last two commits into one that explains the approach.

Anyway, nice one for pursuing this and finding a single approach for it, especially with the weird rollup [name]-[hash] bit 💯. I've only got a few more minor comments, so I'll approve this now.

benthorner · 2021-07-05T10:58:47Z

gulpfile.js

-        })
+        }),
+        // Terser is a replacement for UglifyJS
+        rollupPluginTerser()


This isn't mentioned in the commit. I can see all the other entries have it, so I'm assuming it's intentional. (Another argument for DRYing up 😉.)

benthorner · 2021-07-05T11:26:02Z

gulpfile.js

@@ -69,7 +91,8 @@ const javascripts = {
      ]
    }).then(bundle => {
      return bundle.write({
-        file: paths.dist + 'javascripts/govuk-frontend-details.js',
+        dir: paths.dist + 'javascripts/',
+        entryFileNames: '[name]-[hash].js',


It took me a long time to find the code that generates [hash], to be sure it will always be 8 chars long. Would have been helpful to have a reference in the commit message that introduces this: https://github.com/rollup/rollup/search?q=digest

gulpfile.js

We’re calling Rollup three times, repeating the same config each time. We can make this less duplicative by having each task share the same config. I’ve done this by making a wrapper function, rather than making the config itself reusable. The way I’ve done it avoids more duplication. Unfortunately I couldn’t work out to do this in such a way that let the tasks keep their names in the console output.

We moved the hash from the query string to file name in this PR: #118 The current code is based on the file name being unchanged.

rjbaker approved these changes Jun 18, 2021

View reviewed changes

quis force-pushed the hash-filenames branch 3 times, most recently from 71b2559 to aa72284 Compare June 18, 2021 16:16

benthorner reviewed Jun 21, 2021

View reviewed changes

gulpfile.js Outdated Show resolved Hide resolved

gulpfile.js Outdated Show resolved Hide resolved

gulpfile.js Outdated Show resolved Hide resolved

idavidmcdonald mentioned this pull request Jun 23, 2021

Fix twitter card not showing our image #124

Merged

quis commented Jun 29, 2021

View reviewed changes

templates/_layout.html Outdated Show resolved Hide resolved

quis commented Jun 29, 2021

View reviewed changes

lib/utils.py Outdated Show resolved Hide resolved

tombye force-pushed the hash-filenames branch from dd62d8d to 2f764d1 Compare July 2, 2021 11:09

quis and others added 9 commits July 2, 2021 21:00

Put hashes in asset filenames, not querystrings

8c4bb5f

Using querystring hashes is a DoS vulnerability because it means that a request with a new querystring will miss the edge and pass through to the origin.

Fix file_fingerprint util test to assert filepath

bbb2762

Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>

Keep some GOVUK images without SHAs

7fa99a8

Some image URLs are baked into the GOVUK Frontend components so can't be passed through our file_fingerprint filter, to get a SHA. Because of this, we need to keep them without the SHA.

Update tests for file_fingerprint filter

852ed41

1. updates the existing one so it covers the new behaviour 2. adds a test for when a file can't be found to match the path sent

tombye force-pushed the hash-filenames branch from 2f764d1 to 852ed41 Compare July 2, 2021 20:14

benthorner approved these changes Jul 5, 2021

View reviewed changes

quis force-pushed the hash-filenames branch from aadec78 to 4bd8783 Compare July 5, 2021 13:15

quis added 2 commits July 5, 2021 14:20

Add comment to explain rollup’s undocumented hashing algorithm

f91321e

quis force-pushed the hash-filenames branch from 4bd8783 to f91321e Compare July 5, 2021 13:27

quis merged commit 8704083 into main Jul 5, 2021

quis deleted the hash-filenames branch July 5, 2021 13:36

tombye added a commit that referenced this pull request Jul 13, 2021

Allow test for JS to match filename with hash

ad3d1e7

We moved the hash from the query string to file name in this PR: #118 The current code is based on the file name being unchanged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put hashes in asset filenames, not querystrings #118

Put hashes in asset filenames, not querystrings #118

quis commented Jun 18, 2021

rjbaker left a comment

benthorner left a comment

quis commented Jun 29, 2021

benthorner commented Jun 29, 2021

rjbaker commented Jun 30, 2021

tombye commented Jul 1, 2021

benthorner commented Jul 1, 2021

tombye commented Jul 1, 2021

tombye commented Jul 2, 2021

benthorner commented Jul 2, 2021

tombye commented Jul 2, 2021

benthorner left a comment

benthorner Jul 5, 2021

benthorner Jul 5, 2021

Put hashes in asset filenames, not querystrings #118

Put hashes in asset filenames, not querystrings #118

Conversation

quis commented Jun 18, 2021

rjbaker left a comment

Choose a reason for hiding this comment

benthorner left a comment

Choose a reason for hiding this comment

quis commented Jun 29, 2021

benthorner commented Jun 29, 2021

rjbaker commented Jun 30, 2021

tombye commented Jul 1, 2021

benthorner commented Jul 1, 2021

tombye commented Jul 1, 2021

tombye commented Jul 2, 2021

benthorner commented Jul 2, 2021

tombye commented Jul 2, 2021

benthorner left a comment

Choose a reason for hiding this comment

benthorner Jul 5, 2021

Choose a reason for hiding this comment

benthorner Jul 5, 2021

Choose a reason for hiding this comment