Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put hashes in asset filenames, not querystrings #118

Merged
merged 11 commits into from
Jul 5, 2021
Merged

Put hashes in asset filenames, not querystrings #118

merged 11 commits into from
Jul 5, 2021

Conversation

quis
Copy link
Member

@quis quis commented Jun 18, 2021

Using querystring hashes is a DoS vulnerability because it means that a request with a new querystring will miss the edge and pass through to the origin.

Copy link
Contributor

@rjbaker rjbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I updated the file_fingerprint test to match the change.

@quis quis force-pushed the hash-filenames branch 3 times, most recently from 71b2559 to aa72284 Compare June 18, 2021 16:16
Copy link
Contributor

@benthorner benthorner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this has gone in already, but I've got some concerns about the approach we've taken here, which I'm finding quite subtle and pretty hard to follow. Was this an urgent change, or could it have waited a day or two? I appreciate this isn't my field of expertise, so I'd be interested to hear @tombye's thoughts on the approach as well.

gulpfile.js Outdated Show resolved Hide resolved
gulpfile.js Outdated Show resolved Hide resolved
gulpfile.js Outdated Show resolved Hide resolved
idavidmcdonald added a commit that referenced this pull request Jun 23, 2021
Currently our specified image is not being shown on twitter cards, for
example in https://twitter.com/msuxg/status/1407254592333697024

In our html we currently have
```
<meta property="og:image" content="https://www.gov.uk/alerts/assets/images/opengraph-image.png?f86f1d0dd106bfbcd8ad1ee5ea68e75e">
...
<meta property="og:image" content="/assets/images/govuk-opengraph-image.png">
```

where the first is the image we want to show and the second is html
added by the GOV.UK Frontend.
https://github.com/alphagov/govuk-frontend/blob/v3.12.0/src/govuk/template.njk#L26

Note the second meta image currently 404s for us because it is
a relative url rather than an absolute url.

Twitter should be following the spec of open graph, where the first
image is given preference and it ignores the second -
https://ogp.me/#array.

However, I think it is not following this correctly and that is why our image is not showing
on twitter cards. This hypothesis backed by a manual test by me to remove the
second meta tag and I see that twitter renders our card with our first
image correctly.

Therefore it leaves us with a few options
1. Get twitter to fix a potential bug with their twitter card
   implementation
2. Change the second meta tag to use an absolute URL
3. Get the design system to remove their fallover tag
4. Add in specific meta tags for twitter images

Option 1 and 3 will take time and effort.

Option 2 may not work because we won't have a file hosted in our assets
directory at that location because of #118

Therefore I've gone for option 4 because I've tested it and it appears
to work and it is also quick and non controversial. You can see the spec documented
at
https://developer.twitter.com/en/docs/twitter-for-websites/cards/overview/markup
@quis
Copy link
Member Author

quis commented Jun 29, 2021

This is 90% working now. Just need to figure out why filenameFormat is being ignored in this one place:
https://github.com/alphagov/notifications-govuk-alerts/pull/118/files#diff-25789e3ba4c2adf4a68996260eb693a441b4a834c38b76167a120f0b51b969f7R52

templates/_layout.html Outdated Show resolved Hide resolved
lib/utils.py Outdated Show resolved Hide resolved
@benthorner
Copy link
Contributor

Just in case it helps: would it be easier to get the Python code to do the renaming? It already knows where the compiled assets are, and how to fingerprint them, so it's not a big step to rename them as well.

@rjbaker
Copy link
Contributor

rjbaker commented Jun 30, 2021

@benthorner Totally forgot that Ollie had a go at doing it in Python: OllieJC@6126d5e

@tombye
Copy link
Contributor

tombye commented Jul 1, 2021

The only problem with doing it all in python is that the JavaScript build generates a sourcemap to give the browser a view of the uncompressed code. This is linked to the compressed code by a comment at it's end, containing the name of the sourcemap. In other words, if we rename the compressed file, we need to rename this too. If rollup (the library that builds the JS) is allowed to insert the hash itself, it will update this reference too.

All that being the case, we've been looking at how we can keep the hash generation in the frontend build but make it work with the python. Caveat: we could just do a search-and-replace on the sourcemap comment but messing with the contents of files doesn't seem a great idea if we can avoid it.

@benthorner
Copy link
Contributor

Thanks for your thoughts @tombye.

I've wondered about source maps in the past: why do we have them? If they're just a nice-to-have, then we could just get rid of them instead of spending more effort to make them work.

@tombye
Copy link
Contributor

tombye commented Jul 1, 2021

I've wondered about source maps in the past: why do we have them? If they're just a nice-to-have, then we could just get rid of them instead of spending more effort to make them work.

It can be very hard to debug JS in the browser without them as we obfuscate our code as well as minify it. They aren't necessarily needed on production but I can't seem much value in doing extra work to remove them when:

  • they aren't loaded until devtools is open
  • we might need to debug JS on production at some point

@tombye
Copy link
Contributor

tombye commented Jul 2, 2021

Me and @quis have been working on some changes which move this to the approach I mentioned in #118 (comment).

This new approach:

  • leaves the process of adding SHAs to the static asset filenames to the Gulp tasks
  • changes the file_fingerprint filter from finding the SHA for filenames from their contents to finding the SHA'ed filename through a regex, built from the original filename

That means as long as assets have 8 character SHAs in their name, we trust Gulp to generate the right SHA for their contents.

This also includes:

  • some tidying up of asset URLs across the pages
  • passing all assets we can through file_fingerprint
  • extra logic to ignore a small set of assets we can't add SHAs to because we can't change their URLs

@benthorner
Copy link
Contributor

Thanks for the update @tombye, that does sound like a simpler approach: all the knowledge about the hashing is in one place, and the Python code just needs to care about knowing what the file names are (a simpler goal).

I noticed there are quite a few "WIP" commits. Can we squash those at all? That might make the PR easier to review.

quis and others added 9 commits July 2, 2021 21:00
Using querystring hashes is a DoS vulnerability because it means that a
request with a new querystring will miss the edge and pass through to
the origin.
Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>
This requires a bit of bodging because `gulp-hash-filename` sees the
filename and extension for `file.js.map` as `file.js` and `.map`.

We should be able to do this with rollup by adding
`entryFileNames: '[name]-[hash].js'` to its output config but that seems
to use a different hashing algorithm so is incompatible with
`gulp-hash-filename` and our backend code 🤡
Changes the plugin used to add a SHA to the
filenames of all file types except JavaScript.

For JavaScript files, use Rollup's own hashing
method. Note: this method means the output file
matches that input so references to it in
templates need to change.

This commit also creates a copy of each file it
adds a SHA to, so the python filter code can find
it without needing to know the SHA at any point.
It turns out that rollup's 'input' option allows
you to define the 'name' property (used in the
format string that defines the final name of
output files). This can be done by passing in an
object containing name:path pairs.

For example:

'input': {
  'foo': '/path/to/file.js'
},
'output': {
  'entryFileNames': '[name].js'
}

...will produce 'foo.js'

These changes use this feature to return the
JavaScript files to the names they had before and
updates the HTML to reflect this.

See the docs for the input option for more detail:

https://www.rollupjs.org/guide/en/#input
The JavaScript files produced by rollup have SHAs
in their names that are not just generated from
their final contents*. Because of this, it is
tricky to determine their filename.

This proposes a different approach to the
file_fingerprint filter, changing it so it gets
the SHA'ed filename by regexp rather than using
the same hashing as the gulp task.

*This issue contains some useful information on
how rollup generates its hashes:

rollup/rollup#2839
Includes:
- fix for how options are sent to gulp plugin that
  does the asset SHA'ing
- give the HTML5 shiv a SHA in its name
- add quotes to hrefs specified on certain tags
  without them*
- stop the static asset build producing duplicate
  files for those with hashes in the name, these
  aren't used to verify the hash any more
- remove unused gulp plugins and moves the
  gulp-sha256-filename plugin into dev
  dependencies
- add hash to name of SVG fallback images

*HTML5 allows the current form (which is why it all
works) but it's a good idea to quote them for
consistency.
Some image URLs are baked into the GOVUK Frontend
components so can't be passed through our
file_fingerprint filter, to get a SHA. Because of
this, we need to keep them without the SHA.
1. updates the existing one so it covers the new
   behaviour
2. adds a test for when a file can't be found to
   match the path sent
@tombye
Copy link
Contributor

tombye commented Jul 2, 2021

I noticed there are quite a few "WIP" commits. Can we squash those at all? That might make the PR easier to review.

I went through those commits, squashed them into one and added a summary of the combined changes. Hopefully this should help.

Copy link
Contributor

@benthorner benthorner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying @tombye ⭐.

I ended up looking at the overall diff: from a reviewers perspective, the commits are hard to follow because they tell a meandering story of dead ends and small fixes (stuff I don't need to know). Fortunately the overall diff wasn't too hard to look at in one go. Definitely something I'll be pushing for in future PRs I review, though. As a quick suggestion, I reckon you could squash all but the last two commits into one that explains the approach.

Anyway, nice one for pursuing this and finding a single approach for it, especially with the weird rollup [name]-[hash] bit 💯. I've only got a few more minor comments, so I'll approve this now.

gulpfile.js Outdated
})
}),
// Terser is a replacement for UglifyJS
rollupPluginTerser()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't mentioned in the commit. I can see all the other entries have it, so I'm assuming it's intentional. (Another argument for DRYing up 😉.)

gulpfile.js Outdated
@@ -69,7 +91,8 @@ const javascripts = {
]
}).then(bundle => {
return bundle.write({
file: paths.dist + 'javascripts/govuk-frontend-details.js',
dir: paths.dist + 'javascripts/',
entryFileNames: '[name]-[hash].js',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a long time to find the code that generates [hash], to be sure it will always be 8 chars long. Would have been helpful to have a reference in the commit message that introduces this: https://github.com/rollup/rollup/search?q=digest

gulpfile.js Outdated Show resolved Hide resolved
quis added 2 commits July 5, 2021 14:20
We’re calling Rollup three times, repeating the same config each time.
We can make this less duplicative by having each task share the same
config. I’ve done this by making a wrapper function, rather than making
the config itself reusable. The way I’ve done it avoids more
duplication.

Unfortunately I couldn’t work out to do this in such a way that let the
tasks keep their names in the console output.
@quis quis merged commit 8704083 into main Jul 5, 2021
@quis quis deleted the hash-filenames branch July 5, 2021 13:36
tombye added a commit that referenced this pull request Jul 13, 2021
We moved the hash from the query string to file
name in this PR:

#118

The current code is based on the file name being
unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants