Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: encode URLs correctly (fix #15298) #15311

Merged
merged 7 commits into from
Mar 12, 2024
Merged

Conversation

pzerelles
Copy link
Contributor

@pzerelles pzerelles commented Dec 11, 2023

Description

Fixes #15298. The problem was that special characters were not encoded correctly in URLs created for assets. This was also a problem for vite-imagetools when there were spaces or other special characters in the filename.

Additional context


What is the purpose of this pull request?

  • Bug fix
  • New Feature
  • Documentation update
  • Other

Before submitting the PR, please make sure you do the following

  • Read the Contributing Guidelines, especially the Pull Request Guidelines.
  • Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
  • Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. fixes #123).
  • Update the corresponding documentation if needed.
  • Ideally, include relevant tests that fail without this PR but pass with it.

Copy link

stackblitz bot commented Dec 11, 2023

Review PR in StackBlitz Codeflow Run & review this pull request in StackBlitz Codeflow.

@pzerelles pzerelles force-pushed the fix-url-encoding branch 4 times, most recently from d4dbd3c to c667958 Compare December 11, 2023 16:08
@patak-dev patak-dev added the p3-minor-bug An edge case that only affects very specific usage (priority) label Dec 11, 2023
@patak-dev patak-dev added this to the 5.1 milestone Dec 11, 2023
@patak-dev
Copy link
Member

patak-dev commented Dec 11, 2023

Thanks for the PR! I mentioned we had this problem here #15246 (review). I think it is a good idea to do this, but I don't know if we should merge this one in a patch. We always had this issue and it would be good to see how all the encoding affects performance. I added it to the 5.1 milestone for now.

@pzerelles
Copy link
Contributor Author

pzerelles commented Dec 12, 2023 via email

benmccann
benmccann previously approved these changes Dec 15, 2023
Copy link
Collaborator

@benmccann benmccann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much we should worry about performance when deciding to merge this or not. Correctness seems more important. Of course if we can optimize the implementation we should certainly do so

@patak-dev
Copy link
Member

I agree that correctness is more important than performance here. I still think we should merge this one as part of the next minor. We discussed with the team and decided to start the beta for 5.1 after the holidays. This change could introduce subtle bugs, for example, I think this changed line isn't right https://github.com/vitejs/vite/pull/15311/files#diff-f2f744fef86a2c562dd5142240912f7a2d28404fac536740a2424daf628aa609R409. That normalized URL is later used in both warmup and to update the module graph. In both cases, the URL should be decoded. So we need to move that later on, and only encode when we write a URL into the code.

@pzerelles
Copy link
Contributor Author

pzerelles commented Dec 15, 2023

You mean the line in packages/vite/src/node/plugins/importAnalysis.ts? That's the only line I was not 100% sure, but without the change it throws for some URLs. There is a decodeURI somewhere that fails if the URL contains a %, for example.

@patak-dev
Copy link
Member

I think the encoding may need to be moved here instead https://github.com/pzerelles/vite/blob/f45775c9515cf9a3d3c9ffc448b871dfc78efe84/packages/vite/src/node/plugins/importAnalysis.ts#L602C17-L602C17

@pzerelles
Copy link
Contributor Author

Thanks so much for providing the right location. I probably didn't understand the code good enough. The original problem is also fixed when moving the encode to that location.

@patak-dev
Copy link
Member

To set expectations, we may start work in 5.1 beta in January. My take is that we should merge it then. I'll be off for a few weeks, then do a review. Other maintainers may get to this one sooner though and I'm fine with them deciding to move forward if performance and test cases look enough to cover this one. We should have a test covering an id that wouldn't work with the encoding as it was before the last commit. I think a file called module%.js should already trigger the issue.

@pzerelles
Copy link
Contributor Author

pzerelles commented Dec 15, 2023

Yes. There is also another thing. Right now I encode only placeholders __VITE_ASSET_{id}__ and not __VITE_PUBLIC_ASSET_{id}__, because I found no documentation about it and when it is used.

@pzerelles
Copy link
Contributor Author

pzerelles commented Dec 15, 2023

Instead of adding a new test I extended the existing unicode url test to include a % character.
That test was wrong before anyway because in dev mode it expected an unencoded url. I corrected that and the test should fail against the current version.

@bluwy
Copy link
Member

bluwy commented Dec 27, 2023

If performance is a concern, I did some tests and I found that browsers can generally handle un-encoded imports too. It'll encode them automatically when requesting. Even if the import string has a mix of encoded and unencoded strings, e.g. import './test%20テother.js', the browser can specifically encode the "テ" and not the %20 as %2520 (double-encoding). So we can probably get away with encoding % as %25 only instead of encodeUri.

@patak-dev
Copy link
Member

@pzerelles we discussed about this in today's team meeting and decided to move forward with the change, but using a helper for performance reasons that will only transform % into %25 instead of a full encoding as this should be enough. Let us know if you'd like to modify this PR to do so, if not, we can work in another PR.

@pzerelles
Copy link
Contributor Author

pzerelles commented Jan 10, 2024

@pzerelles we discussed about this in today's team meeting and decided to move forward with the change, but using a helper for performance reasons that will only transform % into %25 instead of a full encoding as this should be enough. Let us know if you'd like to modify this PR to do so, if not, we can work in another PR.

But this will not fix urls in general and libraries like vite-imagetools or images in general will still be wrong if the files contain spaces or other special characters. Even though browsers support un-encoded special characters, it is against RFC and maybe works for imports, but not for images that will not be imported but delivered from disk. URLs for images must be encoded and if for example vite-imagetools encodes the filename on disk, the url will be decoded by the web server and the file without encoded characters will not be found.

@patak-dev
Copy link
Member

@pzerelles would you create a minimal repro for the issue you describe with image-tools?

@pzerelles
Copy link
Contributor Author

Of course: https://github.com/pzerelles/vite-bug-repro

The second test logo will not be loaded, because vite-imagetools (currently) writes the transformed image to disk with url-encoded filename (which is also wrong), but the web server will look for a file with the url decoded.

The main reason for vite-imagetools doing this is because vite does not url-encode asset urls when they are written to html or js output (I assume). I already have a PR in the vite-imagetools repository to remove this behavior and output files unencoded, and maintainers have signaled to me that it will be accepted if vite can provide the url-encoding, which in their opinion is also the correct place to do it.

@bluwy
Copy link
Member

bluwy commented Jan 24, 2024

Based on the repro, is your main concern of this inconsistency on this HTML during vite preview?

    <img src="/assets/logo 10-rYk2SNri.png" class="logo" alt="Test logo">
    <img src="/assets/logo%2010-4FQfBhVH.webp" class="logo" alt="Test logo">

The strings come from this source code:

import testLogo from "./logo 10.png";
import testLogoScaled from "./logo 10.png?w=64&format=webp";

document.querySelector<HTMLDivElement>("#app")!.innerHTML = `
    <img src="${testLogo}" class="logo" alt="Test logo" />
    <img src="${testLogoScaled}" class="logo" alt="Test logo" />
`

So from what I can tell, we only need to fix the URL string from asset imports. We don't have to handle the encoding of import specifiers to fix vite-imagetools? I think it's good to fix the inconsistency for asset imports specifically, which means using encodeURI on this filename. filename itself shouldn't be pre-encoded because it's not a url string. (implied by its name)

But for others, it looks like vite-imagetools should not be writing files in encoded characters in the first place and it should work. If I manually edit the name after build, it's also working fine.

@pzerelles
Copy link
Contributor Author

pzerelles commented Jan 24, 2024

Based on the repro, is your main concern of this inconsistency on this HTML during vite preview?

    <img src="/assets/logo 10-rYk2SNri.png" class="logo" alt="Test logo">
    <img src="/assets/logo%2010-4FQfBhVH.webp" class="logo" alt="Test logo">

The strings come from this source code:

import testLogo from "./logo 10.png";
import testLogoScaled from "./logo 10.png?w=64&format=webp";

document.querySelector<HTMLDivElement>("#app")!.innerHTML = `
    <img src="${testLogo}" class="logo" alt="Test logo" />
    <img src="${testLogoScaled}" class="logo" alt="Test logo" />
`

So from what I can tell, we only need to fix the URL string from asset imports. We don't have to handle the encoding of import specifiers to fix vite-imagetools? I think it's good to fix the inconsistency for asset imports specifically, which means using encodeURI on this filename. filename itself shouldn't be pre-encoded because it's not a url string. (implied by its name)

But for others, it looks like vite-imagetools should not be writing files in encoded characters in the first place and it should work. If I manually edit the name after build, it's also working fine.

It is not only for vite preview, the problem also exists in SSR builds to the extent that the URL could be different in SSR compared to from Javascript, where often the URL class is wrapped around and does the encoding while in SSR there is currently no encoding happening from Vite's side.

That is correct, vite-imagetools should not be writing files in encoded form and I have a PR waiting there to fix that, too.
If vite can correctly encode the URL string from asset imports, that would be the solution and I think that is what my PR here is about.

@bluwy
Copy link
Member

bluwy commented Jan 24, 2024

I think this PR is doing more than expected to fix the issue for vite-imagetools though. We only need to encode the filename as linked before, but this PR is also:

  1. Encoding the filename passed to renderBuiltUrl
  2. Encoding import specifiers (import-analysis)

Which I don't think are needed for vite-imagetools. no2 is sort-of a real (but different) issue though which we suggested only encoding the %. I don't quite understand how no2 affects vite-imagetools otherwise since it shouldn't be a concern for plugins. Vite is only normalizing the import specifiers to bridge them back to the server when it's requested by the client/browser.


I guess what I'm proposing is that, we can scope this PR down by

  1. Only using encodeURI on the filename here.
  2. For import analysis, only handle % encoding for import specifiers. (Not related to vite-imagetools I think, but could be fixed together if you'd like)

@pzerelles
Copy link
Contributor Author

But isn't renderBuiltUrl used to insert the URL into HTML during SSR? If yes, that needs to be encoded as well.

@bluwy
Copy link
Member

bluwy commented Jan 24, 2024

The consumer who uses renderBuiltUrl will need to encode the filename themselves in this case. In my mind, "filename" is un-encoded, and the returned built "url" is encoded. Unless we rename the filename as url so it's less confusing. Our docs also suggested usage without encoding 🤔 Maybe @patak-dev have some thoughts on this.

@patak-dev
Copy link
Member

To be honest, I wasn't thinking about encoding when designing the renderBuiltUrl API or the docs as this wasn't in the table at that point. From what I see from the usage in the ecosystem, everyone is adding a base or wrapping the filename in a function call and then returning it without encoding.

What if we keep filename as is, and start encoding (only replacing %) on our side to the output of renderBuiltUrl? We have many places in our API were we use url but we mean an unencoded url.

@pzerelles
Copy link
Contributor Author

pzerelles commented Jan 24, 2024

But isn't that the distinction, filename or url. I agree that the filename should not be url-encoded, but the url should be if it is represented as string.

From a developer standpoint, I import an image and get back a string to use as src in my img tag. The RFC says that it should be encoded. If that is not how it should work, everyone will need to url-encode the imported url separately.

And vite-imagetools can not work around that, because the Vite pipeline generates the urls from filenames.

@patak-dev
Copy link
Member

But isn't that the distinction, filename or url. I agree that the filename should not be url-encoded, but the url should be if it is represented as string.

We have two definitions for URL internally:

  • Browser URL: these are the ones your browser will see. They are wrapped when needed during dev with /@id/, \0 is replaced by __x00__, for example. They also have queries like t and import. And these should be encoded (at least the %).
  • Server URL: these are the ones we pass to transformRequest, resolveId, the moduleGraph APIs, etc. These are unwrapped, no /@id/, the can start with \0 (they don't have queries t and import, but they can have a direct query for css), these should not be encoded. These are the URLs you use when authoring source code too.
  • id: these are resolved Server URLs

I think we need to properly document this in the docs, but it is tricky. We should first find good names for these.

From a developer standpoint, I import an image and get back a string to use as src in my img tag. The RFC says that it should be encoded. If that is not how it should work, everyone will need to url-encode the imported url separately.

I'm proposing moving the encoding from renderBuiltUrl(encode(filename)) to encode(renderBuiltUrl(filename)) inside Vite, so it should be the same for vite-imagetools, no?

@bluwy
Copy link
Member

bluwy commented Jan 25, 2024

Hmm, maybe I'm also looking at the issue wrongly. @patak-dev I think I agree that perhaps we should leave renderBuiltUrl filename un-encoded, and also leave its returned url un-encoded so we don't have to change that.

If we do these encoding/decoding at a low-level (e.g. the renderBuiltUrl primitive) that many higher-level APIs relies on, it makes it hard to reason the encoding state of the URL. For simplicity and from what I observe in the codebase so far, I think what we should do is that:

  1. For every logic that deals with URLs internally, it must be decoded.
  2. At the start of the logic, if the URL is encoded, we decode it.
  3. At the end of the logic, if the URL needs to be encoded, then we encode the (surely) decoded URL.
Examples

Examples of no2:

const urlReplacer: CssUrlReplacer = async (url, importer) => {
const decodedUrl = decodeURI(url)

const url = new URL(req.url!.replace(/^\/{2,}/, '/'), 'http://example.com')
const pathname = decodeURI(url.pathname)

Examples of no3:

fetch(
new URL(
`${base}__open-in-editor?file=${encodeURIComponent(file)}`,
import.meta.url,
),
)

url.pathname = encodeURI(newPathname)
req.url = url.href.slice(url.origin.length)
serveFromRoot(req, res, next)

If the internals keep flipping between encoded and decoded, I think it'll be hard to maintain over-time.


So I guess my new suggestion now is to update this part (which is the end of the logic) and encode it 🤔

return `export default ${JSON.stringify(url)}`

@patak-dev I think for this case, we might need to do a full encode since we don't know how it's going to be used. We could only do the partial encode for import specifiers I believe since we have control over that.

@bluwy
Copy link
Member

bluwy commented Feb 7, 2024

Moving this to 5.2 since we're releasing 5.1 soon and the PR needs an update to finish up.

@bluwy bluwy modified the milestones: 5.1, 5.2 Feb 7, 2024
@pzerelles
Copy link
Contributor Author

pzerelles commented Feb 9, 2024

So I guess my new suggestion now is to update this part (which is the end of the logic) and encode it 🤔

return `export default ${JSON.stringify(url)}`

@patak-dev I think for this case, we might need to do a full encode since we don't know how it's going to be used. We could only do the partial encode for import specifiers I believe since we have control over that.

@bluwy I changed the PR to this and it works for dev server. But in build mode, the URLs are unencoded and the original Repro from #15298 fails with preview.

I updated the test html and removed and invalid unencoded image URL from there and added the problematic % to an encoded image URL in index.html. The tests are fine still, but during the tests I see messages of URI malformed. Strangely, that encoded URL from index.html arrives at viteHtmlFallbackMiddleware with only the % at the end already decoded but the rest of the unicode still decoded. I appreciate any help with this.

@pzerelles
Copy link
Contributor Author

I added the URL encode again in toOutputFilePathInJS, but after renderBuiltUrl. Everything is working then, only the mysterious URL malformed during tests, that does not make the tests fail.

@bluwy
Copy link
Member

bluwy commented Mar 8, 2024

I went ahead and rebased this, and fixed the URI malformed issue. The issue long exists in Vite, only when you reference an asset with % in the filename.

I also removed the encoding in toOutputFilePathInJS . I searched around the codebase and make sure every part where we render the URL as a string (which is the end of the lifecycle of a url string), to call encodeURI. (or partialEncodeURI if only % encode is needed). Among the places are nearby:

  • toAssetPathFromHtml
  • toAssetPathFromCss
  • toAssetPathFromHtml
  • assetUrlRE
  • publicAssetUrlRE
  • cssUrlAssetRE
  • workerAssetUrlRE

Hoping this change doesn't blow everything up and should be more accurate.

@pzerelles
Copy link
Contributor Author

@bluwy Thank you very much. I wouldn't have found all those places probably. Will try if it solves all issues I had.

patak-dev
patak-dev previously approved these changes Mar 8, 2024
Copy link
Member

@patak-dev patak-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@bluwy
Copy link
Member

bluwy commented Mar 8, 2024

/ecosystem-ci run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p3-minor-bug An edge case that only affects very specific usage (priority)
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

"Internal server error: URI malformed" when using a percent sign in URL
4 participants