Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove newline entities #46

Merged
merged 4 commits into from
Nov 9, 2022
Merged

fix: remove newline entities #46

merged 4 commits into from
Nov 9, 2022

Conversation

kniemasik
Copy link
Contributor

No description provided.

@kniemasik kniemasik requested a review from a team as a code owner November 7, 2022 09:20
src/index.ts Outdated
@@ -1,6 +1,6 @@
const invalidProtocolRegex = /^([^\w]*)(javascript|data|vbscript)/im;
const htmlEntitiesRegex = /&#(\w+)(^\w|;)?/g;
const htmlTabEntityRegex = /&tab;/gi;
const htmlSpaceEntityRegex = /&(tab|newline);/gi;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any other html entities that represent blank spaces?

I'm also wondering if we can just trash html entities entirely, or if there's a world in which an html entity is a valid part of a url?

I guess it technically is for query params. Should we be using the URL class here and only remove the html entities from the domain itself, and leave them in the query params? (would depend again on whether or not we continue to support IE11)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous regex (htmlEntitiesRegex) has the ; as an optional piece. Is there a world in which new lines or tabs can be added without using the ; part of it?

I'm still concerned that there are other space characters that are vulnerable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe the problem came from whitespace ctrl characters specifically that were making it through our checks - other whitespace html entities appear to be properly encoded in the browser and i think are valid in a url, so i'm inclined to not strip those (or any other html entity) out.

based on what i can find, the ; is always required for tabs and new lines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, perfect ❤️

src/index.ts Outdated
@@ -12,7 +12,7 @@ function isRelativeUrlWithoutProtocol(url: string): boolean {

// adapted from https://stackoverflow.com/a/29824550/2601552
function decodeHtmlCharacters(str: string) {
str = str.replace(htmlTabEntityRegex, "	");
str = str.replace(htmlSpaceEntityRegex, "	");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little odd that &tab; and &newline; both get converted to 	, which is a tab. The 	 then gets decoded and stripped out as a control character. Since we know &tab; and &newline; are both control characters, can we just eliminate them here?

  str = str.replace(htmlSpaceEntityRegex, "");

That does change the nature of what decodeHtmlCharacters does, though, but decoding &newline; as a tab is already doing that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - i like that a lot better (eliminating them entirely). fixed!

@crookedneighbor crookedneighbor merged commit a39ca11 into main Nov 9, 2022
@kniemasik kniemasik deleted the fix_newline_entity branch November 9, 2022 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants