Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Use <meta> tags to discover the per-page encoding of html previews #4183

Merged
merged 4 commits into from
Nov 15, 2018

Conversation

hawkowl
Copy link
Contributor

@hawkowl hawkowl commented Nov 14, 2018

No description provided.

@hawkowl hawkowl requested a review from a team November 14, 2018 04:51
@hawkowl
Copy link
Contributor Author

hawkowl commented Nov 14, 2018

Fixes #2891

@@ -53,6 +53,9 @@

logger = logging.getLogger(__name__)

_charset_match = re.compile(br"<\s*meta[^>]*charset\s*=\s*([a-z0-9-]+)", flags=re.I)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obligatory reference to https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 here.

(yeah, I don't know how you're supposed to parse the HTML before you know what encoding it is, either)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, when I was coming up with this, I had two solutions:

  1. Decode as latin1 (which will decode any 8 bit stream) or ascii with errors set to ignore
  2. Regex looking for a <meta beginning block and then charset=<foo>, which is likely to be in a http-equiv="Content-Type".

Since I consider this best-effort (correctly configured servers ought to be serving the correct header in the Content-Type), I decided a regex was better than parsing it, then parsing it again. I chose not to check for the http equiv part since it can go before or after the content="charset=utf8" bit and make the regex far more complex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fully agreed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only a 2/10 on the HE COMES scale :)

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hawkowl hawkowl merged commit df758e1 into develop Nov 15, 2018
@hawkowl hawkowl deleted the hawkowl/http-equiv-encodings branch November 15, 2018 17:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants