-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Use <meta> tags to discover the per-page encoding of html previews #4183
Conversation
Fixes #2891 |
@@ -53,6 +53,9 @@ | |||
|
|||
logger = logging.getLogger(__name__) | |||
|
|||
_charset_match = re.compile(br"<\s*meta[^>]*charset\s*=\s*([a-z0-9-]+)", flags=re.I) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obligatory reference to https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 here.
(yeah, I don't know how you're supposed to parse the HTML before you know what encoding it is, either)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, when I was coming up with this, I had two solutions:
- Decode as latin1 (which will decode any 8 bit stream) or ascii with errors set to ignore
- Regex looking for a
<meta
beginning block and thencharset=<foo>
, which is likely to be in ahttp-equiv="Content-Type"
.
Since I consider this best-effort (correctly configured servers ought to be serving the correct header in the Content-Type), I decided a regex was better than parsing it, then parsing it again. I chose not to check for the http equiv part since it can go before or after the content="charset=utf8"
bit and make the regex far more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fully agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only a 2/10 on the HE COMES scale :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
No description provided.