-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace cChardet with chardetng_py. #7559
Conversation
I think it's still going to be preferable to remove chardet completely, we can add chardetng to the docs for the few users that might want it. Also, should the |
Also, the article on chardetng says that it explicitly doesn't support utf-8 and will never select that as an encoding. I see, for example, that Github use |
|
I agree it's difficult to get right (actually, probably impossible), but I'm just suggesting we move it to documentation and tell the user to use this library if relevant to them. We should be able to provide a simple copy/paste bit of code that shows how to do it (with allow_utf8 and tld included for best accuracy). |
I've updated my PR with the following changes:
Is there a separate PR which drops character detection completely? I can't find it. I can go ahead and work on opening a pull request which does that, along with adding the documentation you suggested. |
Not yet, so feel free. I was going to look at it after a couple of other tasks I'm dealing with currently. I assume something like this would be sufficient for the documentation:
Although, not sure how well it'll play with TLDs that have multiple parts (but, that's another hard problem that requires an up-to-date list of TLDs). |
That's likely wrong, unfortunately. You probably want to look at the CONTENT_TYPE header first. I believe this is closer try:
text = await resp.text()
except UnicodeDecodeError:
tld = resp.url.host.rsplit(".")[-1]
body = await response.read()
text = body.decode(chardetng_py.detect(body, allow_utf8=True, tld=tld)) I'll see if I can get that other PR opened. |
Right, maybe then it's worth adding a parameter to
Could also include the parameter in ClientSession, so it only has to be set once. At which point there's really no convenience lost, just copy/paste a couple of lines of code at setup time and you have the charset behaviour back. Users are also in full control over which libraries to use too, so we also don't have to worry about cchardet being abandoned etc. |
Maybe it should only be set in ClientSession and we can skip the parameter in .text() (we can always add it later if users ask for it). |
So a few more things: You don't have to worry about cchardet being abandoned if you accept this pull request. It's not based on chardet at all. It's a different library written in rust. In terms of maintenance burden, it's much easier to keep rust bindings updated with maturin/pyo3 than how cchardet is written. I can pinky promise I'll keep it updated if that makes you feel better. 😆 I have opened the pull request to remove charset detection completely here: #7560 Regarding setting a fallback character set at the session level. I'm not sure I love that either. The second the session accesses multiple urls from multiple domains, it's just going to get confusing if you've picked the right fallback or not. |
Not sure what you mean, my example was using a user-supplied function to reimplement the current behaviour. If a mimetype isn't present in Content-Type, then we call that function (which can just default to |
Oh, I understand now. You were recommending setting a callable at the client session level. An alternate approach would be to direct users to subclass subclass ClientSession. class ClientSessionWithCharsetDetection(ClientSession):
async def text(self, *args, **kwargs):
try:
return await super().text(*args, **kwargs)
except UnicodeDecodeError:
tld = self.url.host.rsplit(".")[-1]
return self._body.decode(
chardetng_py.detect(self._body, allow_utf8=True, tld=tld)
) I think that achieves the same function as a callable. I generally try to avoid inheritance, but it looks clean there. |
ClientSession is final: Line 167 in db2c274
So, a callable would be needed. |
Wow, I had no idea about I do have one more proposal: Change the behavior of charset detection so the encoding specified in the header or utf-8 by default is tried, and only in the case of a decoding failure, try character set detection. This will actually improve performance, and then you can put a warning in the character set detection block so that users of the library will have a more visible heads up that their code won't work when character detection is removed at a later time. Does that sound reasonable? Presumably DeprecationWarning EDIT Unrelated, but for anyone reading, it looks like subclassing CilentSession is forbidden in |
Could be an improvement, let's take a look at it in a PR. |
@Dreamsorcerer Pull request is here: #7561 I included the chardetng_py changes in there, because I really do think if character set detection is being offered, even in a deprecated form, that it's a nice addition. |
What do these changes do?
This removes cChardet as an optional dependency for speeds up and replaces it with chardetng_py, which is a python binding to Mozilla's chartdetng (or chardet Next Generation) library.
Advantages over cChardet:
Other notes
i. chardetng is as-fast-or-faster than cchardet in my testing
ii. encoding detection is as-good-or-better than cchardet
Are there changes in behavior for the user?
The exact encoding returned by chardetng might not match what was returned by cchardet, but it is in production use with Firefox. If you have specific questions about the operation of the library, you should read this blog post: https://hsivonen.fi/chardetng/
Related issue number
Should close #7126
cchardetng_py also support incremental encoding detection where the buffer can be fed into the detector in chunks.
Implementing that would solve #4112
Docs: https://chardetng-py.readthedocs.io/en/latest/class_reference.html
Checklist
CONTRIBUTORS.txt
CHANGES
folder<issue_id>.<type>
for example (588.bugfix)issue_id
change it to the pr id after creating the pr.feature
: Signifying a new feature..bugfix
: Signifying a bug fix..doc
: Signifying a documentation improvement..removal
: Signifying a deprecation or removal of public API..misc
: A ticket has been closed, but it is not of interest to users.