Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems decoding content #77

Open
cvlli opened this issue Dec 9, 2016 · 9 comments
Open

Problems decoding content #77

cvlli opened this issue Dec 9, 2016 · 9 comments

Comments

@cvlli
Copy link

cvlli commented Dec 9, 2016

Traceback (most recent call last): File "/root/PycharmProjects/Teste/main.py", line 26, in <module> next_message = next(all_messages) File "/usr/local/lib/python3.5/dist-packages/imbox/__init__.py", line 50, in fetch_list yield (uid, self.fetch_by_uid(uid)) File "/usr/local/lib/python3.5/dist-packages/imbox/__init__.py", line 41, in fetch_by_uid email_object = parse_email(raw_email) File "/usr/local/lib/python3.5/dist-packages/imbox/parser.py", line 151, in parse_email content = decode_content(part) File "/usr/local/lib/python3.5/dist-packages/imbox/parser.py", line 119, in decode_content return content.decode(charset) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 906: invalid start byte

Some emails can have content encoded with Latin-1, Latin-15 or something else.

May add:

`
import chardet
...

def _decode_content(message):
targetEncoding = "utf-8"

content = message.get_payload(decode=True)
charset = message.get_content_charset('utf-8')
try:
    sourceEncoding = chardet.detect(content).get("encoding")
    content = str(content).decode(sourceEncoding).encode(targetEncoding)
    return content.decode(charset)
except AttributeError:
    return content

`

@sblondon
Copy link
Contributor

I confirm the issue. I have an example where there is an encoding error:
'utf-8' codec can't decode byte 0xc3 in position 1856: invalid continuation byte

The e-mail is correctly parsed when I change the line:
raw_email = str_encode(raw_email, 'utf-8')
by
raw_email = str_encode(raw_email, 'latin-1')

This problem is probably the duplicate of #64.

chardet library could fix the problem, but perhaps there are others solutions?

@martinrusev What do you think about it? Interested by a pull-request?

By the way, I think the if at the begining of the parse_email() function is not necessary because raw_email is always a byte type: parse_email() is called only by imbox.Imbox.fetch_by_uid() and the returned data by imaplib seems to always be bytes.

@martinrusev
Copy link
Owner

@sblondon chardet would be a nice fix for this problem. A Pull request is always welcome !

@sblondon
Copy link
Contributor

I checked the erroneous message with the latest imbox version from the repository and I can't reproduce the error. So I will not send a pull-request until I get a new error.
I have no idea when it will occur again. Perhaps never?

@cvlli If you still have errors, could you provide a file example? If you have login access to the IMAP server, the file is probably is ~/Maildir/cur (or tmp). The goal is to add another test case to fix the issue and avoid encoding error in the future.

@ghost
Copy link

ghost commented Nov 20, 2017

has most probably been fixed by #96 and can be closed

@ghost
Copy link

ghost commented Nov 20, 2017

And #78

@wesinator
Copy link

Similar decoding error in 0.9.5

  File ".local/lib/python3.6/site-packages/imbox/__init__.py", line 57, in fetch_list
    yield (uid, self.fetch_by_uid(uid))
  File ".local/lib/python3.6/site-packages/imbox/__init__.py", line 48, in fetch_by_uid
    email_object = parse_email(raw_email, policy=self.parser_policy)
  File ".local/lib/python3.6/site-packages/imbox/parser.py", line 181, in parse_email
    parsed_email['sent_from'] = get_mail_addresses(email_message, 'from')
  File ".local/lib/python3.6/site-packages/imbox/parser.py", line 55, in get_mail_addresses
    addresses[index] = {'name': decode_mail_header(address_name),
  File ".local/lib/python3.6/site-packages/imbox/parser.py", line 36, in decode_mail_header
    logger.debug("Mail header no. {}: {} encoding {}".format(index, str_decode(text, charset or 'utf-8'), charset))
  File ".local/lib/python3.6/site-packages/imbox/utils.py", line 12, in str_decode
    return value.decode(encoding or 'utf-8', errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 20: invalid start byte

@sblondon
Copy link
Contributor

@wesinator can you provide the header which produce the bug?

@wesinator
Copy link

@sblondon No, unfortunately I lost that specific one, it got deleted. But if I see it again I'll try to provide a header.

@sblondon
Copy link
Contributor

ok, thanks @wesinator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants