Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mail::Body#to_s uses wrong String encoding #1413

Open
koppen opened this issue Sep 25, 2020 · 6 comments
Open

Mail::Body#to_s uses wrong String encoding #1413

koppen opened this issue Sep 25, 2020 · 6 comments

Comments

@koppen
Copy link

koppen commented Sep 25, 2020

Given an email encoded with a ISO-8859-1 charset, I'd expect the strings coming out of mail methods to be either:

  1. Encoded with ISO-8859-1
  2. Converted correctly to UTF-8

However, we see them returned with ASCII-8BIT encodings which doesn't seem right - and which causes problems on subsequent handling of those strings.

How to reproduce

The following script illustrates the issue:

require "mail"

mail = Mail.new('
Content-Type: multipart/alternative; boundary="_000_AM0PR09MB233743E6638C8665A901BF08CA220AM0PR09MB2337eurp_"

--_000_AM0PR09MB233743E6638C8665A901BF08CA220AM0PR09MB2337eurp_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

R=F8dgr=F8d med fl=F8de p=E5!
')
text_part = mail.text_part

charset = text_part.charset
puts "charset: #{charset.inspect}"

body_string = text_part.body.to_s
puts "encoding: #{body_string.encoding.inspect}"

# This triggers a conversion of the string to UTF-8, which then fails.
require "json"
JSON.generate({body: body_string})

Running it outputs

charset: "iso-8859-1"
encoding: #<Encoding:ASCII-8BIT>
Traceback (most recent call last):
	2: from repro2.rb:22:in `<main>'
	1: from ~/.rvm/gems/ruby-2.6.5/gems/json-2.3.0/lib/json/common.rb:224:in `generate'
~/.rvm/gems/ruby-2.6.5/gems/json-2.3.0/lib/json/common.rb:224:in `generate': "\xF8" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

While the exception comes from inside the JSON gem, I'd risk assessing that the root cause is because body_string.encoding is #<Encoding:ASCII-8BIT> and not #<Encoding:ISO-8859-1>. To verify this we can add a

body_string.force_encoding(charset)

before performing the JSON encoding, which makes the script run without exceptions.

Versions

  • mail 2.8.0.edge
  • ruby 2.6.5 and 2.7

Possibly related issues

I did look through the existing issues, and while I wasn't able to find any that matches the issue exactly, there are quite a few that seem related. Many are older, though, and/or missing reproduction steps:

@benlangfeld
Copy link

This was originally reported in #809 but closed based on a misunderstanding of later comments.

@koppen
Copy link
Author

koppen commented Nov 4, 2020

Nicely spotted, that is the exact issue we're seeing.

And misunderstandings so easily occur here, when we're talking about encodings, but not those encodings, the other encoding, which is really charset. 🙄 And it's especially hard because you can't visually tell the difference and in so many cases everything still works even though it is wrong.

@koppen
Copy link
Author

koppen commented Nov 4, 2020

It does seem, though, that using Mail::Message#decoded and Mail::Part#decoded, as @jeremy suggests in #809 (comment), works. Perhaps not entirely as expected as it always returns a UTF-8 encoded String, but that does satisfy my expectation number 2 above.

That said, it still seems problematic that Mail::Body#decoded can return strings with a decidedly incorrect encoding, but I wonder if this might be more an issue of improving the docs around #decoded?

@benlangfeld
Copy link

benlangfeld commented Nov 4, 2020

It does seem, though, that using Mail::Message#decoded and Mail::Part#decoded, as @jeremy suggests in #809 (comment), works.

Only if the email is multi-part. Otherwise, Mail::Message.parts returns [].

Perhaps not entirely as expected as it always returns a UTF-8 encoded String, but that does satisfy my expectation number 2 above.

That said, it still seems problematic that Mail::Body#decoded can return strings with a decidedly incorrect encoding, but I wonder if this might be more an issue of improving the docs around #decoded?

I agree, it's broken.

>> s = "Delivered-To: foo@example.com\r\nFrom: Ben Langfeld <blangfeld@powerhrg.com>\r\nContent-Type: text/plain;\r\n\tcharset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\nMime-Version: 1.0 (Mac OS X Mail 13.4 \\(3608.120.23.2.1\\))\r\nSubject: Test\r\nMessage-Id: <CC698DA3-B6BA-4DE4-8E5A-8200D066FCC3@powerhrg.com>\r\nTo: foo@example.com\r\n\r\n=F0=9F=91=8D\r\n\r\nBen=\r\n"
=> "Delivered-To: foo@example.com\r\nFrom: Ben Langfeld <blangfeld@powerhrg.com>\r\nContent-Type: text/plain;\r\n\tcharset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\nMime-Version: 1.0 (Mac OS X Mail 13.4 \\(3608.120.23.2.1\\))\r\nSubject: Test\r\nMessage-Id: <CC698DA3-B6BA-4DE4-8E5A-8200D066FCC3@powerhrg.com>\r\nTo: foo@example.com\r\n\r\n=F0=9F=91=8D\r\n\r\nBen=\r\n"
>> mail = Mail.read_from_string(s)
=> #<Mail::Message:70158438295940, Multipart: false, Headers: <From: Ben Langfeld <blangfeld@powerhrg.com>>, <To: foo@example.com>, <Message-ID: <CC698DA3-B6BA-4DE4-8E5A-8200D066FCC3@powerhrg.com>>, <Subject: Test>, <Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.1\))>, <Content-Type: text/plain;	charset=utf-8>, <Content-Transfer-Encoding: quoted-printable>, <Delivered-To: foo@example.com>>
>> mail.body
=> #<Mail::Body:0x00007f9e118bc9d0 @boundary=nil, @preamble=nil, @epilogue=nil, @charset="US-ASCII", @part_sort_order=["text/plain", "text/enriched", "text/html", "multipart/alternative"], @parts=[], @raw_source="=F0=9F=91=8D\r\n\r\nBen=\r\n", @ascii_only=true, @encoding="quoted-printable">
>> mail.body.decoded
=> "\xF0\x9F\x91\x8D\r\n\r\nBen"
>> mail.decoded
=> "👍\r\n\r\nBen"
>> mail.parts
=> []
>> mail.part
=> [#<Mail::Part:70158574799800, Multipart: false, Headers: <Content-Type: text/plain;>>, #<Mail::Part:70158574804860, Multipart: false, Headers: >]
>> mail.part[0].decoded
=> "\xF0\x9F\x91\x8D\r\n\r\nBen"

@TylerRick
Copy link
Contributor

I'm running this issue too, while processing a message sent into the app from Cloudmailin.

Encoding::UndefinedConversionError: "\xC2" from ASCII-8BIT to UTF-8

So what's the solution?

What do we expect decoded to return in these cases? How can we fix it so that it does so?

(And why is this 5-year-old (#1216), very serious, app-crashing bug still unresolved?)

@TylerRick
Copy link
Contributor

Not a proper solution to the underlying problem, but here's the quick workaround I came up with, in case it's helpful to anyone...

  html = fix_encoding(@inbound.html_part)

  private def fix_encoding(part)
    body_str = part.body.decoded
    # Why is the encoding sometimes ASCII-8BIT (and sometimes already converted to UTF-8) instead of
    # respecting the claimed encoding? Why does trying to encode it as UTF-8 result in:
    # Encoding::UndefinedConversionError ("\xE2" from ASCII-8BIT to UTF-8)? We may never know. (See
    # https://github.com/mikel/mail/issues/1413, etc.) But here is a workaround.
    logger.debug "body_str.encoding: #{body_str.encoding} (part.charset: #{part.charset})"
    unless body_str.encoding == Encoding.find('utf-8')
      new_encoding =
        if (body_str.encode('utf-8') rescue false)
          'utf-8'
        else
          part.charset
        end
      logger.warn "Using force_encoding to change from #{body_str.encoding} to #{new_encoding} (part.charset: #{part.charset})"
      body_str = body_str.force_encoding(new_encoding)
    end
    body_str
  end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants