Mail::Body#to_s uses wrong String encoding #1413

koppen · 2020-09-25T11:48:12Z

Given an email encoded with a ISO-8859-1 charset, I'd expect the strings coming out of mail methods to be either:

Encoded with ISO-8859-1
Converted correctly to UTF-8

However, we see them returned with ASCII-8BIT encodings which doesn't seem right - and which causes problems on subsequent handling of those strings.

How to reproduce

The following script illustrates the issue:

require "mail"

mail = Mail.new('
Content-Type: multipart/alternative; boundary="_000_AM0PR09MB233743E6638C8665A901BF08CA220AM0PR09MB2337eurp_"

--_000_AM0PR09MB233743E6638C8665A901BF08CA220AM0PR09MB2337eurp_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

R=F8dgr=F8d med fl=F8de p=E5!
')
text_part = mail.text_part

charset = text_part.charset
puts "charset: #{charset.inspect}"

body_string = text_part.body.to_s
puts "encoding: #{body_string.encoding.inspect}"

# This triggers a conversion of the string to UTF-8, which then fails.
require "json"
JSON.generate({body: body_string})

Running it outputs

charset: "iso-8859-1"
encoding: #<Encoding:ASCII-8BIT>
Traceback (most recent call last):
	2: from repro2.rb:22:in `<main>'
	1: from ~/.rvm/gems/ruby-2.6.5/gems/json-2.3.0/lib/json/common.rb:224:in `generate'
~/.rvm/gems/ruby-2.6.5/gems/json-2.3.0/lib/json/common.rb:224:in `generate': "\xF8" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

While the exception comes from inside the JSON gem, I'd risk assessing that the root cause is because body_string.encoding is #<Encoding:ASCII-8BIT> and not #<Encoding:ISO-8859-1>. To verify this we can add a

body_string.force_encoding(charset)

before performing the JSON encoding, which makes the script run without exceptions.

Versions

mail 2.8.0.edge
ruby 2.6.5 and 2.7

Possibly related issues

I did look through the existing issues, and while I wasn't able to find any that matches the issue exactly, there are quite a few that seem related. Many are older, though, and/or missing reproduction steps:

The text was updated successfully, but these errors were encountered:

benlangfeld · 2020-11-04T01:44:13Z

This was originally reported in #809 but closed based on a misunderstanding of later comments.

koppen · 2020-11-04T07:38:27Z

Nicely spotted, that is the exact issue we're seeing.

And misunderstandings so easily occur here, when we're talking about encodings, but not those encodings, the other encoding, which is really charset. 🙄 And it's especially hard because you can't visually tell the difference and in so many cases everything still works even though it is wrong.

koppen · 2020-11-04T08:38:35Z

It does seem, though, that using Mail::Message#decoded and Mail::Part#decoded, as @jeremy suggests in #809 (comment), works. Perhaps not entirely as expected as it always returns a UTF-8 encoded String, but that does satisfy my expectation number 2 above.

That said, it still seems problematic that Mail::Body#decoded can return strings with a decidedly incorrect encoding, but I wonder if this might be more an issue of improving the docs around #decoded?

benlangfeld · 2020-11-04T11:24:07Z

It does seem, though, that using Mail::Message#decoded and Mail::Part#decoded, as @jeremy suggests in #809 (comment), works.

Only if the email is multi-part. Otherwise, Mail::Message.parts returns [].

Perhaps not entirely as expected as it always returns a UTF-8 encoded String, but that does satisfy my expectation number 2 above.

That said, it still seems problematic that Mail::Body#decoded can return strings with a decidedly incorrect encoding, but I wonder if this might be more an issue of improving the docs around #decoded?

I agree, it's broken.

>> s = "Delivered-To: foo@example.com\r\nFrom: Ben Langfeld <blangfeld@powerhrg.com>\r\nContent-Type: text/plain;\r\n\tcharset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\nMime-Version: 1.0 (Mac OS X Mail 13.4 \\(3608.120.23.2.1\\))\r\nSubject: Test\r\nMessage-Id: <CC698DA3-B6BA-4DE4-8E5A-8200D066FCC3@powerhrg.com>\r\nTo: foo@example.com\r\n\r\n=F0=9F=91=8D\r\n\r\nBen=\r\n"
=> "Delivered-To: foo@example.com\r\nFrom: Ben Langfeld <blangfeld@powerhrg.com>\r\nContent-Type: text/plain;\r\n\tcharset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\nMime-Version: 1.0 (Mac OS X Mail 13.4 \\(3608.120.23.2.1\\))\r\nSubject: Test\r\nMessage-Id: <CC698DA3-B6BA-4DE4-8E5A-8200D066FCC3@powerhrg.com>\r\nTo: foo@example.com\r\n\r\n=F0=9F=91=8D\r\n\r\nBen=\r\n"
>> mail = Mail.read_from_string(s)
=> #<Mail::Message:70158438295940, Multipart: false, Headers: <From: Ben Langfeld <blangfeld@powerhrg.com>>, <To: foo@example.com>, <Message-ID: <CC698DA3-B6BA-4DE4-8E5A-8200D066FCC3@powerhrg.com>>, <Subject: Test>, <Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.1\))>, <Content-Type: text/plain;	charset=utf-8>, <Content-Transfer-Encoding: quoted-printable>, <Delivered-To: foo@example.com>>
>> mail.body
=> #<Mail::Body:0x00007f9e118bc9d0 @boundary=nil, @preamble=nil, @epilogue=nil, @charset="US-ASCII", @part_sort_order=["text/plain", "text/enriched", "text/html", "multipart/alternative"], @parts=[], @raw_source="=F0=9F=91=8D\r\n\r\nBen=\r\n", @ascii_only=true, @encoding="quoted-printable">
>> mail.body.decoded
=> "\xF0\x9F\x91\x8D\r\n\r\nBen"
>> mail.decoded
=> "👍\r\n\r\nBen"
>> mail.parts
=> []
>> mail.part
=> [#<Mail::Part:70158574799800, Multipart: false, Headers: <Content-Type: text/plain;>>, #<Mail::Part:70158574804860, Multipart: false, Headers: >]
>> mail.part[0].decoded
=> "\xF0\x9F\x91\x8D\r\n\r\nBen"

TylerRick · 2023-01-13T18:54:40Z

I'm running this issue too, while processing a message sent into the app from Cloudmailin.

Encoding::UndefinedConversionError: "\xC2" from ASCII-8BIT to UTF-8

So what's the solution?

What do we expect decoded to return in these cases? How can we fix it so that it does so?

(And why is this 5-year-old (#1216), very serious, app-crashing bug still unresolved?)

TylerRick · 2023-01-24T00:13:18Z

Not a proper solution to the underlying problem, but here's the quick workaround I came up with, in case it's helpful to anyone...

  html = fix_encoding(@inbound.html_part)

  private def fix_encoding(part)
    body_str = part.body.decoded
    # Why is the encoding sometimes ASCII-8BIT (and sometimes already converted to UTF-8) instead of
    # respecting the claimed encoding? Why does trying to encode it as UTF-8 result in:
    # Encoding::UndefinedConversionError ("\xE2" from ASCII-8BIT to UTF-8)? We may never know. (See
    # https://github.com/mikel/mail/issues/1413, etc.) But here is a workaround.
    logger.debug "body_str.encoding: #{body_str.encoding} (part.charset: #{part.charset})"
    unless body_str.encoding == Encoding.find('utf-8')
      new_encoding =
        if (body_str.encode('utf-8') rescue false)
          'utf-8'
        else
          part.charset
        end
      logger.warn "Using force_encoding to change from #{body_str.encoding} to #{new_encoding} (part.charset: #{part.charset})"
      body_str = body_str.force_encoding(new_encoding)
    end
    body_str
  end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mail::Body#to_s uses wrong String encoding #1413

Mail::Body#to_s uses wrong String encoding #1413

koppen commented Sep 25, 2020 •

edited

Loading

benlangfeld commented Nov 4, 2020

koppen commented Nov 4, 2020

koppen commented Nov 4, 2020

benlangfeld commented Nov 4, 2020 •

edited

Loading

TylerRick commented Jan 13, 2023

TylerRick commented Jan 24, 2023

Mail::Body#to_s uses wrong String encoding #1413

Mail::Body#to_s uses wrong String encoding #1413

Comments

koppen commented Sep 25, 2020 • edited Loading

How to reproduce

Versions

Possibly related issues

benlangfeld commented Nov 4, 2020

koppen commented Nov 4, 2020

koppen commented Nov 4, 2020

benlangfeld commented Nov 4, 2020 • edited Loading

TylerRick commented Jan 13, 2023

TylerRick commented Jan 24, 2023

koppen commented Sep 25, 2020 •

edited

Loading

benlangfeld commented Nov 4, 2020 •

edited

Loading