UUencoded attachment parsing #80

sylencecc · 2020-11-24T13:40:06Z

When dealing with attachments encoded via uuencoding (Content-transfer-encoding is uuencode or x-uuencode), mail-parser treats them as text, as can be seen in parse() (mailparser.py:378):

if transfer_encoding == "base64" or (
  transfer_encoding == "quoted-\
  printable" and "application" in mail_content_type):
    ...
else:
  payload = ported_string(p.get_payload(decode=True), encoding=charset)
  log.debug("Filename {!r} part {!r} is not binary".format(filename, i))

Within the else block, the payload is correctly decoded with p.get_payload(decode=True), but then passed to ported_string() which attempts to encode the returned bytes to UTF-8 in utils.py:85:

def ported_string(raw_data, encoding='utf-8', errors='ignore'):
...
  try:
    return six.text_type(raw_data, encoding).strip()
  except (LookupError, UnicodeDecodeError):
    return six.text_type(raw_data, "utf-8", errors).strip()

Since errors are ignored, encoding doesn't fail, but returns a attachment stripped of all bytes that can't be encoded in utf-8 (that can be easily verified by attempting to write that binary to disk with write_attachments).

I encountered this issue while porting SpamScope to Python3, which has a test test_store_samples_unicode_error that parses and saves a uuencoded attachment. According to the test, the resulting file should have a MD5 checksum of 2ea90c996ca28f751d4841e6c67892b8. That test passes with Python2, because the incorrectly parsed payload does indeed have that hash. However, with Python3 the hash changes due to differences in unicode handling. However, the correct checksum is actually 4f2cf891e7cfb349fca812091f184ecc.

The text was updated successfully, but these errors were encountered:

sylencecc · 2020-11-24T14:04:56Z

I added the respective uuencoded mail sample from the SpamScope repository (its name is mail_test_10 over there) and a test case that highlights the issue. That test fails with py2 while attempting to write the binary data as utf-8:

File "mailparser/utils.py", line 562, in write_sample
  f.write(payload)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0612' in position 57: ordinal not in range(128

SpamScope doesn't run into that exception, because it ships with its own write method that succeeds in writing the (now broken) file.
For py3, the test fails as expected due to an invalid hash:

AssertionError: 'c2d03e7cb36e48c625d9243ee3fdc401' != '4f2cf891e7cfb349fca812091f184ecc'
- c2d03e7cb36e48c625d9243ee3fdc401
+ 4f2cf891e7cfb349fca812091f184ecc

As for the fix, I decided to simply re-encode the uuencoded attachment in base64 and call it a day. That's because the binary flag is mostly treated as base64 all over the place (even in SpamScope), even though an attachment could also be encoded as uuencode (amongst others).
To highlight a few examples:

In mailparser/utils.py:557 binary-encoded attachments are always treated as if they were base64

if binary:
  with open(sample, "wb") as f:
    f.write(base64.b64decode(payload))

Same for SpamScope, e.g. in src/modules/attachments/attachments.py:259:

if raw_dict["binary"]:
  try:
    payload = base64.b64decode(raw_dict["payload"])

However, just keeping binary set to False causes other issues, because then the payload will be treated as utf-8 string in various places.

For a better solution it might be beneficial to simply treat payload as a byte string or bytearray if binary is True instead of expecting a base64-encoded string. Since that would require various rewrites all over mail-parser and SpamScope, I went with this easier fix. The obvious drawback of my solution is that the transfer encoding is now incorrectly reported as base64 even though the actual attachment was encoded with uuencode. That's still better than not being able to parse those attachments at all.

fedelemantuano · 2020-11-24T18:37:09Z

Hi @sylencecc,

thanks a lot for your PR. Are you working on SpamScope porting?

sylencecc · 2020-11-25T08:48:09Z

Yeah, we have some py3-only dependencies, which is why I'm currently in the process of porting it over. If you're interested in the results, I'll happily send a pull request as soon as I'm done. However, I'm not keeping backwards compatibility: it won't run with py2 anymore. In addition to that, I'm only using the Docker-based version, so I won't touch the Ansible stuff for now. Docker-wise, since the SpamScope image depends on fmantuano/spamscope-deps (which I didn't find a repository for) and this again depends on fmantuano/apache-storm (which I DID find a repository for), I did the following:

Forked the repository for fmantuano/apache-storm and updated Apache Storm to version 2.2.0 (see here)
Merged the Dockerfile for spamscope-deps into the Dockerfile within the SpamScope repository and updated all dependencies. The drawback of that merge is that building the v8 engine for thug takes forever, however now users have the chance to manage their dependencies which I find better than depending on the outdated soamscope-deps from Docker Hub.
Conversion process itself is still underway (it's just a side project), but SpamScope tests are passing now with py3.

Let me know if I should send you PR requests for all that stuff. A separate branch might be appropriate.

fedelemantuano · 2020-11-26T21:08:13Z

Yes you can send me all PRs.

I'm not keeping backwards compatibility: it won't run with py2 anymore.
Me too.

Good job.

sylencecc · 2020-12-01T17:54:03Z

Please have a look at my SpamScope fork and the Storm Dockerfile which the new SpamScope image depends on. So far, the included tests run fine, as does the default debug topology. The project we're using SpamScope in also seems to run on py3 without further issues. However, due to the mediocre test coverage we can't be all too confident that this update doesn't break anything. Moreover, I didn't update the Ansible playbooks due to a lack of time.

Do you want to merge that info master/develop or create a separate branch just for py3 support?
Should I squash my commits into a single "py3 update" commit?
Did you encounter any issues with my changes, anything I forgot etc.?

fedelemantuano · 2020-12-05T10:09:29Z

Hi @sylencecc,

I will check your code. Now I can't.
I prefer to use a custum branch, git flow style feature/py3.
Yes, it better squash your commits.

For code I will test it.

sylencecc · 2021-02-10T00:33:07Z

Any news on that? Did you find time to test the py3 changes?

fedelemantuano · 2021-02-10T07:45:24Z

Hi @sylencecc I will test it. I'm working on it best effort.

sylencecc mentioned this issue Nov 24, 2020

Proper handling of uuencoded attachments by re-encoding them in base64 #81

Merged

fedelemantuano closed this as completed in #81 Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UUencoded attachment parsing #80

UUencoded attachment parsing #80

sylencecc commented Nov 24, 2020

sylencecc commented Nov 24, 2020 •

edited

Loading

fedelemantuano commented Nov 24, 2020

sylencecc commented Nov 25, 2020 •

edited

Loading

fedelemantuano commented Nov 26, 2020

sylencecc commented Dec 1, 2020 •

edited

Loading

fedelemantuano commented Dec 5, 2020

sylencecc commented Feb 10, 2021

fedelemantuano commented Feb 10, 2021

UUencoded attachment parsing #80

UUencoded attachment parsing #80

Comments

sylencecc commented Nov 24, 2020

sylencecc commented Nov 24, 2020 • edited Loading

fedelemantuano commented Nov 24, 2020

sylencecc commented Nov 25, 2020 • edited Loading

fedelemantuano commented Nov 26, 2020

sylencecc commented Dec 1, 2020 • edited Loading

fedelemantuano commented Dec 5, 2020

sylencecc commented Feb 10, 2021

fedelemantuano commented Feb 10, 2021

sylencecc commented Nov 24, 2020 •

edited

Loading

sylencecc commented Nov 25, 2020 •

edited

Loading

sylencecc commented Dec 1, 2020 •

edited

Loading