Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PreviewText field in MessageSummary is incorrect for the email with charset="ks_c_5601-1987" #1755

Closed
aldayneko opened this issue May 20, 2024 · 14 comments
Labels
server-bug The bug appears to be in the server

Comments

@aldayneko
Copy link

aldayneko commented May 20, 2024

IMailFolder.FetchAsync returns incorrect PreviewText in case when email has Content-Type: text/plain; charset="ks_c_5601-1987"
The result is "� � � �ȳ�ϼ� �, � �⸦ �ٶ�ϴ�. �̹� �ָ� � �ȹ� � �ֳ�? �ٵ� � �" but should be like this "안녕하세요 여러분 내 주제는 여기 안녕하세요 여러분"
This charset is registered on the environment and I can read the correct body when fetch body parts, but can't get the correct value inside PreviewText.
I can see in the code that finally PreviewText value is read like UTF8 but what to do if there is another encoding. Is there way to customize it ?

public async Task<string> ReadLiteralAsync (CancellationToken cancellationToken)
{
	if (Stream.Mode != ImapStreamMode.Literal)
		throw new InvalidOperationException ();

	int literalLength = Stream.LiteralLength;
	var buf = ArrayPool<byte>.Shared.Rent (literalLength);

	try {
		int n, nread = 0;

		do {
			if ((n = await Stream.ReadAsync (buf, nread, literalLength - nread, cancellationToken).ConfigureAwait (false)) > 0)
				nread += n;
		} while (nread < literalLength);

		try {
			return TextEncodings.UTF8.GetString (buf, 0, nread);
		} catch {
			return TextEncodings.Latin1.GetString (buf, 0, nread);
		}
	} finally {
		ArrayPool<byte>.Shared.Return (buf);
	}
}

Platform (please complete the following information):

  • MailKit Version: 4.6

Exception
NO

To Reproduce
Call FetchAsync and check the result when message has charset="ks_c_5601-1987"

Expected behavior
PreviewText should have readable value

@jstedfast
Copy link
Owner

Could you construct a sample message with such a message body for me? This way I could add it to my unit tests, debug it, and figure out a proper solution?

Thanks!

@aldayneko
Copy link
Author

Sure, here is the text file with message source. Please, let me know if it's what you need
korean.txt
Thanks.

@jstedfast
Copy link
Owner

Quick question:

I started to take a look at this and there are 2 implementations for preview text.

  1. The server supports the PREVIEW feature.
  2. The server does not support the PREVIEW feature, so MailKit attempts to fetch a small chunk of the message body itself.

Option 2 is by far the most prevalent because very few servers actually support the PREVIEW feature. That said, based on your original bug report, it sounds like you've already done some debugging and have narrowed the bug down to the ReadLiteral method doing incorrect character conversion.

Does that mean your server supports the PREVIEW feature?

I just want to make sure I'm looking into the correct bits of code.

@aldayneko
Copy link
Author

Yeah, I have been trying to debug it. Server doesn't support PREVIEW feature, so I see that previewText value finally is read as a small chunk of the body and decoded inside ReadLiteralAsync as UTF8, which is incorrect since it's not UTF8. It seems like we need somehow to put here the correct encoding from the body charset

@jstedfast
Copy link
Owner

I would have assumed that https://github.com/jstedfast/MailKit/blob/master/MailKit/Net/Imap/ImapFolderFetch.cs#L918 would have taken care of this because QueueFetchPreviewCommand sets up a FetchPreviewContext which gets the raw body data added to it via the FetchStreamAsync callback method which consumes the body content into a Stream and then calls ctx.AddAsync() which eventually makes it to FetchPreviewTextContext.Add()

Perhaps what is happening is that there is a DecoderException for ks_c_5601-1987 (due to incomplete data that ends in the middle of a multibyte character?) which then causes fallback to iso-8859-1?

@jstedfast
Copy link
Owner

Take a look at the above commit - it attempts to reproduce the issue in the unit tests but it works fine for me.

That said, I suspect that it's working for me because the unit test is likely missing something important but I don't know what it is.

As per my previous comment, perhaps the issue is that the unit test has a shorter message body than the message body that is causing you problems and if the message body were longer, then it might hit a charset conversion issue that results in fallback to iso-8859-1.

Do you think it would be possible for you to take a look at the above unit test that I just added use that as a template for obtaining an IMAP server response that causes this issue?

@jstedfast
Copy link
Owner

I've scraped some Korean text off of Wikipedia to try and get a long string of text and it still works, so my thinking is that perhaps you did not register the international text encodings in your app?

Make sure at program startup, you call the following line of code:

System.Text.Encoding.RegisterProvider (System.Text.CodePagesEncodingProvider.Instance);

@aldayneko
Copy link
Author

Sorry, I didn't have time to try your recommendations above, I'll do it tomorrow or during the weekend

Regarding encodings - it's registered exactly as you wrote and I'm able to read body and subject in correct format when I call MailFolder.GetMessageAsync()

@aldayneko
Copy link
Author

I've tried to copy-paste responses from the server into unit test that you mentioned, but I'm not sure that I did it correctly.
I'm getting some exceptions about parsing string, probably because I copied wrong data.
So, I caught all raw logs via ProtocolLogger and wrote to the file
Could you please check it with this data ? It exactly what I get from the imap server
imap_log.txt

@jstedfast
Copy link
Owner

jstedfast commented May 24, 2024

Okay, so this is what we're interested in:

C: D00000007 UID FETCH 284 (UID FLAGS INTERNALDATE ENVELOPE BODYSTRUCTURE PREVIEW BODY.PEEK[HEADER.FIELDS (IMPORTANCE SECURE-REPLY-SENDER)])
S: * 144 FETCH (UID 284 FLAGS (\Seen) INTERNALDATE "17-May-2024 07:48:43 +0000" PREVIEW {213}
S: � � � �ȳϼ� �, � �⸦ �ٶ�ϴ�. �̹� �ָ� � �ȹ� � �ֳ�? �ٵ� � �ñ� �ٶ�Կ�! � �ȹ�ϴ� �̵� �ֽ�ϴ�. � � �غ�. � �! ENVELOPE [snip] BODY[HEADER.FIELDS (IMPORTANCE SECURE-REPLY-SENDER)] {2}
S: 
S: )
S: D00000007 OK Fetch completed (0.001 + 0.000 secs).

Unfortunately, the IMAP server does claim to support the PREVIEW feature and it is providing the preview text in the wrong charset. It MUST provide the preview text in UTF-8, but it is providing it in ks_c_5601-1987 as you've noted.

You could try submitting a bug here: https://github.com/dovecot/core

The specification in question can be found here: https://www.rfc-editor.org/rfc/rfc8970.html
Specifically, section 3.3 notes:

The generated string MUST NOT be content transfer encoded and MUST be
encoded in UTF-8 [RFC3629].  The server SHOULD remove any formatting
markup and do whatever processing might be useful in rendering the
preview as plain text.

If you file a bug against Dovecot, feel free to /cc me so that I can follow along and chime in if needed.

As a work-around, you could try disabling server-side PREVIEW by doing this in your code after authenticating:

client.Capabilities &= ~ImapCapabilities.Preview;

This will unfortunately make the query a bit slower, but it should provide better results until Dovecot fixes the bug.

Hope that helps.

@jstedfast jstedfast added the server-bug The bug appears to be in the server label May 24, 2024
@aldayneko
Copy link
Author

aldayneko commented May 27, 2024

Yes, you're right, the IMAP server supports preview.
Sorry, when I was debugging here https://github.com/jstedfast/MailKit/blob/master/MailKit/Net/Imap/ImapFolderFetch.cs#L1273 the previewText variable was false that's why I thought preview is not supported.
Btw, regarding rfc, the link that you've provided has a status 'Proposed Standard' - so, it's not final and we can't say it's a bug ( or we can ? :))
Anyway, we're trying to fix it with dovecat configuration. Disabling server-side PREVIEW is an option, but not preferable
Just a question - why we can't read response from preview base on body charset instead of UTF-8
Or it'll be breaking change because in most cases it's returned as UTF-8 ?

@jstedfast
Copy link
Owner

the previewText variable was false that's why I thought preview is not supported.

Don't worry about it, I can understand why you misunderstood.

Btw, regarding rfc, the link that you've provided has a status 'Proposed Standard' - so, it's not final and we can't say it's a bug ( or we can ? :))

We can say it's a bug because clients and servers are supposed to implement specifications exactly or they aren't worth writing.

Anyway, we're trying to fix it with dovecat configuration. Disabling server-side PREVIEW is an option, but not preferable

You don't have to disable it in Dovecot in a config file, just disable it in MailKit by using the snippet I pasted in my previous comment.

Just a question - why we can't read response from preview base on body charset instead of UTF-8
Or it'll be breaking change because in most cases it's returned as UTF-8 ?

Because we don't necessarily know the body charset.

@aldayneko
Copy link
Author

aldayneko commented May 28, 2024

if we don't know anything about body charset, how do we read Korean text in the subject ? I thought, in most cases it should be the same.

@jstedfast
Copy link
Owner

Every header (or even every email address in a To/From/Cc header) and every MIME part of the message can use a different charset.

You can't rely on a subject charset to be the same as the body.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
server-bug The bug appears to be in the server
Projects
None yet
Development

No branches or pull requests

2 participants