Resolve #405. Unicode please!!! #406

davidshen84 · 2018-06-02T06:48:58Z

I converted that peculiar dot into utf-8 encoding, and the solution can build on my Windows 10 64 bit system which has GB2312 as the default character encoding system.

Also, if possible, can we remove that dot?

ghost · 2018-06-02T13:18:59Z

Hi David

Thanks for this. Do you think we should run something like this over the entire repository? I tested it and it does not break any tests on my side.

class Program
{
	static void Main(string[] args)
	{
		var inputDirectory = @"c:\code\windsor"; // <- change to your own checkout path

		WriteUTF8WithoutBOM(inputDirectory, "*.cs");
		WriteUTF8WithoutBOM(inputDirectory, "*.xml");
		WriteUTF8WithoutBOM(inputDirectory, "*.config");
		WriteUTF8WithoutBOM(inputDirectory, "*.csproj");
		WriteUTF8WithoutBOM(inputDirectory, "*.props");


		Console.ReadLine();
	}

	private static void WriteUTF8WithoutBOM(string inputDirectory, string searchPattern)
	{
		foreach (var file in Directory.GetFiles(inputDirectory, searchPattern, SearchOption.AllDirectories))
		{
			if (file.Contains(".git\\")) continue;
			if (file.Contains(".vs\\")) continue;

			Console.WriteLine(file);

			Encoding utf8WithoutBom = new UTF8Encoding(false);
			var fileContents = File.ReadAllText(file, utf8WithoutBom);
			var streamWriter = new StreamWriter(file, false, utf8WithoutBom);
			streamWriter.Write(fileContents);
			streamWriter.Dispose();
		}
	}
}

Also don't forget to add a CHANGELOG.md entry.

ghost · 2018-06-02T13:21:01Z

Also, if possible, can we remove that dot?

We don't have to remove the dot if the file is encoded properly.

ghost

We should maybe fix all the files. There are some files with UTF8 BOMS in them too.

Please see: #406 (comment)

davidshen84 · 2018-06-03T06:12:12Z

@fir3pho3nixx , good idea. I have updated the PR.

ghost · 2018-06-03T13:06:11Z

@davidshen84 - Don't forget to make a note in the CHANGELOG.md

jonorossi · 2018-06-03T17:47:51Z

@fir3pho3nixx why did you want to remove BOMs? They are valid as part of the Unicode standard and not hurting anyone; Visual Studio is actually the one saving them there, not sure if the new versions of VS still adds them to new files.

I don't see a reason to mention this in the changelog, unless the change affects users of the library.

If I recall looking at the debugger views in Windsor the dots might be those fancy angle brackets, I recall seeing them somewhere in Windsor debugger views (added in 2011: d7ff1da). I see no reason not to just make this function output exactly what you'd type as a C# developer, no need for fancy characters.

stakx · 2018-06-03T18:08:04Z

If you need a (sort of) good excuse to normalize BOMs across the whole codebase, you could just update the year to 2018 in the copyright (if that's in fact something that needs to be done at all)... and do the BOM adjustment as an aside. ;-)

ghost · 2018-06-04T19:26:40Z

Oh dear this was never about BOM’s. Sorry if that is what you thought. The code I supplied actually converts ANSI to UTF8 but happens to remove BOM’s. I thought this was a good approach as BOM’s infer things about character streams.

@jonorossi This actually a defect which prevents people with non standard encoding from building the solution. So I thought it was a fix of some description. Updating the change log is neither here not there. As long as it builds! Sweet.

@stakx I am not sure I follow. License headers and UTF encoding of source files that allow this solution to be built in Asia are very different. The real problem is we have some ANSI duff ruining the party.

stakx · 2018-06-04T23:04:52Z

@fir3pho3nixx, please disregard my remark, I was half joking. (But to explain what I meant, I was also under the impression that this was about BOMs mainly, and since the diffs already affect the first line of each code file, which happens to contain the copyright line, you could've killed two birds with one stone but put the focus on only one of those. Sneaky, and not altogether serious.)

ghost · 2018-06-05T00:03:28Z

@stakx - Awww man, if only the BOM's were distributed across all code files. Then we could update the headers! :)

ghost · 2018-06-05T00:07:44Z

CHANGELOG.md

@@ -5,6 +5,7 @@
 Enhancements:
 - Added XML documentation to BeginScope and RequireScope lifetime extensions (@jonorossi)
 - Upgraded build to use NUnit Adapters (@fir3pho3nixx, #243)
+- Remove BOM from all source code files (@davidshen84, #405)


Maybe we should say:

Fixed encoding problem where builds fail under GB2312 character encoding system

davidshen84 · 2018-06-05T00:28:03Z

Yeah~the discussions carried me away, and I forgot why I opened this bug in the first place :)

…

On Tue, Jun 5, 2018 at 8:07 AM Gavin van der Merwe ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In CHANGELOG.md <#406 (comment)>: > @@ -5,6 +5,7 @@ Enhancements: - Added XML documentation to BeginScope and RequireScope lifetime extensions ***@***.***) - Upgraded build to use NUnit Adapters ***@***.***, #243) +- Remove BOM from all source code files ***@***.***, #405) Maybe we should say: Fixed encoding problems prevented build under GB2312 character encoding system — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#406 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEAzQk48WdvI8hf5aY0JhUygPDI-dKoks5t5cvUgaJpZM4UXk4N> .

stakx · 2018-06-05T11:05:43Z

Also, if possible, can we remove that dot?

Including this request in the PR referenced above.

ghost · 2018-06-05T21:25:29Z

Guys, should we put the BOM'S back? @davidshen84 sorry if the answer is yes. I will submit a PR to your fork if this is the case. Are we happy to unify on UTF-8 with BOM'S?

ghost · 2018-06-05T21:52:27Z

Interesting to note we have +/-624 files with BOM's and 1508 without BOM's.

ghost · 2018-06-05T22:23:02Z

https://en.wikipedia.org/wiki/UTF-8#Byte_order_mark

stakx · 2018-06-05T22:24:45Z

What endianness are you referring to? UTF-8 does not have any endianness, it's a byte stream. Endianness would only be applicable with UTF-16 or UTF-32 (AFAIK).

ghost · 2018-06-05T22:25:05Z

Programming language parsers not explicitly designed for UTF-8 can often handle UTF-8 in string constants and comments, but will choke on encountering an UTF-8 BOM at the start of the file.

Quite telling, right?

ghost · 2018-06-05T22:27:37Z

What endianness are you referring to? UTF-8 does not have any endianness, it's a byte stream. Endianness would only be applicable with UTF-16 or UTF-32 (AFAIK).

It infers endian nature for character streams.
https://en.wikipedia.org/wiki/Byte_order_mark

stakx · 2018-06-05T22:31:49Z

It infers endian nature for character streams.
https://en.wikipedia.org/wiki/Byte_order_mark

I am still not sure I understand what you're referring to. Unlike with UTF-16 (where the BOM can be the byte sequence fe ff or ff fe, depending on endianness), there is no such thing as Little Endian UTF-8 or Big Endian UTF-8. There's only UTF-8, and its BOM is invariably the byte sequence ef bb bf. So unless you're converting everything to UTF-16, the question of endianness simply should not arise.

See e.g. https://unicode.org/faq/utf_bom.html#bom5.

ghost · 2018-06-05T22:37:04Z

Yes and with UTF-8 it is 0xEF,0xBB,0xBF.

stakx · 2018-06-05T22:39:58Z

As to the question of whether to reverse the BOM change or not: I was just passing by, so I'll leave the decision up to you folks. (My purely personal opinion is that the the IT industry has got to a point where tools can finally be expected to handle UTF-8 properly. We should no longer have to adjust to broken tooling, so removing the BOM for that reason seems like the wrong incentive. The broken tooling should be made UTF-ready instead. On a second note, the presence of a UTF-8 BOM can help tools recognize UTF-8 encoding, but today most good tooling probably assumes that encoding by default, anyway. So whether the BOM is present or not shouldn't make much difference, as long as the characters in the file are actually encoded as UTF-8. I agree that from an "aesthetic" viewpoint, it feels nice to have uniformity, but it's probably not worth spending too much time on the matter. It's fine either way.)

ghost · 2018-06-05T22:49:20Z

My purely personal opinion is that the the IT industry has got to a point where tools can finally be expected to handle UTF-8 properly. We should no longer have to adjust to broken tooling, so removing the BOM for that reason seems like the wrong incentive. The broken tooling should be made UTF-ready instead.

Cool. This PR just grew to a change set over 624 files to one of ~~1508~~ 904 files.

On a second note, the presence of a UTF-8 BOM can help tools recognize UTF-8 encoding, but today most good tooling probably assumes that encoding by default, anyway. So whether the BOM is present or not shouldn't make much difference, as long as the characters in the file are actually encoded as UTF-8. I agree that from an "aesthetic" viewpoint, it feels nice to have uniformity, but it's probably not worth spending too much time on the matter. It's fine either way.)

Pick one.

ghost · 2018-06-05T23:01:58Z

@davidshen84 - Can you please merge my PR into your master branch?

Reverse the BOM

davidshen84 · 2018-06-05T23:06:20Z

@fir3pho3nixx, merged. I think we should have opened another issue to discuss this BOM problem. This issue was only about a few files that were in ANSI encoding.

ghost · 2018-06-05T23:08:45Z

@fir3pho3nixx, merged. I think we should have opened another issue to discuss this BOM problem.

True. I just felt encoding was a valid way to bring this up.

This issue was only about a few files that were in ANSI encoding.

1 from what I could tell.

ghost · 2018-06-05T23:11:46Z

Sorry for all the stress people!

stakx · 2018-06-05T23:12:13Z

@fir3pho3nixx - Please accept my apologies if any of my above statements upset you... that certainly wasn't my intention at all! While I don't see any immediate need for it, I can still appreciate why you'd like to harmonize BOMs across the codebase. I'll stay out of this. Again, sorry if I upset you.

ghost · 2018-06-05T23:14:51Z

Please accept my apologies if any of my above statements upset you...

Never. You can tell me off anytime! :)

I'll stay out of this. Again, sorry if I upset you.

No. You are always welcome.

jonorossi · 2018-06-06T14:14:01Z

I think we should have opened another issue to discuss this BOM problem. This issue was only about a few files that were in ANSI encoding.

Oops, yer this thread appears to have gone a bit off track. Looking at your commit that matters (62e2461), @stakx has made this fix in #407 along with others to TypeUtil.ToCSharpString which was a little broken so we can close this pull request without merging. Thanks for bringing up this problem.

I've logged #408 to continue the BOM discussion.

ghost · 2018-06-07T18:40:34Z

Sorry sorry sorry

Resolve #405. Unicode please!!!

62e2461

ghost suggested changes Jun 2, 2018

View reviewed changes

Resolve #405 Convert all code file encodings to UTF-8 without BOM.

f343ed4

ghost approved these changes Jun 3, 2018

View reviewed changes

Mention bug change #405 in change log.

0029b60

ghost reviewed Jun 5, 2018

View reviewed changes

Update change log description.

8522639

stakx mentioned this pull request Jun 5, 2018

Make TypeUtil.ToCSharpString produce more C#-like formatting #407

Merged

Reverse the BOM

ddd0c37

Merge pull request #1 from fir3pho3nixx/reverse-the-bomb

7de7463

Reverse the BOM

jonorossi mentioned this pull request Jun 6, 2018

Byte order marks in source files #408

Closed

jonorossi closed this Jun 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve #405. Unicode please!!! #406

Resolve #405. Unicode please!!! #406

davidshen84 commented Jun 2, 2018

ghost commented Jun 2, 2018 •

edited by ghost

Loading

ghost commented Jun 2, 2018

ghost left a comment

davidshen84 commented Jun 3, 2018

ghost commented Jun 3, 2018

jonorossi commented Jun 3, 2018

stakx commented Jun 3, 2018 •

edited

Loading

ghost commented Jun 4, 2018

stakx commented Jun 4, 2018

ghost commented Jun 5, 2018

ghost Jun 5, 2018 •

edited by ghost

Loading

davidshen84 commented Jun 5, 2018 via email

stakx commented Jun 5, 2018 •

edited

Loading

ghost commented Jun 5, 2018

ghost commented Jun 5, 2018 •

edited by ghost

Loading

ghost commented Jun 5, 2018

stakx commented Jun 5, 2018

ghost commented Jun 5, 2018

ghost commented Jun 5, 2018

stakx commented Jun 5, 2018 •

edited

Loading

ghost commented Jun 5, 2018

stakx commented Jun 5, 2018

ghost commented Jun 5, 2018 •

edited by ghost

Loading

ghost commented Jun 5, 2018 •

edited by ghost

Loading

davidshen84 commented Jun 5, 2018

ghost commented Jun 5, 2018

ghost commented Jun 5, 2018

stakx commented Jun 5, 2018

ghost commented Jun 5, 2018

jonorossi commented Jun 6, 2018

ghost commented Jun 7, 2018

Resolve #405. Unicode please!!! #406

Resolve #405. Unicode please!!! #406

Conversation

davidshen84 commented Jun 2, 2018

ghost commented Jun 2, 2018 • edited by ghost Loading

ghost commented Jun 2, 2018

ghost left a comment

Choose a reason for hiding this comment

davidshen84 commented Jun 3, 2018

ghost commented Jun 3, 2018

jonorossi commented Jun 3, 2018

stakx commented Jun 3, 2018 • edited Loading

ghost commented Jun 4, 2018

stakx commented Jun 4, 2018

ghost commented Jun 5, 2018

ghost Jun 5, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

davidshen84 commented Jun 5, 2018 via email

stakx commented Jun 5, 2018 • edited Loading

ghost commented Jun 5, 2018

ghost commented Jun 5, 2018 • edited by ghost Loading

ghost commented Jun 5, 2018

stakx commented Jun 5, 2018

ghost commented Jun 5, 2018

ghost commented Jun 5, 2018

stakx commented Jun 5, 2018 • edited Loading

ghost commented Jun 5, 2018

stakx commented Jun 5, 2018

ghost commented Jun 5, 2018 • edited by ghost Loading

ghost commented Jun 5, 2018 • edited by ghost Loading

davidshen84 commented Jun 5, 2018

ghost commented Jun 5, 2018

ghost commented Jun 5, 2018

stakx commented Jun 5, 2018

ghost commented Jun 5, 2018

jonorossi commented Jun 6, 2018

ghost commented Jun 7, 2018

ghost commented Jun 2, 2018 •

edited by ghost

Loading

stakx commented Jun 3, 2018 •

edited

Loading

ghost Jun 5, 2018 •

edited by ghost

Loading

stakx commented Jun 5, 2018 •

edited

Loading

ghost commented Jun 5, 2018 •

edited by ghost

Loading

stakx commented Jun 5, 2018 •

edited

Loading

ghost commented Jun 5, 2018 •

edited by ghost

Loading

ghost commented Jun 5, 2018 •

edited by ghost

Loading