Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve #405. Unicode please!!! #406

Closed
wants to merge 6 commits into from
Closed

Resolve #405. Unicode please!!! #406

wants to merge 6 commits into from

Conversation

davidshen84
Copy link

I converted that peculiar dot into utf-8 encoding, and the solution can build on my Windows 10 64 bit system which has GB2312 as the default character encoding system.

Also, if possible, can we remove that dot?

@ghost
Copy link

ghost commented Jun 2, 2018

Hi David

Thanks for this. Do you think we should run something like this over the entire repository? I tested it and it does not break any tests on my side.

class Program
{
	static void Main(string[] args)
	{
		var inputDirectory = @"c:\code\windsor"; // <- change to your own checkout path

		WriteUTF8WithoutBOM(inputDirectory, "*.cs");
		WriteUTF8WithoutBOM(inputDirectory, "*.xml");
		WriteUTF8WithoutBOM(inputDirectory, "*.config");
		WriteUTF8WithoutBOM(inputDirectory, "*.csproj");
		WriteUTF8WithoutBOM(inputDirectory, "*.props");


		Console.ReadLine();
	}

	private static void WriteUTF8WithoutBOM(string inputDirectory, string searchPattern)
	{
		foreach (var file in Directory.GetFiles(inputDirectory, searchPattern, SearchOption.AllDirectories))
		{
			if (file.Contains(".git\\")) continue;
			if (file.Contains(".vs\\")) continue;

			Console.WriteLine(file);

			Encoding utf8WithoutBom = new UTF8Encoding(false);
			var fileContents = File.ReadAllText(file, utf8WithoutBom);
			var streamWriter = new StreamWriter(file, false, utf8WithoutBom);
			streamWriter.Write(fileContents);
			streamWriter.Dispose();
		}
	}
}

Also don't forget to add a CHANGELOG.md entry.

@ghost
Copy link

ghost commented Jun 2, 2018

Also, if possible, can we remove that dot?

We don't have to remove the dot if the file is encoded properly.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maybe fix all the files. There are some files with UTF8 BOMS in them too.

Please see: #406 (comment)

@davidshen84
Copy link
Author

@fir3pho3nixx , good idea. I have updated the PR.

@ghost
Copy link

ghost commented Jun 3, 2018

@davidshen84 - Don't forget to make a note in the CHANGELOG.md

@jonorossi
Copy link
Member

@fir3pho3nixx why did you want to remove BOMs? They are valid as part of the Unicode standard and not hurting anyone; Visual Studio is actually the one saving them there, not sure if the new versions of VS still adds them to new files.

I don't see a reason to mention this in the changelog, unless the change affects users of the library.

If I recall looking at the debugger views in Windsor the dots might be those fancy angle brackets, I recall seeing them somewhere in Windsor debugger views (added in 2011: d7ff1da). I see no reason not to just make this function output exactly what you'd type as a C# developer, no need for fancy characters.

@stakx
Copy link
Member

stakx commented Jun 3, 2018

If you need a (sort of) good excuse to normalize BOMs across the whole codebase, you could just update the year to 2018 in the copyright (if that's in fact something that needs to be done at all)... and do the BOM adjustment as an aside. ;-)

@ghost
Copy link

ghost commented Jun 4, 2018

Oh dear this was never about BOM’s. Sorry if that is what you thought. The code I supplied actually converts ANSI to UTF8 but happens to remove BOM’s. I thought this was a good approach as BOM’s infer things about character streams.

@jonorossi This actually a defect which prevents people with non standard encoding from building the solution. So I thought it was a fix of some description. Updating the change log is neither here not there. As long as it builds! Sweet.

@stakx I am not sure I follow. License headers and UTF encoding of source files that allow this solution to be built in Asia are very different. The real problem is we have some ANSI duff ruining the party.

@stakx
Copy link
Member

stakx commented Jun 4, 2018

@fir3pho3nixx, please disregard my remark, I was half joking. (But to explain what I meant, I was also under the impression that this was about BOMs mainly, and since the diffs already affect the first line of each code file, which happens to contain the copyright line, you could've killed two birds with one stone but put the focus on only one of those. Sneaky, and not altogether serious.)

@ghost
Copy link

ghost commented Jun 5, 2018

@stakx - Awww man, if only the BOM's were distributed across all code files. Then we could update the headers! :)

CHANGELOG.md Outdated
@@ -5,6 +5,7 @@
Enhancements:
- Added XML documentation to BeginScope and RequireScope lifetime extensions (@jonorossi)
- Upgraded build to use NUnit Adapters (@fir3pho3nixx, #243)
- Remove BOM from all source code files (@davidshen84, #405)
Copy link

@ghost ghost Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should say:

Fixed encoding problem where builds fail under GB2312 character encoding system

@davidshen84
Copy link
Author

davidshen84 commented Jun 5, 2018 via email

@stakx
Copy link
Member

stakx commented Jun 5, 2018

Also, if possible, can we remove that dot?

Including this request in the PR referenced above.

@ghost
Copy link

ghost commented Jun 5, 2018

Guys, should we put the BOM'S back? @davidshen84 sorry if the answer is yes. I will submit a PR to your fork if this is the case. Are we happy to unify on UTF-8 with BOM'S?

@ghost
Copy link

ghost commented Jun 5, 2018

Interesting to note we have +/-624 files with BOM's and 1508 without BOM's.

@ghost
Copy link

ghost commented Jun 5, 2018

@stakx
Copy link
Member

stakx commented Jun 5, 2018

What endianness are you referring to? UTF-8 does not have any endianness, it's a byte stream. Endianness would only be applicable with UTF-16 or UTF-32 (AFAIK).

@ghost
Copy link

ghost commented Jun 5, 2018

Programming language parsers not explicitly designed for UTF-8 can often handle UTF-8 in string constants and comments, but will choke on encountering an UTF-8 BOM at the start of the file.

Quite telling, right?

@ghost
Copy link

ghost commented Jun 5, 2018

What endianness are you referring to? UTF-8 does not have any endianness, it's a byte stream. Endianness would only be applicable with UTF-16 or UTF-32 (AFAIK).

It infers endian nature for character streams.
https://en.wikipedia.org/wiki/Byte_order_mark

@stakx
Copy link
Member

stakx commented Jun 5, 2018

It infers endian nature for character streams.
https://en.wikipedia.org/wiki/Byte_order_mark

I am still not sure I understand what you're referring to. Unlike with UTF-16 (where the BOM can be the byte sequence fe ff or ff fe, depending on endianness), there is no such thing as Little Endian UTF-8 or Big Endian UTF-8. There's only UTF-8, and its BOM is invariably the byte sequence ef bb bf. So unless you're converting everything to UTF-16, the question of endianness simply should not arise.

See e.g. https://unicode.org/faq/utf_bom.html#bom5.

@ghost
Copy link

ghost commented Jun 5, 2018

Yes and with UTF-8 it is 0xEF,0xBB,0xBF.

@stakx
Copy link
Member

stakx commented Jun 5, 2018

As to the question of whether to reverse the BOM change or not: I was just passing by, so I'll leave the decision up to you folks. (My purely personal opinion is that the the IT industry has got to a point where tools can finally be expected to handle UTF-8 properly. We should no longer have to adjust to broken tooling, so removing the BOM for that reason seems like the wrong incentive. The broken tooling should be made UTF-ready instead. On a second note, the presence of a UTF-8 BOM can help tools recognize UTF-8 encoding, but today most good tooling probably assumes that encoding by default, anyway. So whether the BOM is present or not shouldn't make much difference, as long as the characters in the file are actually encoded as UTF-8. I agree that from an "aesthetic" viewpoint, it feels nice to have uniformity, but it's probably not worth spending too much time on the matter. It's fine either way.)

@ghost
Copy link

ghost commented Jun 5, 2018

My purely personal opinion is that the the IT industry has got to a point where tools can finally be expected to handle UTF-8 properly. We should no longer have to adjust to broken tooling, so removing the BOM for that reason seems like the wrong incentive. The broken tooling should be made UTF-ready instead.

Cool. This PR just grew to a change set over 624 files to one of 1508 904 files.

On a second note, the presence of a UTF-8 BOM can help tools recognize UTF-8 encoding, but today most good tooling probably assumes that encoding by default, anyway. So whether the BOM is present or not shouldn't make much difference, as long as the characters in the file are actually encoded as UTF-8. I agree that from an "aesthetic" viewpoint, it feels nice to have uniformity, but it's probably not worth spending too much time on the matter. It's fine either way.)

Pick one.

@ghost
Copy link

ghost commented Jun 5, 2018

@davidshen84 - Can you please merge my PR into your master branch?

@davidshen84
Copy link
Author

@fir3pho3nixx, merged. I think we should have opened another issue to discuss this BOM problem. This issue was only about a few files that were in ANSI encoding.

@ghost
Copy link

ghost commented Jun 5, 2018

@fir3pho3nixx, merged. I think we should have opened another issue to discuss this BOM problem.

True. I just felt encoding was a valid way to bring this up.

This issue was only about a few files that were in ANSI encoding.

1 from what I could tell.

@ghost
Copy link

ghost commented Jun 5, 2018

Sorry for all the stress people!

@stakx
Copy link
Member

stakx commented Jun 5, 2018

@fir3pho3nixx - Please accept my apologies if any of my above statements upset you... that certainly wasn't my intention at all! While I don't see any immediate need for it, I can still appreciate why you'd like to harmonize BOMs across the codebase. I'll stay out of this. Again, sorry if I upset you.

@ghost
Copy link

ghost commented Jun 5, 2018

Please accept my apologies if any of my above statements upset you...

Never. You can tell me off anytime! :)

I'll stay out of this. Again, sorry if I upset you.

No. You are always welcome.

@jonorossi
Copy link
Member

I think we should have opened another issue to discuss this BOM problem. This issue was only about a few files that were in ANSI encoding.

Oops, yer this thread appears to have gone a bit off track. Looking at your commit that matters (62e2461), @stakx has made this fix in #407 along with others to TypeUtil.ToCSharpString which was a little broken so we can close this pull request without merging. Thanks for bringing up this problem.

I've logged #408 to continue the BOM discussion.

@jonorossi jonorossi closed this Jun 6, 2018
@ghost
Copy link

ghost commented Jun 7, 2018

Sorry sorry sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants