Added utf-8 BOM removal #630

Raibaz · 2019-10-23T13:30:50Z

This fixes #272 by adding explicit removal of the UTF-8 BOM from the content of the files being parsed; it also introduces a minor refactoring that reduces code duplication.

Note that this also removes the BOM from the output when formatting files; it shouldn't be a big deal, as UTF-8 files can work without it and it is in fact suggested not to include it.

Tapchicoma · 2019-11-04T20:59:42Z

After reading some discussions over internet, I would say removing UTF-8 BOM symbol on file format is not a nice idea.

Raibaz · 2019-11-04T21:46:14Z

Why do you think it's not a good idea?

Apparently, the BOM is not mandatory for UTF-8 files and messes up with non-UTF-8 applications (including ktlint) that don't expect the non-ASCII characters at the beginning of the file and try to parse them as ASCII characters, resulting in bad parsing of the file.

My opinion is that removing it from the start of the file is safe, as the bytes in the BOM cannot be there for any other reason that would need them to be there, and their presence is just preventing ktlint from operating as expected.

Do you have any other ideas in mind on how to handle UTF-8 files with BOM? I guess it can probably be dealt with relatively easily when doing just file validation, but when fixing style violations it will likely be trickier, as the whole text content of the file is going to be manipulated.

Tapchicoma · 2019-11-05T19:56:09Z

For example, I've read following issue: editorconfig/editorconfig#297, where some people complain about removing BOM support.

Generally, I would say that it is not responsibility of ktlint to remove BOM on file format. BOM itself does not relate to Kotlin code style and it may happen that people added it intentionally.

Do you have any other ideas in mind on how to handle UTF-8 files with BOM? I guess it can probably be dealt with relatively easily when doing just file validation, but when fixing style violations it will likely be trickier, as the whole text content of the file is going to be manipulated.

Current approach for removing it in lint() method is ok as it is not destructive. For format() method BOM could be saved in some String field that would be added back on returning formatted file as string here:

ktlint/ktlint-core/src/main/kotlin/com/pinterest/ktlint/core/KtLint.kt

Line 413 in 2e7d67c

    
           return if (mutated) rootNode.text.replace("\n", determineLineSeparator(params.text, params.userData)) else params.text

Raibaz · 2019-11-06T10:20:04Z

Right, makes sense, I updated my PR accordingly.

Tapchicoma · 2019-11-11T21:59:44Z

@Raibaz could you fix code style? Other then that your PR looks good to me.

Tapchicoma

Thank you for your contribution!

Added utf-8 BOM removal

e0a8366

Restore UTF8 BOM after formatting if it was present

06933ee

Fixed code style

019c8e5

Tapchicoma approved these changes Nov 12, 2019

View reviewed changes

Tapchicoma merged commit 6864374 into pinterest:master Nov 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added utf-8 BOM removal #630

Added utf-8 BOM removal #630

Raibaz commented Oct 23, 2019

Tapchicoma commented Nov 4, 2019 •

edited

Loading

Raibaz commented Nov 4, 2019

Tapchicoma commented Nov 5, 2019

Raibaz commented Nov 6, 2019

Tapchicoma commented Nov 11, 2019

Tapchicoma left a comment

Added utf-8 BOM removal #630

Added utf-8 BOM removal #630

Conversation

Raibaz commented Oct 23, 2019

Tapchicoma commented Nov 4, 2019 • edited Loading

Raibaz commented Nov 4, 2019

Tapchicoma commented Nov 5, 2019

Raibaz commented Nov 6, 2019

Tapchicoma commented Nov 11, 2019

Tapchicoma left a comment

Choose a reason for hiding this comment

Tapchicoma commented Nov 4, 2019 •

edited

Loading