Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added utf-8 BOM removal #630

Merged
merged 3 commits into from
Nov 12, 2019
Merged

Conversation

Raibaz
Copy link
Contributor

@Raibaz Raibaz commented Oct 23, 2019

This fixes #272 by adding explicit removal of the UTF-8 BOM from the content of the files being parsed; it also introduces a minor refactoring that reduces code duplication.

Note that this also removes the BOM from the output when formatting files; it shouldn't be a big deal, as UTF-8 files can work without it and it is in fact suggested not to include it.

@Tapchicoma
Copy link
Collaborator

Tapchicoma commented Nov 4, 2019

After reading some discussions over internet, I would say removing UTF-8 BOM symbol on file format is not a nice idea.

@Raibaz
Copy link
Contributor Author

Raibaz commented Nov 4, 2019

Why do you think it's not a good idea?

Apparently, the BOM is not mandatory for UTF-8 files and messes up with non-UTF-8 applications (including ktlint) that don't expect the non-ASCII characters at the beginning of the file and try to parse them as ASCII characters, resulting in bad parsing of the file.

My opinion is that removing it from the start of the file is safe, as the bytes in the BOM cannot be there for any other reason that would need them to be there, and their presence is just preventing ktlint from operating as expected.

Do you have any other ideas in mind on how to handle UTF-8 files with BOM? I guess it can probably be dealt with relatively easily when doing just file validation, but when fixing style violations it will likely be trickier, as the whole text content of the file is going to be manipulated.

@Tapchicoma
Copy link
Collaborator

For example, I've read following issue: editorconfig/editorconfig#297, where some people complain about removing BOM support.

Generally, I would say that it is not responsibility of ktlint to remove BOM on file format. BOM itself does not relate to Kotlin code style and it may happen that people added it intentionally.

Do you have any other ideas in mind on how to handle UTF-8 files with BOM? I guess it can probably be dealt with relatively easily when doing just file validation, but when fixing style violations it will likely be trickier, as the whole text content of the file is going to be manipulated.

Current approach for removing it in lint() method is ok as it is not destructive. For format() method BOM could be saved in some String field that would be added back on returning formatted file as string here:

return if (mutated) rootNode.text.replace("\n", determineLineSeparator(params.text, params.userData)) else params.text

@Raibaz
Copy link
Contributor Author

Raibaz commented Nov 6, 2019

Right, makes sense, I updated my PR accordingly.

@Tapchicoma
Copy link
Collaborator

@Raibaz could you fix code style? Other then that your PR looks good to me.

Copy link
Collaborator

@Tapchicoma Tapchicoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

@Tapchicoma Tapchicoma merged commit 6864374 into pinterest:master Nov 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 with BOM results in "Not a valid Kotlin file (expecting a top level declaration)"
2 participants