Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minification inadvertently converts ISO-8859-1 non-ASCII characters to UTF-8 #107

Closed
SetTrend opened this issue Jul 17, 2019 · 11 comments
Closed

Comments

@SetTrend
Copy link

  • Operating System: Windows 10x64
  • Node Version: v10.3.0
  • NPM Version: 6.1.0
  • webpack Version: 4.35.0
  • terser-webpack-plugin Version: coming with webpack

Expected Behavior

Minifying string literals in a UTF-8 Javascript file should not alter the text's non-ASCII characters.

Actual Behavior

Non-ASCII characters in a string literal from a UTF-8 JavaScript file are inadvertently converted to their UTF-8 counterparts.

Code

See this sample repository.

How Do We Reproduce?

See above.

@alexander-akait
Copy link
Member

hm, looks like a terser bug, can you try compress file without webpack plugin and check all fine or not?

@SetTrend
Copy link
Author

@evilebottnawi :

Would you mind elaborating on how I would do that? I'm not too deep into WebPack at this time, I'm afraid.

@alexander-akait
Copy link
Member

@SetTrend install terser https://github.com/terser-js/terser and try to use CLI for compress file, also can you try ascii_only option https://github.com/terser-js/terser#output-options

@SetTrend
Copy link
Author

SetTrend commented Jul 29, 2019

I just did as you suggested and ran the proposed test. It looks like terser is the issue here, indeed:

Running ...

npx tsc

results in these four files:

Dist result

With TestUC.js displaying the following content:

TSC encoding

(Which is fine until that point.)

Then running ...

npx terser "dist/*.js" -c -m -o "dist2/test.js"

results in this file:

Dist2-Result

With test.js displaying the following content:

Terser encoding

(Which is wrong. The ISO 8859-1 encoding has falsely been converted to its UTF-8 equivalent. This is also true when TSC is run with option "emitBOM": false, so inadvertent source file encoding guessing is not the cause of this issue.)

What would you suggest on how to proceed?

@alexander-akait
Copy link
Member

alexander-akait commented Jul 30, 2019

@SetTrend i can't understand, bug exists in terser? Why do you use ISO 8859-1?

@SetTrend
Copy link
Author

SetTrend commented Jul 30, 2019

@evilebottnawi :

That's not the case. Let me iterate:

  • The JavaScript source is UTF-8, as you can see from the screenshots. There's a control character in the string literals, that's all. Yet, control characters in string literals should not be altered.

  • Regardless, and although it's irrelevant to this issue: the ECMAScript specification (latest) doesn't set any restrictions to JavaScript source code encoding:

    10 - ECMAScript Language: Source Code
    10.1 - Source Text
    [...]
    The actual encodings used to store and interchange ECMAScript source text is not relevant to this specification.


What Terser does: It actually interprets the literal strings' encoding, concludes from the control character within the string literals that the string literals must be ISO 8895-1 and converts what it believes to have found into UTF-8. That's false behaviour. Instead, in general, string literals must be left untouched.

No matter what character points a string literal contains, Terser is supposed to render it 1:1, unchanged.

@alexander-akait
Copy link
Member

@SetTrend i think better open issue in terser repo, we just wrapper for this tool, i think we can't do something on our side

@endiliey
Copy link

agree that it is an issue in terser.

@SetTrend
Copy link
Author

SetTrend commented Aug 6, 2019

@evilebottnawi, @endiliey

After the issue has been confirmed by one of the Terser contributors, he offered a workaround: The ascii_only option.

Trying to add that I noticed that the Terser ascii_only option is not supported by webpack-contrib/terser-webpack-plugin. Would you mind adding it?

@SetTrend SetTrend reopened this Aug 6, 2019
@alexander-akait
Copy link
Member

You can set any custom options for terser using https://github.com/webpack-contrib/terser-webpack-plugin#terseroptions

@alexander-akait
Copy link
Member

output: { ascii_only: true }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants