Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 characters not processed properly with custom shell.html #10802

Closed
raysan5 opened this issue Mar 30, 2020 · 12 comments
Closed

UTF8 characters not processed properly with custom shell.html #10802

raysan5 opened this issue Mar 30, 2020 · 12 comments
Labels

Comments

@raysan5
Copy link

raysan5 commented Mar 30, 2020

When using a custom shell.html that includes Unicode characters (i.e. emojis) encoded as UTF8, after the file is processed, the characters are not read properly and they are converted to the ASCII equivalent byte values:

example compiled with 1.38.42-upstream:

shell_icons_ok

example compiled with 1.39.9:

shell_icons_wrong

With latest version, icons (4 bytes utf8) are converted to the equivalent ASCII (4 bytes).

Issue related here: #6511 (comment)

@kripken
Copy link
Member

kripken commented Mar 30, 2020

If would be good to add a testcase here. Then we can bisect and quickly find where this broke.

@kripken
Copy link
Member

kripken commented Mar 30, 2020

(I see there is a shell.html in the link, but we also need a full testcase - source file to build, command, etc., basically all the steps to be able to compile locally and see your problem.)

@raysan5
Copy link
Author

raysan5 commented Mar 31, 2020

@kripken I created a self-contained example for testing: emscripten_test_case.zip

It includes all required files to compile the example, you can see the difference between input shell.html (using Unicode characters) and generated output (no Unicode characters).

Compilation line used:

emcc -o core_basic_window.html core_basic_window.c -Wall -std=c99 -D_DEFAULT_SOURCE -Wno-missing-braces -O3 -s USE_GLFW=3 -s ASYNCIFY --shell-file shell.html libraylib.bc

@kripken
Copy link
Member

kripken commented Apr 1, 2020

Thanks for the full testcase!

It works for me on emsdk install tot (latest tip of tree build, which is a little newer than 1.39.11). I tested on both chrome and firefox so it doesn't look like a browser issue. But I don't remember us fixing anything related to this since 1.39.9, so that's weird... But maybe I forgot something, please check on tot.

@raysan5
Copy link
Author

raysan5 commented Apr 3, 2020

Hi @kripken, I tested it with emsdk install tot, same issue.

Investigating a bit more, I saw that, in fact, issue resides on Python update. Just tested it with python 2.7.13.1_64bit (keeping emsdk version) and it works ok. It seems python 3.7.4_64bit is not processing the file properly on my side...

<time_span>

So... a couple of hours later going down the rabbit hole... I found the problem! 😄

I digged into the following functions:

  • emcc.py [generate_traditional_runtime_html()] ->
    • shared.py [read_and_preprocess()] ->
      • preprocessor.js [ENVIRONMENT_IS_NODE -> read()]

I verified that from_html and to_html were correct, even print() -> process['stdout'].write(x + '\n'); is correct... but string is in stdout now... and it needs to be retrieved... as utf-8!

shared.py: line 3459 (read_and_preprocess()):

  out = open(stdout, 'r').read()   # This line does not read the string as utf-8 (at least on my system)

it should be:

  out = open(stdout, 'r', encoding="utf-8").read()

Investigating a bit further, it seems encoding="utf-8" parameter support for open() was introduced on Python 3+ (link).

It's just a small tweak, do you want me to send a PR?

@sbc100
Copy link
Collaborator

sbc100 commented Apr 3, 2020

We've had a few different issues raised regarding encoding and python3 on windows.

They seems to mostly be related to subprocess stdin and stdout:

Firstly there was this issue back in December:

#10027

Which got solved with by setting PYTHONUTF8 in the environment:

WebAssembly/waterfall#608

Then there was this issue more recently:

#10551

I tried to fix that but it got reverted:

#10558

@sbc100
Copy link
Collaborator

sbc100 commented Apr 3, 2020

Sadly we can't just add encoding="utf-8" wince we need to continue to support python2 for a bit longer.

@raysan5
Copy link
Author

raysan5 commented Apr 3, 2020

@sbc100 ok, no worries, at least the problem has been detected and it's documented in this issue. 😄

@sbc100
Copy link
Collaborator

sbc100 commented Apr 3, 2020

Perhaps you could help me debug this. I've been having trouble figuring out that the correct soltuion.

On your system, what does sys.getdefaultencoding() produce? And what about locale.getpreferredencoding()?

@sbc100
Copy link
Collaborator

sbc100 commented Apr 3, 2020

if you set PYTHONUTF8 in the environment does that solve the issue for you?

@raysan5
Copy link
Author

raysan5 commented Apr 3, 2020

@sbc100 sure!

  • I got utf-8 and cp1252 without setting PYTHONUTF8.
  • I got utf-8 and UTF-8 after setting PYTHONUTF8 and now open(stdout, 'r').read() command works properly.

Nice! :D

@stale
Copy link

stale bot commented Jun 4, 2021

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

@stale stale bot added the wontfix label Jun 4, 2021
@stale stale bot closed this as completed Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants