UTF8 characters not processed properly with custom shell.html #10802

raysan5 · 2020-03-30T19:44:14Z

When using a custom shell.html that includes Unicode characters (i.e. emojis) encoded as UTF8, after the file is processed, the characters are not read properly and they are converted to the ASCII equivalent byte values:

example compiled with 1.38.42-upstream:

example compiled with 1.39.9:

With latest version, icons (4 bytes utf8) are converted to the equivalent ASCII (4 bytes).

Issue related here: #6511 (comment)

The text was updated successfully, but these errors were encountered:

kripken · 2020-03-30T23:06:59Z

If would be good to add a testcase here. Then we can bisect and quickly find where this broke.

kripken · 2020-03-30T23:09:07Z

(I see there is a shell.html in the link, but we also need a full testcase - source file to build, command, etc., basically all the steps to be able to compile locally and see your problem.)

raysan5 · 2020-03-31T12:02:38Z

@kripken I created a self-contained example for testing: emscripten_test_case.zip

It includes all required files to compile the example, you can see the difference between input shell.html (using Unicode characters) and generated output (no Unicode characters).

Compilation line used:

emcc -o core_basic_window.html core_basic_window.c -Wall -std=c99 -D_DEFAULT_SOURCE -Wno-missing-braces -O3 -s USE_GLFW=3 -s ASYNCIFY --shell-file shell.html libraylib.bc

kripken · 2020-04-01T20:41:55Z

Thanks for the full testcase!

It works for me on emsdk install tot (latest tip of tree build, which is a little newer than 1.39.11). I tested on both chrome and firefox so it doesn't look like a browser issue. But I don't remember us fixing anything related to this since 1.39.9, so that's weird... But maybe I forgot something, please check on tot.

raysan5 · 2020-04-03T16:49:35Z

Hi @kripken, I tested it with emsdk install tot, same issue.

Investigating a bit more, I saw that, in fact, issue resides on Python update. Just tested it with python 2.7.13.1_64bit (keeping emsdk version) and it works ok. It seems python 3.7.4_64bit is not processing the file properly on my side...

<time_span>

So... a couple of hours later going down the rabbit hole... I found the problem! 😄

I digged into the following functions:

emcc.py [generate_traditional_runtime_html()] ->
- shared.py [read_and_preprocess()] ->
  - preprocessor.js [ENVIRONMENT_IS_NODE -> read()]

I verified that from_html and to_html were correct, even print() -> process['stdout'].write(x + '\n'); is correct... but string is in stdout now... and it needs to be retrieved... as utf-8!

shared.py: line 3459 (read_and_preprocess()):

  out = open(stdout, 'r').read()   # This line does not read the string as utf-8 (at least on my system)

it should be:

  out = open(stdout, 'r', encoding="utf-8").read()

Investigating a bit further, it seems encoding="utf-8" parameter support for open() was introduced on Python 3+ (link).

It's just a small tweak, do you want me to send a PR?

sbc100 · 2020-04-03T17:13:53Z

We've had a few different issues raised regarding encoding and python3 on windows.

They seems to mostly be related to subprocess stdin and stdout:

Firstly there was this issue back in December:

#10027

Which got solved with by setting PYTHONUTF8 in the environment:

WebAssembly/waterfall#608

Then there was this issue more recently:

#10551

I tried to fix that but it got reverted:

#10558

sbc100 · 2020-04-03T17:15:39Z

Sadly we can't just add encoding="utf-8" wince we need to continue to support python2 for a bit longer.

raysan5 · 2020-04-03T17:23:24Z

@sbc100 ok, no worries, at least the problem has been detected and it's documented in this issue. 😄

sbc100 · 2020-04-03T17:28:35Z

Perhaps you could help me debug this. I've been having trouble figuring out that the correct soltuion.

On your system, what does sys.getdefaultencoding() produce? And what about locale.getpreferredencoding()?

sbc100 · 2020-04-03T17:28:56Z

if you set PYTHONUTF8 in the environment does that solve the issue for you?

raysan5 · 2020-04-03T17:56:35Z

@sbc100 sure!

I got utf-8 and cp1252 without setting PYTHONUTF8.
I got utf-8 and UTF-8 after setting PYTHONUTF8 and now open(stdout, 'r').read() command works properly.

Nice! :D

stale · 2021-06-04T14:50:31Z

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

raysan5 mentioned this issue Mar 30, 2020

Recent Chrome changes to sound autoplay settings #6511

Closed

sbc100 mentioned this issue May 28, 2020

win10 Emscripten encountered a problem compiling C / C + + files #11272

Closed

stale bot added the wontfix label Jun 4, 2021

stale bot closed this as completed Jul 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 characters not processed properly with custom shell.html #10802

UTF8 characters not processed properly with custom shell.html #10802

raysan5 commented Mar 30, 2020

kripken commented Mar 30, 2020

kripken commented Mar 30, 2020

raysan5 commented Mar 31, 2020

kripken commented Apr 1, 2020

raysan5 commented Apr 3, 2020

sbc100 commented Apr 3, 2020

sbc100 commented Apr 3, 2020

raysan5 commented Apr 3, 2020

sbc100 commented Apr 3, 2020

sbc100 commented Apr 3, 2020

raysan5 commented Apr 3, 2020

stale bot commented Jun 4, 2021

UTF8 characters not processed properly with custom shell.html #10802

UTF8 characters not processed properly with custom shell.html #10802

Comments

raysan5 commented Mar 30, 2020

kripken commented Mar 30, 2020

kripken commented Mar 30, 2020

raysan5 commented Mar 31, 2020

kripken commented Apr 1, 2020

raysan5 commented Apr 3, 2020

sbc100 commented Apr 3, 2020

sbc100 commented Apr 3, 2020

raysan5 commented Apr 3, 2020

sbc100 commented Apr 3, 2020

sbc100 commented Apr 3, 2020

raysan5 commented Apr 3, 2020

stale bot commented Jun 4, 2021