Build Failure with UnicodeDecodeError #10551

nokotan · 2020-02-24T09:13:43Z

Updating EMSDK (Python 2.x to Python 3.7) results build error (seems to be encoding problem)

OS: Windows 10 Home version 1909
Language: Japanese

Project Files:
https://github.com/nokotan/DxLibForHTML5-VSCode

Command:

em++ Main.cpp -o index.html -O1 -std=c++17 -g4 -IDxLibForHTML5 -LDxLibForHTML5 -lbullet -lfreetype -logg -lpng -lvorbis -lz -lDxDrawFunc -lDxUseCLib -lDxLib --emrun -s ASSERTIONS=1 -s MAIN_MODULE=1 -s FULL_ES3=1 -s ALLOW_MEMORY_GROWTH=1 --source-map-base http://localhost:8080/ --preload-file assets --shell-file template.html

Traceback:

Traceback (most recent call last):
  File "\*\*\*/emsdk/upstream/emscripten/emcc.py", line 3905, in <module>
    sys.exit(run(sys.argv))
  File "\*\*\*/emsdk/upstream/emscripten/emcc.py", line 2373, in run
    final = do_emscripten(final, shared.replace_or_append_suffix(target, '.mem'))
  File "\*\*\*/emsdk/upstream/emscripten/emcc.py", line 475, in do_emscripten
    emscripten.run(infile, outfile, memfile)
  File "\*\*\*/emsdk/upstream/emscripten/emscripten.py", line 2794, in run
    return temp_files.run_and_clean(lambda: emscripter(
  File "\*\*\*/emsdk/upstream/emscripten/tools/tempfiles.py", line 105, in run_and_clean
    return func()
  File "\*\*\*/emsdk/upstream/emscripten/emscripten.py", line 2795, in <lambda>
    infile, outfile_obj, memfile, temp_files, shared.DEBUG)
  File "\*\*\*/emsdk/upstream/emscripten/emscripten.py", line 2207, in emscript_wasm_backend
    glue, forwarded_data = compile_settings(temp_files)
  File "\*\*\*/emsdk/upstream/emscripten/emscripten.py", line 773, in compile_settings
    cwd=path_from_root('src'), env=env)
  File "\*\*\*/emsdk/upstream/emscripten/tools/jsrun.py", line 100, in run_js_tool
    return shared.check_call(command, *args, **kw).stdout
  File "\*\*\*/emsdk/upstream/emscripten/tools/shared.py", line 200, in check_call
    return run_process(cmd, *args, **kw)
  File "\*\*\*/emsdk/upstream/emscripten/tools/shared.py", line 181, in run_process
    ret = subprocess.run(cmd, check=check, input=input, *args, **kw)
  File "D:/obj/windows-release/37amd64_Release/msi_python/zip_amd64/subprocess.py", line 474, in run
  File "D:/obj/windows-release/37amd64_Release/msi_python/zip_amd64/subprocess.py", line 926, in communicate
UnicodeDecodeError: 'cp932' codec can't decode byte 0x86 in position 24221: illegal multibyte sequence

sbc100 · 2020-02-24T22:01:40Z

Interesting. It looks like python3 is choosing to use the system default rather than utf8. I see a PEP out that fixes this: https://www.python.org/dev/peps/pep-0597/

In the mean time we should probably force utf8 somehow.

This seems to mostly be a problem on windows. There is even a PEP out to make this the default: https://www.python.org/dev/peps/pep-0597 Fixes: #10551

sbc100 · 2020-04-03T17:29:39Z

Re-opening because fix got revertted.

sbc100 · 2020-04-03T17:29:43Z

if you set PYTHONUTF8 in the environment does that solve the issue for you?

nokotan · 2020-04-05T16:53:49Z

Thank you for your comment.

after setting environment variable "PYTHONUTF8" to 1, compiling my project on python 3.7.7 has succeeded.

jeromelaban · 2020-05-05T13:59:43Z

I've having a similar issue, but setting PYTHONUTF8=1 does not seem to have an effect.

Is there something else may be changed to work around this issue ?

(I removed the stack trace it was the same as above)

nokotan · 2020-05-05T17:00:27Z

Does setting PYTHONIOENCODING=utf8 instead of PYTHONUTF8=1 resolve your problem?
PYTHONUTF8=1 should be available option for python 3.7 and later.

jeromelaban · 2020-05-05T19:52:46Z

@nokotan thanks for the help. It's not having any effect, unfortunately. It looks like the ascii decoder is still used in subprocess.

I'll update my running Python, must be related.

stale · 2021-06-02T17:18:28Z

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

nokotan · 2021-07-10T04:21:18Z

I can no longer reproduce this problem with latest emscripten, keeping this issue closed as stale bot just closed this issue.

19890843006 · 2021-07-15T11:59:29Z

@nokotan I reproduce this problem with emcc --version 2.0.3;
OS: Windows 10 Home version 1909
Python：Version - 3.9.5 [sys.sys.getdefaultencoding() = 'utf-8']
Vscode
Language: Chinese

command:

emcc index.cc --pre-js pre.js -o index.js

// pre.js
Module = {};
Module.onRuntimeInitialized = function () {
  postMessage("Worker Ready.");
};
self.onmessage = function (e) {
  console.log("哈哈:  data from Main-js:" + e.data); 
  console.log("开始: start.");
  var p = Module._Pi(e.data);
  postMessage(p);
  console.log("Worker: 结束.");
};

// index.cc
int main()
{
    int64_t a = 9223372036854775806; //0x7FFFFFFFFFFFFFFE
    a += 1;
    printf("%lld\n", a);
}

When index.js includes Chinese characters, like "哈哈，？"，it will build fails with EMCC.
When i delete Chinese characters, it will build success.

The Error Tips:
Traceback (most recent call last):
File "D:\Work\project-study\dems-h265player\emsdk\upstream\emscripten\emcc.py", line 3731, in
sys.exit(main(sys.argv))
File "D:\Work\project-study\dems-h265player\emsdk\upstream\emscripten\emcc.py", line 3724, in main
ret = run(args)
File "D:\Work\project-study\dems-h265player\emsdk\upstream\emscripten\emcc.py", line 1062, in run
options, newargs, settings_map = phase_parse_arguments(state)
File "C:\Users\wwx1028182\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 79, in inner
return func(*args, **kwds)
File "D:\Work\project-study\dems-h265player\emsdk\upstream\emscripten\emcc.py", line 1175, in phase_parse_arguments
options, settings_changes, user_js_defines, newargs = parse_args(newargs)
File "D:\Work\project-study\dems-h265player\emsdk\upstream\emscripten\emcc.py", line 2778, in parse_args
options.pre_js += open(consume_arg_file()).read() + '\n'
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 224: illegal multibyte sequence

sbc100 · 2021-07-16T19:58:47Z

What encoding is your file in? It seems that python is trying to use an encoding called gbk which doesn't seem to match the file encoding .. any idea why it would be doing that?

19890843006 · 2021-07-20T02:41:58Z

@sbc100 Thank you for your reply.
Here is all the information about this error:
EMCC VERSION: 2.0.23
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 2.0.23 (b15ca40)
clang version 13.0.0 (Cswircachegitchromium.googlesource.com-external-git.luolix.top-llvm-llvm--project 5852582532b3eb3ea8da51a1e272d8d017bd36c9)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: D:/Work/project-study/dems-h265player/emsdk/upstream/bin

OS: Windows 10 Home version 1909
Python Version：Version - 3.9.5 [sys.getdefaultencoding() = 'utf-8']
IDE: Vscode
VSCODE Files Encoding Setting：UTF-8
Language: Chinese
Command: emcc index.cc -o inde.js --pre-js pre.js

I have found the cause of this problem .
The Error is because I use python 3.9.5 version, and the default encoding of the Windows system I am using is GBK (chcp p36 ).
In the python 3.x version, the built-in open methods will use the system default encoding (Not python's sys.getdefaultencoding()='utf-8') when the third parameter Encoding is not specified.
In emcc.py --pre-js uses:

// line 2778
elif check_arg('--pre-js'):
      options.pre_js += open(consume_arg_file()).read() + '\n'

The third parameter Encoding is not specified, and my pkg.js is utf-8 encoding, so an error occurs when python opening the file。
The following is the source code description of python 3.9.5 open method：

(function) open: (file: _OpenFile, mode: str, buffering: int = ..., encoding: str | None = ..., errors: str | None = ..., newline: str | None = ..., closefd: bool = ..., opener: _Opener | None = ...) -> IO
Open file and return a stream. Raise OSError upon failure.


file is either a text or byte string giving the name (and the path if the file isn't in the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped. (If a file descriptor is given, it is closed when the returned I/O object is closed, unless closefd is set to False.)


mode is an optional string that specifies the mode in which the file is opened. It defaults to 'r' which means open for reading in text mode. Other common values are 'w' for writing (truncating the file if it already exists), 'x' for creating and writing to a new file, and 'a' for appending (which on some Unix systems, means that all writes append to the end of the file regardless of the current seek position). In text mode, if encoding is not specified the encoding used is platform
dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The available modes are:


Character	Meaning
'r'	open for reading (default)
'w'	open for writing, truncating the file first
'x'	create a new file and open it for writing
'a'	open for writing, appending to the end of the file if it exists
'b'	binary mode
't'	text mode (default)
'+'	open a disk file for updating (reading and writing)
'U'	universal newline mode (deprecated)
The default mode is 'rt' (open for reading text). For binary random access, the mode 'w+b' opens and truncates the file to 0 bytes, while 'r+b' opens the file without truncation. The 'x' mode implies 'w' and raises an FileExistsError if the file already exists.


Python distinguishes between files opened in binary and text modes, even when the underlying operating system doesn't. Files opened in binary mode (appending 'b' to the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is appended to the mode argument), the contents of the file are returned as strings, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.


'U' mode is deprecated and will raise an exception in future versions of Python. It has no effect in Python 3. Use newline to control universal newlines mode.


buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:


Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device's "block size" and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.


"Interactive" text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.


encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent, but any encoding supported by Python can be passed. See the codecs module for the list of supported encodings.
...
...

Among them, the description of open third parameter Encoding：

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent, but any encoding supported by Python can be passed. See the codecs module for the list of supported encodings.

It can be seen that when open method opens the file, if encoding is not specified, then the encoding will use "platform dependent". (In my question, my Windows encoding is "GBK").
There is no such problem in Linux, because the default encoding of Linux system is "UTF-8". On Windows, when the platform encoding format is "GBK" or other values other than "UTF-8", I modify the Windows platform encoding to "UTF-8" to solve this error.

I think this solution is a bit complicated, but I don't know if there is a better way to solve this error.

sbc100 · 2021-07-20T02:45:36Z

I see, so it sounds like there is not much emscripten can do to solve that problem is there? If a user sets their default encoding to "GBK" then we should probably reasonably expect their source code to be encoded using this encoding I guess?

The alternative would be to force UTF-8 but that sounds just likely to cause issues for uses.

19890843006 · 2021-07-20T03:35:09Z

@sbc100
Because the main cause of the error lies in python and the Platform System, there is no good way except to modify the default encoding of the platform.

Perhaps, Emscripten can allow passing the third parameter encoding to specify the encoding when opening files with python open like --pre-js --post-js --extern-pre-js ...(any options to read and write files ).

But I don’t know if it is necessary.

nokotan · 2022-11-12T04:13:15Z

Maybe this issue will be affected by #16736?
It seems that python command line option '-E' also makes PYTHONUTF8 disabled.

Compilation to web target is currently failing with the following bug (debug lines obtained with EMCC_DEBUG=1): make raylib_game make[1]: Entering directory '/home/fcasas/Music/nokia-pod-racer/src' emcc -o raylib_game.html raylib_game.o screen_logo.o screen_title.o screen_options.o screen_gameplay.o screen_ending.o -std=c99 -Wall -Wno-missing-braces -Wunused-result -D_DEFAULT_SOURCE -Os -s MINIFY_HTML=0 -I. -I/home/fcasas/Videos/raylib-web/src -I/home/fcasas/Videos/raylib-web/src/external -I/home/fcasas/Videos/raylib-web/src/extras -L. -L/home/fcasas/Videos/raylib-web/src -L/home/fcasas/Videos/raylib-web/src -s USE_GLFW=3 -s TOTAL_MEMORY=134217728 -s FORCE_FILESYSTEM=1 --preload-file resources --shell-file minshell.html /home/fcasas/Videos/raylib-web/src/libraylib.a -DPLATFORM_WEB ... emcc:DEBUG: minifying HTML file raylib_game.html profiler:DEBUG: block "final emitting" raised an exception after 0.066 seconds profiler:DEBUG: block "post_link" raised an exception after 3.469 seconds profiler:DEBUG: block "main" raised an exception after 3.653 seconds Traceback (most recent call last): File "/usr/share/emscripten/emcc.py", line 3929, in <module> sys.exit(main(sys.argv)) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/usr/share/emscripten/emcc.py", line 3922, in main ret = run(args) File "/usr/share/emscripten/emcc.py", line 1194, in run phase_post_link(options, state, wasm_target, wasm_target, target) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/usr/share/emscripten/emcc.py", line 2740, in phase_post_link phase_final_emitting(options, state, target, wasm_target, memfile) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/usr/share/emscripten/emcc.py", line 2867, in phase_final_emitting generate_html(target, options, js_target, target_basename, File "/usr/share/emscripten/emcc.py", line 3663, in generate_html minify_html(target) File "/usr/share/emscripten/emcc.py", line 3637, in minify_html shared.check_call(['htmlmin', opts, '--', filename, filename]) File "/usr/share/emscripten/tools/shared.py", line 221, in check_call return run_process(cmd, *args, **kw) File "/usr/share/emscripten/tools/shared.py", line 105, in run_process ret = subprocess.run(cmd, check=check, input=input, *args, **kw) File "/usr/lib/python3.10/subprocess.py", line 501, in run with Popen(*popenargs, **kwargs) as process: File "/usr/lib/python3.10/subprocess.py", line 969, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.10/subprocess.py", line 1778, in _execute_child self.pid = _posixsubprocess.fork_exec( TypeError: expected str, bytes or os.PathLike object, not list This seems related to this bug in emscripten: emscripten-core/emscripten#10551 The solution is doing the same as in: emscripten-core/emscripten#8547 adding the following compilation flag: CFLAGS += -s MINIFY_HTML=0

sbc100 added a commit that referenced this issue Feb 24, 2020

Use utf-8 encoding by default for all subprocess communication.

8277aea

This seems to mostly be a problem on windows. There is even a PEP out to make this the default: https://www.python.org/dev/peps/pep-0597 Fixes: #10551

sbc100 mentioned this issue Feb 24, 2020

Use utf-8 encoding by default for all subprocess communication. #10558

Merged

sbc100 closed this as completed in #10558 Feb 25, 2020

sbc100 added a commit that referenced this issue Feb 25, 2020

Use utf-8 encoding by default for all subprocess communication. (#10558)

7b9b09a

This seems to mostly be a problem on windows. There is even a PEP out to make this the default: https://www.python.org/dev/peps/pep-0597 Fixes: #10551

sbc100 mentioned this issue Apr 3, 2020

UTF8 characters not processed properly with custom shell.html #10802

Closed

sbc100 reopened this Apr 3, 2020

jeromelaban mentioned this issue May 5, 2020

chore: Bump to emscripten 1.39.14 unoplatform/Uno.SkiaSharp#19

Closed

jeromelaban mentioned this issue Jun 29, 2020

Seems "wasm-strip" is no more unoplatform/Uno.Wasm.Bootstrap#235

Closed

jeromelaban mentioned this issue Jul 9, 2020

Some changes needed by Wave Engine unoplatform/Uno.Wasm.Bootstrap#245

Closed

kichikuou mentioned this issue Mar 10, 2021

Emscripten Linking Error on Windows kichikuou/system3-sdl2#8

Closed

nokotan mentioned this issue Apr 6, 2021

最新のEMSDKでビルドができない nokotan/DxLibForHTML5-VSCode#2

Closed

stale bot added the wontfix label Jun 2, 2021

stale bot closed this as completed Jul 9, 2021

nokotan changed the title ~~Build Fails with latest EMSDK~~ Build Failure with UnicodeDecodeError Nov 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build Failure with UnicodeDecodeError #10551

Build Failure with UnicodeDecodeError #10551

nokotan commented Feb 24, 2020

sbc100 commented Feb 24, 2020

sbc100 commented Apr 3, 2020

sbc100 commented Apr 3, 2020

nokotan commented Apr 5, 2020

jeromelaban commented May 5, 2020 •

edited

Loading

nokotan commented May 5, 2020

jeromelaban commented May 5, 2020 •

edited

Loading

stale bot commented Jun 2, 2021

nokotan commented Jul 10, 2021

19890843006 commented Jul 15, 2021 •

edited

Loading

sbc100 commented Jul 16, 2021

19890843006 commented Jul 20, 2021 •

edited

Loading

sbc100 commented Jul 20, 2021

19890843006 commented Jul 20, 2021 •

edited

Loading

nokotan commented Nov 12, 2022

Build Failure with UnicodeDecodeError #10551

Build Failure with UnicodeDecodeError #10551

Comments

nokotan commented Feb 24, 2020

sbc100 commented Feb 24, 2020

sbc100 commented Apr 3, 2020

sbc100 commented Apr 3, 2020

nokotan commented Apr 5, 2020

jeromelaban commented May 5, 2020 • edited Loading

nokotan commented May 5, 2020

jeromelaban commented May 5, 2020 • edited Loading

stale bot commented Jun 2, 2021

nokotan commented Jul 10, 2021

19890843006 commented Jul 15, 2021 • edited Loading

sbc100 commented Jul 16, 2021

19890843006 commented Jul 20, 2021 • edited Loading

sbc100 commented Jul 20, 2021

19890843006 commented Jul 20, 2021 • edited Loading

nokotan commented Nov 12, 2022

jeromelaban commented May 5, 2020 •

edited

Loading

jeromelaban commented May 5, 2020 •

edited

Loading

19890843006 commented Jul 15, 2021 •

edited

Loading

19890843006 commented Jul 20, 2021 •

edited

Loading

19890843006 commented Jul 20, 2021 •

edited

Loading