Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Failure with UnicodeDecodeError #10551

Closed
nokotan opened this issue Feb 24, 2020 · 15 comments · Fixed by #10558
Closed

Build Failure with UnicodeDecodeError #10551

nokotan opened this issue Feb 24, 2020 · 15 comments · Fixed by #10558
Labels

Comments

@nokotan
Copy link
Contributor

nokotan commented Feb 24, 2020

Updating EMSDK (Python 2.x to Python 3.7) results build error (seems to be encoding problem)

OS: Windows 10 Home version 1909
Language: Japanese

Project Files:
https://github.com/nokotan/DxLibForHTML5-VSCode

Command:

em++ Main.cpp -o index.html -O1 -std=c++17 -g4 -IDxLibForHTML5 -LDxLibForHTML5 -lbullet -lfreetype -logg -lpng -lvorbis -lz -lDxDrawFunc -lDxUseCLib -lDxLib --emrun -s ASSERTIONS=1 -s MAIN_MODULE=1 -s FULL_ES3=1 -s ALLOW_MEMORY_GROWTH=1 --source-map-base http://localhost:8080/ --preload-file assets --shell-file template.html

Traceback:

Traceback (most recent call last):
  File "\*\*\*/emsdk/upstream/emscripten/emcc.py", line 3905, in <module>
    sys.exit(run(sys.argv))
  File "\*\*\*/emsdk/upstream/emscripten/emcc.py", line 2373, in run
    final = do_emscripten(final, shared.replace_or_append_suffix(target, '.mem'))
  File "\*\*\*/emsdk/upstream/emscripten/emcc.py", line 475, in do_emscripten
    emscripten.run(infile, outfile, memfile)
  File "\*\*\*/emsdk/upstream/emscripten/emscripten.py", line 2794, in run
    return temp_files.run_and_clean(lambda: emscripter(
  File "\*\*\*/emsdk/upstream/emscripten/tools/tempfiles.py", line 105, in run_and_clean
    return func()
  File "\*\*\*/emsdk/upstream/emscripten/emscripten.py", line 2795, in <lambda>
    infile, outfile_obj, memfile, temp_files, shared.DEBUG)
  File "\*\*\*/emsdk/upstream/emscripten/emscripten.py", line 2207, in emscript_wasm_backend
    glue, forwarded_data = compile_settings(temp_files)
  File "\*\*\*/emsdk/upstream/emscripten/emscripten.py", line 773, in compile_settings
    cwd=path_from_root('src'), env=env)
  File "\*\*\*/emsdk/upstream/emscripten/tools/jsrun.py", line 100, in run_js_tool
    return shared.check_call(command, *args, **kw).stdout
  File "\*\*\*/emsdk/upstream/emscripten/tools/shared.py", line 200, in check_call
    return run_process(cmd, *args, **kw)
  File "\*\*\*/emsdk/upstream/emscripten/tools/shared.py", line 181, in run_process
    ret = subprocess.run(cmd, check=check, input=input, *args, **kw)
  File "D:/obj/windows-release/37amd64_Release/msi_python/zip_amd64/subprocess.py", line 474, in run
  File "D:/obj/windows-release/37amd64_Release/msi_python/zip_amd64/subprocess.py", line 926, in communicate
UnicodeDecodeError: 'cp932' codec can't decode byte 0x86 in position 24221: illegal multibyte sequence
@sbc100
Copy link
Collaborator

sbc100 commented Feb 24, 2020

Interesting. It looks like python3 is choosing to use the system default rather than utf8. I see a PEP out that fixes this: https://www.python.org/dev/peps/pep-0597/

In the mean time we should probably force utf8 somehow.

sbc100 added a commit that referenced this issue Feb 24, 2020
This seems to mostly be a problem on windows. There is even a PEP
out to make this the default: https://www.python.org/dev/peps/pep-0597

Fixes: #10551
sbc100 added a commit that referenced this issue Feb 25, 2020
This seems to mostly be a problem on windows. There is even a PEP
out to make this the default: https://www.python.org/dev/peps/pep-0597

Fixes: #10551
@sbc100
Copy link
Collaborator

sbc100 commented Apr 3, 2020

Re-opening because fix got revertted.

@sbc100 sbc100 reopened this Apr 3, 2020
@sbc100
Copy link
Collaborator

sbc100 commented Apr 3, 2020

if you set PYTHONUTF8 in the environment does that solve the issue for you?

@nokotan
Copy link
Contributor Author

nokotan commented Apr 5, 2020

Thank you for your comment.

after setting environment variable "PYTHONUTF8" to 1, compiling my project on python 3.7.7 has succeeded.

@jeromelaban
Copy link

jeromelaban commented May 5, 2020

I've having a similar issue, but setting PYTHONUTF8=1 does not seem to have an effect.

Is there something else may be changed to work around this issue ?

(I removed the stack trace it was the same as above)

@nokotan
Copy link
Contributor Author

nokotan commented May 5, 2020

Does setting PYTHONIOENCODING=utf8 instead of PYTHONUTF8=1 resolve your problem?
PYTHONUTF8=1 should be available option for python 3.7 and later.

@jeromelaban
Copy link

jeromelaban commented May 5, 2020

@nokotan thanks for the help. It's not having any effect, unfortunately. It looks like the ascii decoder is still used in subprocess.

I'll update my running Python, must be related.

@stale
Copy link

stale bot commented Jun 2, 2021

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

@stale stale bot added the wontfix label Jun 2, 2021
@stale stale bot closed this as completed Jul 9, 2021
@nokotan
Copy link
Contributor Author

nokotan commented Jul 10, 2021

I can no longer reproduce this problem with latest emscripten, keeping this issue closed as stale bot just closed this issue.

@19890843006
Copy link

19890843006 commented Jul 15, 2021

@nokotan I reproduce this problem with emcc --version 2.0.3;
OS: Windows 10 Home version 1909
Python:Version - 3.9.5 [sys.sys.getdefaultencoding() = 'utf-8']
Vscode
Language: Chinese

command:

emcc index.cc --pre-js pre.js -o index.js

// pre.js
Module = {};
Module.onRuntimeInitialized = function () {
  postMessage("Worker Ready.");
};
self.onmessage = function (e) {
  console.log("哈哈:  data from Main-js:" + e.data); 
  console.log("开始: start.");
  var p = Module._Pi(e.data);
  postMessage(p);
  console.log("Worker: 结束.");
};
// index.cc
int main()
{
    int64_t a = 9223372036854775806; //0x7FFFFFFFFFFFFFFE
    a += 1;
    printf("%lld\n", a);
}

When index.js includes Chinese characters, like "哈哈,?",it will build fails with EMCC.
When i delete Chinese characters, it will build success.

The Error Tips:
Traceback (most recent call last):
File "D:\Work\project-study\dems-h265player\emsdk\upstream\emscripten\emcc.py", line 3731, in
sys.exit(main(sys.argv))
File "D:\Work\project-study\dems-h265player\emsdk\upstream\emscripten\emcc.py", line 3724, in main
ret = run(args)
File "D:\Work\project-study\dems-h265player\emsdk\upstream\emscripten\emcc.py", line 1062, in run
options, newargs, settings_map = phase_parse_arguments(state)
File "C:\Users\wwx1028182\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 79, in inner
return func(*args, **kwds)
File "D:\Work\project-study\dems-h265player\emsdk\upstream\emscripten\emcc.py", line 1175, in phase_parse_arguments
options, settings_changes, user_js_defines, newargs = parse_args(newargs)
File "D:\Work\project-study\dems-h265player\emsdk\upstream\emscripten\emcc.py", line 2778, in parse_args
options.pre_js += open(consume_arg_file()).read() + '\n'
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 224: illegal multibyte sequence


@sbc100
Copy link
Collaborator

sbc100 commented Jul 16, 2021

What encoding is your file in? It seems that python is trying to use an encoding called gbk which doesn't seem to match the file encoding .. any idea why it would be doing that?

@19890843006
Copy link

19890843006 commented Jul 20, 2021

@sbc100 Thank you for your reply.
Here is all the information about this error:
EMCC VERSION: 2.0.23
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 2.0.23 (b15ca40)
clang version 13.0.0 (Cswircachegitchromium.googlesource.com-external-git.luolix.top-llvm-llvm--project 5852582532b3eb3ea8da51a1e272d8d017bd36c9)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: D:/Work/project-study/dems-h265player/emsdk/upstream/bin

OS: Windows 10 Home version 1909
Python Version:Version - 3.9.5 [sys.getdefaultencoding() = 'utf-8']
IDE: Vscode
VSCODE Files Encoding Setting:UTF-8
Language: Chinese
Command: emcc index.cc -o inde.js --pre-js pre.js

I have found the cause of this problem .
The Error is because I use python 3.9.5 version, and the default encoding of the Windows system I am using is GBK (chcp p36 ).
In the python 3.x version, the built-in open methods will use the system default encoding (Not python's sys.getdefaultencoding()='utf-8') when the third parameter Encoding is not specified.
In emcc.py --pre-js uses:

// line 2778
elif check_arg('--pre-js'):
      options.pre_js += open(consume_arg_file()).read() + '\n'

The third parameter Encoding is not specified, and my pkg.js is utf-8 encoding, so an error occurs when python opening the file。
The following is the source code description of python 3.9.5 open method:

(function) open: (file: _OpenFile, mode: str, buffering: int = ..., encoding: str | None = ..., errors: str | None = ..., newline: str | None = ..., closefd: bool = ..., opener: _Opener | None = ...) -> IO
Open file and return a stream. Raise OSError upon failure.


file is either a text or byte string giving the name (and the path if the file isn't in the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped. (If a file descriptor is given, it is closed when the returned I/O object is closed, unless closefd is set to False.)


mode is an optional string that specifies the mode in which the file is opened. It defaults to 'r' which means open for reading in text mode. Other common values are 'w' for writing (truncating the file if it already exists), 'x' for creating and writing to a new file, and 'a' for appending (which on some Unix systems, means that all writes append to the end of the file regardless of the current seek position). In text mode, if encoding is not specified the encoding used is platform
dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The available modes are:


Character	Meaning
'r'	open for reading (default)
'w'	open for writing, truncating the file first
'x'	create a new file and open it for writing
'a'	open for writing, appending to the end of the file if it exists
'b'	binary mode
't'	text mode (default)
'+'	open a disk file for updating (reading and writing)
'U'	universal newline mode (deprecated)
The default mode is 'rt' (open for reading text). For binary random access, the mode 'w+b' opens and truncates the file to 0 bytes, while 'r+b' opens the file without truncation. The 'x' mode implies 'w' and raises an FileExistsError if the file already exists.


Python distinguishes between files opened in binary and text modes, even when the underlying operating system doesn't. Files opened in binary mode (appending 'b' to the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is appended to the mode argument), the contents of the file are returned as strings, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.


'U' mode is deprecated and will raise an exception in future versions of Python. It has no effect in Python 3. Use newline to control universal newlines mode.


buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:


Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device's "block size" and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.


"Interactive" text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.


encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent, but any encoding supported by Python can be passed. See the codecs module for the list of supported encodings.
...
...

Among them, the description of open third parameter Encoding:

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent, but any encoding supported by Python can be passed. See the codecs module for the list of supported encodings.

It can be seen that when open method opens the file, if encoding is not specified, then the encoding will use "platform dependent". (In my question, my Windows encoding is "GBK").
There is no such problem in Linux, because the default encoding of Linux system is "UTF-8". On Windows, when the platform encoding format is "GBK" or other values ​​other than "UTF-8", I modify the Windows platform encoding to "UTF-8" to solve this error.

I think this solution is a bit complicated, but I don't know if there is a better way to solve this error.

@sbc100
Copy link
Collaborator

sbc100 commented Jul 20, 2021

I see, so it sounds like there is not much emscripten can do to solve that problem is there? If a user sets their default encoding to "GBK" then we should probably reasonably expect their source code to be encoded using this encoding I guess?

The alternative would be to force UTF-8 but that sounds just likely to cause issues for uses.

@19890843006
Copy link

19890843006 commented Jul 20, 2021

@sbc100
Because the main cause of the error lies in python and the Platform System, there is no good way except to modify the default encoding of the platform.

Perhaps, Emscripten can allow passing the third parameter encoding to specify the encoding when opening files with python open like --pre-js --post-js --extern-pre-js ...(any options to read and write files ).

But I don’t know if it is necessary.

@nokotan nokotan changed the title Build Fails with latest EMSDK Build Failure with UnicodeDecodeError Nov 12, 2022
@nokotan
Copy link
Contributor Author

nokotan commented Nov 12, 2022

Maybe this issue will be affected by #16736?
It seems that python command line option '-E' also makes PYTHONUTF8 disabled.

autopawn added a commit to autopawn/nokia-pod-racer that referenced this issue Feb 26, 2023
Compilation to web target is currently failing with the following bug
(debug lines obtained with EMCC_DEBUG=1):

make raylib_game
make[1]: Entering directory '/home/fcasas/Music/nokia-pod-racer/src'
emcc -o raylib_game.html  raylib_game.o  screen_logo.o  screen_title.o  screen_options.o  screen_gameplay.o  screen_ending.o -std=c99 -Wall -Wno-missing-braces -Wunused-result -D_DEFAULT_SOURCE -Os -s MINIFY_HTML=0 -I. -I/home/fcasas/Videos/raylib-web/src -I/home/fcasas/Videos/raylib-web/src/external -I/home/fcasas/Videos/raylib-web/src/extras -L. -L/home/fcasas/Videos/raylib-web/src -L/home/fcasas/Videos/raylib-web/src -s USE_GLFW=3 -s TOTAL_MEMORY=134217728 -s FORCE_FILESYSTEM=1 --preload-file resources --shell-file minshell.html /home/fcasas/Videos/raylib-web/src/libraylib.a -DPLATFORM_WEB
...
emcc:DEBUG: minifying HTML file raylib_game.html
profiler:DEBUG: block "final emitting" raised an exception after 0.066 seconds
profiler:DEBUG: block "post_link" raised an exception after 3.469 seconds
profiler:DEBUG: block "main" raised an exception after 3.653 seconds
Traceback (most recent call last):
  File "/usr/share/emscripten/emcc.py", line 3929, in <module>
    sys.exit(main(sys.argv))
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/share/emscripten/emcc.py", line 3922, in main
    ret = run(args)
  File "/usr/share/emscripten/emcc.py", line 1194, in run
    phase_post_link(options, state, wasm_target, wasm_target, target)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/share/emscripten/emcc.py", line 2740, in phase_post_link
    phase_final_emitting(options, state, target, wasm_target, memfile)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/share/emscripten/emcc.py", line 2867, in phase_final_emitting
    generate_html(target, options, js_target, target_basename,
  File "/usr/share/emscripten/emcc.py", line 3663, in generate_html
    minify_html(target)
  File "/usr/share/emscripten/emcc.py", line 3637, in minify_html
    shared.check_call(['htmlmin', opts, '--', filename, filename])
  File "/usr/share/emscripten/tools/shared.py", line 221, in check_call
    return run_process(cmd, *args, **kw)
  File "/usr/share/emscripten/tools/shared.py", line 105, in run_process
    ret = subprocess.run(cmd, check=check, input=input, *args, **kw)
  File "/usr/lib/python3.10/subprocess.py", line 501, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.10/subprocess.py", line 969, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.10/subprocess.py", line 1778, in _execute_child
    self.pid = _posixsubprocess.fork_exec(
TypeError: expected str, bytes or os.PathLike object, not list

This seems related to this bug in emscripten:

emscripten-core/emscripten#10551

The solution is doing the same as in:

emscripten-core/emscripten#8547

adding the following compilation flag:

CFLAGS += -s MINIFY_HTML=0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants