Salt fails with UnicodeDecodeError on non-ascii characters on Windows (code pages 850 or 437). #19166

marbx · 2014-12-21T19:54:39Z

Hello
I use a Windows 7 Enterprise 64 bit minion with a German locale.

Two commands (of many) fail: cmd.run dir and cmd.run whoami
To reproduce, you need a German locale or a file with a non-ascii character (like ä).

From reading the stacktrace:
0x84 (hex) is 132 (decimal).
Character 132 in Codepage 437 (Windows) is ä, which you find...
at position 8 of of dir = Datenträger in Laufwerk c: (= Volume in drive c:).
at position 10 of of cmd.run whoami = nt-autorität\system (= nt-authority\system).

Character 132 in UTF-8 is isolatin-1. Character 132 in isolatin-1 is undefined.

A utf-8 decoder fails with UnicodeDecodeError on character 132.
An example is raw_input().decode('utf-8') found at http://stackoverflow.com/questions/22772888/german-umlauts-read-in-with-raw-input-in-python-2-7
I assume utf_8_decode(), used in salt/output/nested.py, behaves the same.

The above stackoverflow link contains a possible solution.

The stacktraces:

root@MP2Pdebian:/home/markus# uname -a
Linux MP2Pdebian 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2+deb7u2 x86_64 GNU/Linux
root@MP2Pdebian:/home/markus# salt --versions
           Salt: 2014.7.0
         Python: 2.7.3 (default, Mar 13 2014, 11:03:55)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: Not Installed
       pycrypto: 2.6
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 13.1.0
           RAET: Not Installed
            ZMQ: 3.2.3
           Mako: 0.7.0

root@MP2Pdebian:/home/markus# salt MarkusTp test.version
MarkusTp:
    2014.7.0

root@MP2Pdebian:/home/markus# salt MarkusTp cmd.run dir
[ERROR   ] An un-handled exception was caught by salt's global exception handler:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 8: invalid start byte
Traceback (most recent call last):
--snip--
  File "/usr/lib/python2.7/dist-packages/salt/output/nested.py", line 123, in output
    return nest.display(ret, __opts__.get('nested_indent', 0), '', '')
  File "/usr/lib/python2.7/dist-packages/salt/output/nested.py", line 114, in display
    out = self.display(val, indent + 4, '', out)
  File "/usr/lib/python2.7/dist-packages/salt/output/nested.py", line 89, in display
    prefix=prefix)
  File "/usr/lib/python2.7/dist-packages/salt/output/nested.py", line 58, in ustring
    indent, color, prefix, msg.decode(encoding), endc, suffix)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 8: invalid start byte


root@MP2Pdebian:~# salt MarkusTp cmd.run whoami
[ERROR   ] An un-handled exception was caught by salt's global exception handler:
--snip--
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 10: invalid start byte

The text was updated successfully, but these errors were encountered:

jfindlay · 2014-12-22T18:35:16Z

@markuskramerIgitt, thanks for opening a new issue. It seems that this has been fixed already on the 2014.7 branch. A new function called salt.utils.sdecode was created for this purpose and has been applied to the line causing the problem.

jfindlay · 2014-12-22T18:50:08Z

You could try updating your salt minion, but we'll be releasing 2014.7.1 within the next few weeks.

marbx · 2015-01-07T23:14:42Z

Hello @jfindlay
please pardon me if I sound harsh and like a broken record.

The function sdecode(string) is documented with

Since we don't know where a string is coming from and that string will
need to be safely decoded, this function will attempt to decode the string
until if has a working string that does not stack trace

Why? When you issue cmd.run, you know the minion where the string is coming back from.
Network configuration software should always know where "the string is coming from".

Apart from that sdecode() for cmd.run does not solve the problem because it uses get_encodings() (luckily the function above) which in turn uses sys.getdefaultencoding().

sys.getdefaultencoding() returns ascii on Windows and Darwin.

Please forget sys.getdefaultencoding()
Documentation says Return the name of the current default string encoding used by the Unicode implementation.

Instead use sys.stdin.encoding
sys.stdin.encoding returns cp850 on Windows and UTF-8 on Darwin.
Character 132 is ä in both code page 850 and 437, so this is the correct function.

get_encodings() always appends utf-8 and latin-1
Only therefore non-ascii characters on UNIX can be decoded.

The function get_encodings() is documented with

return a list of string encodings to try

Why? Use the right encoding. Networking configuration software should not "try a list of string encodings".

On Windows code page 850 or 437 get_encodings() fails to decode ä and therefore the problem remains.

I seriously doubt that someone tested this fix on Windows.
Therefore: please use Jenkins to test not only UNIX but Windows too. I will repeat that over and over again.

jfindlay · 2015-01-08T00:06:38Z

@markuskramerIgitt, you bring up some good points. Please don't think that your criticism is unwelcome; we are very glad to incorporate changes into salt such as those you've described. I am not directly involved with the encoding efforts, perhaps @thatch45 or @s0undt3ch can respond to the issues you've raised.

If you think that you've got a better way to write sdecode, then please feel free to submit a pull request making the changes necessary. If you do, please reference this issue in the pull request.

Autotesting windows with jenkins is one of our priorities, once we get jenkins stabilized. :-)

s0undt3ch · 2015-01-31T01:16:29Z

@markuskramerIgitt It has been our will from the very first day we setup Jenkins to also test Windows, however, till this day, we haven't had the bandwidth to add support for it, we run destructive tests so we need to be able to recreate the machine over and over again(before salt is installed).

About, "we don't know where a string is coming", that comment is referring to where as in, from the CLI? From a direct method call to a python library? ... not which minion ...

As for sys.stdin.encoding it can be None, and please remember that Salt's daemons, master and minion, run detached, so this wouldn't work:

salt zjenkins-debian7-build000716 cmd.run 'python -c "print 1; import sys; print sys.stdin.encoding; print sys.stdout.encoding"'
zjenkins-debian7-build000716:
    1
    None
    None

I agree with you that this solution is not yet the perfect one, specially on Windows, but it's a better solution than no solution. It's at least a start.

marbx · 2015-02-05T11:00:21Z

@s0undt3ch, thank you for the command, it supports my suggestion!

Here, W is a Windows Minion:

root@M:~# salt W cmd.run 'python -c "import sys; print sys.stdin.encoding; print sys.stdout.encoding"'
W:
    cp437
    None

This is correct, which means that sys.stdin.encoding is useful.

It also demonstrates that sys.getdefaultencoding() is useless, as mentioned abowe:

root@M:~# salt W cmd.run 'python -c "import sys; print sys.stdin.encoding; print sys.getdefaultencoding()"'
W:
    cp437
    ascii

This is a step forward.

I kindly suggest that the encoding of the Minion can be queried without cmd.run.
Maybe it should be part of the "Grain". If the encoding can change, then maybe not the Grain, I have no experience with Grain and Pillar. But any client has an enconding and sdecode() should use that encoding.
I can only provide a code snippet for sdecode, I have no plans to compile Salt.

If sys.stdin.encoding results in None, then your client still has one enconding and there must be at least one way to find out, and Salt should know it.

Why does sys.stdin.encoding results in None on zjenkins-debian7-build000716, anyhow?
Here, M is a Debian Master, and has an utf-8 encoding:

root@M:~# python -c "import sys; print sys.stdin.encoding; print sys.getdefaultencoding()"
UTF-8
ascii

Let us get an overview of the "origin of a string" problem:

When the string comes from a CLI, then the encoding is the encoding of the Minion.
The encoding of the Minion should be (must be) known to Salt.
Why is a "direct method call to a python library" a problem?
Why is " Salt's daemons, master and minion, run detached" a problem?

Markus

joehealy · 2015-02-05T14:17:26Z

I don't think this bug applies to just windows. A file.managed with binary data can trigger it on linux as well. Rather than trying to guess the encoding, using repr() may assist by providing an escaped string if no encoding is found.

marbx · 2015-02-05T17:06:54Z

Hello Joe,
You are right, this error should also occur under Unix.
Can you create a file with a non-ascii character (like ä) on Unix and issue cmd.run ls?

I disagree: Salt should never guess the encoding. Can you think of a situation where Salt cannot know the encoding?

Because the encoding can always be found, repr() should never applied on human readable text.

As a goody, the decoder could detect binary data and escape it with repr().
When you ls a binary file, you get a question like "this is binary data, do you really want to see it?"

s0undt3ch · 2015-03-09T13:27:04Z

A step forward has been made in a247dc0 and 92dfa30.

These changes have been merged into 2015.2 and we will start investigating if this detection is trustworthy(since it's done, or at least attempted, before the process has been daemonized), and once we know we can trust that information, then we can start taking care of the Unicode problems that we've been having.

@markuskramerIgitt that command probably works on the windows minion because its probably not a detached daemon, ie, no forking has occurred. Just for information sake, is that minion using threads or multiprocessing?

marbx · 2015-04-05T10:32:43Z

Hello @s0undt3ch,
Thank you for the update.
Please tell me the steps to find out if my minion is using threads or multiprocessing.

Best regards,
Markus

Three notes:

From your comment in a247dc0 I start to get an idea of the size of the problem: ("sys.stdin.encoding is the most reliable source but it's reset to None once a process is daemonized")

In 92dfa30 you have put the encoding in the grains.
I think that makes sense: the documentation says "grains are bits of information loaded when the salt minion starts, so this information is static", and I think the encoding is static information about a minion.

Do you know this encoding guessing library: https://pypi.python.org/pypi/chardet ?
I have no experience with it.

s0undt3ch · 2015-04-05T17:18:54Z

Besides requiring an additional external library(in case we used it), Chardet is most reliable when you feed it enough data. Salt's data is usually a line or two, not enough...

Consider this:

In [3]: chardet.detect('Pão')                                                                                                                                                                  
Out[3]: {'confidence': 0.73, 'encoding': 'windows-1252'}

In [4]: chardet.detect('Alimentção')                                                                                                                                                           
Out[4]: {'confidence': 0.7525, 'encoding': 'utf-8'}

In [5]: chardet.detect('último')                                                                                                                                                               
Out[5]: {'confidence': 0.7539685890654046, 'encoding': 'ISO-8859-2'}

In [6]: chardet.detect('prático')                                                                                                                                                              
Out[6]: {'confidence': 0.99, 'encoding': 'TIS-620'}

In [7]: chardet.detect('pão alimentção último prático')
Out[7]: {'confidence': 0.9690625, 'encoding': 'utf-8'}

In [8]: chardet.detect(u'pão alimentção último prático'.encode('iso-8859-15'))
Out[8]: {'confidence': 0.99, 'encoding': 'windows-1251'}

In [9]:

Those are all Portuguese words with accent characters.
I'm using UTF-8 but when not using it, it should be one of, windows-1252/Cp1252(for windows platforms), ISO-8859-1(no euro character), ISO-8859-9(still no euro character) and the most commonly used, ISO-8859-15(with euro(€) character). windows-1252 and utf-8 were the only correct guesses...

About threading/multi-processing... Linux defaults to multi-processes, and Windows not defaults to what @UtahDave ?

marbx · 2015-04-05T18:56:08Z

Hello Pedro,
by a strange coincidence, I try to learn Portuguese since January, so
all my comprehension goes to chardet: Portuguese is very difficult to
get right, especially the nasal sound :-)

But jokes aside: thank you for letting me know the limitations of
chardet.
What I see from your examples is that chardet can be very sure 0.99 but
is wrong, as in example #6 and #8
The only reason that I mentioned it is that I have it on my list of
module candidates for removal from Deluge, the BitTorrent client I try
do tame and administer with Salt.
If you don't need to look it up: which library does chardet need?

Have you tried calling the windows command chcp (change codepage)?

Bye,
Markus

Am 05.04.2015 19:19 schrieb Pedro Algarvio:

Besides requiring an additional external library(in case we used it), Chardet is most reliable when you feed it enough data. Salt's data is usually a line or two, not enough...

Consider this:

In [3]: chardet.detect('Pão')
Out[3]: {'confidence': 0.73, 'encoding': 'windows-1252'}

In [4]: chardet.detect('Alimentção')
Out[4]: {'confidence': 0.7525, 'encoding': 'utf-8'}

In [5]: chardet.detect('último')
Out[5]: {'confidence': 0.7539685890654046, 'encoding': 'ISO-8859-2'}

In [6]: chardet.detect('prático')
Out[6]: {'confidence': 0.99, 'encoding': 'TIS-620'}

In [7]: chardet.detect('pão alimentção último prático')
Out[7]: {'confidence': 0.9690625, 'encoding': 'utf-8'}

In [8]: chardet.detect(u'pão alimentção último prático'.encode('iso-8859-15'))
Out[8]: {'confidence': 0.99, 'encoding': 'windows-1251'}

In [9]:

Those are all Portuguese words with accent characters.
I'm using UTF-8 but when not using it, it should be one of, windows-1252/Cp1252(for windows platforms), ISO-8859-1(no euro character), ISO-8859-9(still no euro character) and the most commonly used, ISO-8859-15(with euro(EUR) character). windows-1252 and utf-8 were the only correct guesses...

About threading/multi-processing... Linux defaults to multi-processes, and Windows not defaults to what @UtahDave [1] ?

Reply to this email directly or view it on GitHub [2].

Links:

[1] https://github.com/UtahDave
[2] #19166 (comment)

terminalmage · 2015-06-02T21:27:33Z

The traceback was fixed a while back in 2014.7.0, but the characters still aren't being printed properly. I have opened a new issue to track this: #24344

marbx mentioned this issue Dec 21, 2014

State fails with cmd.run on Windows #18503

Closed

jfindlay self-assigned this Dec 22, 2014

jfindlay added this to the Under Review milestone Dec 22, 2014

jfindlay added Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists help-wanted Community help is needed to resolve this labels Dec 22, 2014

jfindlay modified the milestones: Approved, Under Review Dec 22, 2014

jfindlay removed their assignment Dec 22, 2014

jfindlay removed the help-wanted Community help is needed to resolve this label Dec 22, 2014

jfindlay added the fixed-pls-verify fix is linked, bug author to confirm fix label Dec 22, 2014

marbx closed this as completed Jan 7, 2015

marbx changed the title ~~Salt fails with UnicodeDecodeError on non-ascii characters in codebase 437 (Windows).~~ Salt fails with UnicodeDecodeError on non-ascii characters on Windows (code pages 850 or 437). Jan 7, 2015

marbx reopened this Jan 7, 2015

marbx closed this as completed Feb 5, 2015

marbx reopened this Feb 5, 2015

marbx mentioned this issue Feb 5, 2015

UnicodeDecodeError in highstate outputter #16139

Closed

rallytime removed the fixed-pls-verify fix is linked, bug author to confirm fix label Feb 24, 2015

terminalmage added Windows P2 Priority 2 labels Jun 2, 2015

terminalmage closed this as completed Jun 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Salt fails with UnicodeDecodeError on non-ascii characters on Windows (code pages 850 or 437). #19166

Salt fails with UnicodeDecodeError on non-ascii characters on Windows (code pages 850 or 437). #19166

marbx commented Dec 21, 2014

jfindlay commented Dec 22, 2014

jfindlay commented Dec 22, 2014

marbx commented Jan 7, 2015

jfindlay commented Jan 8, 2015

s0undt3ch commented Jan 31, 2015

marbx commented Feb 5, 2015

joehealy commented Feb 5, 2015

marbx commented Feb 5, 2015

s0undt3ch commented Mar 9, 2015

marbx commented Apr 5, 2015

s0undt3ch commented Apr 5, 2015

marbx commented Apr 5, 2015

terminalmage commented Jun 2, 2015

Salt fails with UnicodeDecodeError on non-ascii characters on Windows (code pages 850 or 437). #19166

Salt fails with UnicodeDecodeError on non-ascii characters on Windows (code pages 850 or 437). #19166

Comments

marbx commented Dec 21, 2014

jfindlay commented Dec 22, 2014

jfindlay commented Dec 22, 2014

marbx commented Jan 7, 2015

jfindlay commented Jan 8, 2015

s0undt3ch commented Jan 31, 2015

marbx commented Feb 5, 2015

joehealy commented Feb 5, 2015

marbx commented Feb 5, 2015

s0undt3ch commented Mar 9, 2015

marbx commented Apr 5, 2015

s0undt3ch commented Apr 5, 2015

marbx commented Apr 5, 2015

Links:

terminalmage commented Jun 2, 2015