Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salt fails with UnicodeDecodeError on non-ascii characters on Windows (code pages 850 or 437). #19166

Closed
marbx opened this issue Dec 21, 2014 · 13 comments
Labels
Bug broken, incorrect, or confusing behavior P2 Priority 2 severity-low 4th level, cosemtic problems, work around exists Windows
Milestone

Comments

@marbx
Copy link
Contributor

marbx commented Dec 21, 2014

Hello
I use a Windows 7 Enterprise 64 bit minion with a German locale.

Two commands (of many) fail: cmd.run dir and cmd.run whoami
To reproduce, you need a German locale or a file with a non-ascii character (like ä).

From reading the stacktrace:
0x84 (hex) is 132 (decimal).
Character 132 in Codepage 437 (Windows) is ä, which you find...
at position 8 of of dir = Datenträger in Laufwerk c: (= Volume in drive c:).
at position 10 of of cmd.run whoami = nt-autorität\system (= nt-authority\system).

Character 132 in UTF-8 is isolatin-1. Character 132 in isolatin-1 is undefined.

A utf-8 decoder fails with UnicodeDecodeError on character 132.
An example is raw_input().decode('utf-8') found at http://stackoverflow.com/questions/22772888/german-umlauts-read-in-with-raw-input-in-python-2-7
I assume utf_8_decode(), used in salt/output/nested.py, behaves the same.

The above stackoverflow link contains a possible solution.

The stacktraces:

root@MP2Pdebian:/home/markus# uname -a
Linux MP2Pdebian 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2+deb7u2 x86_64 GNU/Linux
root@MP2Pdebian:/home/markus# salt --versions
           Salt: 2014.7.0
         Python: 2.7.3 (default, Mar 13 2014, 11:03:55)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: Not Installed
       pycrypto: 2.6
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 13.1.0
           RAET: Not Installed
            ZMQ: 3.2.3
           Mako: 0.7.0

root@MP2Pdebian:/home/markus# salt MarkusTp test.version
MarkusTp:
    2014.7.0

root@MP2Pdebian:/home/markus# salt MarkusTp cmd.run dir
[ERROR   ] An un-handled exception was caught by salt's global exception handler:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 8: invalid start byte
Traceback (most recent call last):
--snip--
  File "/usr/lib/python2.7/dist-packages/salt/output/nested.py", line 123, in output
    return nest.display(ret, __opts__.get('nested_indent', 0), '', '')
  File "/usr/lib/python2.7/dist-packages/salt/output/nested.py", line 114, in display
    out = self.display(val, indent + 4, '', out)
  File "/usr/lib/python2.7/dist-packages/salt/output/nested.py", line 89, in display
    prefix=prefix)
  File "/usr/lib/python2.7/dist-packages/salt/output/nested.py", line 58, in ustring
    indent, color, prefix, msg.decode(encoding), endc, suffix)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 8: invalid start byte


root@MP2Pdebian:~# salt MarkusTp cmd.run whoami
[ERROR   ] An un-handled exception was caught by salt's global exception handler:
--snip--
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 10: invalid start byte
@jfindlay jfindlay self-assigned this Dec 22, 2014
@jfindlay jfindlay added this to the Under Review milestone Dec 22, 2014
@jfindlay jfindlay added Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists help-wanted Community help is needed to resolve this labels Dec 22, 2014
@jfindlay jfindlay modified the milestones: Approved, Under Review Dec 22, 2014
@jfindlay jfindlay removed their assignment Dec 22, 2014
@jfindlay
Copy link
Contributor

@markuskramerIgitt, thanks for opening a new issue. It seems that this has been fixed already on the 2014.7 branch. A new function called salt.utils.sdecode was created for this purpose and has been applied to the line causing the problem.

@jfindlay jfindlay removed the help-wanted Community help is needed to resolve this label Dec 22, 2014
@jfindlay
Copy link
Contributor

You could try updating your salt minion, but we'll be releasing 2014.7.1 within the next few weeks.

@jfindlay jfindlay added the fixed-pls-verify fix is linked, bug author to confirm fix label Dec 22, 2014
@marbx
Copy link
Contributor Author

marbx commented Jan 7, 2015

Hello @jfindlay
please pardon me if I sound harsh and like a broken record.

The function sdecode(string) is documented with

Since we don't know where a string is coming from and that string will
need to be safely decoded, this function will attempt to decode the string
until if has a working string that does not stack trace

Why? When you issue cmd.run, you know the minion where the string is coming back from.
Network configuration software should always know where "the string is coming from".

Apart from that sdecode() for cmd.run does not solve the problem because it uses get_encodings() (luckily the function above) which in turn uses sys.getdefaultencoding().

sys.getdefaultencoding() returns ascii on Windows and Darwin.

Please forget sys.getdefaultencoding()
Documentation says Return the name of the current default string encoding used by the Unicode implementation.

Instead use sys.stdin.encoding
sys.stdin.encoding returns cp850 on Windows and UTF-8 on Darwin.
Character 132 is ä in both code page 850 and 437, so this is the correct function.

get_encodings() always appends utf-8 and latin-1
Only therefore non-ascii characters on UNIX can be decoded.

The function get_encodings() is documented with

return a list of string encodings to try

Why? Use the right encoding. Networking configuration software should not "try a list of string encodings".

On Windows code page 850 or 437 get_encodings() fails to decode ä and therefore the problem remains.

I seriously doubt that someone tested this fix on Windows.
Therefore: please use Jenkins to test not only UNIX but Windows too. I will repeat that over and over again.

@marbx marbx closed this as completed Jan 7, 2015
@marbx marbx changed the title Salt fails with UnicodeDecodeError on non-ascii characters in codebase 437 (Windows). Salt fails with UnicodeDecodeError on non-ascii characters on Windows (code pages 850 or 437). Jan 7, 2015
@marbx marbx reopened this Jan 7, 2015
@jfindlay
Copy link
Contributor

jfindlay commented Jan 8, 2015

@markuskramerIgitt, you bring up some good points. Please don't think that your criticism is unwelcome; we are very glad to incorporate changes into salt such as those you've described. I am not directly involved with the encoding efforts, perhaps @thatch45 or @s0undt3ch can respond to the issues you've raised.

If you think that you've got a better way to write sdecode, then please feel free to submit a pull request making the changes necessary. If you do, please reference this issue in the pull request.

Autotesting windows with jenkins is one of our priorities, once we get jenkins stabilized. :-)

@s0undt3ch
Copy link
Collaborator

@markuskramerIgitt It has been our will from the very first day we setup Jenkins to also test Windows, however, till this day, we haven't had the bandwidth to add support for it, we run destructive tests so we need to be able to recreate the machine over and over again(before salt is installed).

About, "we don't know where a string is coming", that comment is referring to where as in, from the CLI? From a direct method call to a python library? ... not which minion ...

As for sys.stdin.encoding it can be None, and please remember that Salt's daemons, master and minion, run detached, so this wouldn't work:

salt zjenkins-debian7-build000716 cmd.run 'python -c "print 1; import sys; print sys.stdin.encoding; print sys.stdout.encoding"'
zjenkins-debian7-build000716:
    1
    None
    None

I agree with you that this solution is not yet the perfect one, specially on Windows, but it's a better solution than no solution. It's at least a start.

@marbx
Copy link
Contributor Author

marbx commented Feb 5, 2015

@s0undt3ch, thank you for the command, it supports my suggestion!

Here, W is a Windows Minion:

root@M:~# salt W cmd.run 'python -c "import sys; print sys.stdin.encoding; print sys.stdout.encoding"'
W:
    cp437
    None

This is correct, which means that sys.stdin.encoding is useful.

It also demonstrates that sys.getdefaultencoding() is useless, as mentioned abowe:

root@M:~# salt W cmd.run 'python -c "import sys; print sys.stdin.encoding; print sys.getdefaultencoding()"'
W:
    cp437
    ascii

This is a step forward.

I kindly suggest that the encoding of the Minion can be queried without cmd.run.
Maybe it should be part of the "Grain". If the encoding can change, then maybe not the Grain, I have no experience with Grain and Pillar. But any client has an enconding and sdecode() should use that encoding.
I can only provide a code snippet for sdecode, I have no plans to compile Salt.

If sys.stdin.encoding results in None, then your client still has one enconding and there must be at least one way to find out, and Salt should know it.

Why does sys.stdin.encoding results in None on zjenkins-debian7-build000716, anyhow?
Here, M is a Debian Master, and has an utf-8 encoding:

root@M:~# python -c "import sys; print sys.stdin.encoding; print sys.getdefaultencoding()"
UTF-8
ascii

Let us get an overview of the "origin of a string" problem:

  • When the string comes from a CLI, then the encoding is the encoding of the Minion.
  • The encoding of the Minion should be (must be) known to Salt.
  • Why is a "direct method call to a python library" a problem?
  • Why is " Salt's daemons, master and minion, run detached" a problem?

Markus

@joehealy
Copy link
Contributor

joehealy commented Feb 5, 2015

I don't think this bug applies to just windows. A file.managed with binary data can trigger it on linux as well. Rather than trying to guess the encoding, using repr() may assist by providing an escaped string if no encoding is found.

@marbx
Copy link
Contributor Author

marbx commented Feb 5, 2015

Hello Joe,
You are right, this error should also occur under Unix.
Can you create a file with a non-ascii character (like ä) on Unix and issue cmd.run ls?

I disagree: Salt should never guess the encoding. Can you think of a situation where Salt cannot know the encoding?

Because the encoding can always be found, repr() should never applied on human readable text.

As a goody, the decoder could detect binary data and escape it with repr().
When you ls a binary file, you get a question like "this is binary data, do you really want to see it?"

@rallytime rallytime removed the fixed-pls-verify fix is linked, bug author to confirm fix label Feb 24, 2015
@s0undt3ch
Copy link
Collaborator

A step forward has been made in a247dc0 and 92dfa30.

These changes have been merged into 2015.2 and we will start investigating if this detection is trustworthy(since it's done, or at least attempted, before the process has been daemonized), and once we know we can trust that information, then we can start taking care of the Unicode problems that we've been having.

@markuskramerIgitt that command probably works on the windows minion because its probably not a detached daemon, ie, no forking has occurred. Just for information sake, is that minion using threads or multiprocessing?

@marbx
Copy link
Contributor Author

marbx commented Apr 5, 2015

Hello @s0undt3ch,
Thank you for the update.
Please tell me the steps to find out if my minion is using threads or multiprocessing.

Best regards,
Markus

Three notes:

From your comment in a247dc0 I start to get an idea of the size of the problem: ("sys.stdin.encoding is the most reliable source but it's reset to None once a process is daemonized")

In 92dfa30 you have put the encoding in the grains.
I think that makes sense: the documentation says "grains are bits of information loaded when the salt minion starts, so this information is static", and I think the encoding is static information about a minion.

Do you know this encoding guessing library: https://pypi.python.org/pypi/chardet ?
I have no experience with it.

@s0undt3ch
Copy link
Collaborator

Besides requiring an additional external library(in case we used it), Chardet is most reliable when you feed it enough data. Salt's data is usually a line or two, not enough...

Consider this:

In [3]: chardet.detect('Pão')                                                                                                                                                                  
Out[3]: {'confidence': 0.73, 'encoding': 'windows-1252'}

In [4]: chardet.detect('Alimentção')                                                                                                                                                           
Out[4]: {'confidence': 0.7525, 'encoding': 'utf-8'}

In [5]: chardet.detect('último')                                                                                                                                                               
Out[5]: {'confidence': 0.7539685890654046, 'encoding': 'ISO-8859-2'}

In [6]: chardet.detect('prático')                                                                                                                                                              
Out[6]: {'confidence': 0.99, 'encoding': 'TIS-620'}

In [7]: chardet.detect('pão alimentção último prático')
Out[7]: {'confidence': 0.9690625, 'encoding': 'utf-8'}

In [8]: chardet.detect(u'pão alimentção último prático'.encode('iso-8859-15'))
Out[8]: {'confidence': 0.99, 'encoding': 'windows-1251'}

In [9]: 

Those are all Portuguese words with accent characters.
I'm using UTF-8 but when not using it, it should be one of, windows-1252/Cp1252(for windows platforms), ISO-8859-1(no euro character), ISO-8859-9(still no euro character) and the most commonly used, ISO-8859-15(with euro(€) character). windows-1252 and utf-8 were the only correct guesses...

About threading/multi-processing... Linux defaults to multi-processes, and Windows not defaults to what @UtahDave ?

@marbx
Copy link
Contributor Author

marbx commented Apr 5, 2015

Hello Pedro,
by a strange coincidence, I try to learn Portuguese since January, so
all my comprehension goes to chardet: Portuguese is very difficult to
get right, especially the nasal sound :-)

But jokes aside: thank you for letting me know the limitations of
chardet.
What I see from your examples is that chardet can be very sure 0.99 but
is wrong, as in example #6 and #8
The only reason that I mentioned it is that I have it on my list of
module candidates for removal from Deluge, the BitTorrent client I try
do tame and administer with Salt.
If you don't need to look it up: which library does chardet need?

Have you tried calling the windows command chcp (change codepage)?

Bye,
Markus

Am 05.04.2015 19:19 schrieb Pedro Algarvio:

Besides requiring an additional external library(in case we used it), Chardet is most reliable when you feed it enough data. Salt's data is usually a line or two, not enough...

Consider this:

In [3]: chardet.detect('Pão')
Out[3]: {'confidence': 0.73, 'encoding': 'windows-1252'}

In [4]: chardet.detect('Alimentção')
Out[4]: {'confidence': 0.7525, 'encoding': 'utf-8'}

In [5]: chardet.detect('último')
Out[5]: {'confidence': 0.7539685890654046, 'encoding': 'ISO-8859-2'}

In [6]: chardet.detect('prático')
Out[6]: {'confidence': 0.99, 'encoding': 'TIS-620'}

In [7]: chardet.detect('pão alimentção último prático')
Out[7]: {'confidence': 0.9690625, 'encoding': 'utf-8'}

In [8]: chardet.detect(u'pão alimentção último prático'.encode('iso-8859-15'))
Out[8]: {'confidence': 0.99, 'encoding': 'windows-1251'}

In [9]:

Those are all Portuguese words with accent characters.
I'm using UTF-8 but when not using it, it should be one of, windows-1252/Cp1252(for windows platforms), ISO-8859-1(no euro character), ISO-8859-9(still no euro character) and the most commonly used, ISO-8859-15(with euro(EUR) character). windows-1252 and utf-8 were the only correct guesses...

About threading/multi-processing... Linux defaults to multi-processes, and Windows not defaults to what @UtahDave [1] ?

Reply to this email directly or view it on GitHub [2].

Links:

[1] https://github.com/UtahDave
[2] #19166 (comment)

@terminalmage terminalmage added Windows P2 Priority 2 labels Jun 2, 2015
@terminalmage
Copy link
Contributor

The traceback was fixed a while back in 2014.7.0, but the characters still aren't being printed properly. I have opened a new issue to track this: #24344

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior P2 Priority 2 severity-low 4th level, cosemtic problems, work around exists Windows
Projects
None yet
Development

No branches or pull requests

6 participants