Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when reading value labels #4

Closed
mdbecker opened this issue Oct 3, 2018 · 18 comments
Closed

UnicodeDecodeError when reading value labels #4

mdbecker opened this issue Oct 3, 2018 · 18 comments

Comments

@mdbecker
Copy link

mdbecker commented Oct 3, 2018

Thanks for the great library! I'm running into a UnicodeDecodeError when reading a catalog file. I tried reading the catalog separately per the instructions to try and isolate the problem. I tried switching the encoding to 'latin1' and 'windows-1252' but the error seems unchanged as a result. Looking through run_readstat_parser

value_label_handler = <readstat_value_label_handler> handle_value_label
note_handler = <readstat_note_handler> handle_note
retcode = readstat_set_metadata_handler(parser, metadata_handler)
retcode = readstat_set_variable_handler(parser, variable_handler)
retcode = readstat_set_value_label_handler(parser, value_label_handler)
retcode = readstat_set_note_handler(parser, note_handler)
if not metaonly:
retcode = readstat_set_value_handler(parser, value_handler)
# if the user set the encoding manually
if data.user_encoding:
encoding_bytes = data.user_encoding.encode("utf-8")
readstat_set_file_character_encoding(parser, <char *> encoding_bytes)
it looks like the encoding is set after handle_value_label is called? Is this a bug? Is there any workaround to this? Thanks!

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-9-3ded1d336483> in <module>()
----> 1 df_empty, catalog = pyreadstat.read_sas7bcat('formats.sas7bcat', encoding='windows-1252')

pyreadstat/pyreadstat.pyx in pyreadstat.pyreadstat.read_sas7bcat()

pyreadstat/_readstat_parser.pyx in pyreadstat._readstat_parser.run_conversion()

pyreadstat/_readstat_parser.pyx in pyreadstat._readstat_parser.handle_value_label()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 35: invalid start byte

P.S.
I tried forking the repo and moving the encoding part of the code before the call to handle_value_label but it didn't seem to help 😢 unfortunately:

https://github.com/mdbecker/pyreadstat/blob/cd64eef89ce328a3fb717f1e9675c4b9792c3b89/pyreadstat/_readstat_parser.pyx#L550-L560

mdbecker added a commit to mdbecker/pyreadstat that referenced this issue Oct 3, 2018
@ofajardo
Copy link
Collaborator

ofajardo commented Oct 3, 2018

Hi,

No,all the code you selected is just setting up things, nothing is being called until line 564 (error = parse_func(parser, filename, ctx)), so I think where that line is not important. Actually I played a bit with it: with a good file that was read correctly, I started changing the encoding until I made it crash, so that makes me thing it is fine.
If you feel brave you can bring the line a bit before and recompile to see if it helps, if it does I would be very happy to take the change. If you think you cannot, you can send me the file and I can try for you (but late next week the earliest)

Otherwise the real solution would be to guess what is real encoding of that file ... it may also be that the file is corrupt ... is it possible to read it correctly in SAS? what does SAS say the encoding is?

It seems in general sas catalog files are really painful to deal with, even in SAS they may not work if they were saved in an older sas version or coming from sas running on a different OS. For that reason people over here suggest that it is good idea to avoid them and just have the labels in the code.

I have seen a similar thing here tidyverse/haven#312 . I downloaded the sample files and I also was not able to read them, I think R is still not doing it correctly either (it does not fail but the labels I see it contains the strange character). What do you get if you try to read your file with R-Haven?

As last resource I could emulate R current behavior: I can take from C bytes to python bytes first and then have them converted to strings (in this second step your file would fail), but adding a flag in order not to do that conversion, or if it fails to do the conversion fall back to bytes. You would get python bytes with some strange characters on it, and you would have to handle that yourself in some way ... but it may be unreadable anyway, so I am not sure how much that helps, and in addition it would slow down the code a bit. It would take me a while to do such change.

@ofajardo
Copy link
Collaborator

ofajardo commented Oct 3, 2018

quick heads up: I moved lines 559 - 561 to before 550 and tried with the bad sas7bcat file from the R issue. I got the same error as before (UnicodeError). So it seems that does not solve the issue.

@mdbecker
Copy link
Author

mdbecker commented Oct 3, 2018

🤔 I don't actually have SAS so I can't check if the file is corrupt or not but I know people in my organization have read this into SAS so I have to assume it works with at least some version of SAS. It would be nice if there was a flag to read as bytes instead of string. Maybe I could read it using python 2.7, dump it to json, and then re-read it in python 3.6? I could probably upload the file somewhere if it helps with debugging? It sounds though like you've been able to reproduce the issue so you don't really need my file?

@mdbecker
Copy link
Author

mdbecker commented Oct 3, 2018

P.S. The error is 'utf-8' codec can't decode byte even when I set the encoding to something other than utf-8. Why is this? Also the stack trace says the error is happening in the handle_value_label function but I don't see any code in there dealing with encoding/decoding so I'm a little confused as to where the error is being thrown from (sorry I'm a cython newb). My only guess is that the error has something to do with the casting being done on line 459?

@ofajardo
Copy link
Collaborator

ofajardo commented Oct 3, 2018

yeah, it should be line 459. Basically what happens is this: the C code of readstat should give already good bytes in UTF-8 through the usage of the library iconv. Then, in line 459 we simply cast those C bytes to python utf-8 string. But as there are bad characters that cannot be interpreted, it fails.
Something you could do for now is to change c_str_value to c_str_value in line 459, also it would be necessary to change the declaration of py_str_value from str to bytes in line 436. Maybe there is something else, but that may be enough for you to get bytes instead of string.
Then you can handle that depending on what you want, you can for example ignore the bad characters ...
I think that's what I would do to get started. If it solves the issue, then I would later try to translate those bytes into strings, and if an error ocurs leave them as bytes and raise a warning.

Regarding python 2.7 I am not sure why you need that work around, python 3 bytes will work for this case.
Uploading your file ... it depends if you do this change I suggest and you are happy with it, I can just add the checkings etc. If it does not work, I can try other things with that sample file and then you can check my changes with your file. If it still does not work then I would like to take a look to your file. Or you can upload it from the start so I can check from the beginning. It depends on how sensitive is the data contained there.

@ofajardo
Copy link
Collaborator

ofajardo commented Oct 3, 2018

OK, for me the line to change is 452 and 449 to set value_label_name to bytes

cdef bytes value_label_name

var_label = <str> val_labels
value_label_name = <bytes> label

that makes me able to read that bad sas7bcat file. Can you test with yours? It can be that you need 459 in addition or also 451. I think these are the three possibilities and probably one would have to change all of them for a permanent solution.

If you then check the meta.value_labels you should see some funny characters.

@mdbecker
Copy link
Author

mdbecker commented Oct 3, 2018

Alright let me give me a go. If I'm understanding you properly it might be that I'm passing the wrong encoding to iconv? Maybe I should see what chardet thinks the encoding of the file is?

@mdbecker
Copy link
Author

mdbecker commented Oct 3, 2018

None of those changes seemed to have any effect. Am I recompiling correctly?

pip uninstall pyreadstat
pip install --no-cache-dir git+https://github.com/mdbecker/pyreadstat.git

I don't see anything that looks like output from a compiler so I'm worried that it's caching something somewhere. Here's the output of the pip install:

Collecting git+https://github.com/mdbecker/pyreadstat.git
  Cloning https://github.com/mdbecker/pyreadstat.git to /private/var/folders/55/nkkrq4d93n90cp0j_mknh3dw0000gn/T/pip-req-build-_g74_25j
Installing collected packages: pyreadstat
  Running setup.py install for pyreadstat ... done
Successfully installed pyreadstat-0.1.8

@ofajardo
Copy link
Collaborator

ofajardo commented Oct 3, 2018

it might be that I'm passing the wrong encoding to iconv? -> yes, that is the first possibility. In the documentation there is a link to a list with all possible iconv encodings. Put them into a list and loop through them to see if any work. The second possiblity is that from iconv's point of view the file is malformed. I would not be too surprised, because sometimes not even SAS can read sas7bcat files correctly.

Regarding compilation, first do pip unininstall pyreadstat, then you have to clone the repo, do the changes in the code, open a shell and navigate to the repo folder and then do:

python3 setup.py build_ext --inplace --use-cython

now you can import pyreadstat if you are in this folder and only in this because the library has been compiled and put locally.

@mdbecker
Copy link
Author

mdbecker commented Oct 3, 2018

FYI using python 2.7 seems to be working. I haven't found any funny characters yet and when I look at meta.encoding it says meta.file_encoding it's 'windows-1252'. The only issue I ran into is it didn't apply the categoricals in the catalog file for some reason. I'm not sure how these are supposed to get applied but I did notice meta.variable_value_labels is an empty dictionary. In any case it's not too hard to me to apply the catalog by hand so at least I have a reasonable workaround for now. I'll try recompiling with the changes in a bit and send you more details. Thanks!

@ofajardo
Copy link
Collaborator

ofajardo commented Oct 3, 2018

wow, very smart move!
And very intriguing finding that in python 2.7 you don't see any strange character in meta.value_labels. What if you try to convert them to unicode? If you don't get any error then it would be super strange, because it means the problem is somehow in Python 3.
In order to apply the catalog labels, the sas7bdat must have meta.variable_to_label. Basically meta.variable_to_label says variable X has a label Y. Then meta.value_labels from sas7bcat says the label Y is for value 1 label is Male and for value 2 label is Female. Merging both things you get meta.variable_value_labels which will say for variable X value 1 label is Male and for value 2, and this you can apply directly in pandas. So, it needs a piece coming from sas7bdat and a piece from sas7bcat.

@mdbecker
Copy link
Author

mdbecker commented Oct 3, 2018

Well I found the bad character!
The string had '2\xb0' in it which when decoded with 'latin1' results in 2°. Oddly I tried passing in the encoding of latin1 to read_sas7bcat but I still get the error. It seems like it's using utf-8 even when I specify latin1?

@ofajardo
Copy link
Collaborator

ofajardo commented Oct 3, 2018

wow, you're a good detective!
As far as I remember if you pass some nonsensical encoding it breaks, what makes me think that in general it does listen to the parameter ... so I don't know whats happening. At this point would be good to discriminate if the problem is in my code or in readstat ... the only way to check that is trying the same thing in R-haven ... or it could also be that the bug is in iconv ...

@mdbecker
Copy link
Author

mdbecker commented Oct 3, 2018

Alright I'll try to get that installed and report back. Thanks for the assistance!

@ofajardo
Copy link
Collaborator

ofajardo commented Oct 3, 2018

"2\xb0" can be decoded with both latin1 and windows-1252, and since iconv thinks it is win1252 it should do it correctly. If you say latin-1, then meta.file_encoding is latin1? If yes thats ok.
The problem is after iconv handling it, it should be valid utf8 and its not ...

@ofajardo
Copy link
Collaborator

OK, I had a deeper look, and I think I found the bug in readstat. I have filed an issue on their repo, which should be relatively easy to fix; I think. I'll update pyreadstat once it gets fixed in readstat, then you can try with your file again and hopefully it should solve the issue.

@ofajardo
Copy link
Collaborator

ofajardo commented Oct 12, 2018

@mdbecker I have updated pyreadstat and the issue with the sample sas7bcat file I have is solved. It would be great if you could confirm it also solves the issue for your file.
I have uploaded the new version (0.1.9) to Pypi so you can install it doing pip install pyreadstat

@ofajardo
Copy link
Collaborator

This issue should be most likely solved. Closed after 1 month of inactivity. Please re-open the issue if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants