UnicodeDecodeError when reading value labels #4

mdbecker · 2018-10-03T15:49:51Z

Thanks for the great library! I'm running into a UnicodeDecodeError when reading a catalog file. I tried reading the catalog separately per the instructions to try and isolate the problem. I tried switching the encoding to 'latin1' and 'windows-1252' but the error seems unchanged as a result. Looking through run_readstat_parser

pyreadstat/pyreadstat/_readstat_parser.pyx

Lines 547 to 561 in 6e90e91

    
           value_label_handler = <readstat_value_label_handler> handle_value_label 
        
           note_handler = <readstat_note_handler> handle_note 
        
           retcode = readstat_set_metadata_handler(parser, metadata_handler) 
        
           retcode = readstat_set_variable_handler(parser, variable_handler) 
        
           retcode = readstat_set_value_label_handler(parser, value_label_handler) 
        
           retcode = readstat_set_note_handler(parser, note_handler) 
        
           if not metaonly: 
        
               retcode = readstat_set_value_handler(parser, value_handler) 
        
           # if the user set the encoding manually 
        
           if data.user_encoding: 
        
               encoding_bytes = data.user_encoding.encode("utf-8") 
        
               readstat_set_file_character_encoding(parser, <char *> encoding_bytes)

it looks like the encoding is set after handle_value_label is called? Is this a bug? Is there any workaround to this? Thanks!

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-9-3ded1d336483> in <module>()
----> 1 df_empty, catalog = pyreadstat.read_sas7bcat('formats.sas7bcat', encoding='windows-1252')

pyreadstat/pyreadstat.pyx in pyreadstat.pyreadstat.read_sas7bcat()

pyreadstat/_readstat_parser.pyx in pyreadstat._readstat_parser.run_conversion()

pyreadstat/_readstat_parser.pyx in pyreadstat._readstat_parser.handle_value_label()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 35: invalid start byte

P.S.
I tried forking the repo and moving the encoding part of the code before the call to handle_value_label but it didn't seem to help 😢 unfortunately:

https://github.com/mdbecker/pyreadstat/blob/cd64eef89ce328a3fb717f1e9675c4b9792c3b89/pyreadstat/_readstat_parser.pyx#L550-L560

The text was updated successfully, but these errors were encountered:

Hoping this will fix Roche#4

ofajardo · 2018-10-03T16:39:51Z

Hi,

No,all the code you selected is just setting up things, nothing is being called until line 564 (error = parse_func(parser, filename, ctx)), so I think where that line is not important. Actually I played a bit with it: with a good file that was read correctly, I started changing the encoding until I made it crash, so that makes me thing it is fine.
If you feel brave you can bring the line a bit before and recompile to see if it helps, if it does I would be very happy to take the change. If you think you cannot, you can send me the file and I can try for you (but late next week the earliest)

Otherwise the real solution would be to guess what is real encoding of that file ... it may also be that the file is corrupt ... is it possible to read it correctly in SAS? what does SAS say the encoding is?

It seems in general sas catalog files are really painful to deal with, even in SAS they may not work if they were saved in an older sas version or coming from sas running on a different OS. For that reason people over here suggest that it is good idea to avoid them and just have the labels in the code.

I have seen a similar thing here tidyverse/haven#312 . I downloaded the sample files and I also was not able to read them, I think R is still not doing it correctly either (it does not fail but the labels I see it contains the strange character). What do you get if you try to read your file with R-Haven?

As last resource I could emulate R current behavior: I can take from C bytes to python bytes first and then have them converted to strings (in this second step your file would fail), but adding a flag in order not to do that conversion, or if it fails to do the conversion fall back to bytes. You would get python bytes with some strange characters on it, and you would have to handle that yourself in some way ... but it may be unreadable anyway, so I am not sure how much that helps, and in addition it would slow down the code a bit. It would take me a while to do such change.

ofajardo · 2018-10-03T17:10:02Z

quick heads up: I moved lines 559 - 561 to before 550 and tried with the bad sas7bcat file from the R issue. I got the same error as before (UnicodeError). So it seems that does not solve the issue.

mdbecker · 2018-10-03T17:27:41Z

🤔 I don't actually have SAS so I can't check if the file is corrupt or not but I know people in my organization have read this into SAS so I have to assume it works with at least some version of SAS. It would be nice if there was a flag to read as bytes instead of string. Maybe I could read it using python 2.7, dump it to json, and then re-read it in python 3.6? I could probably upload the file somewhere if it helps with debugging? It sounds though like you've been able to reproduce the issue so you don't really need my file?

mdbecker · 2018-10-03T17:36:16Z

P.S. The error is 'utf-8' codec can't decode byte even when I set the encoding to something other than utf-8. Why is this? Also the stack trace says the error is happening in the handle_value_label function but I don't see any code in there dealing with encoding/decoding so I'm a little confused as to where the error is being thrown from (sorry I'm a cython newb). My only guess is that the error has something to do with the casting being done on line 459?

ofajardo · 2018-10-03T17:53:02Z

yeah, it should be line 459. Basically what happens is this: the C code of readstat should give already good bytes in UTF-8 through the usage of the library iconv. Then, in line 459 we simply cast those C bytes to python utf-8 string. But as there are bad characters that cannot be interpreted, it fails.
Something you could do for now is to change c_str_value to c_str_value in line 459, also it would be necessary to change the declaration of py_str_value from str to bytes in line 436. Maybe there is something else, but that may be enough for you to get bytes instead of string.
Then you can handle that depending on what you want, you can for example ignore the bad characters ...
I think that's what I would do to get started. If it solves the issue, then I would later try to translate those bytes into strings, and if an error ocurs leave them as bytes and raise a warning.

Regarding python 2.7 I am not sure why you need that work around, python 3 bytes will work for this case.
Uploading your file ... it depends if you do this change I suggest and you are happy with it, I can just add the checkings etc. If it does not work, I can try other things with that sample file and then you can check my changes with your file. If it still does not work then I would like to take a look to your file. Or you can upload it from the start so I can check from the beginning. It depends on how sensitive is the data contained there.

ofajardo · 2018-10-03T18:06:07Z

OK, for me the line to change is 452 and 449 to set value_label_name to bytes

cdef bytes value_label_name

var_label = <str> val_labels
value_label_name = <bytes> label

that makes me able to read that bad sas7bcat file. Can you test with yours? It can be that you need 459 in addition or also 451. I think these are the three possibilities and probably one would have to change all of them for a permanent solution.

If you then check the meta.value_labels you should see some funny characters.

mdbecker · 2018-10-03T18:11:31Z

Alright let me give me a go. If I'm understanding you properly it might be that I'm passing the wrong encoding to iconv? Maybe I should see what chardet thinks the encoding of the file is?

mdbecker · 2018-10-03T18:25:53Z

None of those changes seemed to have any effect. Am I recompiling correctly?

pip uninstall pyreadstat
pip install --no-cache-dir git+https://github.com/mdbecker/pyreadstat.git

I don't see anything that looks like output from a compiler so I'm worried that it's caching something somewhere. Here's the output of the pip install:

Collecting git+https://github.com/mdbecker/pyreadstat.git
  Cloning https://github.com/mdbecker/pyreadstat.git to /private/var/folders/55/nkkrq4d93n90cp0j_mknh3dw0000gn/T/pip-req-build-_g74_25j
Installing collected packages: pyreadstat
  Running setup.py install for pyreadstat ... done
Successfully installed pyreadstat-0.1.8

ofajardo · 2018-10-03T18:33:58Z

it might be that I'm passing the wrong encoding to iconv? -> yes, that is the first possibility. In the documentation there is a link to a list with all possible iconv encodings. Put them into a list and loop through them to see if any work. The second possiblity is that from iconv's point of view the file is malformed. I would not be too surprised, because sometimes not even SAS can read sas7bcat files correctly.

Regarding compilation, first do pip unininstall pyreadstat, then you have to clone the repo, do the changes in the code, open a shell and navigate to the repo folder and then do:

python3 setup.py build_ext --inplace --use-cython

now you can import pyreadstat if you are in this folder and only in this because the library has been compiled and put locally.

mdbecker · 2018-10-03T19:04:54Z

FYI using python 2.7 seems to be working. I haven't found any funny characters yet and when I look at meta.encoding it says meta.file_encoding it's 'windows-1252'. The only issue I ran into is it didn't apply the categoricals in the catalog file for some reason. I'm not sure how these are supposed to get applied but I did notice meta.variable_value_labels is an empty dictionary. In any case it's not too hard to me to apply the catalog by hand so at least I have a reasonable workaround for now. I'll try recompiling with the changes in a bit and send you more details. Thanks!

ofajardo · 2018-10-03T19:49:14Z

wow, very smart move!
And very intriguing finding that in python 2.7 you don't see any strange character in meta.value_labels. What if you try to convert them to unicode? If you don't get any error then it would be super strange, because it means the problem is somehow in Python 3.
In order to apply the catalog labels, the sas7bdat must have meta.variable_to_label. Basically meta.variable_to_label says variable X has a label Y. Then meta.value_labels from sas7bcat says the label Y is for value 1 label is Male and for value 2 label is Female. Merging both things you get meta.variable_value_labels which will say for variable X value 1 label is Male and for value 2, and this you can apply directly in pandas. So, it needs a piece coming from sas7bdat and a piece from sas7bcat.

mdbecker · 2018-10-03T20:24:38Z

Well I found the bad character!
The string had '2\xb0' in it which when decoded with 'latin1' results in 2°. Oddly I tried passing in the encoding of latin1 to read_sas7bcat but I still get the error. It seems like it's using utf-8 even when I specify latin1?

ofajardo · 2018-10-03T20:39:14Z

wow, you're a good detective!
As far as I remember if you pass some nonsensical encoding it breaks, what makes me think that in general it does listen to the parameter ... so I don't know whats happening. At this point would be good to discriminate if the problem is in my code or in readstat ... the only way to check that is trying the same thing in R-haven ... or it could also be that the bug is in iconv ...

mdbecker · 2018-10-03T20:41:37Z

Alright I'll try to get that installed and report back. Thanks for the assistance!

ofajardo · 2018-10-03T21:38:32Z

"2\xb0" can be decoded with both latin1 and windows-1252, and since iconv thinks it is win1252 it should do it correctly. If you say latin-1, then meta.file_encoding is latin1? If yes thats ok.
The problem is after iconv handling it, it should be valid utf8 and its not ...

ofajardo · 2018-10-11T16:05:57Z

OK, I had a deeper look, and I think I found the bug in readstat. I have filed an issue on their repo, which should be relatively easy to fix; I think. I'll update pyreadstat once it gets fixed in readstat, then you can try with your file again and hopefully it should solve the issue.

ofajardo · 2018-10-12T08:31:21Z

@mdbecker I have updated pyreadstat and the issue with the sample sas7bcat file I have is solved. It would be great if you could confirm it also solves the issue for your file.
I have uploaded the new version (0.1.9) to Pypi so you can install it doing pip install pyreadstat

ofajardo · 2018-11-12T14:38:48Z

This issue should be most likely solved. Closed after 1 month of inactivity. Please re-open the issue if necessary.

mdbecker added a commit to mdbecker/pyreadstat that referenced this issue Oct 3, 2018

_readstat_parser: Try setting encoding earlier

cd64eef

Hoping this will fix Roche#4

This was referenced Oct 11, 2018

readstat not converting encoding of sas7bcat labels WizardMac/ReadStat#152

Closed

read_sas: encoding of .sas7bdat tidyverse/haven#394

Closed

ofajardo closed this as completed Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError when reading value labels #4

UnicodeDecodeError when reading value labels #4

mdbecker commented Oct 3, 2018 •

edited

Loading

ofajardo commented Oct 3, 2018

ofajardo commented Oct 3, 2018

mdbecker commented Oct 3, 2018 •

edited

Loading

mdbecker commented Oct 3, 2018

ofajardo commented Oct 3, 2018

ofajardo commented Oct 3, 2018

mdbecker commented Oct 3, 2018

mdbecker commented Oct 3, 2018

ofajardo commented Oct 3, 2018

mdbecker commented Oct 3, 2018 •

edited

Loading

ofajardo commented Oct 3, 2018

mdbecker commented Oct 3, 2018

ofajardo commented Oct 3, 2018

mdbecker commented Oct 3, 2018

ofajardo commented Oct 3, 2018

ofajardo commented Oct 11, 2018

ofajardo commented Oct 12, 2018 •

edited

Loading

ofajardo commented Nov 12, 2018

UnicodeDecodeError when reading value labels #4

UnicodeDecodeError when reading value labels #4

Comments

mdbecker commented Oct 3, 2018 • edited Loading

ofajardo commented Oct 3, 2018

ofajardo commented Oct 3, 2018

mdbecker commented Oct 3, 2018 • edited Loading

mdbecker commented Oct 3, 2018

ofajardo commented Oct 3, 2018

ofajardo commented Oct 3, 2018

mdbecker commented Oct 3, 2018

mdbecker commented Oct 3, 2018

ofajardo commented Oct 3, 2018

mdbecker commented Oct 3, 2018 • edited Loading

ofajardo commented Oct 3, 2018

mdbecker commented Oct 3, 2018

ofajardo commented Oct 3, 2018

mdbecker commented Oct 3, 2018

ofajardo commented Oct 3, 2018

ofajardo commented Oct 11, 2018

ofajardo commented Oct 12, 2018 • edited Loading

ofajardo commented Nov 12, 2018

mdbecker commented Oct 3, 2018 •

edited

Loading

mdbecker commented Oct 3, 2018 •

edited

Loading

mdbecker commented Oct 3, 2018 •

edited

Loading

ofajardo commented Oct 12, 2018 •

edited

Loading