-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError when reading value labels #4
Comments
Hoping this will fix Roche#4
Hi, No,all the code you selected is just setting up things, nothing is being called until line 564 (error = parse_func(parser, filename, ctx)), so I think where that line is not important. Actually I played a bit with it: with a good file that was read correctly, I started changing the encoding until I made it crash, so that makes me thing it is fine. Otherwise the real solution would be to guess what is real encoding of that file ... it may also be that the file is corrupt ... is it possible to read it correctly in SAS? what does SAS say the encoding is? It seems in general sas catalog files are really painful to deal with, even in SAS they may not work if they were saved in an older sas version or coming from sas running on a different OS. For that reason people over here suggest that it is good idea to avoid them and just have the labels in the code. I have seen a similar thing here tidyverse/haven#312 . I downloaded the sample files and I also was not able to read them, I think R is still not doing it correctly either (it does not fail but the labels I see it contains the strange character). What do you get if you try to read your file with R-Haven? As last resource I could emulate R current behavior: I can take from C bytes to python bytes first and then have them converted to strings (in this second step your file would fail), but adding a flag in order not to do that conversion, or if it fails to do the conversion fall back to bytes. You would get python bytes with some strange characters on it, and you would have to handle that yourself in some way ... but it may be unreadable anyway, so I am not sure how much that helps, and in addition it would slow down the code a bit. It would take me a while to do such change. |
quick heads up: I moved lines 559 - 561 to before 550 and tried with the bad sas7bcat file from the R issue. I got the same error as before (UnicodeError). So it seems that does not solve the issue. |
🤔 I don't actually have SAS so I can't check if the file is corrupt or not but I know people in my organization have read this into SAS so I have to assume it works with at least some version of SAS. It would be nice if there was a flag to read as bytes instead of string. Maybe I could read it using python 2.7, dump it to json, and then re-read it in python 3.6? I could probably upload the file somewhere if it helps with debugging? It sounds though like you've been able to reproduce the issue so you don't really need my file? |
P.S. The error is |
yeah, it should be line 459. Basically what happens is this: the C code of readstat should give already good bytes in UTF-8 through the usage of the library iconv. Then, in line 459 we simply cast those C bytes to python utf-8 string. But as there are bad characters that cannot be interpreted, it fails. Regarding python 2.7 I am not sure why you need that work around, python 3 bytes will work for this case. |
OK, for me the line to change is 452 and 449 to set value_label_name to bytes
that makes me able to read that bad sas7bcat file. Can you test with yours? It can be that you need 459 in addition or also 451. I think these are the three possibilities and probably one would have to change all of them for a permanent solution. If you then check the meta.value_labels you should see some funny characters. |
Alright let me give me a go. If I'm understanding you properly it might be that I'm passing the wrong encoding to iconv? Maybe I should see what chardet thinks the encoding of the file is? |
None of those changes seemed to have any effect. Am I recompiling correctly? pip uninstall pyreadstat
pip install --no-cache-dir git+https://github.com/mdbecker/pyreadstat.git I don't see anything that looks like output from a compiler so I'm worried that it's caching something somewhere. Here's the output of the pip install:
|
it might be that I'm passing the wrong encoding to iconv? -> yes, that is the first possibility. In the documentation there is a link to a list with all possible iconv encodings. Put them into a list and loop through them to see if any work. The second possiblity is that from iconv's point of view the file is malformed. I would not be too surprised, because sometimes not even SAS can read sas7bcat files correctly. Regarding compilation, first do pip unininstall pyreadstat, then you have to clone the repo, do the changes in the code, open a shell and navigate to the repo folder and then do: python3 setup.py build_ext --inplace --use-cython now you can import pyreadstat if you are in this folder and only in this because the library has been compiled and put locally. |
FYI using python 2.7 seems to be working. I haven't found any funny characters yet and when I look at meta.encoding it says meta.file_encoding it's 'windows-1252'. The only issue I ran into is it didn't apply the categoricals in the catalog file for some reason. I'm not sure how these are supposed to get applied but I did notice |
wow, very smart move! |
Well I found the bad character! |
wow, you're a good detective! |
Alright I'll try to get that installed and report back. Thanks for the assistance! |
"2\xb0" can be decoded with both latin1 and windows-1252, and since iconv thinks it is win1252 it should do it correctly. If you say latin-1, then meta.file_encoding is latin1? If yes thats ok. |
OK, I had a deeper look, and I think I found the bug in readstat. I have filed an issue on their repo, which should be relatively easy to fix; I think. I'll update pyreadstat once it gets fixed in readstat, then you can try with your file again and hopefully it should solve the issue. |
@mdbecker I have updated pyreadstat and the issue with the sample sas7bcat file I have is solved. It would be great if you could confirm it also solves the issue for your file. |
This issue should be most likely solved. Closed after 1 month of inactivity. Please re-open the issue if necessary. |
Thanks for the great library! I'm running into a
UnicodeDecodeError
when reading a catalog file. I tried reading the catalog separately per the instructions to try and isolate the problem. I tried switching the encoding to'latin1'
and'windows-1252'
but the error seems unchanged as a result. Looking throughrun_readstat_parser
pyreadstat/pyreadstat/_readstat_parser.pyx
Lines 547 to 561 in 6e90e91
handle_value_label
is called? Is this a bug? Is there any workaround to this? Thanks!P.S.
I tried forking the repo and moving the encoding part of the code before the call to
handle_value_label
but it didn't seem to help 😢 unfortunately:https://github.com/mdbecker/pyreadstat/blob/cd64eef89ce328a3fb717f1e9675c4b9792c3b89/pyreadstat/_readstat_parser.pyx#L550-L560
The text was updated successfully, but these errors were encountered: