Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

working with big spss files #79

Open
AtanasAtanasovIpsos opened this issue Sep 21, 2020 · 10 comments
Open

working with big spss files #79

AtanasAtanasovIpsos opened this issue Sep 21, 2020 · 10 comments
Labels
enhancement New feature or request requires changes in Readstat waiting for changes in the C library Readstat

Comments

@AtanasAtanasovIpsos
Copy link

First I want to say this library is great!
We have some raw SPSS files that are extremely large(about 6GB with 1.3 million of vars). SPSS itself can work with those. Pyreadstat however cannot handle it even with the option of reading the metadata only. While there is still plenty of RAM available left in the system (the usage of Python is about 1.5 GB) and there is 64 GB ram on the machine. The stacktrace is as follows:

File "C:\Users\thomas\Downloads\Ipsos\Carlsberg\build_column_overview.py", line 32, in <module>
    df, meta = pyreadstat.read_sav(os.path.join(path, file), metadataonly=True)
  File "pyreadstat\pyreadstat.pyx", line 325, in pyreadstat.pyreadstat.read_sav
  File "pyreadstat\_readstat_parser.pyx", line 945, in pyreadstat._readstat_parser.run_conversion
  File "pyreadstat\_readstat_parser.pyx", line 784, in pyreadstat._readstat_parser.run_readstat_parser
  File "pyreadstat\_readstat_parser.pyx", line 714, in pyreadstat._readstat_parser.check_exit_status
ReadstatError: Unable to allocate memory

This happens both on Windows10 64 bit and Linux64bit with python(64 bit)=3.8 and pyreadstat=1.0.2

Now I understand that spss is probably not the best file-format for this data, but unfortunately, that is what we have.

@ofajardo
Copy link
Collaborator

thanks for the report. Would you be able to produce some python code that using pyreadstat.write_sav, produces a large sample file that raises the error on your end? This is to be able to reproduce the issue but without the need of you transferring the file (but just the code to produce the file)

@AtanasAtanasovIpsos
Copy link
Author

Thanks for the reply. I will try playing with the write_sav and will see if I can produce such a file.

@AtanasAtanasovIpsos
Copy link
Author

Here is an example of a code that will generate about 84.6MB of a file that cannot be read back due to the same error.

import random
import pandas as pd
import numpy as np
import pyreadstat

N=1300000
DataSet = pd.DataFrame(np.random.randn(1, N),columns=['A'+str(x) for x in range(1,N+1)])
pyreadstat.write_sav(DataSet,'DataFile.sav')
#%%
df, meta = pyreadstat.read_sav('DataFile.sav', metadataonly=True)

@evanmiller
Copy link

Hi, ReadStat restricts individual memory allocations to 16 MB - this is to prevent denial of service type scenarios with mal-formed data. With 1.3 million variables in your file you are likely hitting that limit with the column metadata.

Some options are 1) Increasing the limit 2) Adding an option to specify the limit and 3) Removing the limit altogether.

@AtanasAtanasovIpsos
Copy link
Author

AtanasAtanasovIpsos commented Sep 21, 2020

The second option would be the best one for me. Or something like:

df, meta = pyreadstat.read_sav('DataFile.sav', metadataonly=True,safetylimits=False)

@ofajardo
Copy link
Collaborator

ofajardo commented Sep 21, 2020

That's a good suggestion.

Given the experience with pyreadr I think 1 is nit good because there will be always somebody with a larger file that will hit the new limit. I personally think removing it would be better, as it was done in pyreadr. That will be less confusing for the users, as they don't need to be aware if the extra flag to inactivate the limit.

@AtanasAtanasovIpsos
Copy link
Author

you are right about the bigger files. now that I think more about it, removing the limit seems also good solution :)

@ofajardo
Copy link
Collaborator

ofajardo commented Dec 3, 2020

hi @evanmiller, is this something coming in Readstat version 1.1.5, or not yet? (just for clarity)

@evanmiller
Copy link

@ofajardo No solution yet

@ofajardo
Copy link
Collaborator

ofajardo commented Dec 3, 2020

@evanmiller Ok thanks!

@ofajardo ofajardo added enhancement New feature or request requires changes in Readstat waiting for changes in the C library Readstat labels Jan 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request requires changes in Readstat waiting for changes in the C library Readstat
Projects
None yet
Development

No branches or pull requests

3 participants