Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Unable to allocate memory #3

Closed
saegomat opened this issue Jan 10, 2019 · 40 comments
Closed

ValueError: Unable to allocate memory #3

saegomat opened this issue Jan 10, 2019 · 40 comments
Labels
bug Something isn't working

Comments

@saegomat
Copy link

saegomat commented Jan 10, 2019

Hello @ofajardo,
This is a great package!

My RDS files are 300MB+ and I run into memory issues

import pyreadr
scr = 'xyz.rds'
result = pyreadr.read_r(scr)
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\tsaeger\AppData\Local\Continuum\anaconda3\lib\site-packages\pyreadr\pyreadr.py", line 39, in read_r
parser.parse(path)
File "pyreadr\librdata.pyx", line 113, in pyreadr.librdata.Parser.parse
File "pyreadr\librdata.pyx", line 138, in pyreadr.librdata.Parser.parse
ValueError: Unable to allocate memory

It works like a champ for smaller RDS files. I have not tested where the cut-off is. My system has 32GB of RAM.

Best,
--T

@ofajardo
Copy link
Owner

ofajardo commented Jan 11, 2019

hi @saegomat

Thanks a lot for trying the package and for the positive feedback!

short answer: I think unfortunately you would need more RAM. I also think it is difficult to calculate just looking at the size of the .rdata file how much RAM you will need.

more elaboration:

I was doing tests with a 40 Mb (compressed) .rdata file. If you write this to a CSV is 440 MB, meaning that it was compressed 11 times. When you read this into memory it occupies 1.3 GB both in R and python. This is normal. That means for this case you need 32,5 times more RAM than the size of the .rdata.

Now, I have 16 GB RAM. So if theoretically I would be able to load a .rdata file of 16/32,5 = 492 MB. BUT you have to take into account that windows is already using a lot of RAM, in my case at least 4 GB ... so I have only 12 GB free. That means I should be able to read 12/32,5 = 369 MB RAM, that should be the cutoff. By just replicating the original data frame, I produced .rdata files of several sizes and indeed I was able to read until 360 MB and 400 MB one raises the error you have seen. So the calculation is works. (I don't know what windows is doing but some times the busy RAM was 5 or even as high as 8 GB and sometimes just 3 GB ... with the same programs open).

Now, why can I read almost 400 MB and you can't?

First thing would be to check how much FREE RAM you actually have, maybe you are running lot of things and you have much less of what you think. (use the windows task manager).

Another thing is that the degree of compression is quiet variable. I also tried a big matrix of just numbers, the .rdata file was 7 GB and the uncompressed 7GB as well (so no compression), I managed to read into memory. For a matrix of just strings, the compression rate was just half. I guess the initial data frame I was working with (the 40 MB one) has a lot of empty strings/missing and repeated values and those can be compressed a lot.

So, difficult to say just looking at the file how much it will be uncompressed and then in RAM. My guess would be that yours is somehow super compressed and that's why you fail to read it.

you could try to uncompress it to see how large it is. As rule of thumb you then need 5 times more RAM. The only way to know would be to open it in R (if you manage) and look at what is the size of the resulting data frame. If it is very large, more than your free RAM, then that would be the explanation.

Are you generating the files yourself or did you get them from somewhere? If you are generating them yourself then check how large the object in R is, my guess is that you are right at the border of what you can handle.

@ofajardo
Copy link
Owner

One more technical detail / precision: (notes for myself):

The error comes when zlib is trying to decompress the file into memory and thinks there is not enough memory to do so.

It should be coming from rdata_read.c, for zip:

if (inflateInit2(ctx->z_strm, (15+32)) != Z_OK) {
        retval = RDATA_ERROR_MALLOC;
        goto cleanup;
    }

Nothing can be done :(

@saegomat
Copy link
Author

Hi @ofajardo,
thanks much for the feedback. Opening up the data file through other means it shows ~28GB RAM used. I'll need to work on a workaround :-)

Best,
--T

@ofajardo
Copy link
Owner

Closed after 1 month inactivity

@cpury
Copy link

cpury commented Mar 1, 2019

Hey, I'm having the same issue, but I find it hard to believe... My file is 22 MB only, but I can't open it even if my machine has 12 GB RAM free... How is that possible? I've never run into this kind of problem with other file types. I can't imagine R has some sort of superior compression algorithm that would compress 10s of GB in only 22 MB...

@cpury
Copy link

cpury commented Mar 1, 2019

I just opened my file via rpy2 with no problem. The data in there, uncompressed, is 200 MB... I'm new to R, but it's a major issue if opening 200 MB of data needs 10s of GBs of RAM.

@ofajardo
Copy link
Owner

ofajardo commented Mar 1, 2019

would you share the file to take a look?

@cpury
Copy link

cpury commented Mar 1, 2019

Unfortunately I can't share it :(

@ofajardo
Copy link
Owner

ofajardo commented Mar 1, 2019

pitty ... as your case seems really extreme would have been good to see what is exactly happening.

Just one thing, in case you have a few minutes: can you make a copy of your file, then rename the extension from RData or RDS to .zip, and then unzip it. How much takes the unzipped file?

@cpury
Copy link

cpury commented Mar 2, 2019

Hey Otto, unzipped it's 187.2 MB in size.

@ofajardo
Copy link
Owner

ofajardo commented Mar 3, 2019

Thanks. Your case definitely looks like a bug. But without being able to reproduce it nothing can be done.

Is only that file failing or any? (You have some samples in the test_data folder.

If more files are failing, one easy thing you could try is to install in a different way (for example if you did pip, try conda and compiling from source), maybe it's a shared library conflict.

@cpury
Copy link

cpury commented Mar 3, 2019

Thanks for looking into it, Otto. It's a pity I can't share the file. I will try again next time I get an R file and let you know if I run into similar trouble or not.

@ofajardo
Copy link
Owner

ofajardo commented Mar 5, 2019

I have filed an issue in librdata to see if something can be done in order to improve this situation. No guarantee that it can be fixed tough.

That would be for the general case, the case reported by @cpury I think it may be a different bug, for that we will need a sample file if the problem appears again.

@Gootjes
Copy link

Gootjes commented Mar 29, 2019

For me, I get the memory allocation error for any dataset with more than 2 to the power of 22 rows (8MB as a .Rdata file, 11MB as unzipped).
I am on a 8GB RAM system, and have about 4GB at my disposal.

Using the following code, I can see the magic cutoff happening. I am not sure whether it will replicate on your side, as it probably depends on RAM.

d <- data.frame(ID=1:((2**22)+0)); save(d, file = "2pow22.Rdata")
d <- data.frame(ID=1:((2**22)+1)); save(d, file = "2pow22_and_1.Rdata")
import pyreadr as P
P.read_r("2pow22.Rdata")
P.read_r("2pow22_and_1.Rdata")

The interesting this is, the issue is not the dataset size per se, but rather the amount of rows, as this code runs fine:

d <- data.frame(ID=1:(2**22), ID2=1:(2**22)); save(d, file = "2pow22_by_2.Rdata")
P.read_r("2pow22_by_2.Rdata")

@ofajardo ofajardo reopened this Mar 29, 2019
@ofajardo
Copy link
Owner

ofajardo commented Mar 30, 2019

hi @Gootjes ,

brilliant! really nice detective work! thanks a lot for the report.

I can fully reproduce the issue and furthermore if all the values in the ID vector are 0, still the same error arises meaning it is not the numeric value inside the vector the one that causes the issue, but the number or rows itself as you say. I can also reproduce the issue after decompressing the files, meaning the error is not coming from zlib.

As a next step I need to see if the same error arises if trying to read these files using C directly, so that I know if the issue comes from the C library or from my code. Actually I am using a variant of the original C code that compiles on windows, therefore I have to check that as well. At the moment however, the original C library has a bug that is precluding me from testing, but I have already make an issue over there and will follow up.

I also would like to know to which extent the issues from the people come from this issue, or from a lack of enough RAM. In your case, did you encounter the problem while reading a normal file and then managed to reduce the issue to this minimal example? I just re-checked the example I made up and mentioned before in this issue, and in that case I can get the memory error with less than 2**22 rows meaning it is indeed lack of RAM.

@cpury, @saegomat do your files have more than 2**22 rows and therefore would this explain your issues? in the case of @cpury I would say probably yes, but in the case of @saegomat which one of these two is it ?

@ofajardo ofajardo added bug Something isn't working enhancement New feature or request labels Mar 30, 2019
@Gootjes
Copy link

Gootjes commented Mar 30, 2019

hi @ofajardo,

I did not encounter this issue myself on a normal file, instead I produced the minimal example because someone reported (jamovi/jamovi#689) this issue with a dataset with 5.2 million cases (rows).

I am not well-versed in C, so cannot check whether it is in the librdata C library or in your code. I checked out your initial pyreadr commit be4a941, but this issue is already apparent there. If it is in your code, it has been there from the start :).

@ofajardo
Copy link
Owner

ofajardo commented Mar 30, 2019

@Gootjes I found the problem, it is in the C library, file rdata_read.c line 1106. The problem is in the function rdata_malloc, line 89:

static void *rdata_malloc(size_t len) {
    if (len > MAX_BUFFER_SIZE || len == 0)
        return NULL;

    return malloc(len);
}

basically this MaxBufferSize is harcoded to 16777216 (2 ** 24) bytes. If the vector has more bytes than that it raises an error. For example 2 ** 22 rows of integers (your example) is 2 ** 22 * 4 bytes, which is the maxbuffersize, one more integer and the error comes. If you make a vector of numeric (which is 8 bytes) the vector can be half the size and still would raise the error. I manually increased MaxBufferSize and then the error dissapeared. I am not sure why it was hardcoded to this particular value.

I will report in librdata for them to take a look.

@Gootjes
Copy link

Gootjes commented Mar 30, 2019

That makes complete sense, awesome find! Thanks

@saegomat
Copy link
Author

hi @Gootjes ,

brilliant! really nice detective work! thanks a lot for the report.

I can fully reproduce the issue and furthermore if all the values in the ID vector are 0, still the same error arises meaning it is not the numeric value inside the vector the one that causes the issue, but the number or rows itself as you say. I can also reproduce the issue after decompressing the files, meaning the error is not coming from zlib.

As a next step I need to see if the same error arises if trying to read these files using C directly, so that I know if the issue comes from the C library or from my code. Actually I am using a variant of the original C code that compiles on windows, therefore I have to check that as well. At the moment however, the original C library has a bug that is precluding me from testing, but I have already make an issue over there and will follow up.

I also would like to know to which extent the issues from the people come from this issue, or from a lack of enough RAM. In your case, did you encounter the problem while reading a normal file and then managed to reduce the issue to this minimal example? I just re-checked the example I made up and mentioned before in this issue, and in that case I can get the memory error with less than 2**22 rows meaning it is indeed lack of RAM.

@cpury, @saegomat do your files have more than 2**22 rows and therefore would this explain your issues? in the case of @cpury I would say probably yes, but in the case of @saegomat which one of these two is it ?

Hello,
My data set was 25 million rows, so yes above the 2**22 rows.

Thx, --T

@ofajardo ofajardo added the waiting for librdata changes the issue needs some fixes to the C library librdata before it can be solved label Apr 3, 2019
@ofajardo ofajardo removed the enhancement New feature or request label Apr 14, 2019
@ofajardo ofajardo removed the waiting for librdata changes the issue needs some fixes to the C library librdata before it can be solved label Apr 14, 2019
@ofajardo
Copy link
Owner

ofajardo commented Apr 14, 2019

The limit (MAX_BUFFER_SIZE) has been changed in librdata, now the max size of a numeric vector is 2 ** 32 bytes (4GB), meaning 2 ** 30 elements for an Integer vector or 2 ** 29 elements for a Double Vector. This is now explained in the README in known limitations.

Although there is still a hard coded limit, this should hopefully be enough for practical applications.

If somebody still encounter issues, please report.

There is a new version 0.1.9 of pyreadr in Pypy (pip) and conda with the fix.

@prdctofchem
Copy link

prdctofchem commented Feb 12, 2020

I am receiving the same error as those above, but my file is not large (only 111KB). I have a single data.table in the .rds file and plenty of memory (64GB RAM, ~11 in use). I am running Python3.7 and pyreadr version 0.2.2. The exact error messaging is:

---------------------------------------------------------------------------
LibrdataError                             Traceback (most recent call last)
<ipython-input-17-e56e8f0640e6> in <module>
----> 1 dt = pyreadr.read_r('dir_path\\import_file.rds')

C:\Python37\lib\site-packages\pyreadr\pyreadr.py in read_r(path, use_objects, timezone)
     38     if timezone:
     39         parser.set_timezone(timezone)
---> 40     parser.parse(path)
     41 
     42     result = OrderedDict()

C:\Python37\lib\site-packages\pyreadr\librdata.pyx in pyreadr.librdata.Parser.parse()

C:\Python37\lib\site-packages\pyreadr\librdata.pyx in pyreadr.librdata.Parser.parse()

LibrdataError: Unable to allocate memory

I've tried changing some of the parameters for saveRDS to see if it had anything to do with compression methods and I've looked around to see if there are possibly permission issues with librdata library, but have not come across anything helpful. I also tried saving a R data.frame in the RDS to see if it had anything to do with the object class, and surprisingly I received a different error:

---------------------------------------------------------------------------
LibrdataError                             Traceback (most recent call last)
<ipython-input-14-8c720a500237> in <module>
----> 1 dt = pyreadr.read_r('dir_path\\import_file.rds')

C:\Python37\lib\site-packages\pyreadr\pyreadr.py in read_r(path, use_objects, timezone)
     38     if timezone:
     39         parser.set_timezone(timezone)
---> 40     parser.parse(path)
     41 
     42     result = OrderedDict()

C:\Python37\lib\site-packages\pyreadr\librdata.pyx in pyreadr.librdata.Parser.parse()

C:\Python37\lib\site-packages\pyreadr\librdata.pyx in pyreadr.librdata.Parser.parse()

LibrdataError: Invalid file, or file has unsupported features

Hopefully that offers a clue since I can't understand how this could actually be a memory-related problem. Thank you for your help.
-R

Edit:
Single data.table object stored in an rds file here:
https://drive.google.com/file/d/14pfyluEH-ADJC6wf8P0zoklikCsEFxV9/view?usp=sharing

@ofajardo
Copy link
Owner

Thanks for the report. Without an example file and not being able to reproduce, it's not useful though.

@prdctofchem
Copy link

Hey Otto,
I updated my comment with the file I am attempting to import.
Thank you,
-R

@ofajardo ofajardo reopened this Feb 14, 2020
@ofajardo
Copy link
Owner

Thanks, I can reproduce the error, I'll take a look.

@ofajardo
Copy link
Owner

The error is coming from the C library librdata, so I have reported there: WizardMac/librdata#28

ofajardo added a commit that referenced this issue Feb 18, 2020
@prdctofchem
Copy link

Thank you for your attention on this! Issue resolved.
-R

@ofajardo
Copy link
Owner

Great!

@prdctofchem
Copy link

Is there any support for handling column attributes from a data.table object? If not I can open a new issue if you'd like.

@ofajardo
Copy link
Owner

I don't think there is, but uf you open a ticket with a good example I can bring it upstream, and maybe one day we get it.

@beyondpie
Copy link

The same error. I have a 144M data; and including one matrix with (10^5 row, 10^3 column).

@ofajardo
Copy link
Owner

ofajardo commented Jul 6, 2020

hmm 10^5 rows is more or less 2^17 rows. In the previous comment I said that now the limit is 2^30 integers or 2^29 doubles (numeric). Therefore it seems your issue is somewhere else. Please provide a file and/or code to reproduce it.
Please also make sure you are using the latest version of the package.

@beyondpie
Copy link

@ofajardo
thank you so much for your response, and I will check it !

@beyondpie
Copy link

@ofajardo

I notice that my matrix has 7 * 10^5 row, 9 * 10^3 columns. The total number of elements (integers) is 7 * 10^8, which is beyond 2^30...

@ofajardo
Copy link
Owner

ofajardo commented Jul 6, 2020

The limit is per column

@beyondpie
Copy link

Thank you! I check that I use the newest version. I directly use save to save the matrix. Or I need to save it as a list or other formats?

@ofajardo
Copy link
Owner

ofajardo commented Jul 6, 2020

Ahhh you are using a matrix. Then the limit is on the whole matrix. Use a dataframe instead

@ofajardo ofajardo reopened this Jul 6, 2020
@beyondpie
Copy link

beyondpie commented Jul 6, 2020

Thank you! Using data.frame works!

Really appreciate!

@ofajardo
Copy link
Owner

the column limit discussed in this issue is removed in pyreadr version 0.3.0.

@psureshmagadi17
Copy link

Hey, I'm getting this issue even with a small-sized RDS(1.5MB). Is there a fix for this?

@ofajardo
Copy link
Owner

ofajardo commented Sep 3, 2020

@psureshmagadi17 update the package to the latest version that should fix it. If still getting the error open a separate issue as the bug is something else, but please provide a sample file. It is truly imposible to help without being able to reproduce the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants