ValueError: Unable to allocate memory #3

saegomat · 2019-01-10T18:58:54Z

Hello @ofajardo,
This is a great package!

My RDS files are 300MB+ and I run into memory issues

import pyreadr
scr = 'xyz.rds'
result = pyreadr.read_r(scr)
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\tsaeger\AppData\Local\Continuum\anaconda3\lib\site-packages\pyreadr\pyreadr.py", line 39, in read_r
parser.parse(path)
File "pyreadr\librdata.pyx", line 113, in pyreadr.librdata.Parser.parse
File "pyreadr\librdata.pyx", line 138, in pyreadr.librdata.Parser.parse
ValueError: Unable to allocate memory

It works like a champ for smaller RDS files. I have not tested where the cut-off is. My system has 32GB of RAM.

Best,
--T

ofajardo · 2019-01-11T17:20:34Z

hi @saegomat

Thanks a lot for trying the package and for the positive feedback!

short answer: I think unfortunately you would need more RAM. I also think it is difficult to calculate just looking at the size of the .rdata file how much RAM you will need.

more elaboration:

I was doing tests with a 40 Mb (compressed) .rdata file. If you write this to a CSV is 440 MB, meaning that it was compressed 11 times. When you read this into memory it occupies 1.3 GB both in R and python. This is normal. That means for this case you need 32,5 times more RAM than the size of the .rdata.

Now, I have 16 GB RAM. So if theoretically I would be able to load a .rdata file of 16/32,5 = 492 MB. BUT you have to take into account that windows is already using a lot of RAM, in my case at least 4 GB ... so I have only 12 GB free. That means I should be able to read 12/32,5 = 369 MB RAM, that should be the cutoff. By just replicating the original data frame, I produced .rdata files of several sizes and indeed I was able to read until 360 MB and 400 MB one raises the error you have seen. So the calculation is works. (I don't know what windows is doing but some times the busy RAM was 5 or even as high as 8 GB and sometimes just 3 GB ... with the same programs open).

Now, why can I read almost 400 MB and you can't?

First thing would be to check how much FREE RAM you actually have, maybe you are running lot of things and you have much less of what you think. (use the windows task manager).

Another thing is that the degree of compression is quiet variable. I also tried a big matrix of just numbers, the .rdata file was 7 GB and the uncompressed 7GB as well (so no compression), I managed to read into memory. For a matrix of just strings, the compression rate was just half. I guess the initial data frame I was working with (the 40 MB one) has a lot of empty strings/missing and repeated values and those can be compressed a lot.

So, difficult to say just looking at the file how much it will be uncompressed and then in RAM. My guess would be that yours is somehow super compressed and that's why you fail to read it.

you could try to uncompress it to see how large it is. As rule of thumb you then need 5 times more RAM. The only way to know would be to open it in R (if you manage) and look at what is the size of the resulting data frame. If it is very large, more than your free RAM, then that would be the explanation.

Are you generating the files yourself or did you get them from somewhere? If you are generating them yourself then check how large the object in R is, my guess is that you are right at the border of what you can handle.

ofajardo · 2019-01-11T18:00:12Z

One more technical detail / precision: (notes for myself):

The error comes when zlib is trying to decompress the file into memory and thinks there is not enough memory to do so.

It should be coming from rdata_read.c, for zip:

if (inflateInit2(ctx->z_strm, (15+32)) != Z_OK) {
        retval = RDATA_ERROR_MALLOC;
        goto cleanup;
    }

Nothing can be done :(

saegomat · 2019-01-14T17:35:38Z

Hi @ofajardo,
thanks much for the feedback. Opening up the data file through other means it shows ~28GB RAM used. I'll need to work on a workaround :-)

Best,
--T

ofajardo · 2019-02-19T20:53:43Z

Closed after 1 month inactivity

cpury · 2019-03-01T14:33:31Z

Hey, I'm having the same issue, but I find it hard to believe... My file is 22 MB only, but I can't open it even if my machine has 12 GB RAM free... How is that possible? I've never run into this kind of problem with other file types. I can't imagine R has some sort of superior compression algorithm that would compress 10s of GB in only 22 MB...

cpury · 2019-03-01T14:54:28Z

I just opened my file via rpy2 with no problem. The data in there, uncompressed, is 200 MB... I'm new to R, but it's a major issue if opening 200 MB of data needs 10s of GBs of RAM.

ofajardo · 2019-03-01T16:05:33Z

would you share the file to take a look?

cpury · 2019-03-01T16:10:41Z

Unfortunately I can't share it :(

ofajardo · 2019-03-01T16:27:39Z

pitty ... as your case seems really extreme would have been good to see what is exactly happening.

Just one thing, in case you have a few minutes: can you make a copy of your file, then rename the extension from RData or RDS to .zip, and then unzip it. How much takes the unzipped file?

cpury · 2019-03-02T17:48:17Z

Hey Otto, unzipped it's 187.2 MB in size.

ofajardo · 2019-03-03T08:24:23Z

Thanks. Your case definitely looks like a bug. But without being able to reproduce it nothing can be done.

Is only that file failing or any? (You have some samples in the test_data folder.

If more files are failing, one easy thing you could try is to install in a different way (for example if you did pip, try conda and compiling from source), maybe it's a shared library conflict.

cpury · 2019-03-03T11:26:23Z

Thanks for looking into it, Otto. It's a pity I can't share the file. I will try again next time I get an R file and let you know if I run into similar trouble or not.

ofajardo · 2019-03-05T17:43:54Z

I have filed an issue in librdata to see if something can be done in order to improve this situation. No guarantee that it can be fixed tough.

That would be for the general case, the case reported by @cpury I think it may be a different bug, for that we will need a sample file if the problem appears again.

Gootjes · 2019-03-29T14:11:05Z

For me, I get the memory allocation error for any dataset with more than 2 to the power of 22 rows (8MB as a .Rdata file, 11MB as unzipped).
I am on a 8GB RAM system, and have about 4GB at my disposal.

Using the following code, I can see the magic cutoff happening. I am not sure whether it will replicate on your side, as it probably depends on RAM.

d <- data.frame(ID=1:((2**22)+0)); save(d, file = "2pow22.Rdata")
d <- data.frame(ID=1:((2**22)+1)); save(d, file = "2pow22_and_1.Rdata")

import pyreadr as P
P.read_r("2pow22.Rdata")
P.read_r("2pow22_and_1.Rdata")

The interesting this is, the issue is not the dataset size per se, but rather the amount of rows, as this code runs fine:

d <- data.frame(ID=1:(2**22), ID2=1:(2**22)); save(d, file = "2pow22_by_2.Rdata")

P.read_r("2pow22_by_2.Rdata")

ofajardo · 2019-03-30T08:44:58Z

hi @Gootjes ,

brilliant! really nice detective work! thanks a lot for the report.

I can fully reproduce the issue and furthermore if all the values in the ID vector are 0, still the same error arises meaning it is not the numeric value inside the vector the one that causes the issue, but the number or rows itself as you say. I can also reproduce the issue after decompressing the files, meaning the error is not coming from zlib.

As a next step I need to see if the same error arises if trying to read these files using C directly, so that I know if the issue comes from the C library or from my code. Actually I am using a variant of the original C code that compiles on windows, therefore I have to check that as well. At the moment however, the original C library has a bug that is precluding me from testing, but I have already make an issue over there and will follow up.

I also would like to know to which extent the issues from the people come from this issue, or from a lack of enough RAM. In your case, did you encounter the problem while reading a normal file and then managed to reduce the issue to this minimal example? I just re-checked the example I made up and mentioned before in this issue, and in that case I can get the memory error with less than 2**22 rows meaning it is indeed lack of RAM.

@cpury, @saegomat do your files have more than 2**22 rows and therefore would this explain your issues? in the case of @cpury I would say probably yes, but in the case of @saegomat which one of these two is it ?

Gootjes · 2019-03-30T14:05:16Z

hi @ofajardo,

I did not encounter this issue myself on a normal file, instead I produced the minimal example because someone reported (jamovi/jamovi#689) this issue with a dataset with 5.2 million cases (rows).

I am not well-versed in C, so cannot check whether it is in the librdata C library or in your code. I checked out your initial pyreadr commit be4a941, but this issue is already apparent there. If it is in your code, it has been there from the start :).

ofajardo · 2019-03-30T14:22:43Z

@Gootjes I found the problem, it is in the C library, file rdata_read.c line 1106. The problem is in the function rdata_malloc, line 89:

static void *rdata_malloc(size_t len) {
    if (len > MAX_BUFFER_SIZE || len == 0)
        return NULL;

    return malloc(len);
}

basically this MaxBufferSize is harcoded to 16777216 (2 ** 24) bytes. If the vector has more bytes than that it raises an error. For example 2 ** 22 rows of integers (your example) is 2 ** 22 * 4 bytes, which is the maxbuffersize, one more integer and the error comes. If you make a vector of numeric (which is 8 bytes) the vector can be half the size and still would raise the error. I manually increased MaxBufferSize and then the error dissapeared. I am not sure why it was hardcoded to this particular value.

I will report in librdata for them to take a look.

Gootjes · 2019-03-30T15:14:28Z

That makes complete sense, awesome find! Thanks

saegomat · 2019-03-30T22:15:54Z

hi @Gootjes ,

brilliant! really nice detective work! thanks a lot for the report.

I can fully reproduce the issue and furthermore if all the values in the ID vector are 0, still the same error arises meaning it is not the numeric value inside the vector the one that causes the issue, but the number or rows itself as you say. I can also reproduce the issue after decompressing the files, meaning the error is not coming from zlib.

As a next step I need to see if the same error arises if trying to read these files using C directly, so that I know if the issue comes from the C library or from my code. Actually I am using a variant of the original C code that compiles on windows, therefore I have to check that as well. At the moment however, the original C library has a bug that is precluding me from testing, but I have already make an issue over there and will follow up.

I also would like to know to which extent the issues from the people come from this issue, or from a lack of enough RAM. In your case, did you encounter the problem while reading a normal file and then managed to reduce the issue to this minimal example? I just re-checked the example I made up and mentioned before in this issue, and in that case I can get the memory error with less than 2**22 rows meaning it is indeed lack of RAM.

@cpury, @saegomat do your files have more than 2**22 rows and therefore would this explain your issues? in the case of @cpury I would say probably yes, but in the case of @saegomat which one of these two is it ?

Hello,
My data set was 25 million rows, so yes above the 2**22 rows.

Thx, --T

…eadrError

ofajardo · 2019-04-14T12:43:25Z

The limit (MAX_BUFFER_SIZE) has been changed in librdata, now the max size of a numeric vector is 2 ** 32 bytes (4GB), meaning 2 ** 30 elements for an Integer vector or 2 ** 29 elements for a Double Vector. This is now explained in the README in known limitations.

Although there is still a hard coded limit, this should hopefully be enough for practical applications.

If somebody still encounter issues, please report.

There is a new version 0.1.9 of pyreadr in Pypy (pip) and conda with the fix.

prdctofchem · 2020-02-12T21:26:27Z

I am receiving the same error as those above, but my file is not large (only 111KB). I have a single data.table in the .rds file and plenty of memory (64GB RAM, ~11 in use). I am running Python3.7 and pyreadr version 0.2.2. The exact error messaging is:

---------------------------------------------------------------------------
LibrdataError                             Traceback (most recent call last)
<ipython-input-17-e56e8f0640e6> in <module>
----> 1 dt = pyreadr.read_r('dir_path\\import_file.rds')

C:\Python37\lib\site-packages\pyreadr\pyreadr.py in read_r(path, use_objects, timezone)
     38     if timezone:
     39         parser.set_timezone(timezone)
---> 40     parser.parse(path)
     41 
     42     result = OrderedDict()

C:\Python37\lib\site-packages\pyreadr\librdata.pyx in pyreadr.librdata.Parser.parse()

C:\Python37\lib\site-packages\pyreadr\librdata.pyx in pyreadr.librdata.Parser.parse()

LibrdataError: Unable to allocate memory

I've tried changing some of the parameters for saveRDS to see if it had anything to do with compression methods and I've looked around to see if there are possibly permission issues with librdata library, but have not come across anything helpful. I also tried saving a R data.frame in the RDS to see if it had anything to do with the object class, and surprisingly I received a different error:

---------------------------------------------------------------------------
LibrdataError                             Traceback (most recent call last)
<ipython-input-14-8c720a500237> in <module>
----> 1 dt = pyreadr.read_r('dir_path\\import_file.rds')

C:\Python37\lib\site-packages\pyreadr\pyreadr.py in read_r(path, use_objects, timezone)
     38     if timezone:
     39         parser.set_timezone(timezone)
---> 40     parser.parse(path)
     41 
     42     result = OrderedDict()

C:\Python37\lib\site-packages\pyreadr\librdata.pyx in pyreadr.librdata.Parser.parse()

C:\Python37\lib\site-packages\pyreadr\librdata.pyx in pyreadr.librdata.Parser.parse()

LibrdataError: Invalid file, or file has unsupported features

Hopefully that offers a clue since I can't understand how this could actually be a memory-related problem. Thank you for your help.
-R

Edit:
Single data.table object stored in an rds file here:
https://drive.google.com/file/d/14pfyluEH-ADJC6wf8P0zoklikCsEFxV9/view?usp=sharing

ofajardo · 2020-02-13T06:15:03Z

Thanks for the report. Without an example file and not being able to reproduce, it's not useful though.

prdctofchem · 2020-02-14T00:28:01Z

Hey Otto,
I updated my comment with the file I am attempting to import.
Thank you,
-R

ofajardo · 2020-02-14T13:22:05Z

Thanks, I can reproduce the error, I'll take a look.

ofajardo · 2020-02-17T14:57:33Z

The error is coming from the C library librdata, so I have reported there: WizardMac/librdata#28

prdctofchem · 2020-02-18T16:33:47Z

Thank you for your attention on this! Issue resolved.
-R

ofajardo · 2020-02-18T17:23:01Z

Great!

prdctofchem · 2020-02-27T20:08:08Z

Is there any support for handling column attributes from a data.table object? If not I can open a new issue if you'd like.

ofajardo · 2020-02-27T20:16:22Z

I don't think there is, but uf you open a ticket with a good example I can bring it upstream, and maybe one day we get it.

beyondpie · 2020-07-06T03:31:36Z

The same error. I have a 144M data; and including one matrix with (10^5 row, 10^3 column).

ofajardo · 2020-07-06T08:33:46Z

hmm 10^5 rows is more or less 2^17 rows. In the previous comment I said that now the limit is 2^30 integers or 2^29 doubles (numeric). Therefore it seems your issue is somewhere else. Please provide a file and/or code to reproduce it.
Please also make sure you are using the latest version of the package.

beyondpie · 2020-07-06T12:22:30Z

@ofajardo
thank you so much for your response, and I will check it !

beyondpie · 2020-07-06T18:32:19Z

@ofajardo

I notice that my matrix has 7 * 10^5 row, 9 * 10^3 columns. The total number of elements (integers) is 7 * 10^8, which is beyond 2^30...

ofajardo · 2020-07-06T19:09:15Z

The limit is per column

beyondpie · 2020-07-06T19:16:31Z

Thank you! I check that I use the newest version. I directly use save to save the matrix. Or I need to save it as a list or other formats?

ofajardo · 2020-07-06T19:19:27Z

Ahhh you are using a matrix. Then the limit is on the whole matrix. Use a dataframe instead

beyondpie · 2020-07-06T19:26:09Z

Thank you! Using data.frame works!

Really appreciate!

ofajardo · 2020-08-26T15:12:02Z

the column limit discussed in this issue is removed in pyreadr version 0.3.0.

psureshmagadi17 · 2020-09-02T20:36:18Z

Hey, I'm getting this issue even with a small-sized RDS(1.5MB). Is there a fix for this?

ofajardo · 2020-09-03T05:15:07Z

@psureshmagadi17 update the package to the latest version that should fix it. If still getting the error open a separate issue as the bug is something else, but please provide a sample file. It is truly imposible to help without being able to reproduce the issue.

ofajardo closed this as completed Feb 19, 2019

ofajardo mentioned this issue Mar 5, 2019

decompress files in chunks WizardMac/librdata#17

Closed

Gootjes mentioned this issue Mar 29, 2019

Unable to load medium size data frame jamovi/jamovi#689

Open

ofajardo reopened this Mar 29, 2019

ofajardo added bug Something isn't working enhancement New feature or request labels Mar 30, 2019

ofajardo mentioned this issue Mar 30, 2019

Memory error when reading long vectors WizardMac/librdata#19

Closed

ofajardo added the waiting for librdata changes the issue needs some fixes to the C library librdata before it can be solved label Apr 3, 2019

ofajardo removed the enhancement New feature or request label Apr 14, 2019

ofajardo added a commit that referenced this issue Apr 14, 2019

updated librdata source solves #3. Using custom LibrdataError and Pyr…

a26c37d

…eadrError

ofajardo removed the waiting for librdata changes the issue needs some fixes to the C library librdata before it can be solved label Apr 14, 2019

ofajardo closed this as completed Apr 14, 2019

ofajardo reopened this Feb 14, 2020

ofajardo added a commit that referenced this issue Feb 18, 2020

fixed bug out of memory issue #3

3a02bbb

ofajardo closed this as completed Feb 18, 2020

ofajardo reopened this Jul 6, 2020

ofajardo closed this as completed Jul 9, 2020

ofajardo mentioned this issue Aug 14, 2020

cannot allocate memory. Maybe my file is too big, like 28M in RDS format. #37

Closed

ValueError: Unable to allocate memory #3

ValueError: Unable to allocate memory #3

Comments

saegomat commented Jan 10, 2019 • edited Loading

ofajardo commented Jan 11, 2019 • edited Loading

ofajardo commented Jan 11, 2019

saegomat commented Jan 14, 2019

ofajardo commented Feb 19, 2019

cpury commented Mar 1, 2019

cpury commented Mar 1, 2019

ofajardo commented Mar 1, 2019

cpury commented Mar 1, 2019 • edited Loading

ofajardo commented Mar 1, 2019

cpury commented Mar 2, 2019

ofajardo commented Mar 3, 2019 • edited Loading

cpury commented Mar 3, 2019

ofajardo commented Mar 5, 2019

Gootjes commented Mar 29, 2019 • edited Loading

ofajardo commented Mar 30, 2019 • edited Loading

Gootjes commented Mar 30, 2019

ofajardo commented Mar 30, 2019 • edited Loading

Gootjes commented Mar 30, 2019

saegomat commented Mar 30, 2019

ofajardo commented Apr 14, 2019 • edited Loading

prdctofchem commented Feb 12, 2020 • edited Loading

ofajardo commented Feb 13, 2020

prdctofchem commented Feb 14, 2020

ofajardo commented Feb 14, 2020

ofajardo commented Feb 17, 2020

prdctofchem commented Feb 18, 2020

ofajardo commented Feb 18, 2020

prdctofchem commented Feb 27, 2020

ofajardo commented Feb 27, 2020

beyondpie commented Jul 6, 2020

ofajardo commented Jul 6, 2020 • edited Loading

beyondpie commented Jul 6, 2020

beyondpie commented Jul 6, 2020

ofajardo commented Jul 6, 2020

beyondpie commented Jul 6, 2020

ofajardo commented Jul 6, 2020

beyondpie commented Jul 6, 2020 • edited Loading

ofajardo commented Aug 26, 2020

psureshmagadi17 commented Sep 2, 2020

ofajardo commented Sep 3, 2020

saegomat commented Jan 10, 2019 •

edited

Loading

ofajardo commented Jan 11, 2019 •

edited

Loading

cpury commented Mar 1, 2019 •

edited

Loading

ofajardo commented Mar 3, 2019 •

edited

Loading

Gootjes commented Mar 29, 2019 •

edited

Loading

ofajardo commented Mar 30, 2019 •

edited

Loading

ofajardo commented Mar 30, 2019 •

edited

Loading

ofajardo commented Apr 14, 2019 •

edited

Loading

prdctofchem commented Feb 12, 2020 •

edited

Loading

ofajardo commented Jul 6, 2020 •

edited

Loading

beyondpie commented Jul 6, 2020 •

edited

Loading