-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][GcsFileSystem][Parquet] fails to create ParquetFile from GCS after a few hundred files #35318
Comments
The error ( |
In the same code, if I download the blob as bytes, there was no issues. So I doubt it's connection issue. I don't know how |
@coryan Have you already seen this error? |
No, that is a new one for me. What version of Apache/Arrow is this running with? Does it include the fixes in #34051? If it does not include those fixes I speculate (as in "I am not sure, but maybe") that this is starting 300+ downloads. That will consume about 600 sockets and exhausting some resource (e.g. file descriptors). |
It was using the latest version as of the time of the issue report |
The latest pypi published version, that is |
Okay. 12.0.0 was released on 2023-05-02, about a week after this issue report. The fixes are not in the 11.0.0 release: I think it is worthwhile to try with 12.0.0. |
with pyarrow 12.0.0, I got this error after 333 files:
I'm looping through https://github.com/zpz/biglist/blob/7910c60524aeeee19a037245a61fc58d8638e600/src/biglist/_parquet.py#L49 objects each getting a GCS path. In the loop I call |
From curl/curl#8921 it would seem that too many open file descriptors is indeed a very likely culprit. @zpz can you show what you get from Another possible cause is that the kernel is running out of memory. @zpz can you share the value of |
I run it within a Docker container. In container (with no active work) I got
|
My code does have the problem that as I loop through the hundreds of files, previous files stay around. Now I avoided that situation, still got error after 333 files:
|
Ok, I took a brief tour through the libcurl source code:
Let's dive a bit into google-cloud-cpp. There are two similar functions named We are left with an error returned from
We can eliminate EFAULT as This leaves us with EINTR, which can happen in the case that |
@zpz Perhaps you can try to use strace to see if your program is receiving any signals? |
It should also be possible to monitor RAM usage during the execution. Given that |
Current status on this:
We'll have to bump our bundled version of |
A new |
Opened #35879 |
While there is apparently a valid bug related to this, I should report that I found a bug in my code, which failed to |
My working code is here https://github.com/zpz/biglist/blob/main/src/biglist/_parquet.py#L86 the pyarrow behavior here seems to be flawed in that it should take care of this. It has a context manager. However in this case where the context manager doesn't do much, many applications may not use context manager, and the code should handle finalization regardless. This is the case in multiple places in the standard multiprocessing code. |
@zpz the arrow/python/pyarrow/parquet/core.py Lines 346 to 350 in e798e2a
arrow/python/pyarrow/parquet/core.py Lines 431 to 433 in e798e2a
(note that cc @jorisvandenbossche for the potential |
My code does not use context manager on this ParquetFile. My previous code had a bug in closing it. |
### Rationale for this change The version will fix #35318. ### What changes are included in this PR? Use the latest released version. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: #35879 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Describe the bug, including details regarding any error messages, version, and platform.
I posted the question on SO https://stackoverflow.com/questions/76012391/pyarrow-fails-to-create-parquetfile-from-blob-in-google-cloud-storage
My guess about the issue is either GcsFileSystem or its interaction with GCS. I don't have code snippet to reproduce the issue. For me it happens after looping through 300+ files. After that, the issue seems to persist.
The gist of it is using
biglist.ParquetFileReader.load_file
lazy=False
, it works fine.lazy=True
, after 300+ files, it starts to fail withComponent(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: