Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][GcsFileSystem][Parquet] fails to create ParquetFile from GCS after a few hundred files #35318

Closed
zpz opened this issue Apr 25, 2023 · 22 comments · Fixed by #36119
Closed

Comments

@zpz
Copy link

zpz commented Apr 25, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I posted the question on SO https://stackoverflow.com/questions/76012391/pyarrow-fails-to-create-parquetfile-from-blob-in-google-cloud-storage

My guess about the issue is either GcsFileSystem or its interaction with GCS. I don't have code snippet to reproduce the issue. For me it happens after looping through 300+ files. After that, the issue seems to persist.

The gist of it is using biglist.ParquetFileReader.load_file

  • if lazy=False, it works fine.
  • if lazy=True, after 300+ files, it starts to fail with
    File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/core.py", line 319, in __init__
      source = filesystem.open_input_file(source)
    File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
    File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
    File "pyarrow/error.pxi", line 138, in pyarrow.lib.check_status
  pyarrow.lib.ArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error GetObjectMetadata: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)

Component(s)

Parquet, Python

@zpz
Copy link
Author

zpz commented Apr 25, 2023

@westonpace
Copy link
Member

The error (unrecoverable error from select/poll) originates from curl. So it seems this related to curl / GCS. Is there any possibility the connection is having issues?

@zpz
Copy link
Author

zpz commented Apr 27, 2023

In the same code, if I download the blob as bytes, there was no issues. So I doubt it's connection issue. I don't know how curl is used; it's not used in my code. I feel the issue is some interaction between GcsFileSystem and the GCS service. Note that the issue happens only after processing a few hundred blobs, so there seems to be some thing building up.

@pitrou
Copy link
Member

pitrou commented May 11, 2023

@coryan Have you already seen this error?

@coryan
Copy link
Contributor

coryan commented May 11, 2023

Have you already seen this error?

No, that is a new one for me.

What version of Apache/Arrow is this running with? Does it include the fixes in #34051?

If it does not include those fixes I speculate (as in "I am not sure, but maybe") that this is starting 300+ downloads. That will consume about 600 sockets and exhausting some resource (e.g. file descriptors).

@zpz
Copy link
Author

zpz commented May 11, 2023

It was using the latest version as of the time of the issue report

@zpz
Copy link
Author

zpz commented May 11, 2023

The latest pypi published version, that is

@coryan
Copy link
Contributor

coryan commented May 11, 2023

The latest pypi published version, that is

Okay. 12.0.0 was released on 2023-05-02, about a week after this issue report. The fixes are not in the 11.0.0 release:

771c37a

I think it is worthwhile to try with 12.0.0.

@zpz
Copy link
Author

zpz commented May 12, 2023

with pyarrow 12.0.0, I got this error after 333 files:

error after 0.4718797499954235 seconds
<class 'pyarrow.lib.ArrowException'>
Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)
('Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)',)


Traceback (most recent call last):
  File "/home/docker-user/sunny/tests/manual/parq.py", line 67, in <module>
    main()
  File "/home/docker-user/sunny/tests/manual/parq.py", line 41, in main
    n = len(batch)
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 171, in __len__
    return self.num_rows
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 203, in num_rows
    return self.metadata.num_rows
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 199, in metadata
    return self.file.metadata
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 194, in file
    self._file = self.load_file(self.path, lazy=self.lazy)
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 97, in load_file
    file = ParquetFile(pp, filesystem=ff)
  File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 334, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1220, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 138, in pyarrow.lib.check_status
pyarrow.lib.ArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)

I'm looping through https://github.com/zpz/biglist/blob/7910c60524aeeee19a037245a61fc58d8638e600/src/biglist/_parquet.py#L49 objects each getting a GCS path. In the loop I call len(obj), which calls its load_file with lazy=True

@westonpace
Copy link
Member

From curl/curl#8921 it would seem that too many open file descriptors is indeed a very likely culprit.

@zpz can you show what you get from ulimit -a?

Another possible cause is that the kernel is running out of memory. @zpz can you share the value of cat /proc/sys/vm/overcommit_memory? If overcommit is disabled (i.e. if that command returns 2) then it is possible the kernel will decide it is out of memory well before it actually uses all physical memory.

@zpz
Copy link
Author

zpz commented May 17, 2023

I run it within a Docker container. In container (with no active work) I got

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62474
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
$ cat /proc/sys/vm/overcommit_memory 
1

@zpz
Copy link
Author

zpz commented May 17, 2023

My code does have the problem that as I loop through the hundreds of files, previous files stay around. Now I avoided that situation, still got error after 333 files:

error after 0.4439328750013374 seconds
<class 'pyarrow.lib.ArrowException'>
Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)
('Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)',)

.
.
.

  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 171, in __len__
    return self.num_rows
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 203, in num_rows
    return self.metadata.num_rows
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 199, in metadata
    return self.file.metadata
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 194, in file
    self._file = self.load_file(self.path, lazy=self.lazy)
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 97, in load_file
    file = ParquetFile(pp, filesystem=ff)
  File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 334, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1220, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 138, in pyarrow.lib.check_status
pyarrow.lib.ArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)

@pitrou
Copy link
Member

pitrou commented May 18, 2023

Ok, I took a brief tour through the libcurl source code:

  • The "Unrecoverable error in select/poll" error is generated in curl_multi_wait if Curl_poll returns -1
  • Curl_poll (which, understably, is a wrapper around poll on Unix) returns 1 in three situations:
    1. nfds is non-zero and poll returns an error that's not EINTR
    2. nfds is zero and the given timeout is negative
    3. nfds is zero and poll returns an error including EINTR

Let's dive a bit into google-cloud-cpp. There are two similar functions named WaitForHandles (CurlImpl::WaitForHandles and CurlDownloadRequest::WaitForHandles). Both call curl_multi_wait with zero extra file descriptors and a hard-coded positive timeout. This eliminates the "negative timeout" situation above.

We are left with an error returned from poll. According to the Linux man page, these can be:

       EFAULT fds points outside the process's accessible address space.  The
              array given as argument was not contained in the  calling  pro‐
              gram's address space.

       EINTR  A signal occurred before any requested event; see signal(7).

       EINVAL The nfds value exceeds the RLIMIT_NOFILE value.

       ENOMEM Unable to allocate memory for kernel data structures.

We can eliminate EFAULT as curl_poll ensures the fds point to accessible memory.
EINVAL is extremely unlikely given a limit of 1048576 open files in #35318 (comment) .
ENOMEM cannot be ruled out, but I guess exhaustion of kernel data space would manifest randomly in other ways?

This leaves us with EINTR, which can happen in the case that curl_multi_wait doesn't find any file descriptors to wait for.

@pitrou
Copy link
Member

pitrou commented May 18, 2023

@zpz Perhaps you can try to use strace to see if your program is receiving any signals?
See https://unix.stackexchange.com/a/372581

@westonpace
Copy link
Member

ENOMEM cannot be ruled out, but I guess exhaustion of kernel data space would manifest randomly in other ways?

It should also be possible to monitor RAM usage during the execution. Given that overcommit_memory=1 I think we should only see ENOMEM if free memory is close to 0. However, I agree this is not the likely culprit.

@pitrou
Copy link
Member

pitrou commented May 19, 2023

Current status on this:

We'll have to bump our bundled version of google-cloud-cpp when a new release gets done.
As for libcurl, users will often rely on a system version thereof, and we can't expect it to get a fix.

@pitrou
Copy link
Member

pitrou commented Jun 1, 2023

A new google-cloud-cpp version has been released with the fix:
https://github.com/googleapis/google-cloud-cpp/releases/tag/v2.11.0

@pitrou
Copy link
Member

pitrou commented Jun 1, 2023

Opened #35879

@zpz
Copy link
Author

zpz commented Jun 19, 2023

While there is apparently a valid bug related to this, I should report that I found a bug in my code, which failed to close the ParquetFile. That led to buildup of memory consumption. After fixing that, my immediate problem seems to be solved.

@zpz
Copy link
Author

zpz commented Jun 19, 2023

My working code is here https://github.com/zpz/biglist/blob/main/src/biglist/_parquet.py#L86 the pyarrow behavior here seems to be flawed in that it should take care of this. It has a context manager. However in this case where the context manager doesn't do much, many applications may not use context manager, and the code should handle finalization regardless. This is the case in multiple places in the standard multiprocessing code.

@pitrou
Copy link
Member

pitrou commented Jun 19, 2023

@zpz the ParquetFile context manager should ensure that the reader is closed, does that not happen for you?

def __enter__(self):
return self
def __exit__(self, *args, **kwargs):
self.close()

def close(self, force: bool = False):
if self._close_source or force:
self.reader.close()

(note that _close_source is True if you initialized the ParquetFile with a filesystem argument)

cc @jorisvandenbossche for the potential ParquetFile issue.

@zpz
Copy link
Author

zpz commented Jun 19, 2023

My code does not use context manager on this ParquetFile. My previous code had a bug in closing it.

kou added a commit that referenced this issue Jun 27, 2023
### Rationale for this change

The version will fix #35318.

### What changes are included in this PR?

Use the latest released version.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: #35879

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants