[Python][GcsFileSystem][Parquet] fails to create ParquetFile from GCS after a few hundred files #35318

zpz · 2023-04-25T03:51:31Z

Describe the bug, including details regarding any error messages, version, and platform.

I posted the question on SO https://stackoverflow.com/questions/76012391/pyarrow-fails-to-create-parquetfile-from-blob-in-google-cloud-storage

My guess about the issue is either GcsFileSystem or its interaction with GCS. I don't have code snippet to reproduce the issue. For me it happens after looping through 300+ files. After that, the issue seems to persist.

The gist of it is using biglist.ParquetFileReader.load_file

if lazy=False, it works fine.
if lazy=True, after 300+ files, it starts to fail with

    File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/core.py", line 319, in __init__
      source = filesystem.open_input_file(source)
    File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
    File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
    File "pyarrow/error.pxi", line 138, in pyarrow.lib.check_status
  pyarrow.lib.ArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error GetObjectMetadata: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)

Component(s)

Parquet, Python

The text was updated successfully, but these errors were encountered:

zpz · 2023-04-25T03:52:40Z

the load_file function is https://github.com/zpz/biglist/blob/c6c1eca5be99370f23b5fcef481b43af8125eecc/src/biglist/_parquet.py#L75

westonpace · 2023-04-26T21:25:45Z

The error (unrecoverable error from select/poll) originates from curl. So it seems this related to curl / GCS. Is there any possibility the connection is having issues?

zpz · 2023-04-27T08:14:40Z

In the same code, if I download the blob as bytes, there was no issues. So I doubt it's connection issue. I don't know how curl is used; it's not used in my code. I feel the issue is some interaction between GcsFileSystem and the GCS service. Note that the issue happens only after processing a few hundred blobs, so there seems to be some thing building up.

pitrou · 2023-05-11T14:06:45Z

@coryan Have you already seen this error?

coryan · 2023-05-11T14:26:49Z

Have you already seen this error?

No, that is a new one for me.

What version of Apache/Arrow is this running with? Does it include the fixes in #34051?

If it does not include those fixes I speculate (as in "I am not sure, but maybe") that this is starting 300+ downloads. That will consume about 600 sockets and exhausting some resource (e.g. file descriptors).

zpz · 2023-05-11T19:03:41Z

It was using the latest version as of the time of the issue report

zpz · 2023-05-11T19:08:46Z

The latest pypi published version, that is

coryan · 2023-05-11T19:43:41Z

The latest pypi published version, that is

Okay. 12.0.0 was released on 2023-05-02, about a week after this issue report. The fixes are not in the 11.0.0 release:

771c37a

I think it is worthwhile to try with 12.0.0.

zpz · 2023-05-12T15:39:19Z

with pyarrow 12.0.0, I got this error after 333 files:

error after 0.4718797499954235 seconds
<class 'pyarrow.lib.ArrowException'>
Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)
('Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)',)


Traceback (most recent call last):
  File "/home/docker-user/sunny/tests/manual/parq.py", line 67, in <module>
    main()
  File "/home/docker-user/sunny/tests/manual/parq.py", line 41, in main
    n = len(batch)
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 171, in __len__
    return self.num_rows
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 203, in num_rows
    return self.metadata.num_rows
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 199, in metadata
    return self.file.metadata
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 194, in file
    self._file = self.load_file(self.path, lazy=self.lazy)
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 97, in load_file
    file = ParquetFile(pp, filesystem=ff)
  File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 334, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1220, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 138, in pyarrow.lib.check_status
pyarrow.lib.ArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)

I'm looping through https://github.com/zpz/biglist/blob/7910c60524aeeee19a037245a61fc58d8638e600/src/biglist/_parquet.py#L49 objects each getting a GCS path. In the loop I call len(obj), which calls its load_file with lazy=True

westonpace · 2023-05-15T18:50:45Z

From curl/curl#8921 it would seem that too many open file descriptors is indeed a very likely culprit.

@zpz can you show what you get from ulimit -a?

Another possible cause is that the kernel is running out of memory. @zpz can you share the value of cat /proc/sys/vm/overcommit_memory? If overcommit is disabled (i.e. if that command returns 2) then it is possible the kernel will decide it is out of memory well before it actually uses all physical memory.

zpz · 2023-05-17T22:20:56Z

I run it within a Docker container. In container (with no active work) I got

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62474
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

$ cat /proc/sys/vm/overcommit_memory 
1

zpz · 2023-05-17T22:52:46Z

My code does have the problem that as I loop through the hundreds of files, previous files stay around. Now I avoided that situation, still got error after 333 files:

error after 0.4439328750013374 seconds
<class 'pyarrow.lib.ArrowException'>
Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)
('Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)',)

.
.
.

  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 171, in __len__
    return self.num_rows
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 203, in num_rows
    return self.metadata.num_rows
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 199, in metadata
    return self.file.metadata
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 194, in file
    self._file = self.load_file(self.path, lazy=self.lazy)
  File "/usr/local/lib/python3.10/site-packages/biglist/_parquet.py", line 97, in load_file
    file = ParquetFile(pp, filesystem=ff)
  File "/usr/local/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 334, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1220, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 138, in pyarrow.lib.check_status
pyarrow.lib.ArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error ReadObjectNotWrapped: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)

pitrou · 2023-05-18T08:53:43Z

Ok, I took a brief tour through the libcurl source code:

The "Unrecoverable error in select/poll" error is generated in curl_multi_wait if Curl_poll returns -1
Curl_poll (which, understably, is a wrapper around poll on Unix) returns 1 in three situations:
1. nfds is non-zero and poll returns an error that's not EINTR
2. nfds is zero and the given timeout is negative
3. nfds is zero and poll returns an error including EINTR

Let's dive a bit into google-cloud-cpp. There are two similar functions named WaitForHandles (CurlImpl::WaitForHandles and CurlDownloadRequest::WaitForHandles). Both call curl_multi_wait with zero extra file descriptors and a hard-coded positive timeout. This eliminates the "negative timeout" situation above.

We are left with an error returned from poll. According to the Linux man page, these can be:

       EFAULT fds points outside the process's accessible address space.  The
              array given as argument was not contained in the  calling  pro‐
              gram's address space.

       EINTR  A signal occurred before any requested event; see signal(7).

       EINVAL The nfds value exceeds the RLIMIT_NOFILE value.

       ENOMEM Unable to allocate memory for kernel data structures.

We can eliminate EFAULT as curl_poll ensures the fds point to accessible memory.
EINVAL is extremely unlikely given a limit of 1048576 open files in #35318 (comment) .
ENOMEM cannot be ruled out, but I guess exhaustion of kernel data space would manifest randomly in other ways?

This leaves us with EINTR, which can happen in the case that curl_multi_wait doesn't find any file descriptors to wait for.

pitrou · 2023-05-18T08:55:33Z

@zpz Perhaps you can try to use strace to see if your program is receiving any signals?
See https://unix.stackexchange.com/a/372581

westonpace · 2023-05-18T20:49:34Z

ENOMEM cannot be ruled out, but I guess exhaustion of kernel data space would manifest randomly in other ways?

It should also be possible to monitor RAM usage during the execution. Given that overcommit_memory=1 I think we should only see ENOMEM if free memory is close to 0. However, I agree this is not the likely culprit.

pitrou · 2023-05-19T15:37:41Z

Current status on this:

a fix was merged for libcurl: select: avoid returning an error on EINTR from select() or poll() curl/curl#11143
a workaround was merged for google-cloud-cpp: fix: workaround curl_multi_poll returning an error on EINTR googleapis/google-cloud-cpp#11649

We'll have to bump our bundled version of google-cloud-cpp when a new release gets done.
As for libcurl, users will often rely on a system version thereof, and we can't expect it to get a fix.

pitrou · 2023-06-01T19:10:47Z

A new google-cloud-cpp version has been released with the fix:
https://github.com/googleapis/google-cloud-cpp/releases/tag/v2.11.0

pitrou · 2023-06-01T19:12:42Z

Opened #35879

zpz · 2023-06-19T01:28:10Z

While there is apparently a valid bug related to this, I should report that I found a bug in my code, which failed to close the ParquetFile. That led to buildup of memory consumption. After fixing that, my immediate problem seems to be solved.

zpz · 2023-06-19T01:39:52Z

My working code is here https://github.com/zpz/biglist/blob/main/src/biglist/_parquet.py#L86 the pyarrow behavior here seems to be flawed in that it should take care of this. It has a context manager. However in this case where the context manager doesn't do much, many applications may not use context manager, and the code should handle finalization regardless. This is the case in multiple places in the standard multiprocessing code.

pitrou · 2023-06-19T12:47:24Z

@zpz the ParquetFile context manager should ensure that the reader is closed, does that not happen for you?

arrow/python/pyarrow/parquet/core.py

Lines 346 to 350 in e798e2a

    
           def __enter__(self): 
        
               return self 
        
           def __exit__(self, *args, **kwargs): 
        
               self.close()

arrow/python/pyarrow/parquet/core.py

Lines 431 to 433 in e798e2a

    
           def close(self, force: bool = False): 
        
               if self._close_source or force: 
        
                   self.reader.close()

(note that _close_source is True if you initialized the ParquetFile with a filesystem argument)

cc @jorisvandenbossche for the potential ParquetFile issue.

zpz · 2023-06-19T15:40:59Z

My code does not use context manager on this ParquetFile. My previous code had a bug in closing it.

### Rationale for this change The version will fix #35318. ### What changes are included in this PR? Use the latest released version. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: #35879 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

zpz added the Type: bug label Apr 25, 2023

github-actions bot added Component: Parquet Component: Python labels Apr 25, 2023

zpz mentioned this issue May 10, 2023

"Unknown error" reading Parquet data from Google Cloud zpz/biglist#100

Closed

This was referenced May 18, 2023

Curl_poll may not ignore EINTR if 0 file descriptors are passed curl/curl#11135

Closed

curl_multi_wait might erroneously return CURLM_UNRECOVERABLE_POLL googleapis/google-cloud-cpp#11647

Closed

This was referenced May 18, 2023

Python SDK unable to download file due to checksum mismatch googleapis/google-resumable-media-python#204

Closed

error downloading from GCS zpz/upathlib#116

Closed

kou mentioned this issue Jun 16, 2023

GH-35879: [C++] Bump bundled google-cloud-cpp to 2.12.0 #36119

Merged

kou closed this as completed in #36119 Jun 27, 2023

ResidentMario mentioned this issue Oct 23, 2023

[Ray Datasets] Reads fail for partitioned Parquet objects that include key=value in their GCS path ray-project/ray#40227

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][GcsFileSystem][Parquet] fails to create ParquetFile from GCS after a few hundred files #35318

[Python][GcsFileSystem][Parquet] fails to create ParquetFile from GCS after a few hundred files #35318

zpz commented Apr 25, 2023 •

edited

Loading

zpz commented Apr 25, 2023

westonpace commented Apr 26, 2023

zpz commented Apr 27, 2023

pitrou commented May 11, 2023

coryan commented May 11, 2023

zpz commented May 11, 2023

zpz commented May 11, 2023

coryan commented May 11, 2023

zpz commented May 12, 2023

westonpace commented May 15, 2023

zpz commented May 17, 2023 •

edited

Loading

zpz commented May 17, 2023

pitrou commented May 18, 2023 •

edited

Loading

pitrou commented May 18, 2023

westonpace commented May 18, 2023

pitrou commented May 19, 2023 •

edited

Loading

pitrou commented Jun 1, 2023

pitrou commented Jun 1, 2023

zpz commented Jun 19, 2023

zpz commented Jun 19, 2023

pitrou commented Jun 19, 2023 •

edited

Loading

zpz commented Jun 19, 2023

[Python][GcsFileSystem][Parquet] fails to create ParquetFile from GCS after a few hundred files #35318

[Python][GcsFileSystem][Parquet] fails to create ParquetFile from GCS after a few hundred files #35318

Comments

zpz commented Apr 25, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

zpz commented Apr 25, 2023

westonpace commented Apr 26, 2023

zpz commented Apr 27, 2023

pitrou commented May 11, 2023

coryan commented May 11, 2023

zpz commented May 11, 2023

zpz commented May 11, 2023

coryan commented May 11, 2023

zpz commented May 12, 2023

westonpace commented May 15, 2023

zpz commented May 17, 2023 • edited Loading

zpz commented May 17, 2023

pitrou commented May 18, 2023 • edited Loading

pitrou commented May 18, 2023

westonpace commented May 18, 2023

pitrou commented May 19, 2023 • edited Loading

pitrou commented Jun 1, 2023

pitrou commented Jun 1, 2023

zpz commented Jun 19, 2023

zpz commented Jun 19, 2023

pitrou commented Jun 19, 2023 • edited Loading

zpz commented Jun 19, 2023

zpz commented Apr 25, 2023 •

edited

Loading

zpz commented May 17, 2023 •

edited

Loading

pitrou commented May 18, 2023 •

edited

Loading

pitrou commented May 19, 2023 •

edited

Loading

pitrou commented Jun 19, 2023 •

edited

Loading