Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][FlightRPC] Segmentation Fault when invoking authenticate concurrently over a same FlightClient #38565

Open
jdanap opened this issue Nov 2, 2023 · 15 comments

Comments

@jdanap
Copy link

jdanap commented Nov 2, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Hello. I encountered issue with segmentation fault/crashing pyarrow.flight.FlightClient whenever the client is used concurrently.

As we have narrowed down the cause of the failure to the FlightClient's authenticate function, I have simplified the example reproducible usage scenario to the following:

import concurrent.futures
from time import perf_counter

from pyarrow import flight
from pyarrow.flight import ClientAuthHandler

HOST = 'grpc://obviously.does.not.exist.io:80'
CONCURRENT_TASKS=20
client: flight.FlightClient = flight.FlightClient(HOST)
class NoOpClientAuth(ClientAuthHandler):
    def __init__(self):
        self.token = None

    def authenticate(self, outgoing, incoming):
        print("no-op")

    def get_token(self):
        return self.token

def task1():
    return client.authenticate(NoOpClientAuth())

def task2():
    return client.authenticate(NoOpClientAuth())

def run_concurrently(tasks_info, max_workers, cancel_future=None):
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        for task_and_parameters in tasks_info:
            task, parameters = task_and_parameters[0], task_and_parameters[1:]
            modified_task = task
            futures.append(executor.submit(modified_task, *parameters))

        results = []
        for future in futures:
            try:
                results.append(future.result())
            except Exception:
                if cancel_future:
                    cancel_future.cancel()
                raise
        return results

if __name__ == '__main__':

    tasks = []
    for _ in range(CONCURRENT_TASKS):
       tasks.append([task1])
       tasks.append([task2])

    t1_start = perf_counter()
    results = run_concurrently(tasks, 10)
    t1_stop = perf_counter()
    print("Elapsed time during the whole program in seconds:", t1_stop-t1_start)

With lower level of concurrency, we are less likely to encounter the crash. Instead, we would get 404 or DNS resolution error as expected (or in real use case, be able to get through the authentication successfully just fine). But set it 20 and I have been able to reproduce the crash consistently, albeit after varying number of authenticate calls before the script above exits with a Segmentation Fault in Unix environment.

Debugging in dev mode, the point of failure was located to be here, where it generates std::bad_function_call

We could work around this by always instantiating a new client instead of reusing a singleton one. However, to my understanding, reusing the same client is encouraged.

I am wondering what the original source of failure is, and/or if re-authenticating over/reusing the same client is not the intended usage in this case.

Version(s)

pyarrow=12.0.0 and pyarrow=14.0.0

Component(s)

FlightRPC, Python

@kou
Copy link
Member

kou commented Nov 3, 2023

Could you share full backtrace?

@donatobarone
Copy link

donatobarone commented Nov 6, 2023

Hi @kou this is related to #38519 ticket in a way, I was trying to debug the issue myself, but @jdanap took over the investigation now. Considering I am not able to go any deeper, I can just say that the backtrace looks like this:

libc++abi: terminating due to uncaught exception of type std::__1::bad_function_call: std::exception
Process 95253 stopped
* thread #36, stop reason = signal SIGABRT
    frame #0: 0x00000001a7b60744 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`:
->  0x1a7b60744 <+8>:  b.lo   0x1a7b60764               ; <+40>
    0x1a7b60748 <+12>: pacibsp 
    0x1a7b6074c <+16>: stp    x29, x30, [sp, #-0x10]!
    0x1a7b60750 <+20>: mov    x29, sp
Target 0: (Python) stopped.
(lldb) bt
* thread #36, stop reason = signal SIGABRT
  * frame #0: 0x00000001a7b60744 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001a7b97c28 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x00000001a7aa5ae8 libsystem_c.dylib`abort + 180
    frame #3: 0x00000001a7b50b84 libc++abi.dylib`abort_message + 132
    frame #4: 0x00000001a7b403b4 libc++abi.dylib`demangling_terminate_handler() + 320
    frame #5: 0x00000001a7816e68 libobjc.A.dylib`_objc_terminate() + 160
    frame #6: 0x00000001a7b4ff48 libc++abi.dylib`std::__terminate(void (*)()) + 16
    frame #7: 0x00000001a7b52d34 libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 36
    frame #8: 0x00000001a7b52ce0 libc++abi.dylib`__cxa_throw + 140
    frame #9: 0x000000010570e9d8 libarrow_python_flight.dylib`std::__1::__throw_bad_function_call[abi:v15006]() + 56
    frame #10: 0x000000010570eee8 libarrow_python_flight.dylib`std::__1::__function::__value_func<arrow::Status (_object*, arrow::flight::ClientAuthSender*, arrow::flight::ClientAuthReader*)>::operator()[abi:v15006](_object*&&, arrow::flight::ClientAuthSender*&&, arrow::flight::ClientAuthReader*&&) const + 68
    frame #11: 0x000000010570ee98 libarrow_python_flight.dylib`std::__1::function<arrow::Status (_object*, arrow::flight::ClientAuthSender*, arrow::flight::ClientAuthReader*)>::operator()(_object*, arrow::flight::ClientAuthSender*, arrow::flight::ClientAuthReader*) const + 68
    frame #12: 0x000000010570ed54 libarrow_python_flight.dylib`arrow::py::flight::PyClientAuthHandler::Authenticate(arrow::flight::ClientAuthSender*, arrow::flight::ClientAuthReader*)::$_2::operator()() const + 80
    frame #13: 0x0000000105703c24 libarrow_python_flight.dylib`decltype(fp()) arrow::py::SafeCallIntoPython<arrow::py::flight::PyClientAuthHandler::Authenticate(arrow::flight::ClientAuthSender*, arrow::flight::ClientAuthReader*)::$_2>(arrow::py::flight::PyClientAuthHandler::Authenticate(arrow::flight::ClientAuthSender*, arrow::flight::ClientAuthReader*)::$_2&&) + 80
    frame #14: 0x0000000105703bc8 libarrow_python_flight.dylib`arrow::py::flight::PyClientAuthHandler::Authenticate(arrow::flight::ClientAuthSender*, arrow::flight::ClientAuthReader*) + 64
    frame #15: 0x000000013d346fc8 libarrow_flight.1400.dylib`arrow::flight::transport::grpc::(anonymous namespace)::GrpcClientImpl::AuthenticateInternal(this=0x0000000103534be0, rpc=0x0000000179c9e398)::ClientRpc&) at grpc_client.cc:1058:7
    frame #16: 0x000000013d332b60 libarrow_flight.1400.dylib`arrow::flight::transport::grpc::(anonymous namespace)::GrpcClientImpl::Authenticate(this=0x0000000103534be0, options=0x0000000123ed3f00, auth_handler=<unavailable>) at grpc_client.cc:864:12
    frame #17: 0x000000013d2d9540 libarrow_flight.1400.dylib`arrow::flight::FlightClient::Authenticate(this=0x0000000103542e30, options=0x0000000123ed3f00, auth_handler=nullptr) at client.cc:567:22
    frame #18: 0x0000000123e0c130 _flight.cpython-311-darwin.so`__pyx_pf_7pyarrow_7_flight_12FlightClient_8authenticate(__pyx_obj_7pyarrow_7_flight_FlightClient*, _object*, __pyx_obj_7pyarrow_7_flight_FlightCallOptions*) + 1112
    frame #19: 0x0000000123e068bc _flight.cpython-311-darwin.so`__pyx_pw_7pyarrow_7_flight_12FlightClient_9authenticate(_object*, _object* const*, long, _object*) + 1152
    frame #20: 0x0000000123e7f488 _flight.cpython-311-darwin.so`__Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS(_object*, _object* const*, unsigned long, _object*) + 208
    frame #21: 0x000000010063fde0 Python`_PyEval_EvalFrameDefault + 40368
    frame #22: 0x0000000100644564 Python`_PyEval_Vector + 116
    frame #23: 0x0000000100641cdc Python`_PyEval_EvalFrameDefault + 48300
    frame #24: 0x0000000100644564 Python`_PyEval_Vector + 116
    frame #25: 0x0000000100641cdc Python`_PyEval_EvalFrameDefault + 48300
    frame #26: 0x0000000100644564 Python`_PyEval_Vector + 116
    frame #27: 0x000000010056a198 Python`method_vectorcall + 380
    frame #28: 0x00000001006f6ad4 Python`thread_run + 168
    frame #29: 0x00000001006981cc Python`pythread_wrapper + 48
    frame #30: 0x00000001a7b97fa8 libsystem_pthread.dylib`_pthread_start + 148

The issue is happening exactly as @jdanap mentioned, which should be here

const Status status = vtable_.authenticate(handler_.obj(), outgoing, incoming);

Let me know if there is anything else you need

@kou
Copy link
Member

kou commented Nov 7, 2023

Thanks.

Could you execute the following commands in the debugger session?

(lldb) f 10
(lldb) p vtable_

@kou
Copy link
Member

kou commented Nov 7, 2023

Ah, arrow::flight::FlightClient::Authenticate() isn't thread safe:

auth_handler_ = std::move(auth_handler);

@lidavidm We don't want to recommend Authenticate() because it's stateful, right?

@lidavidm
Copy link
Member

lidavidm commented Nov 7, 2023

Thanks for finding that! We could/should probably add a lock for this to be consistent with other methods, but yeah, I've been slowly marking the old authentication method deprecated (not for removal) anyways because of its statefulness.

@donatobarone
Copy link

Thanks.

Could you execute the following commands in the debugger session?

(lldb) f 10
(lldb) p vtable_

I wish I could be able to see the symbols :D that has been the problem described in #38519.

@kou @lidavidm do you have any suggestions on what should be the recommended approach here to avoid the problem?

@lidavidm
Copy link
Member

lidavidm commented Nov 7, 2023

Do you need to authenticate concurrently with other calls/with itself? I would expect a client to authenticate once, up front, and then enter concurrent usage

@donatobarone
Copy link

So our scenario is as follows:

  • Service B written in python (using the above example) gets data from service A
  • Different users can reach out to service B at the same time
  • Each user has to be authenticated against service A in order to get data

In this scenario, unless we put a lock in the code in service B to avoid the authenticate being called at the same time (which I remember trying, but didn't seem to help, still I want to validate that again), don't think there is any other way, unless I am missing something. Do you agree @lidavidm ??

@lidavidm
Copy link
Member

lidavidm commented Nov 7, 2023

No, then each user/request should use its own client. Authenticate is explicitly stateful, and is not appropriate if you are using multiple authentication contexts.

@kou
Copy link
Member

kou commented Nov 7, 2023

We could/should probably add a lock for this to be consistent with other methods

I think that we can't do it...
Because auth_handler_ is also used in other methods such as ListFlights() and GetFlightInfo(). We can't use a lock to support concurrent multiple auth_handler_s in one client.

@lidavidm
Copy link
Member

lidavidm commented Nov 7, 2023

Ah...we'd have to put a reader/writer lock over every method - I'm not sure that's worth the tradeoff. We should document the requirements for Authenticate instead, then.

@kou
Copy link
Member

kou commented Nov 7, 2023

We should document the requirements for Authenticate instead, then.

+1

@jdanap
Copy link
Author

jdanap commented Nov 8, 2023

Documenting the requirements/limitations for Authenticate makes sense to me.

Correct me if I am wrong - but I believe even if we can make authenticate stateless or thread-safe by locking the auth_handler (here and in any other place that uses it) or any other smarter means, we would still run into a race condition between the authenticate call itself, and the methods affected by the auth_handler such as the actual do_get/do_action and stream reading. Under a multi-user context scenario, authentication under user A should not be overridden, thread-safely or not, by user B until user A completes their request following Authenticate.

@jcferretti
Copy link

There is likely no need for locking, just making SetToken take a shared_ptr instead of a raw pointer may be enough, see linked issue.

@jcferretti
Copy link

There is likely no need for locking, just making SetToken take a shared_ptr instead of a raw pointer may be enough, see linked issue.

Actually that's wrong, there is no guarantee that a mutation and a read of a shared_ptr is consistent; thanks to lidavidm for pointing out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants