Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any idea why this isn't working in a Apache Pyspark UDF? #253

Open
rphes opened this issue Mar 21, 2024 · 1 comment
Open

Any idea why this isn't working in a Apache Pyspark UDF? #253

rphes opened this issue Mar 21, 2024 · 1 comment

Comments

@rphes
Copy link

rphes commented Mar 21, 2024

Hi and thanks for maintaining this. You might not be familiar with it, but Pyspark allows you to run distributed computations on many nodes. You can process data using Python code via a so-called user-defined function (UDF). Python UDFs are serialized using cloudpickle in order to be able to send them to and use them on worker nodes.

I have a UDF that requests some data from an API using requests. I'd like to mock this API in my tests. So far, this seems like an ideal use-case for requests-mock. The issue is, though, that the session instance I wrap in my UDF seems to get un-patched somewhere before it winds up on the executor that actually runs the UDF.

Now there is a lot of complexity involved here, but perhaps you have some idea of what could cause a requests-mock session patch to get undone. I hope you can help me.

@jamielennox
Copy link
Owner

Sorry, i haven't used it in combination with pyspark. But my guess is the same as yours, that something in the way the mocking is being done is being replaced by the pyspark process. If you can figure out where we can look at supporting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants