Support XML-RPC multicall #3778

di · 2018-04-20T20:18:28Z

ewdurbin · 2018-04-20T20:20:50Z

warehouse/legacy/api/xmlrpc/views.py

+        raise XMLRPCWrappedError(
+            ValueError('Method name not provided')
+        )
+


If we're going to support this, I'd love to at least limit the number of calls one can issue in a multi call.

No idea what a good number would be though.

Fine with me. Shall we set it to the legacy 100 amount and see how it goes for now?

Well it's already broken... so why not try something like 10 and see how it goes.

oh hey you know what, I have data for this. back in a jiff

welp.

zgrep -h 'multicall' xmlrpc.log* | jq '.params[]|length' | sort | uniq -c | sort -n 2 15 2 16 2 25 2 30 2 32 2 52 2 62 2 86 4 14 4 22 4 46 4 8 6 44 6 68 12 98 16 58 18 73 20 11 20 31 20 72 34 12 42 10 68 48 68 59 78 61 114 6 138 4 152 2 158 60 216 3 895 100 1284 24 2036 5 6170 1

I'm the 6170 people calling multicall with a single operation.

@dstufft @di I'm going to suggest we start with a cap of 20... and see who complains. It may be something where it is trivial for them to tune the batching.

A more restrictive multi call may be a good way to start some conversations with heavy XMLRPC users.

Honestly, number of calls is possibly not the right answer here. The thing we're really looking to avoid is to prevent buffering 30GB of data in memory before we send things out to the end user, for some operations that's going to be 1000 multi calls, other ones that's going to be 2 multi calls, and for still others that's going to vary based upon the parameters passed to the exact return value.

I see a few possible ways of resolving this conflict.

Do the "typical" thing here, and just limit the number of multicalls you can make in a single call, and tune that up/down as we get experience and metrics on how bad of a value that is.

Institute a limit on response size for these, and sum the size of our response, once we hit some defined limit, bail out and raise a http error.

Drop buffering completely, and implement a generator based approach that will feed responses, one call at a time to the requestor so that by definition a multi call is no more memory expensive than a long lived connection and a single call.

Given we don't particularly like or want to support XMLRPC long term, I have no problem with (1) being the answer here. It's a quick, easy thing to do and lets us move on. However It would still be possible to drastically inflate our memory usage (find our biggest long_description and request it N times).

(2) is probably a better implementation of the "limit and then bail" approach, except that it's harder for callers to ensure they're staying within the limits because they're not going to have a deterministic way to know if any particular multi call is going to succeed or not. If these were all new clients, I'd say that's fine because they can implement back offs or limit the usage to procedures that don't return huge blobs of data like release_data does, but that's more annoying in a legacy API with clients that may have been written a decade ago (RPC is a horrible way to write APIs that are designed to last a decade+ :( ).

(3) is probably the best of all possible worlds, since it completely eliminates the ability to blow up memory usage by buffering a million things in memory, but it's also the hardest one to actually implement. I'm not even sure if pyramid_rpc can handle a streaming response (Pyramid can, but don't know about pyramid_rpc), so you might end up having to fight stuff at multiple layers of the stack and take over more of the serialization aspects in order to implement this method.

All in all, this is a lot of words to say go for any of the above methods, I think they're all fine with different tradeoffs, and I don't personally care a whole lot about delving too deeply into making this legacy API work great for all consumers ever.

Maybe: let's do #1, but also add some metrics about the size of these responses?

dstufft · 2018-04-20T20:21:54Z

warehouse/legacy/api/xmlrpc/views.py

+        )
+
+    responses = []
+    for arg in args:


FWIW, it looks like the original code had a limit of 100 multicalls in a single response:

https://github.com/pypa/pypi-legacy/blob/a418131180a14768328713df5533c600c425760d/rpc.py#L221-L224

Although even 100 multi calls feels like it's a lot and can end up buffering a HUGE response here.

ewdurbin · 2018-04-24T14:08:32Z

tests/unit/legacy/api/xmlrpc/test_init.py

@@ -0,0 +1,48 @@
+import pretend


license header

D'oh, done.

ewdurbin · 2018-04-24T14:10:31Z

warehouse/legacy/api/xmlrpc/__init__.py

+
+
+def includeme(config):
+    config.add_subscriber(on_new_response, NewResponse)


will this affect all responses? is there any way to only apply this to the xmlrpc views?

Only responses that have set request.content_length_metric_name to some value.

aye, but we're still going to be executing the logic on every request.

this is cleaner for a metric we might be submitting with every request, but I'm wondering what it really gains us over submitting the metric inline.

The subscriber is going to bail pretty quickly if that attribute doesn't exist on the request, I don't think it's much overhead. We're doing a similar thing already here: https://github.com/pypa/warehouse/blob/0e33d86537ce10469ec93e8fd15532b16f579dea/warehouse/datadog/pyramid_datadog.py#L105-L108

The reason I did it via a subscriber is because it's difficult to get the content length of the response in the view, because it just returns a list of results, not an actual Response.

The alternative would be either a) estimating the actual Content-Length in the view (messy/unnecessarily expensive to serialize the payload twice) , or b) use character length as a proxy for content length (which is not quite right).

Hmm, perhaps this could be done with a response callback instead... (https://docs.pylonsproject.org/projects/pyramid/en/latest/narr/hooks.html#using-response-callbacks)

ewdurbin reviewed Apr 20, 2018

View reviewed changes

dstufft reviewed Apr 20, 2018

View reviewed changes

di force-pushed the xmlrpc-multicall branch from 53fdade to 36e3aaf Compare April 20, 2018 20:39

di added 3 commits April 23, 2018 13:54

Support XML-RPC multicall

20a53c7

Limit number of multicalls to 20

f38ae14

Add metrics for multicall response size

bc95bc7

di force-pushed the xmlrpc-multicall branch from 36e3aaf to bc95bc7 Compare April 23, 2018 18:55

ewdurbin reviewed Apr 24, 2018

View reviewed changes

Add license header

d59763f

di force-pushed the xmlrpc-multicall branch from 0a608fd to 31ad71b Compare April 24, 2018 15:43

Use a response callback instead

e458894

di force-pushed the xmlrpc-multicall branch from 31ad71b to e458894 Compare April 24, 2018 15:46

Merge branch 'master' into xmlrpc-multicall

a169126

dstufft merged commit 754c88a into pypi:master Apr 25, 2018

dstufft deleted the xmlrpc-multicall branch April 25, 2018 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support XML-RPC multicall #3778

Support XML-RPC multicall #3778

di commented Apr 20, 2018

ewdurbin Apr 20, 2018

di Apr 20, 2018

ewdurbin Apr 20, 2018

ewdurbin Apr 20, 2018

ewdurbin Apr 20, 2018

dstufft Apr 20, 2018

ewdurbin Apr 20, 2018

ewdurbin Apr 20, 2018

dstufft Apr 20, 2018

di Apr 20, 2018

dstufft Apr 20, 2018

ewdurbin Apr 24, 2018

di Apr 24, 2018

ewdurbin Apr 24, 2018

di Apr 24, 2018

ewdurbin Apr 24, 2018

di Apr 24, 2018

di Apr 24, 2018



		def includeme(config):
		config.add_subscriber(on_new_response, NewResponse)

Support XML-RPC multicall #3778

Support XML-RPC multicall #3778

Conversation

di commented Apr 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment