Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support XML-RPC multicall #3778

Merged
merged 6 commits into from
Apr 25, 2018
Merged

Support XML-RPC multicall #3778

merged 6 commits into from
Apr 25, 2018

Conversation

di
Copy link
Member

@di di commented Apr 20, 2018

Fixes #3734.

raise XMLRPCWrappedError(
ValueError('Method name not provided')
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to support this, I'd love to at least limit the number of calls one can issue in a multi call.

No idea what a good number would be though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me. Shall we set it to the legacy 100 amount and see how it goes for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's already broken... so why not try something like 10 and see how it goes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh hey you know what, I have data for this. back in a jiff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

welp.

zgrep -h 'multicall' xmlrpc.log* | jq '.params[]|length' | sort | uniq -c  | sort -n
      2 15
      2 16
      2 25
      2 30
      2 32
      2 52
      2 62
      2 86
      4 14
      4 22
      4 46
      4 8
      6 44
      6 68
     12 98
     16 58
     18 73
     20 11
     20 31
     20 72
     34 12
     42 10
     68 48
     68 59
     78 61
    114 6
    138 4
    152 2
    158 60
    216 3
    895 100
   1284 24
   2036 5
   6170 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm the 6170 people calling multicall with a single operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dstufft @di I'm going to suggest we start with a cap of 20... and see who complains. It may be something where it is trivial for them to tune the batching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more restrictive multi call may be a good way to start some conversations with heavy XMLRPC users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, number of calls is possibly not the right answer here. The thing we're really looking to avoid is to prevent buffering 30GB of data in memory before we send things out to the end user, for some operations that's going to be 1000 multi calls, other ones that's going to be 2 multi calls, and for still others that's going to vary based upon the parameters passed to the exact return value.

I see a few possible ways of resolving this conflict.

  1. Do the "typical" thing here, and just limit the number of multicalls you can make in a single call, and tune that up/down as we get experience and metrics on how bad of a value that is.
  2. Institute a limit on response size for these, and sum the size of our response, once we hit some defined limit, bail out and raise a http error.
  3. Drop buffering completely, and implement a generator based approach that will feed responses, one call at a time to the requestor so that by definition a multi call is no more memory expensive than a long lived connection and a single call.

Given we don't particularly like or want to support XMLRPC long term, I have no problem with (1) being the answer here. It's a quick, easy thing to do and lets us move on. However It would still be possible to drastically inflate our memory usage (find our biggest long_description and request it N times).

(2) is probably a better implementation of the "limit and then bail" approach, except that it's harder for callers to ensure they're staying within the limits because they're not going to have a deterministic way to know if any particular multi call is going to succeed or not. If these were all new clients, I'd say that's fine because they can implement back offs or limit the usage to procedures that don't return huge blobs of data like release_data does, but that's more annoying in a legacy API with clients that may have been written a decade ago (RPC is a horrible way to write APIs that are designed to last a decade+ :( ).

(3) is probably the best of all possible worlds, since it completely eliminates the ability to blow up memory usage by buffering a million things in memory, but it's also the hardest one to actually implement. I'm not even sure if pyramid_rpc can handle a streaming response (Pyramid can, but don't know about pyramid_rpc), so you might end up having to fight stuff at multiple layers of the stack and take over more of the serialization aspects in order to implement this method.

All in all, this is a lot of words to say go for any of the above methods, I think they're all fine with different tradeoffs, and I don't personally care a whole lot about delving too deeply into making this legacy API work great for all consumers ever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe: let's do #1, but also add some metrics about the size of these responses?

)

responses = []
for arg in args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it looks like the original code had a limit of 100 multicalls in a single response:

https://github.com/pypa/pypi-legacy/blob/a418131180a14768328713df5533c600c425760d/rpc.py#L221-L224

Although even 100 multi calls feels like it's a lot and can end up buffering a HUGE response here.

@di di force-pushed the xmlrpc-multicall branch from 53fdade to 36e3aaf Compare April 20, 2018 20:39
@di di force-pushed the xmlrpc-multicall branch from 36e3aaf to bc95bc7 Compare April 23, 2018 18:55
@@ -0,0 +1,48 @@
import pretend
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license header

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, done.



def includeme(config):
config.add_subscriber(on_new_response, NewResponse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this affect all responses? is there any way to only apply this to the xmlrpc views?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only responses that have set request.content_length_metric_name to some value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aye, but we're still going to be executing the logic on every request.

this is cleaner for a metric we might be submitting with every request, but I'm wondering what it really gains us over submitting the metric inline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subscriber is going to bail pretty quickly if that attribute doesn't exist on the request, I don't think it's much overhead. We're doing a similar thing already here: https://github.com/pypa/warehouse/blob/0e33d86537ce10469ec93e8fd15532b16f579dea/warehouse/datadog/pyramid_datadog.py#L105-L108

The reason I did it via a subscriber is because it's difficult to get the content length of the response in the view, because it just returns a list of results, not an actual Response.

The alternative would be either a) estimating the actual Content-Length in the view (messy/unnecessarily expensive to serialize the payload twice) , or b) use character length as a proxy for content length (which is not quite right).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, perhaps this could be done with a response callback instead... (https://docs.pylonsproject.org/projects/pyramid/en/latest/narr/hooks.html#using-response-callbacks)

@di di force-pushed the xmlrpc-multicall branch from 0a608fd to 31ad71b Compare April 24, 2018 15:43
@di di force-pushed the xmlrpc-multicall branch from 31ad71b to e458894 Compare April 24, 2018 15:46
@dstufft dstufft merged commit 754c88a into pypi:master Apr 25, 2018
@dstufft dstufft deleted the xmlrpc-multicall branch April 25, 2018 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants