Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: transition the library to the new microgenerator #158

Merged
merged 35 commits into from
Sep 7, 2020

Conversation

plamut
Copy link
Contributor

@plamut plamut commented Jul 15, 2020

Closes #131.
Closes #168.

This PR replaces the old code generator, the generated parts of the code now use microgenerator. The latter also implies dropping support for Python 2.7 and 3.5.

There are a lot of changes and the transition itself was far from smooth, thus it's probably best to review this commit by commit (I tried to make the commits self-contained with each containing a single change/fix).

Things to focus on in reviews

  • Regenerating the code overrides some of the URLs in the samples README. Seems like a synthtool issue.

  • The clients do not support the client_config argument anymore. At least the ordering keys feature used that to change the timeout to "infinity". We need to see if that's crucial, and if the same can be set in a different way (maybe through the retry policy...)

  • SERVICE_ADDRESS and _DEFAULT_SCOPES constants in clients might be obsolete. Let's see if there are more modern alternatives (they are currently still injected into the generated clients).

  • Regenerating the code out of the box fails, because the Bazel tool incorrectly tries to use Python 2, resulting in syntax errors (could be just a problem on my machine, but it is a known Bazel issue).

    Workaround:

    • Add the snippet from the comment to google/pubsub/v1/BUILD.bazel file in the local googleapis clone and point the synthtool to it:
    $ export SYNTHTOOL_GOOGLEAPIS=/path/to/googleapis
    
    • Patch the local synthtool installation. Add the following two Bazel arguments to the _generate_code() method in synthtool/gcp/gapic_bazel.py (lines 177-178):
      "--python_top=//google/pubsub/v1:myruntime",
      "--incompatible_use_python_toolchains=false",

    The workaround should convince Bazel to use Python 3, as this is the Python version in the configs.

Things left to do

  • Double check that the new client performance is adequate. There have been reports of possibly degraded performance with the new microgenerator. The 10% performance hit was declared acceptable, not a release blocker anymore.
  • After approvals, re-generate the code one more time to make sure it works without errors (such as a new version of black wanting to re-generate some files and causing the CI check to fail).
  • Lint samples. It appears that the "fixup_keywords" step in the migration guide made the linter unhappy. 😄
  • Fix samples and their tests.
  • Fix system tests.
  • Determine how to handle methods that are now either missing, e.g. get_iam_policy(), or do not support all config options anymore, e.g create_subscription(). Adjust or delete?
  • Determine a replacement for now-unsupported client_config argument to client constructor. Or does the code generator need an update? client_config has been replaced by custom Retry settings passed to the GAPIC client publish(). If we want to support custom retries, we must update the user-facing client's publish() method.
  • Add UPGRADING guide to the docs (also depends on the previous point - some of the changes might actually be bugs in the new generator).
  • Make the samples CI checks required?

PR checklist

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

@plamut plamut added the type: process A process-related concern. May include testing, release, or the like. label Jul 15, 2020
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jul 15, 2020
@plamut plamut changed the title Transition the library to the new microgenerator chore: transition the library to the new microgenerator Jul 15, 2020
@plamut plamut force-pushed the use-microgenerator branch 4 times, most recently from a18ea36 to 1ecc48f Compare July 16, 2020 10:02
@plamut plamut force-pushed the use-microgenerator branch from 1ecc48f to 2b01cc4 Compare July 16, 2020 11:44
@plamut
Copy link
Contributor Author

plamut commented Jul 16, 2020

I believe this PR is now ready to be reviewed. The things that still fail are related to incompatible changes in the client itself, for example:

  • Missing IAM methods - get_iam_policy(), set iam policy(), test_iam_permission()
  • Missing create_subscription() parameters - dead_letter_policy, expiration_policy, etc.
  • Missing client_config parameter in publisher client constructor (is there an alternative mechanism?)

We need to figure out how to handle these and I will add the UPGRADING.md file to the docs after all these things have been clarified.

Since this PR is quite complex, every pair of eyes would be beneficial. 🙂

  • @pradn Please check if ordering keys are affected in any way due to the lack of client_config options - they need "infinite" timeouts, right?
  • @anguillanneuf Any feedback on samples would be great, as you're probably most familiar with them.
  • @busunkim96 General feedback on the migration process in case I missed something.

Let's not rush and take the time to review this thoroughly, thanks!

@plamut plamut marked this pull request as ready for review July 16, 2020 11:57
@plamut plamut requested a review from pradn July 16, 2020 11:57
@software-dov
Copy link

I don't think I understand the benchmark result very well; are the two different throughput numbers for cps_publisher_task.py and cps_subscriber_task.py respectively?

Also, just had a thought that may be relevant: is the manual layer invoking the asynchronous client or the synchronous? I see a lot of task/future names and semantics, which made me ask. The asynchronous surface has had basically no optimization focus.

@plamut
Copy link
Contributor Author

plamut commented Aug 20, 2020

I did some publisher profiling with yappi using roughly the code the benchmark framework uses and it seems that in the main thread a lot of time (~70%) is spent constructing PubsubMessage instances inside the publisher.publish() method:

from google.pubsub_v1 import types as gapic_types

...
# Create the Pub/Sub message object.
message = gapic_types.PubsubMessage(
    data=data, ordering_key=ordering_key, attributes=attrs
)

FWIW, profiling the current released version (i.e. the master) branch did now show that piece of code to be problematic.

Is instantiating a PubsubMessage more expensive with the microgenerator and are there ways to construct it faster, e.g. by circumventing any checks/extra logic that might be redundant for this use case?

@software-dov
Copy link

Yes, constructing messages is more expensive with the microgenerator because of its use of proto-plus. Depending on what data, ordering_key, and attrs (actual params, not formal) are, proto-plus may be doing non-trivial work to marshal python structures to protobuf structures.

Can you send me the raw data in some way? I'd like to take a closer look at which part of construction is expensive.
There's a potential hack to get around this, but it's not really something we can recommend our users do.
It's possible to interact with 'vanilla' proto types in Python and then just shove them into proto-plus types before passing the result into a method call.

E.g.

vanilla_pb = gapic_types.PubsubMessage.pb(data=data, ordering_key=ordering_key, attributes=attrs)
proto_plus_pb = gapic_types.PubsubMessage.wrap(vanilla_pb)

@plamut
Copy link
Contributor Author

plamut commented Aug 21, 2020

@software-dov

There's a potential hack to get around this, but it's not really something we can recommend our users do.

This actually improved things considerably - the benchmark showed a publisher throughput of ~57 MB/s, which is only ~11% worse than the current stable version. In addition, the PubsubMessage instance is constructed internally (users only pass in the message data and kwargs), meaning that we can actually use it.

Can you send me the raw data in some way?

The following is roughly how the messages look like in the benchmark and my local script for profiling:

data = b"A" * 250
ordering_key = ""
attrs = {'clientId': '0', 'sendTime': '1598046901869', 'sequenceNumber': '0'}
...
message = gapic_types.PubsubMessage(
    data=data, ordering_key=ordering_key, attributes=attrs
)

(sequenceNumber monotonically increases)

If using the hack, instantiation is considerably faster, but there is still quite a lot of time spent in Marshal.to_proto() - if we can speed this up further, it would be great.

I'll also see if the same can be used to improve subscriber performance.

I don't think I understand the benchmark result very well; are the two different throughput numbers for cps_publisher_task.py and cps_subscriber_task.py respectively?

Correct. The benchmark framework spins up separate compute instances for publisher and subscriber and then measures how well they perform individually.

Also, just had a thought that may be relevant: is the manual layer invoking the asynchronous client or the synchronous? I see a lot of task/future names and semantics, which made me ask.

It's the synchronous client, i.e. google.pubsub_v1.services.publisher.client.PublisherClient. Those futures are subclasses of google.api_core.future.Future and are managed manually by the (hand-written) publisher client itself.

@plamut
Copy link
Contributor Author

plamut commented Aug 22, 2020

I have some good news - circumventing the wrapper classes around the raw protobuf messages to speed up instantiation and attribute access seems to help a lot. Running the benchmarks with the two most recent experimental commits produces the following:

INFO-Results for cps-gcloud-python-publisher:
1157113 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-publisher:
INFO-50%: 5.0625 - 7.59375
1157124 [main] INFO com.google.pubsub.flic.Driver  - 50%: 5.0625 - 7.59375
INFO-99%: 86.49755859375 - 129.746337890625
1157124 [main] INFO com.google.pubsub.flic.Driver  - 99%: 86.49755859375 - 129.746337890625
INFO-99.9%: 194.6195068359375 - 291.92926025390625
1157124 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 194.6195068359375 - 291.92926025390625
INFO-Average throughput: 57.02 MB/s
1157126 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 57.02 MB/s
INFO-Results for cps-gcloud-python-subscriber:
1157126 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-subscriber:
INFO-50%: 2216.8378200531006 - 3325.256730079651
1157127 [main] INFO com.google.pubsub.flic.Driver  - 50%: 2216.8378200531006 - 3325.256730079651
INFO-99%: 16834.112196028233 - 25251.16829404235
1157127 [main] INFO com.google.pubsub.flic.Driver  - 99%: 16834.112196028233 - 25251.16829404235
INFO-99.9%: 25251.16829404235 - 37876.75244106352
1157128 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 25251.16829404235 - 37876.75244106352
INFO-Average throughput: 57.83 MB/s
1157128 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 57.83 MB/s

The performance hit is now only around 10% (publisher ~12%, subscriber ~8%), which might actually be acceptable.

Profiling shows that the speed of creating a new pubsub message and
the speed of accessing the message's attributes significantly affects
the throughput of publisher and subscriber.

This commit makes everything faster by circumventing the wrapper class
around the raw protobuf pubsub messages where possible.
@plamut plamut force-pushed the use-microgenerator branch from 30dce66 to c29d7f8 Compare August 22, 2020 11:20
@software-dov
Copy link

Optimizing marshal.to_proto is a mid-low priority ongoing effort. The good news is that any changes made to proto-plus will be available to all users for free and transparently, so if the new throughput numbers are acceptable they may eventually get better without having to cut a new release.
The not so good news is that I don't think to_proto is likely to get much faster in the general case. There may be specific instances we can optimize, or some crazy fast-path shenanigans, but the code there is doing nonzero work in order to facilitate a more flexible and ergonomic interface. The downside of proto-plus is that its user visible benefits come with a performance cost.

@software-dov
Copy link

My bad, when I said "the raw data" above, I meant the profiling data. I don't particularly want to set up the benchmark environment, but I do want to try looking through the data to see what the hot spots are and if they could be optimized.

@software-dov
Copy link

I have a special, optimized-but-not-reviewed-or-production-ready tarball of proto-plus. Is it possible to run the benchmark suite using it? My own benchmarking has these changes saving about 10-15 milliseconds out of about 500 milliseconds. YMMV, but if it improves the throughput enough it could be worth putting the proto-plus changes up for review.
proto-plus-1.7.1.tar.gz

@plamut
Copy link
Contributor Author

plamut commented Aug 25, 2020

I have a special, optimized-but-not-reviewed-or-production-ready tarball of proto-plus. Is it possible to run the benchmark suite using it?

If pip can access and install it, we can specify this version as a dependency in the framework's requirements.txt, should be doable.

Just to clarify, would you like to run the benchmark using the tip of this PR branch, i.e. the version that already includes the optimizations that try to bypass the protobuf buffer classes? Or on the version that excludes this last commit and uses the wrapper classes more heavily?

My bad, when I said "the raw data" above, I meant the profiling data.

No worries, sent you an email with the data just now.

@software-dov
Copy link

The tip of this PR branch.

Changes are also available in this branch of my fork: https://github.com/software-dov/proto-plus-python/tree/hackaday

@plamut
Copy link
Contributor Author

plamut commented Aug 26, 2020

First run, using proto-plus 1.7.1 on the tip of this PR branch. The results seem more or less similar to the previous benchmark:

INFO-Results for cps-gcloud-python-publisher:
1151295 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-publisher:
INFO-50%: 5.0625 - 7.59375
1151304 [main] INFO com.google.pubsub.flic.Driver  - 50%: 5.0625 - 7.59375
INFO-99%: 86.49755859375 - 129.746337890625
1151304 [main] INFO com.google.pubsub.flic.Driver  - 99%: 86.49755859375 - 129.746337890625
INFO-99.9%: 194.6195068359375 - 291.92926025390625
1151304 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 194.6195068359375 - 291.92926025390625
INFO-Average throughput: 56.58 MB/s
1151306 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 56.58 MB/s
INFO-Results for cps-gcloud-python-subscriber:
1151306 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-subscriber:
INFO-50%: 1477.8918800354004 - 2216.8378200531006
1151306 [main] INFO com.google.pubsub.flic.Driver  - 50%: 1477.8918800354004 - 2216.8378200531006
INFO-99%: 25251.16829404235 - 37876.75244106352
1151307 [main] INFO com.google.pubsub.flic.Driver  - 99%: 25251.16829404235 - 37876.75244106352
INFO-99.9%: 37876.75244106352 - 56815.128661595285
1151307 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 37876.75244106352 - 56815.128661595285
INFO-Average throughput: 57.37 MB/s
1151307 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 57.37 MB/s

I also did another run, the measured throughput was around 55 MB/s - a bit worse, but within the normal results variance (the typical difference appears to be somewhere in the 1-2 MB/s range).

@software-dov
Copy link

Okay. In light of those numbers I'm inclined not to merge the proto-plus changes. Is there an approval process for accepting the throughput regression?

@plamut
Copy link
Contributor Author

plamut commented Aug 26, 2020

Please just mind that that the proto-plus optimizations might not have a noticeable effect here, as the optimizations added here actively try to circumvent proto-plus, but other libraries might still see benefits.

I'm not aware of any formal processes, but we do have weekly PubSub meetings on Thursdays where we discuss these things. I added the question to the agenda whether a -10% performance hit is acceptable in the new major release.

Update: Was confirmed, -10% is good enough for now, although we should strive to improve this further in the mid-term.

@plamut plamut requested a review from cguardia August 28, 2020 10:02
@plamut plamut mentioned this pull request Aug 31, 2020
4 tasks
Copy link
Contributor

@cguardia cguardia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Sorry this took so long. Had to go commit by commit.

@plamut
Copy link
Contributor Author

plamut commented Sep 7, 2020

@cguardia That's fine, appreciated. I will re-generate and benchmark the code again now, just in case, and see if we are ready for the major release.

@plamut
Copy link
Contributor Author

plamut commented Sep 7, 2020

After re-generating the code yet again, performance still appears to be in line with the previous benchmarks:

INFO-Results for cps-gcloud-python-publisher:
1095535 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-publisher:
INFO-50%: 5.0625 - 7.59375
1095545 [main] INFO com.google.pubsub.flic.Driver  - 50%: 5.0625 - 7.59375
INFO-99%: 86.49755859375 - 129.746337890625
1095546 [main] INFO com.google.pubsub.flic.Driver  - 99%: 86.49755859375 - 129.746337890625
INFO-99.9%: 194.6195068359375 - 291.92926025390625
1095546 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 194.6195068359375 - 291.92926025390625
INFO-Average throughput: 56.04 MB/s
1095547 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 56.04 MB/s
INFO-Results for cps-gcloud-python-subscriber:
1095547 [main] INFO com.google.pubsub.flic.Driver  - Results for cps-gcloud-python-subscriber:
INFO-50%: 194.6195068359375 - 291.92926025390625
1095549 [main] INFO com.google.pubsub.flic.Driver  - 50%: 194.6195068359375 - 291.92926025390625
INFO-99%: 16834.112196028233 - 25251.16829404235
1095549 [main] INFO com.google.pubsub.flic.Driver  - 99%: 16834.112196028233 - 25251.16829404235
INFO-99.9%: 37876.75244106352 - 56815.128661595285
1095550 [main] INFO com.google.pubsub.flic.Driver  - 99.9%: 37876.75244106352 - 56815.128661595285
INFO-Average throughput: 56.33 MB/s
1095550 [main] INFO com.google.pubsub.flic.Driver  - Average throughput: 56.33 MB/s

Merging. 🎉

Edit: Ah, @kamalaboulhosn expressed a wish to review the UPGRADING guide, will wait with merging a bit more.

Copy link
Contributor

@kamalaboulhosn kamalaboulhosn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upgrade guide looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement. type: process A process-related concern. May include testing, release, or the like.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PublisherOptions Not Found in API Docs Transition the library to the new microgenerator
9 participants