Skip to content

wvxvw/protopy

Repository files navigation

Documentation

For now, the only way to obtain documentation is to generate it. You can only generate it once the project is installed (Sphinx load Python modules in order to generate documentation, but, unless the project is installed, the imports will not work).

The way to do this is to:

virtualenv ./.venv
./.venv/bin/activate
./setup.py install
./setup.py build_sphinx

About

This project is an attempt to re-implement Protocol Buffers parser (both sources and binary) as a C extension for Python.

Only Version 3 of Protocol Buffers is and will be supported. Only Python 3 is and possibly future version will be supported.

Unfortunately, protoc, the official implementation of Protocol Buffers accepts a lot more than the specification allows, in particular, it accepts virtually any Protobuf v2 syntax in Protobuf v3 source files. This implementation tries to match this behavior.

Note

Imports are resolved differently in this implementation and in protoc: this implementation allows parsing source files in parallel, however, this resolves certain edge cases of import statements differently from how protoc does it. This is as intended.

What’s Different?

Protobuf was initially written for C++, where run-time code generation is problematic. When C++ approach was copied into other languages, it may or may not have made sense there. In particular, in Python, having generated Python code for parsing messages doesn’t make much sense. This is why this implementation doesn’t create descriptors and Python sources doing the deserialization. All you need is to point the parser to the Protobuf source files.

Some other differences

  • Options are parsed but ignored. Options are needed for protoc plugins to do pre- or post-processing on generated sources. protopy works in a very different way, so that none of protoc plugins would even make sense here.
  • Options on fields (such as [default=true] on bool fields) are ignored. The behavior of fields that weren’t sent over the wire is different: they will be always set to None. There’s no concept of default values. This is as intended.
  • Fields sent over the wire, which are not in the description are silently ignored by protoc generated code, but result in errors in protopy. In general, I found the later to be a saner approach. In the future, I might consider making this configurable.
  • protopy doesn’t attempt to find errors in your definitions. It may allow things that aren’t valid for protoc. If you want a better error reporting, protoc is still a better tool. One particular difference here is how definitions get resolved to their full names. It is technically possible to create an ambiguous name using Protobuf IDL: this is because it is not possible to tell which part of the name represents the package and which part represents the local name. Thus:
    foo.bar.baz
    ^   ^--- local name: "bar.baz"
    ^------- package: "foo"
        

    or

    foo.bar.baz
    ^       ^-- local name: "baz"
    ^---------- package: "foo.bar"
        

    protoc will actually alert you when you have this kind of conflict. protopy will use whichever definition it encountered last.

  • Zeroth enumeration values: my reading of documentation is that
    enum Foo {
        BAR = 0;
        BAZ = 1;
    }
        

    Shouldn’t be even valid. But protoc accepts this. Actually, it requires enums to have at least one member, and at least one member whose value is zero. protopy makes no distinction between zeroth member of an enum and other indices, but since protoc assigns a special meaning to zeroth member: it will use that as a default, if no value is given for the field, you should always have zeroth member of every enumeration to stay compatible.

Descriptors

Descriptors in Protobuf deserve a separate paragraph.

Imagine you are a C++ programmer in the late 90’s-early 2000’s. Imagine now the insurmountable task of having to generate C++ code that creates some static data-structures which describe the layout of your data. Well, you are in a pickle. You only have literals to describe numerical values and C-style arrays. The initializer lists don’t really work. I mean, they do, but nesting is horribly broken.

This is how the creative and crafty authors of protobuf invented descriptors. Here’s the basic idea: instead of encoding arbitrary complex structures that describe Protobuf messages in the C++ source code, one would embed a serialized description of those same structures. One would only need to hand-code the meta-structure, the Descriptor. Thus, descriptor is an actual Protobuf message, which you can find here. The code generated by protoc creates a description for every Protobuf message, serializes it using this message definition, which it then embeds in the generated code in a binary form.

On top of information pertaining to the message itself, generated descriptors contain auxiliary information pertaining to certain features of the serialized message, s.a. the location of the original source file containing message definition, and even a scheme for translating Protobuf message field names into JSON. The later deserves a specially warm place in hell, but we’ll deal with it later.

Does Python need Descriptors?

Well, now you are a Python programmer in the late second decade of 21st century… Do you have a problem encoding complex data-structures such as lists and hash-tables using Python syntax? Obviously not. But, the generator code for Protobuf was already written to use descriptors, and major changes are hard. However, with time, the developers of protoc realized that descriptors don’t capture all the information they need in order to create Protobuf types, and this is why, when you open a generated *_pb2.py file (by the way misnamed, because Protobuf v3 will also generate files with this extension), you see a massive copy-and-paste kind of extremely poorly named and equally poorly formatted code.

So, you have this hodge-podge mess of descriptors and badly generated Python code. But, generated code has this particular aura of magic around it: all these abbreviated vague terms with lots of underscores thrown in make you think that something very complex and important is going on there. Something you better not even try understanding because it’s so beyond your, mere mortal, abilities.

However, armed with some experience, you may be far too quick to dismiss this not-so-valuable feature (which is what I did). Here’s the payback.

Unexpectedly: JSON!

There is nothing that says that field names in Protobuf should be encoded using snake-case, but it must have been a convention accepted in Google at the time the format was created. Similarly, there is no particular reason why JSON field names should use camel-case, but it’s kind of a tradition in JavaScript, so, maybe in JSON too… but not really, no.

However, in their eternal wisdom, protoc authors decided that they will provide a “standardized” way to convert between Protobuf and JSON. The idea is very similar to how JSON parser and serializer are implemented in Go: they would translate JSON names into Go names using some pre-defined schema, but one can also augment structs in Go with “tags”, which may completely alter the way field names are translated.

Protobuf supports this through a special option you can attach to message fields: json_name. You don’t need to know this information, and, essentially, in overwhelming majority of cases you can correctly predict the desired JSON field name, simply given Protobuf field name. Unfortunately, this will not work all the time.

As you might have figured out by now, I think, that json_name was a horrible idea, something that increased parsing complexity with no real benefits to the user. This is why I don’t and will not support it in this implementation.

GRPC

I have no plans supporting this feature, service and rpc definitions are accepted by the parser but result in empty nodes.

Benchmarks

Performance is in the gutter… C++ rocks, Python sucks. I finally more or less finished generator and performance tester. Here’s what I’ve learned:

I assumed there will be no difference between using Python data-structures and C++ data-structures to store the contents of the messages. I’ve never been so wrong.

Here’s how Google’s Protobuf for Python does it: they parse the messages into C++ structs, and expose accessors to Python, so, whenever Python side of the code wants to know the value in that struct, they convert it on-demand.

protopy code parses everything into Python tuples, dictionaries enums and lists. So, what’s the bottom line? protopy code is 3-8 times slower than protobuf.

It is still about 10 times faster than protobuf pure Python implementation. The speed will also depend a lot on the structure of your messages. Smaller messages with less nested messages will be parsed faster by protopy than protobuf. Messages with a lot of nesting, long lists and especially hash-tables will tend to parse faster by protobuf.

I’m considering rewriting the parser to use a C struct for storing message data rather than using Python’s data-structures, but it will be a long process.

You can find benchmark project in test/performance.

Good news

Well, not so bad news: on Python 3.7, protopy is only about half as slow as C++ code.

Serializing

The approach taken by the library differs from how protobuf module works. protobuf is designed in a way, perhaps, suitable for C++ or Java language, but it is foreign to Python on a conceptual level. Python’s actual type-system is structural in its heart. This means that types in Python are defined in terms of type operations like addition and multiplication on the basic constituents. Thus, given:

from collections import namedtuple

class Foo:
    def __init__(self, x, y):
        self.x = x
        self.y = y

Bar = namedtuple('Bar', 'x, y')

def baz(arg):
    print('arg: {} {}'.format(arg.x, arg.y))

Foo(1, True) is of the same type as Bar(1, True) from a standpoint of baz.

This is while Java type system is nominative in its heart. Types are almost never considered the same, unless they are defined to be interchangeable intentionally in the source of the program.

Since serialization operates on a multi-program level, there is no real way for environments like Java or C++ to ensure type correctness of serialized data. And this is why, I believe, the approach taken by this library is better. protopy.Serializer doesn’t try to establish that the serialized object belongs to a specific class. The requirements for object to be serialized as a particular message are that it responds to the very broadly defined protocols. In particular, in order for the object to be serialized as a message, it needs to be iterable. The iteration must provide mapping between field values and their desired keys.

Similarly, if the field is encoded as repeatable its value must implement sequence protocol. If it maps to a map field, then it must implement mapping protocol. Atomic values must implement corresponding protocols too. I.e. if a field is to be encoded as int32, then it should be either an integer, or an object whose class inherits from integer, or it must define __int__() method. If it is meant to be encoded as a string, it must define __str__() and so on.

This also means that, unlike in protobuf package, there are no special message classes that implement MessageToString() or MessageFromString() methods etc. I believe that this is a bad way to go about serialization: it encourages boilerplate code that translates between domain objects into objects whose only purpose is to be accepted by serializer before they are discarded. This is just wasteful.

Unlike when using protobuf package, protopy can serialize atomic values s.a. integers and strings. This removes the need for proprietary extensions s.a. =goolge.protobuf.wrappers.StringValue=.

Customizing (de)serialization

Sometimes you may encounter Protobuf messages with special serialization behavior. For example, the StringValue mentioned above is intended to map to a str instances in Python. protopy doesn’t do this translation automatically. Instead, it allows you to substitute known message types with your own constructors.

For example, you could do this:

content = ...
parser = BinParser(['.'])

result = parser.parse('test.proto', 'Wrapper', content)
original = parser.def_parser.find_definition(b'Wrapper')

class Replacement:

    def __init__(self, original):
        self.original = original

    def replacement(self, *args):
        return {'replaced': self.original(*args)}

rp = Replacement(original)
parser.def_parser.update_definition(b'Wrapper', rp.replacement)

result = parser.parse('test.proto', 'Wrapper', content)
assert result['replaced'].wrapped_type == 'Wrapped'

However, you should be extremely careful when doing this. The above example will prevent you from serializing the resulting message back into its binary form.

Speedups

Python enumerators are very complex and slow. Yet, enumerations are one of the basic building blocks of Protobuf definitions. DefParser allows you to customize the way enumerations are instantiated when binary payload is read. In particular, there’s even a helper procedure, that you can substitute in place of the default enumeration factory: protopy.parser.simple_enum in order to speed up parsing. Note that this function will make enumerators impossible to serialize back.

Competitors

I’ve tried some other projects claiming to implement some degree of Protobuf format. Here’s what I’ve found:

  1. pyrobuf Doesn’t support maps. I wasn’t able to verify their performance claims. Documentation does not exist.
  2. protocyt Doesn’t support block comments, I couldn’t really get past this error. Documentation does not exist.
  3. protobuf-gen Is actually just a tool for fixing few problems with Python definitions generated by Google’s compiler.
  4. mypy-protobuf Is actually a plugin for protoc, not sure about its purpose.
  5. protobuf-py3 Is an obsolete fork of Google’s Protobuf.
  6. pystream-protobuf Is a wrapper for Google’s Protobuf.
  7. python3-protobuf Is just another name for protobuf package, they point to the same source code, but this one doesn’t have a lot of Google’s proprietary extensions.
  8. protol Is a tool for automating Python stubs generation using ptotoc.

Progress

Many parts of the code are still of prototype quality, however, the interface is more or less stable.

Priority Tasks

  • [X] Memory (de)allocation needs to be:
    1. Done from APR pools.
    2. More fine-grained pools.
  • [ ] Naming needs work, some names use inconsistent conventions.
  • [ ] Const correctness. A lot of code lacks this.
  • [X] Revise how arguments are supplied to message constructors, maybe we can shave some fat there by creating a tuple right away rather than collecting them into a hash-table and then into a tuple.
  • [X] Rewrite setup.py so that it also builds the lexer and the parser (maybe, conditionally), then get rid of main.c and few more junk files in lib.
  • [X] Few more exotic types need testing: very long varints and floats, I think they don’t parse correctly.
  • [X] defparser is kind of a mess, it can be reorganized and cleaned up a bit.
  • [X] ints in list could be encoded into pointers instead of allocating extra memory.
  • [X] cons may have an alternative version, where it doesn’t allocate more memory, but uses the the value as is. Irrelevant since using APR pools.
  • [X] Some code in protopy.y never releases memory / could allocate less. Irrelevant since using APR pools.
  • [ ] Serializer needs work, a lot of functionality there repeats, and may be consolidated.
  • [ ] It seems like there’s a bug with scheduling of parsing files, somehow very few threads get scheduled when reading files in bulks.

Medium priority

Keep this number low

find ./protopy \( -name '*.py' -or -name '*.[chyl]' \) -exec wc -l {} +

Right now it’s 11315, I would like to get in under 10K.

License

Finally, I was able to put this project under a free license.

This project is licensed under LGPL v3. It relies on Apache Portable Runtime library, which is licensed under Apache 2.0 license (find the license text under lib directory.)

About

Protobuf parser for Python

Resources

License

LGPL-3.0, GPL-3.0 licenses found

Licenses found

LGPL-3.0
COPYING.LESSER
GPL-3.0
COPYING.TXT

Stars

Watchers

Forks

Packages

No packages published