diff --git a/Considerations.md b/Considerations.md new file mode 100644 index 0000000..60e5725 --- /dev/null +++ b/Considerations.md @@ -0,0 +1,191 @@ +# Considerations on the Protocol Buffers Language + +There are several parts of the [specification](./LanguageSpec.md) that are complicated and +inconsistent. These parts exist as unintentional side effects of implementation +details in Google's reference compiler, `protoc`. + +This demonstrates why having a clear language spec is valuable: if the spec were +written first, then the resulting compiler implementation would be more clear. +These peculiar behaviors in the compiler could be clearly categorized as bugs +and then fixed. However, because there was no language spec at the onset of +developing Protobuf, these implementation details have become the de facto spec. +These quirks cannot be clearly categorized as bugs because there may be users +and source code that _rely_ on this behavior. (This is a sort of corollary to +[Hyrum's Law](https://www.hyrumslaw.com/)). + +The following sections describe potential _changes_ to the specification, that +would make it much simpler, more straightforward, and more consistent. These +changes are not backwards compatible, so introducing them would require a new +syntax (e.g. "proto3-strict" or "proto4"). + +## Resolving Relative References +The most complicated and least consistent part of the specification is the section +on [resolving references](./LanguageSpec.md#reference-resolution). + +There are four inconsistencies in particular that ideally would be corrected: +1. The way an option name is resolved vs. a field type name inside a message is + inconsistent. A field type name may refer to a sibling nested message or + enum without any qualifiers. However, an option name may not refer to a + sibling nested extension: it must qualify the extension name with the name + of the enclosing message. +2. An unqualified name for the extendee of an "extend" block or for a field + type name will not fail if there is a sibling extension with the same name, + as long as a type with the correct name exists in an ancestor scope. But + this is not true for other type references: the input and output type names + of a method. For methods, if a nearer scope has an eponymous extension (for + example), the reference is considered invalid. +3. Similar to above, resolving custom option names will fail if there is a + nearer scope with an eponymous element that is not an extension. But the + logic for skipping non-type elements when resolving a field type reference + could be generalized: resolving a custom option name could skip elements + that are not extensions. +4. Unqualified names are handled differently from qualified names. When the + name is qualified, the behavior described in the previous bullet does not + apply (where a matching symbol in a nearby scope will be ignored if it is + the wrong type, in favor of a matching symbol in a farther scope). The + behavior in the qualified case has more conditions (like matching the + first name component only to a composite element), which makes it more + error-prone in practice. An alternative is to always treat them as if + they are unqualified. Combined with the change in bullet 3 above, this + would improve the ergonomics of all relative references as there would be + far fewer cases where the resolution incorrectly finds an element of the + wrong kind. + +Another aspect that is unergonomic and a candidate for change is the existence +of the service scope. None of the elements in this scope (services and methods) +can actually be referenced by other declarations. So resolution when inside a +service (such as service or method options and methods' input and output types) +could behave as if they were in the enclosing package scope. This way, some other +method would never even be considered when resolving a method's input or output +type name. Admittedly, if a feature were ever added to the language that allowed +for referring to a service or method, then such a change would be +counter-productive. + +## Lack of Coherence in Option Values +It is an unfortunate inconsistency that one cannot use array literal notation +(i.e. a sequence of values enclosed in brackets, `[` and `]`) to define the +value for a custom option field that is repeated. This notation is only allowed +inside a message literal. + +Inside a message literal, the syntax changes to relying on the Protobuf text +format. Instead, the Protobuf language could use a subset of the text format (or +an alternate syntax that is similar) that is more streamlined and more +consistent. The following aspects of the text format are particularly inconsistent +with the rest of the language: + +1. The Protobuf IDL uses curly braces (`{` and `}`) for block elements. But + the protobuf text format also allows for the use of `<` and `>`. +2. The text format allows for eliding the colon between a field name and its + value if the value is a message literal (or a list of message literals). + While this may be convenient, the inconsistency is confusing (especially + since only lists of messages are supported, not lists of scalar values). + Two alternatives for increasing the consistency follow: + 1. Always require the colon. This makes the message literal more + closely resemble formats like JSON and YAML. + 2. Allow the colon to be elided for any list value with the observation + that the `[` preceding the value is effective as the separator from + the field name (just as `{` or `<` is for message literal values). +3. The text format encloses extension names in brackets (`[` and `]`). But + other aspects of the IDL, such as option naming, uses parentheses to + enclose extension names (`(` and `)`). Allowing message literals to use + parentheses (and perhaps omit support for brackets) would make the syntax + and punctuation in the language more internally consistent. +4. The text format allows ',' or ';' as a separator, but they are not + required. The IDL syntax in other places is not as flexible: you must use + a comma in compact options and in range lists, and you must use a + semicolon for separating most other elements. So requiring a comma (and + not allowing a semicolon) would make the syntax and punctuation in the + language more internally consistent. +5. Though the text format has no context or scope (since it can be used as + a data exchange format), message literals in the Protobuf IDL **do** have + such a context. So it would be convenient if extension names in message + literals supported the same kinds of references as option names: relative + references allowed and a leading dot (`.`) allowed to indicate the name + is fully-qualified. + +Addressing these issues would go a long way towards making this part of the +language syntax more coherent. + +## Overloading Keywords +In the Protobuf IDL, keywords are allowed as identifiers for user-defined +elements, such as messages, fields, enums, etc. There are a small handful of +places where they may not be used, such as the first component of a type name +for a field declaration that omits the cardinality. But this is only to +prevent ambiguity for the parser (and is mostly due to an implementation detail +of the hand-written recursive descent parser in `protoc`). + +A stronger stance on keywords would be to prevent their use in user-defined +identifiers. This is how many languages are specified, and it can lead to +simpler parsing, as well as making the source easier to read since language +keywords aren't overloaded. + +If keywords were strictly disallowed in identifiers, a new category of +"predeclared identifiers" could be created, which would be a subset of the +current keywords. The reason for this distinction is in case there are some +words in the language that _should_ be usable as user-defined identifiers. +Keywords cannot be used this way; predeclared identifiers can be. + +## Extensions and Any +Both extensions (in proto2) and the `google.protobuf.Any` type attempt to solve +similar problems: the ability to extend the content of a message to include +other user-defined types, but without the message needing a priori (compile-time) +knowledge of those user-defined types. This is most useful when writing generic +container and envelope types. + +Extensions effectively reverse the dependency: instead of the base message +definition needing to import the user-defined field type, the user-defined field +type needs to import the base message in order to extend it. + +* The "good": + * Extensions are easily serialized just like any other field of the base + message would be. + * Extensions have semantic names, which aids readability. + * Extensions are treated just like any other unrecognized field when the + consumer of a message is not aware of all extensions. + * Extensions are declared in the IDL, so the association of the user-defined + type with the base message is part of the schema. +* The "bad": + * Extensions require some form of coordination amongst all extenders to avoid + a tag conflict. + * If an extension is present in the JSON format but the consumer of the data + does not recognize it, it is discarded. (Though same is true of normal + fields, too.) + * Transcoding from binary to JSON and vice versa requires knowledge of + the extensions. There is no way to translate these formats without + a descriptor registry of some sort. + +`Any` messages completely decouple the base message and any user-defined types: +they are completely unrelated in the Protobuf IDL. They are, for the most part, +implemented completely in the runtime and not necessarily part of the language. +The only place they appear in the language is for their custom text format, for +use with declaring option values whose type is `google.protobuf.Any`. + +* The "good": + * Implemented almost entirely in the runtime; very little concern needed in + the language specification itself. + * When using the binary format, unrecognized message types can conveniently + be ignored, similar to unrecognized fields. + * Fields of type `Any` can be easily serialized to the proto binary format. +* The "bad": + * The data is identified by the fully-qualified name of the user-defined + message type. This means there is no semantic name, unless the message + type itself has a semantic name. For scalar values, a custom wrapper + type must be created with a semantic name. Use of generic message types, + such as the well-known types, is bad practice due to lack of a semantic + message name. + * Since they are not part of the language, there is no way to define the + "schema": such as what kinds of messages are allowed in any given field + of type `google.protobuf.Any`. So a field of type `Any` is a total + "free for all". + * Transcoding from binary to JSON and vice versa requires knowledge of + the concrete types inside. There is no way to translate these formats + without a descriptor registry of some sort. + * Unlike with extensions, where unrecognized extensions are ignored + when serializing to JSON, trying to serialize an `Any` that contains + an unrecognized type to JSON results in runtime errors. One must + manually strip unrecognized message types (if the desired outcome + is to ignore the unrecognized data). + +Given the downsides to each of these, there may be room for a different +feature that could suffice as a replacement for extensions in a "proto4" +syntax. (No specific ideas yet.)