Skip to content

Commit

Permalink
add doc with considerations on the language and its future (#3)
Browse files Browse the repository at this point in the history
  • Loading branch information
jhump authored Jul 30, 2022
1 parent 1e30835 commit 0c133f8
Showing 1 changed file with 191 additions and 0 deletions.
191 changes: 191 additions & 0 deletions Considerations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# Considerations on the Protocol Buffers Language

There are several parts of the [specification](./LanguageSpec.md) that are complicated and
inconsistent. These parts exist as unintentional side effects of implementation
details in Google's reference compiler, `protoc`.

This demonstrates why having a clear language spec is valuable: if the spec were
written first, then the resulting compiler implementation would be more clear.
These peculiar behaviors in the compiler could be clearly categorized as bugs
and then fixed. However, because there was no language spec at the onset of
developing Protobuf, these implementation details have become the de facto spec.
These quirks cannot be clearly categorized as bugs because there may be users
and source code that _rely_ on this behavior. (This is a sort of corollary to
[Hyrum's Law](https://www.hyrumslaw.com/)).

The following sections describe potential _changes_ to the specification, that
would make it much simpler, more straightforward, and more consistent. These
changes are not backwards compatible, so introducing them would require a new
syntax (e.g. "proto3-strict" or "proto4").

## Resolving Relative References
The most complicated and least consistent part of the specification is the section
on [resolving references](./LanguageSpec.md#reference-resolution).

There are four inconsistencies in particular that ideally would be corrected:
1. The way an option name is resolved vs. a field type name inside a message is
inconsistent. A field type name may refer to a sibling nested message or
enum without any qualifiers. However, an option name may not refer to a
sibling nested extension: it must qualify the extension name with the name
of the enclosing message.
2. An unqualified name for the extendee of an "extend" block or for a field
type name will not fail if there is a sibling extension with the same name,
as long as a type with the correct name exists in an ancestor scope. But
this is not true for other type references: the input and output type names
of a method. For methods, if a nearer scope has an eponymous extension (for
example), the reference is considered invalid.
3. Similar to above, resolving custom option names will fail if there is a
nearer scope with an eponymous element that is not an extension. But the
logic for skipping non-type elements when resolving a field type reference
could be generalized: resolving a custom option name could skip elements
that are not extensions.
4. Unqualified names are handled differently from qualified names. When the
name is qualified, the behavior described in the previous bullet does not
apply (where a matching symbol in a nearby scope will be ignored if it is
the wrong type, in favor of a matching symbol in a farther scope). The
behavior in the qualified case has more conditions (like matching the
first name component only to a composite element), which makes it more
error-prone in practice. An alternative is to always treat them as if
they are unqualified. Combined with the change in bullet 3 above, this
would improve the ergonomics of all relative references as there would be
far fewer cases where the resolution incorrectly finds an element of the
wrong kind.

Another aspect that is unergonomic and a candidate for change is the existence
of the service scope. None of the elements in this scope (services and methods)
can actually be referenced by other declarations. So resolution when inside a
service (such as service or method options and methods' input and output types)
could behave as if they were in the enclosing package scope. This way, some other
method would never even be considered when resolving a method's input or output
type name. Admittedly, if a feature were ever added to the language that allowed
for referring to a service or method, then such a change would be
counter-productive.

## Lack of Coherence in Option Values
It is an unfortunate inconsistency that one cannot use array literal notation
(i.e. a sequence of values enclosed in brackets, `[` and `]`) to define the
value for a custom option field that is repeated. This notation is only allowed
inside a message literal.

Inside a message literal, the syntax changes to relying on the Protobuf text
format. Instead, the Protobuf language could use a subset of the text format (or
an alternate syntax that is similar) that is more streamlined and more
consistent. The following aspects of the text format are particularly inconsistent
with the rest of the language:

1. The Protobuf IDL uses curly braces (`{` and `}`) for block elements. But
the protobuf text format also allows for the use of `<` and `>`.
2. The text format allows for eliding the colon between a field name and its
value if the value is a message literal (or a list of message literals).
While this may be convenient, the inconsistency is confusing (especially
since only lists of messages are supported, not lists of scalar values).
Two alternatives for increasing the consistency follow:
1. Always require the colon. This makes the message literal more
closely resemble formats like JSON and YAML.
2. Allow the colon to be elided for any list value with the observation
that the `[` preceding the value is effective as the separator from
the field name (just as `{` or `<` is for message literal values).
3. The text format encloses extension names in brackets (`[` and `]`). But
other aspects of the IDL, such as option naming, uses parentheses to
enclose extension names (`(` and `)`). Allowing message literals to use
parentheses (and perhaps omit support for brackets) would make the syntax
and punctuation in the language more internally consistent.
4. The text format allows ',' or ';' as a separator, but they are not
required. The IDL syntax in other places is not as flexible: you must use
a comma in compact options and in range lists, and you must use a
semicolon for separating most other elements. So requiring a comma (and
not allowing a semicolon) would make the syntax and punctuation in the
language more internally consistent.
5. Though the text format has no context or scope (since it can be used as
a data exchange format), message literals in the Protobuf IDL **do** have
such a context. So it would be convenient if extension names in message
literals supported the same kinds of references as option names: relative
references allowed and a leading dot (`.`) allowed to indicate the name
is fully-qualified.

Addressing these issues would go a long way towards making this part of the
language syntax more coherent.

## Overloading Keywords
In the Protobuf IDL, keywords are allowed as identifiers for user-defined
elements, such as messages, fields, enums, etc. There are a small handful of
places where they may not be used, such as the first component of a type name
for a field declaration that omits the cardinality. But this is only to
prevent ambiguity for the parser (and is mostly due to an implementation detail
of the hand-written recursive descent parser in `protoc`).

A stronger stance on keywords would be to prevent their use in user-defined
identifiers. This is how many languages are specified, and it can lead to
simpler parsing, as well as making the source easier to read since language
keywords aren't overloaded.

If keywords were strictly disallowed in identifiers, a new category of
"predeclared identifiers" could be created, which would be a subset of the
current keywords. The reason for this distinction is in case there are some
words in the language that _should_ be usable as user-defined identifiers.
Keywords cannot be used this way; predeclared identifiers can be.

## Extensions and Any
Both extensions (in proto2) and the `google.protobuf.Any` type attempt to solve
similar problems: the ability to extend the content of a message to include
other user-defined types, but without the message needing a priori (compile-time)
knowledge of those user-defined types. This is most useful when writing generic
container and envelope types.

Extensions effectively reverse the dependency: instead of the base message
definition needing to import the user-defined field type, the user-defined field
type needs to import the base message in order to extend it.

* The "good":
* Extensions are easily serialized just like any other field of the base
message would be.
* Extensions have semantic names, which aids readability.
* Extensions are treated just like any other unrecognized field when the
consumer of a message is not aware of all extensions.
* Extensions are declared in the IDL, so the association of the user-defined
type with the base message is part of the schema.
* The "bad":
* Extensions require some form of coordination amongst all extenders to avoid
a tag conflict.
* If an extension is present in the JSON format but the consumer of the data
does not recognize it, it is discarded. (Though same is true of normal
fields, too.)
* Transcoding from binary to JSON and vice versa requires knowledge of
the extensions. There is no way to translate these formats without
a descriptor registry of some sort.

`Any` messages completely decouple the base message and any user-defined types:
they are completely unrelated in the Protobuf IDL. They are, for the most part,
implemented completely in the runtime and not necessarily part of the language.
The only place they appear in the language is for their custom text format, for
use with declaring option values whose type is `google.protobuf.Any`.

* The "good":
* Implemented almost entirely in the runtime; very little concern needed in
the language specification itself.
* When using the binary format, unrecognized message types can conveniently
be ignored, similar to unrecognized fields.
* Fields of type `Any` can be easily serialized to the proto binary format.
* The "bad":
* The data is identified by the fully-qualified name of the user-defined
message type. This means there is no semantic name, unless the message
type itself has a semantic name. For scalar values, a custom wrapper
type must be created with a semantic name. Use of generic message types,
such as the well-known types, is bad practice due to lack of a semantic
message name.
* Since they are not part of the language, there is no way to define the
"schema": such as what kinds of messages are allowed in any given field
of type `google.protobuf.Any`. So a field of type `Any` is a total
"free for all".
* Transcoding from binary to JSON and vice versa requires knowledge of
the concrete types inside. There is no way to translate these formats
without a descriptor registry of some sort.
* Unlike with extensions, where unrecognized extensions are ignored
when serializing to JSON, trying to serialize an `Any` that contains
an unrecognized type to JSON results in runtime errors. One must
manually strip unrecognized message types (if the desired outcome
is to ignore the unrecognized data).

Given the downsides to each of these, there may be room for a different
feature that could suffice as a replacement for extensions in a "proto4"
syntax. (No specific ideas yet.)

0 comments on commit 0c133f8

Please sign in to comment.