add doc with considerations on the language and its future (#3)

bufbuild · Jul 30, 2022 · 0c133f8 · 0c133f8
1 parent 1e30835
commit 0c133f8
Showing 1 changed file with 191 additions and 0 deletions.
diff --git a/Considerations.md b/Considerations.md
@@ -0,0 +1,191 @@
+# Considerations on the Protocol Buffers Language
+
+There are several parts of the [specification](./LanguageSpec.md) that are complicated and
+inconsistent. These parts exist as unintentional side effects of implementation
+details in Google's reference compiler, `protoc`.
+
+This demonstrates why having a clear language spec is valuable: if the spec were
+written first, then the resulting compiler implementation would be more clear.
+These peculiar behaviors in the compiler could be clearly categorized as bugs
+and then fixed. However, because there was no language spec at the onset of
+developing Protobuf, these implementation details have become the de facto spec.
+These quirks cannot be clearly categorized as bugs because there may be users
+and source code that _rely_ on this behavior. (This is a sort of corollary to
+[Hyrum's Law](https://www.hyrumslaw.com/)).
+
+The following sections describe potential _changes_ to the specification, that
+would make it much simpler, more straightforward, and more consistent. These
+changes are not backwards compatible, so introducing them would require a new
+syntax (e.g. "proto3-strict" or "proto4").
+
+## Resolving Relative References
+The most complicated and least consistent part of the specification is the section
+on [resolving references](./LanguageSpec.md#reference-resolution).
+
+There are four inconsistencies in particular that ideally would be corrected:
+1. The way an option name is resolved vs. a field type name inside a message is
+   inconsistent. A field type name may refer to a sibling nested message or
+   enum without any qualifiers. However, an option name may not refer to a
+   sibling nested extension: it must qualify the extension name with the name
+   of the enclosing message.
+2. An unqualified name for the extendee of an "extend" block or for a field
+   type name will not fail if there is a sibling extension with the same name,
+   as long as a type with the correct name exists in an ancestor scope. But
+   this is not true for other type references: the input and output type names
+   of a method. For methods, if a nearer scope has an eponymous extension (for
+   example), the reference is considered invalid.
+3. Similar to above, resolving custom option names will fail if there is a
+   nearer scope with an eponymous element that is not an extension. But the
+   logic for skipping non-type elements when resolving a field type reference
+   could be generalized: resolving a custom option name could skip elements
+   that are not extensions.
+4. Unqualified names are handled differently from qualified names. When the
+   name is qualified, the behavior described in the previous bullet does not
+   apply (where a matching symbol in a nearby scope will be ignored if it is
+   the wrong type, in favor of a matching symbol in a farther scope). The
+   behavior in the qualified case has more conditions (like matching the
+   first name component only to a composite element), which makes it more
+   error-prone in practice. An alternative is to always treat them as if
+   they are unqualified. Combined with the change in bullet 3 above, this
+   would improve the ergonomics of all relative references as there would be
+   far fewer cases where the resolution incorrectly finds an element of the 
+   wrong kind.
+
+Another aspect that is unergonomic and a candidate for change is the existence
+of the service scope. None of the elements in this scope (services and methods)
+can actually be referenced by other declarations. So resolution when inside a
+service (such as service or method options and methods' input and output types)
+could behave as if they were in the enclosing package scope. This way, some other
+method would never even be considered when resolving a method's input or output
+type name. Admittedly, if a feature were ever added to the language that allowed
+for referring to a service or method, then such a change would be
+counter-productive.
+
+## Lack of Coherence in Option Values
+It is an unfortunate inconsistency that one cannot use array literal notation
+(i.e. a sequence of values enclosed in brackets, `[` and `]`) to define the
+value for a custom option field that is repeated. This notation is only allowed
+inside a message literal.
+
+Inside a message literal, the syntax changes to relying on the Protobuf text
+format. Instead, the Protobuf language could use a subset of the text format (or
+an alternate syntax that is similar) that is more streamlined and more
+consistent. The following aspects of the text format are particularly inconsistent
+with the rest of the language:
+
+1. The Protobuf IDL uses curly braces (`{` and `}`) for block elements. But
+   the protobuf text format also allows for the use of `<` and `>`.
+2. The text format allows for eliding the colon between a field name and its
+   value if the value is a message literal (or a list of message literals).
+   While this may be convenient, the inconsistency is confusing (especially
+   since only lists of messages are supported, not lists of scalar values).
+   Two alternatives for increasing the consistency follow:
+   1. Always require the colon. This makes the message literal more
+      closely resemble formats like JSON and YAML.
+   2. Allow the colon to be elided for any list value with the observation
+      that the `[` preceding the value is effective as the separator from
+      the field name (just as `{` or `<` is for message literal values).
+3. The text format encloses extension names in brackets (`[` and `]`). But
+   other aspects of the IDL, such as option naming, uses parentheses to
+   enclose extension names (`(` and `)`). Allowing message literals to use
+   parentheses (and perhaps omit support for brackets) would make the syntax
+   and punctuation in the language more internally consistent.
+4. The text format allows ',' or ';' as a separator, but they are not
+   required. The IDL syntax in other places is not as flexible: you must use
+   a comma in compact options and in range lists, and you must use a
+   semicolon for separating most other elements. So requiring a comma (and
+   not allowing a semicolon) would make the syntax and punctuation in the
+   language more internally consistent.
+5. Though the text format has no context or scope (since it can be used as
+   a data exchange format), message literals in the Protobuf IDL **do** have
+   such a context. So it would be convenient if extension names in message
+   literals supported the same kinds of references as option names: relative
+   references allowed and a leading dot (`.`) allowed to indicate the name
+   is fully-qualified.
+
+Addressing these issues would go a long way towards making this part of the
+language syntax more coherent.
+
+## Overloading Keywords
+In the Protobuf IDL, keywords are allowed as identifiers for user-defined
+elements, such as messages, fields, enums, etc. There are a small handful of
+places where they may not be used, such as the first component of a type name
+for a field declaration that omits the cardinality. But this is only to
+prevent ambiguity for the parser (and is mostly due to an implementation detail
+of the hand-written recursive descent parser in `protoc`).
+
+A stronger stance on keywords would be to prevent their use in user-defined
+identifiers. This is how many languages are specified, and it can lead to
+simpler parsing, as well as making the source easier to read since language
+keywords aren't overloaded.
+
+If keywords were strictly disallowed in identifiers, a new category of
+"predeclared identifiers" could be created, which would be a subset of the
+current keywords. The reason for this distinction is in case there are some
+words in the language that _should_ be usable as user-defined identifiers.
+Keywords cannot be used this way; predeclared identifiers can be.
+
+## Extensions and Any
+Both extensions (in proto2) and the `google.protobuf.Any` type attempt to solve
+similar problems: the ability to extend the content of a message to include
+other user-defined types, but without the message needing a priori (compile-time)
+knowledge of those user-defined types. This is most useful when writing generic
+container and envelope types.
+
+Extensions effectively reverse the dependency: instead of the base message
+definition needing to import the user-defined field type, the user-defined field
+type needs to import the base message in order to extend it.
+
+* The "good":
+  * Extensions are easily serialized just like any other field of the base
+    message would be.
+  * Extensions have semantic names, which aids readability.
+  * Extensions are treated just like any other unrecognized field when the
+    consumer of a message is not aware of all extensions.
+  * Extensions are declared in the IDL, so the association of the user-defined
+    type with the base message is part of the schema.
+* The "bad":
+  * Extensions require some form of coordination amongst all extenders to avoid
+    a tag conflict.
+  * If an extension is present in the JSON format but the consumer of the data
+    does not recognize it, it is discarded. (Though same is true of normal
+    fields, too.)
+  * Transcoding from binary to JSON and vice versa requires knowledge of
+    the extensions. There is no way to translate these formats without
+    a descriptor registry of some sort.
+
+`Any` messages completely decouple the base message and any user-defined types:
+they are completely unrelated in the Protobuf IDL. They are, for the most part,
+implemented completely in the runtime and not necessarily part of the language.
+The only place they appear in the language is for their custom text format, for
+use with declaring option values whose type is `google.protobuf.Any`.
+
+* The "good":
+   * Implemented almost entirely in the runtime; very little concern needed in
+     the language specification itself.
+   * When using the binary format, unrecognized message types can conveniently
+     be ignored, similar to unrecognized fields.
+   * Fields of type `Any` can be easily serialized to the proto binary format.
+* The "bad":
+   * The data is identified by the fully-qualified name of the user-defined
+     message type. This means there is no semantic name, unless the message
+     type itself has a semantic name. For scalar values, a custom wrapper
+     type must be created with a semantic name. Use of generic message types,
+     such as the well-known types, is bad practice due to lack of a semantic
+     message name.
+   * Since they are not part of the language, there is no way to define the
+     "schema": such as what kinds of messages are allowed in any given field
+     of type `google.protobuf.Any`. So a field of type `Any` is a total
+     "free for all".
+   * Transcoding from binary to JSON and vice versa requires knowledge of
+     the concrete types inside. There is no way to translate these formats
+     without a descriptor registry of some sort.
+   * Unlike with extensions, where unrecognized extensions are ignored
+     when serializing to JSON, trying to serialize an `Any` that contains
+     an unrecognized type to JSON results in runtime errors. One must
+     manually strip unrecognized message types (if the desired outcome
+     is to ignore the unrecognized data).
+
+Given the downsides to each of these, there may be room for a different
+feature that could suffice as a replacement for extensions in a "proto4"
+syntax. (No specific ideas yet.)