diff --git a/docs/project/principles/information_accumulation.md b/docs/project/principles/information_accumulation.md new file mode 100644 index 0000000000000..6fcfb632c5337 --- /dev/null +++ b/docs/project/principles/information_accumulation.md @@ -0,0 +1,99 @@ +# Principle: Information accumulation + + + + + +## Table of contents + +- [Background](#background) +- [Principle](#principle) +- [Applications of this principle](#applications-of-this-principle) +- [Exceptions](#exceptions) +- [Alternatives considered](#alternatives-considered) + + + +## Background + +There are many different sources of information in a program, and a tool or a +human interpreting code will not in general have full information, but will +still want to draw conclusions about the code. + +Different languages take different approaches to this problem. For example: + +- In C, information is accumulated linearly in each source file independently, + and only information from earlier in the same file is available. A program + can observe that information is incomplete at one point and complete at + another. +- In C++, the behavior is largely similar to C, except: + - Within certain contexts in a class, information from later in the class + definition is available. + - With C++20 modules, information from other source files can be made + available. + - It is easier to observe -- perhaps even accidentally -- that information + is accumulated incrementally. +- In Rust, all information from the entire crate is available everywhere + within that crate, with exceptions for constructs like proc macros that can + see the state of the program being incrementally built. +- In Swift, all information from the entire source file is available within + that source file. + +## Principle + +In Carbon, information is accumulated incrementally within each source file. +Carbon programs are invalid if they would have a different meaning if more +information were available. + +Carbon source files can be interpreted top-down, without referring to +information that appears substantially later in a file. Source files are +expected to be organized into a topological order where that makes sense, with +forward declarations used to introduce names before they are first referenced +when necessary. + +If a program attempts to use information that has not yet been provided, the +program is invalid. There are multiple options for how this can be reported: + +- The program can be rejected as soon as it tries to use information that + might not be known yet. +- For the case where the information can only be provided in the same source + file, an assumption about the information can be made at the point where it + is needed, and the program can be rejected only if that assumption turns out + to be incorrect. + +Disallowing programs from changing meaning in the context of more information +ensures that the program is interpreted consistently or is rejected. This is +especially important to the coherence of generics and templates. + +## Applications of this principle + +- As in C++, and unlike in Rust and Swift, name lookup only finds names + declared earlier. +- Classes are incomplete until the end of their definition. Unlike in C++, any + attempt to observe a property of an incomplete class that is not known until + the class is complete renders the program invalid. +- When an `impl` needs to be resolved, only those `impl`s that have already + been declared are considered. However, if a later `impl` would change the + result of any earlier `impl` lookup, the program is invalid. + +## Exceptions + +Because a class is not complete until its definition has been fully parsed, +applying this rule would make it impossible to define most member functions +within the class definition. In order to still provide the convenience of +defining class member functions inline, such member function bodies are deferred +and processed as if they appeared immediately after the end of the outermost +enclosing class, like in C++. + +## Alternatives considered + +- Allow information to be used before it is provided + [globally](/proposals/p0875.md#strict-global-consistency), + [within a file](/proposals/p0875.md#context-sensitive-local-consistency), or + [within a top-level declaration](/proposals/p0875.md#top-down-with-minimally-deferred-type-checking). +- [Do not allow inline method bodies to use members before they are declared](/proposals/p0875.md#strict-top-down) +- [Do not allow separate declaration and definition](/proposals/p0875.md#disallow-separate-declaration-and-definition) diff --git a/proposals/p0875.md b/proposals/p0875.md new file mode 100644 index 0000000000000..cd0c48aa91e49 --- /dev/null +++ b/proposals/p0875.md @@ -0,0 +1,1001 @@ +# Principle: information accumulation + + + +[Pull request](https://github.com/carbon-language/carbon-lang/pull/875) + + + +## Table of contents + +- [Problem](#problem) +- [Background](#background) + - [Single-pass "splat" compilation](#single-pass-splat-compilation) + - [Global consistency](#global-consistency) + - [The C++ compromise](#the-c-compromise) + - [Separate declarations and definitions](#separate-declarations-and-definitions) +- [Proposal](#proposal) +- [Details](#details) + - [Goals](#goals) + - [Choice of alternative](#choice-of-alternative) +- [Rationale based on Carbon's goals](#rationale-based-on-carbons-goals) +- [Alternatives considered](#alternatives-considered) + - [Separate declaration and definition](#separate-declaration-and-definition) + - [Allow separate declaration and definition](#allow-separate-declaration-and-definition) + - [Analysis](#analysis) + - [Disallow separate declaration and definition](#disallow-separate-declaration-and-definition) + - [Analysis](#analysis-1) + - [Information flow](#information-flow) + - [Strict top-down](#strict-top-down) + - [Analysis](#analysis-2) + - [Strict global consistency](#strict-global-consistency) + - [Block-scope exception](#block-scope-exception) + - [Analysis](#analysis-3) + - [Top-down with minimally deferred type checking](#top-down-with-minimally-deferred-type-checking) + - [Analysis](#analysis-4) + - [Top-down with deferred method bodies](#top-down-with-deferred-method-bodies) + - [Analysis](#analysis-5) + - [Context-sensitive local consistency](#context-sensitive-local-consistency) + - [Analysis](#analysis-6) + + + +## Problem + +We should have consistent rules describing what information about a program is +visible where. + +## Background + +Information in a source file is provided incrementally, with each source +utterance providing a small piece of the overall picture. Different languages +have different rules for which information is available where. + +### Single-pass "splat" compilation + +In C and other languages of a similar age, single-pass compilation was highly +desirable, due to resource limits and performance concerns. In these languages: + +- Information is accumulated top-down, and can only be used lexically after it + appears. +- Most information can be discarded soon after it is provided: function bodies + don't need to be kept around once they've been converted to the output + format, and no information on local variable or parameter names needs to + persist past the end of the variable's scope. However, the types of globals + and the contents of type definitions must be retained. +- The behavior of an entity can be different at different places in the same + source file. An early use may fail if it depends on information that's + provided later, and in some cases a later use may fail when an earlier use + succeeded because the use is invalid in a way that was not visible at the + point of an earlier use. + +### Global consistency + +In more modern languages such as C#, Rust, Java, and Swift, there is no lexical +information ordering. In these languages: + +- Information is effectively accumulated and processed in separate passes. +- The language design and implementation ensure that the behavior of an entity + is the same everywhere: both before its definition, after its definition, + within its definition, and in any other source file in which it was made + visible. +- Dependency cycles between program properties are carefully avoided by the + language designers. + +### The C++ compromise + +In C++, a hybrid approach is taken. There is a C-like lexical information +ordering rule, but this rule is subverted within classes by -- effectively -- +reordering certain parts of a class that appear within the class definition so +that they are processed after the class definition. This primarily applies to +the bodies of member functions. Here: + +- Information is mostly accumulated top-down, and is accumulated fully + top-down after the reordering step. +- The behavior of a class is the same within member function bodies that are + defined inside the class as it is within member function bodies defined + lexically after the class. +- The language designers need to ensure that the bounds of the member function + bodies and similar constructs can be determined without parsing them, so + that the late-parsed portions can be separated from the early-parsed + portions. In C++, this was not done successfully, and there are constructs + for which this determination is very hard or impossible. + +### Separate declarations and definitions + +Somewhat separate from the direction of information flow is the ability to +separate the information about an entity into multiple distinct regions of +source files. In C and C++, entities can be separately declared and defined. As +a consequence, these languages need rules to determine whether two declarations +declare the same entity. + +In C++, especially for templated declarations, these rules can be incredibly +complex, and even now, more than 30 years after the introduction of templates in +C++, [basic questions are not fully answered](https://wg21.link/cwg2), and +implementations disagree about which declarations declare the same entity in +fairly simple examples. + +One key benefit of this separation is in reduction of _physical dependencies_: +in order to validate a usage of an entity, we need only see a source file +containing a declaration of that entity, and need never consider the source file +containing its definition. This both reduces the number of steps required for an +incremental rebuild and reduces the input information and processing required +for each individual step. + +The ability to break physical dependencies is limited to the cases where +information can actually be hidden from the users of the entity. For example, if +the user actually needs a function body, either because they will evaluate a +call to the function during compilation or because they will inline it prior to +linking, it cannot be physically isolated from the user of that information. As +a consequence, in C++, a programmer must carefully manage which information they +put in the source files that are exposed to client code and which information is +kept separate. + +Another key benefit is that the exported interface of a source file can become +more readable, by presenting an interface that contains only the facts that are +salient to a user and not the implementation details. + +## Proposal + +Carbon will use a +[top-down approach with deferred method bodies](#top-down-with-deferred-method-bodies), +with [forward declarations](#allow-separate-declaration-and-definition). + +In any case where the [strict global consistency](#strict-global-consistency) +model with the [block-scope exception](#block-scope-exception) would consider +the program to be invalid or would assign it a different meaning, the program is +invalid. + +Put another way: we accept exactly the cases where those two models both accept +and give the same result, and reject all other cases. + +## Details + +Each entity has only one meaning and only one set of properties; that meaning +and those properties do not depend on which source file you are within or where +in that source file you are. However, not all information will be available +everywhere, and it is an error to attempt to use information that is not +available -- either because it's in a different source file and not exported or +not imported, or because it appears later in the same source file. + +For example, a full definition of a class `Widget` may be available in some +parts of the program, with only a forward declaration available in other parts +of the program. In this case, any attempt to use `Widget` where only a forward +declaration is available will always either result in the program being invalid +or doing the same thing that would happen if the full information were +available. This is a stronger guarantee than is provided by C++'s One Definition +Rule, which guarantees that every use sees an equivalent definition or no +definition, but does not guarantee that every use has the same behavior -- in +C++ a function could do one thing if `Widget` is defined and a different thing +if `Widget` is only forward declared, and that will not be possible in Carbon. + +Name lookup only provides earlier-declared names. However, the bodies of class +member functions that are defined inside the class are reordered to be +separately defined after the class, so all class member names are declared +before the class body is seen. Unlike in C++, classes are either incomplete or +complete, with no intermediate state. Any attempt to perform name lookup into an +incomplete class is an error, even if the name being looked up has already been +declared. + +``` +class A { + fn F() { + // OK: A and B are complete class types here. + var b: A.B; + } + + class B {} + + // Error: cannot look up B within incomplete class A. + var p: A.B*; +} +``` + +Forward declarations are permitted in order to allow separation of interface and +implementation whenever the developer wishes to do so, and introduce names early +in order to break cycles or when the developer does not wish to define entities +in topological order. All declarations of an entity are required to be part of +the same library, but can be in different source files. If a point within a +source file can only see a declaration of an entity and not its definition, some +uses of that entity will be invalid that would be valid if the definition were +visible. For example, without a class definition, we may be unable to create +instances of the class, and without a function definition, we may be unable to +use the function to compute a compile-time constant. + +### Goals + +For this proposal, we have the following goals as refinements of the overall +Carbon goals: + +- _Comprehensibility._ Our rules should be understandable, and should minimize + surprise and gotchas. Our behavior should be self-consistent, and + explainable in only a few sentences. +- _Ergonomics._ It should be easy to express common developer desires, without + a lot of boilerplate or repetitive code. +- _Readability._ Code written using our rules should be as straightforward as + possible for Carbon developers to read and reason about. +- _Efficient and simple compilation._ It should be relatively straightforward + to implement our semantic rules. Implementation heroics shouldn't be + required, and the number of special cases required should be minimized. +- _Diagnosability._ An implementation should be able to explain coding errors + in ways that are easy to understand and are well-correlated with the error + and its remedy. Diagnostics should appear in an order and style that guides + the developer through logical steps to fix their mistakes. +- _Toolability._ Relatively simple tools should be able to understand simpler + properties of Carbon code. It should ideally be possible to identify which + names can be used in a particular context and what those names mean without + full processing. It should ideally be possible to gather useful and mostly + complete information about a potentially-invalid source file that is + currently being edited, for which it may be desirable to assume there is a + "hole" in the source file at the cursor position that will be filled by + unknown code. + +### Choice of alternative + +We have a large space of options here, all of which have both benefits and +drawbacks. The rationale for our choice is as follows: + +The first consideration is whether to provide a +[globally consistent view](#strict-global-consistency) of code. Providing such a +view would require us to find a compatible build process that can scale to large +projects; see the [detailed analysis](#analysis-3) for details. Moreover, this +is likely to be a choice that is relatively hard to evolve away from. A +non-scalable build process would be an existential threat to the Carbon language +under its stated goals, so we set this option aside. This should be revisited if +a complete design for a scalable build process with a globally-consistent +semantic model is proposed. + +Once we step away from strict global consistency, there will be cases where the +meaning of an entity differs in different parts of the program. In order for +programs to behave consistently and predictably, especially in the presence of +templates and generics, we do not want to allow the same operation to have two +or more different behaviors depending on which information is available. So we +choose to make programs invalid if they would make use of information that is +not available. + +Even though we have chosen not to provide a globally consistent view, we could +still choose to provide a consistent view throughout a package, library, or +source file, such as in the +[context-sensitive local consistency](#context-sensitive-local-consistency) +model. However, again for build performance reasons, we want to avoid +package-at-a-time or even library-at-a-time compilation, and would like a +file-at-a-time compilation strategy. The cost of having different views in +different files is substantially reduced given that we have already chosen to +have different views at least in different packages. So we allow the known +information about an entity to vary between source files, even in the same +library. + +This implies that we at least need forward declarations for entities that are +declared in an `api` file for a library and defined in an `impl` file. However, +this leaves the question of what happens within a single source file: do +entities need to be manually written in dependency order, or is the +implementation expected to use information that appears after its point of +declaration? Neither option guarantees that an entity will behave consistently +wherever it's used, because we already gave up that guarantee for entities +visible across source files, and we need to cope with the resulting risk of +incoherence anyway. + +The simplest option for implementation, and the one most similar to C++, would +then be to accumulate information top-down. This is also the choice that is most +consistent with the cross-source-file behavior: moving a prefix of a source file +into a separate, imported file should generally preserve the validity of the +program. Further, this choice gives us the most opportunity for future language +evolution, as it is the most restrictive option, and hence it is +forward-compatible with the other rules under consideration, as well as +supporting paths for evolution that require a top-down view, such as some +approaches to metaprogramming. + +This choice implies that we need forward declarations, both for declarations in +different source files from their definitions and to resolve cycles or cases +where the developer cannot or does not wish to write code in topological order. +Allowing separate declaration and definition may also be valuable to allow the +code author to hide implementation details from readers of the interface, and +present a condensed catalog or table of contents for the library. This is also +an important aspect in providing a language that is familiar to C++ developers. +So we choose to +[allow separate declaration and definition](#allow-separate-declaration-and-definition) +in general. + +Given the above decisions, class member functions could not be meaningfully +defined inline under a strict top-down rule, because the class would not yet be +complete. We make an ergonomic affordance of reordering the bodies of such +functions as if they appear after the class definition. This also improves +consistency with C++, which has a similar rule. + +## Rationale based on Carbon's goals + +- [Language tools and ecosystem](/docs/project/goals.md#language-tools-and-ecosystem) + - See "Toolability" goal. +- [Software and language evolution](/docs/project/goals.md#software-and-language-evolution) + - Ensuring that the program interpretation is consistent with an + interpretation with full information makes it easier to evolve code, as + changing the amount of visible information -- for example, by reordering + code or by adding or removing imports -- cannot change the meaning of a + program from one valid meaning to another. + - Selecting a rule that disallows information from flowing backwards keeps + open more language evolution paths: + - We could allow more information to flow backwards. + - We could expose metaprogramming constructs that can inspect the + state of a partial source file without creating inconsistency in our + model. +- [Code that is easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write) + - See "Readability", "Ergonomics", and "Comprehensibility" gaols. +- [Fast and scalable development](/docs/project/goals.md#fast-and-scalable-development) + - See "Efficient and simple compilation" and "Diagnosability" goals. +- [Interoperability with and migration from existing C++ code](/docs/project/goals.md#interoperability-with-and-migration-from-existing-c-code) + - The chosen rule is mostly the same as the corresponding rule in C++, + both in accumulating information top-down and in treating inline method + bodies as a special case. This should keep migration simple and improve + familiarity for experienced C++ developers. + +It should be noted that the decision between the model presented here and the +leading alternative, which was +[context-sensitive local consistency](#context-sensitive-local-consistency), was +very close, and came down to minor details. For each of our goals, neither +alternative had a significant advantage. The similarity to C++ and evolvability +of the alternative in this proposal were ultimately the deciding factors. + +## Alternatives considered + +Below, various alternatives are presented and rated according to the +[goals](#goals) for this proposal. + +Generally we need to pick a rule for allowing or disallowing separate +declaration and definition, and a rule for information flow. These choices are +not independent: some information flow rules require separate declaration and +definition. + +### Separate declaration and definition + +#### Allow separate declaration and definition + +We could allow two or more declarations of each entity, with at most one +declaration providing a definition and the other declarations merely introducing +the existence of the entity and its basic properties: + +``` +// Forward declaration. +fn F(); +// ... +// Another forward declaration of the same function. +fn F(); +// One and only definition. +fn F() {} +// Yet another forward declaration. +fn F(); +``` + +This could be done in either a free-for-all fashion as in the above example, +where any number of declarations is permitted, or in a more restricted fashion, +where there can be at most one declaration introducing an entity, and, if that +declaration is not a definition, at most one separate implementation providing +the definition. In this case, the separate definition could contain a syntactic +marker indicating that a prior declaration should exist: + +``` +// Forward declaration. +fn F(); +// ... +// Separate definition, with `impl` marker to indicate that this is just an +// implementation of an entity that was already declared. +impl fn F() {} +``` + +##### Analysis + +_Comprehensibility:_ In general, determining whether two declarations declare +the same entity may not be straightforward, particularly for a redeclaration of +an overloaded function. We have some options, but they all have challenges for +comprehensibility or ergonomics of the rule: + +- We could require token-for-token identical declarations, but this may result + in ergonomic problems -- developers may wish to leave out information in an + interface that isn't relevant for consumers of that interface, or spell the + same type or parameter name differently in an implementation. +- We could allow the declarations to differ so long as they have the same + meaning. However, there may be cases, as there are in C++, where it's + unclear whether two declarations are "sufficiently similar" so as to declare + the same entity. +- We could require all overloads of a function to be declared together in a + group, perhaps with some dedicated syntax, and then match up redeclarations + based on position in the group rather than signature. Such an approach is + likely to lead to developers, especially those familiar with the C++ rule, + being surprised. It will likely also interact poorly with cases where some + but not all of the overloads are defined in their first declaration and the + rest are defined separately, or where it is desirable for different + overloads to be defined in different implementation files. + +_Ergonomics:_ Any case where the declaration and definition are separated +results in repetitive code. However, this repetition may serve a purpose to the +extent that it's presenting an API description or acting as a compilation +barrier. + +_Readability:_ Readability of code may be substantially improved by separating a +declaration of an API from its definition, and permitting an API user to read, +in a small localized portion of a source file, only the information they care +about, without regard for implementation details. + +_Efficient and simple compilation:_ Allowing multiple declarations of an entity +introduces some implementation complexity, as the implementation must look up, +validate, and link together multiple declarations of the same entity. If +redeclarations are permitted across source files -- such as between the +interface and implementation of an API, or between multiple implementation files +-- then this may also require some stable mechanism for identifying the +declaration, such as a name mangling scheme. This may be especially complex in +the case of overloaded function templates, which might differ in arbitrary +Carbon expressions appearing in the function's declaration. This complexity +could be greatly reduced if we require all overloads to be declared together -- +at least in the same source file -- as we can then identify them by name and +index. + +This approach supports physical separation of implementation from interface, +potentially leading to build time wins through reduced recompilation and through +reducing the amount of information fed into each compilation step. + +_Diagnosability:_ Identifying errors where a redeclaration doesn't exactly match +a prior declaration is not completely straightforward. This is especially the +case when function overloads can be added by the same syntax with which a +definition of a prior declaration would be introduced. Even if declarations and +definitions use a different syntax, diagnosing a mismatch between a definition +of an overloaded function and a prior set of declarations requires some amount +of fuzzy matching to infer which one was probably intended. + +_Toolability:_ Supporting separate declarations and definitions will present the +same kinds of complexity for non-compiler tools as for compilers. Tools will +need to be able to reason about multiple declarations providing distinct +information about an entity and the effect of that on how the entity can be used +at different source locations. + +#### Disallow separate declaration and definition + +We could require each entity to be declared in exactly one place, and reject any +cases where the same entity is declared more than once. + +##### Analysis + +_Comprehensibility:_ This rule is easy to explain and easy to reason about. + +_Ergonomics:_ Minimizes the work required by the developer to implement new +functionality and maintain existing functionality. Reduces the chance that a +change in one place will be forgotten in another. + +_Readability:_ Each entity has only one declaration, giving a single canonical +location to include documentation, and no question as to where information about +the entity might be found. This will likely improve the navigability of source +code and the ease with which all the information about an entity can be +gathered, and reduce the chance that information in the declaration or +definition will be incomplete or that they will be inconsistent with each other. + +However, there would be no ability to present a function or class as source code +without also including implementation details. Such implementation details +cannot be physically separated from the API as presented to clients. The ability +to read and understand an interface will depend more heavily on tools that can +hide implementation details, such as documentation generators and IDEs with +outlining, and implementation techniques such as using `interface`s to present +the API of a class separate from its implementation. + +For developers familiar with C++ in particular, the absence of the ability to +separate declaration from definition may create friction, due to the familiarity +of that feature and having their thinking about code layout shaped around it. +Given that Carbon aims to be familiar to those coming from C++, an absence of +forward declarations will work against that goal to some extent, although +emiprical evidence suggests that the extent to which this is an issue varies +widely between existing C++ developers. + +_Efficient and simple compilation:_ This approach leads to a simpler compilation +strategy. However, in the absence of forward declarations, every compilation +referring to an entity needs to have a build dependency on the full definition +of that entity, meaning there is no physical separation between the APIs of +entities and their definitions. This would imposes a potentially-significant +build time cost due to increased recompilations when the definition of an entity +changes. + +There are possibilities to reduce this build time cost. For example, instead of +each declaration identifying whether it is public or private to a library, we +could add a third visibility level to say that the declaration but not the +definition is public, and use a stripping tool as part of the build process to +avoid dependencies on non-public definitions from cascading into unnecessary +rebuilds. + +This approach is also unlikely to be acceptable from an ergonomic standpoint +unless declarations within a source file are visible throughout at least the +entire library containing that file, although this is strictly an independent +choice. Making definitions from a source file visible throughout its library +would presumably require a library-at-a-time build strategy, which is also +expected to introduce build time costs due to changes to a library requiring +more code in that library to be recompiled. This cost could be reduced by using +a reactive compilation model, recompiling only the portions of a library that +are affected by a code change, at some complexity cost. + +_Diagnosability:_ Because each entity can only be declared once, there is no +significant challenge in diagnosing redeclaration issues. However, there is +still some work required to diagnose conflicts between similar or functionally +identical function overloads. + +_Toolability:_ Having a unique source location for each entity allows for +somewhat simpler tooling. For example, there is no need to distinguish between +"jump to declaration" and "jump to definition", or to decide which declaration +should be consulted to find documentation comments, parameter names, and so on. + +### Information flow + +#### Strict top-down + +Carbon could accumulate information top-down. We could require that each program +utterance is type-checked and fully validated before any later code is +considered. + +In order to support this and still permit cyclic references between entities, we +would need to +[allow separate declaration and definition](#allow-separate-declaration-and-definition). + +##### Analysis + +_Comprehensibility:_ This rule is simple to explain, and has no special cases. +However, the inability to look at information from later in the source file is +likely to result in gotchas: + +``` +class Base { + var n: i32; +} +class Derived extends Base { + // Returns me.(Base.n), not me.(Derived.n), because the latter has not + // been declared yet. + fn Get[me: Self]() -> i32 { return me.n; } + var n: i32; +} +``` + +It might be possible to require a diagnostic in such cases, when we find a +declaration that would change the meaning of prior code if it had appeared +earlier, but that would result in implementation complexity, and the fact that +such cases are rejected would still be a surprise. + +_Ergonomics:_ The developer is required to topologically sort their source files +in dependency order, manually breaking cycles with forward declarations. Common +refactoring tasks such as reorganizing code may require effort or tooling +assistance in order to preserve a topological order. + +Developers could adapt to this ruleset by beginning each source file with a +collection of forward declarations. This would mitigate the need to produce a +topological ordering, except within those forward declarations themselves, and +other declarations required to provide those forward declarations. For example, +a forward declaration of a class member will likely only be possible within a +class definition, and the order in which that class definition is given can be +relevant to the validity of other class definitions. However, our experience +with C++ indicates that this will not be done, and instead an ad-hoc and +undisciplined combination of a topological sort and occasional forward +declarations will be used. Indeed, including these additional forward +declarations would increase the maintenance burden and introduce an additional +opportunity for errors due to mismatches between declaration and definition. + +_Readability:_ Developers wishing to understand code have the advantage that +they need only consider prior code, and there is no possibility that a later +source utterance could change the meaning of the code they're reading. However, +it is rare to read code top-down, so the effect of this advantage may be modest. + +This advantage leads to a significant disadvantange: the behaviour of an entity +can be different at different places within a source file. For example, a type +can be incomplete in one place and complete in another, or can fail to implement +an interface when inspected early and then found to implement that interface +later. This can lead to very subtle incoherent behavior. + +In practice, the topological ordering constraint tends to lead to good locality +of information: helpers for particular functionality are often located near to +the functionality. However, this is not a universal advantage, and the +topological constraint sometimes leads to internal helpers being ordered +immediately before their first use instead of in a more logical position near +correlated functionality. + +_Efficient and simple compilation:_ This rule is mostly simple and efficient to +implement, and even allows single-pass processing of source files. + +_Diagnosability:_ Because information is provided top-down, diagnostics can also +be provided top-down and in every case the diagnostic will be caused by an error +at the given location or earlier. Fixing errors should require little or no +backtracking by the developer. + +However, an implementation that strictly confines its processing to top-down +order and produces diagnostics eagerly cannot deliver diagnostics that react +intelligently to contextual cues that appear after the point of the diagnostic. +This approach diminishes the ability for an implementation to pinpoint the cause +of the error and describe it in a developer-oriented fashion. + +_Toolability:_ Limiting information flow to top-down means that tools such as +code completion tools need only consider context prior to the cursor, and they +can be confident that if all the code prior to the cursor is valid that it can +be type-checked and suitable completions offered. + +However, in the case where the user wants to refer to a later-declared entity, +such tools would not be able to use this strategy. They would need to parse as +if there were not a top-down rule in order to find such later-declared entities, +and would likely additionally need the ability to add forward declarations or to +reorder declarations in order to satisfy the ordering requirement. + +#### Strict global consistency + +Carbon could follow an approach of requiring the behavior of every entity to be +globally consistent. In this approach, the behavior of every entity would be as +if the entire program could be consulted to determine facts about that entity. + +In practice, to make this work, we would need to limit where those facts can be +declared. For example, we limit implementations of interfaces to appear only in +source files that must already be in use wherever the question "does this type +implement that interface?" can be asked. + +In addition, we need to reject at least the case where some property of the +program recursively depends upon itself: + +``` +struct X { + var a: Array(sizeof(X), i8); +} +``` + +In order to give globally consistent semantics to, for example, a package name, +we would likely need to process all source files comprising a package at the +same time. This is likely to encourage packages to be small, whereas we have +designed package support on the assumption that the scope of a package matches +that of a complete source code repository. + +This alternative can be considered either with or without the ability to +separate declarations from definitions. If we permit separate declaration and +definition, we would likely require declarations of the same entity to appear in +the same library; however, that limitation is desirable regardless of which +strategy we choose. + +##### Block-scope exception + +Applying this rule to local name lookup in block scope does result in some +surprises. For example, C# uses this approach, and combined with its +disallowance of shadowing of local variables, this +[confuses some developers](https://stackoverflow.com/questions/1196941/variable-scope-confusion-in-c-sharp). +As a variant of this alternative, it would be reasonable and in line with likely +programmer expectations to not apply this rule to names in block scope. We call +this the _block-scope exception_. + +For example: + +``` +var m: i32 = 1; +fn F(b: bool) -> i32 { + if (b) { + var n: i32 = 2; + // With the block-scope exception: + // * `n` is unambiguously the variable declared on the line above, and + // * `m` is unambiguously the global. + // + // Without the block-scope exception: + // * depending on the rules we choose for shadowing, `n` might name the + // variable above, might be ambiguous, or might be invalid because it + // shadows the variable `n` declared below, and + // * `m` would name the local variable declared below. + return n + m; + } + var n: i32 = 3; + var m: i32 = 4; + return n * m; +} +``` + +##### Analysis + +_Comprehensibility:_ This rule is simple to explain, and has no special cases, +other than perhaps the block scope variant. The disallowance of semantic cycles +is likely to be unsurprising as it is a logical necessity in any rule. + +_Ergonomics:_ The developer can organize or arrange their code in any way they +desire. There is never a requirement to forward-declare or repeat an interface +declaration. Refactoring and code reorganization do not require any non-obvious +changes, because the same code means the same thing regardless of how it is +located relative to other code. + +_Readability:_ Reasoning about code is simple in this model, as such reasoning +is largely not context-sensitivity. Instead of questioning "what does this do +here?" we can instead consider "what does this do?". Some context sensitivity +may remain, for example due to access and name bindings differing in different +contexts. + +However, to developers accustomed to a top-down semantic model, the ability to +defer giving key information about an entity -- or even declaring it at all -- +until long after it is first used may hinder readability in some circumstances, +particularly when reading code top-down. + +_Efficient and simple compilation:_ This model forces the compilation process to +operate in multiple stages rather than as a single pass. + +Some form of cycle detection is necessary if cycles are possible. However, such +a mechanism is likely to be necessary for template instantiation too, so this is +likely not a novel requirement for Carbon. + +Forcing all files within a package to be compiled together in order to provide +consistent semantics for the package name may place an undesirable scalability +barrier on the build system. This will also tend to push up build latency, +especially for incremental builds, by decreasing the granule size of +compilation. It may increase the scope of recompilations due to additional +physical dependencies unless we implement a mechanism to detect whether a +library has changed in an "important" way. + +We do not have an existence proof of an efficient build strategy for a +Carbon-like language that follows this model. For example, Rust has known build +scalability issues, as does Java for large projects unless a tool like +[ijar](https://github.com/bazelbuild/bazel/tree/master/third_party/ijar) is used +to automatically create something resembling a C++ header file from a source +file. There is a significant risk that such an automated tool would be very hard +to create for Carbon: because the definitions of classes and functions can be +exposed in various ways through compile-time function evaluation, templates, and +metaprogramming, it seems challenging to put any useful bound on which entities +can be stripped without restricting the capabilities of client code. + +_Diagnosability:_ The implementation is likely to have more contextual +information when providing diagnostics, improving their quality. However, the +diagnostics may appear in a confusing order: if an early declaration needs +information from a later declaration in order to type-check, diagnostics +associated with that later declaration may be produced first, or may be +interleaved with diagnostics for the earlier declaration, leading to the +programmer potentially revisiting the same code multiple times during a +compile-edit cycle. + +_Toolability:_ This model requires tools to consider the whole file as context, +because code may refer to entities that are only introduced later. For an IDE +scenario, where the cursor represents a location where an arbitrary chunk of +code may be missing, this presents a challenge of determining how to +resynchronize the input in order to determine how to interpret the portion of +the source file following the cursor. + +Sophisticated tooling for a top-down model may wish to inspect the trailing +portion of the file anyway, in order to provide a better developer experience, +but this complexity would be forced upon tools with this model. + +#### Top-down with minimally deferred type checking + +We could follow a top-down approach generally, but defer type-checking some or +all top-level entities until we reach the end of that entity. For example, we +would check an entire class as a single unit, following the same principles as +in the globally-consistent rule, but using only information provided prior to +the end of the class definition. This would allow class members to use other +members that have not yet been declared, while not permitting a function +definition preceding the class definition to use such members. + +Following C++'s lead, we would apply this to at least classes and interfaces, +but perhaps to nothing else. We may still want to apply the top-down rule to +function definitions appearing within a class or interface, as the C# behavior +that the scope of a local variable extends backwards before its declaration may +be surprising. + +##### Analysis + +This approach generally has the properties as +[strict top-down](#strict-top-down), except as follows: + +_Comprehensibility:_ Slightly reduced due to additional special-casing of +top-level declarations. Gotchas would be rarer, but may not be fully eliminated. +For example: + +``` +import X; +// G from package X, not G from class X below. +fn F() { X.G(); } +class X { + fn G() {} +} +class Y { + // G from class X below, not G from package X. + fn F() { X.G(); } + class X { + fn G() {} + } +} +``` + +_Ergonomics:_ Improved within classes, as members can now be named everywhere +within the class. + +_Readability:_ The ability to read and understand code may be somewhat harmed, +as the rules for behavior across top-level declarations aren't the same as the +rules for behavior within a top-level declaration. It may be especially jarring +that code that is valid within a class definition is not valid at the top level: + +``` +class A { + // OK + var b: B; + class B {} +} + +// Error, I have not heard of B. +var b: B; +class B {} +``` + +However, the behavior of an entity no longer differs at different points within +the same declaration of that entity, so the ability to reason about the behavior +of an entity is somewhat improved compared to the strict top-down approach. For +entities without a separate declaration and definition, the behavior of those +entities is the same everywhere. + +_Efficient and simple compilation:_ This has the same complications as the +strict global consistency approach, except that it does not apply to contexts +that span multiple files, so it doesn't have the build time disadvantages of +requiring all source files in a library to be built together. + +_Diagnosability:_ Diagnostics should be encountered by the implementation in the +top-down order of the top-level declaration containing the error, but may in +principle appear in any order within that top-level declaration, if there are +forward references that require later code to be analyzed before earlier code +is. + +Diagnostics can be caused by errors appearing both before and after the location +reported, but not arbitrarily far after: later top-level declarations cannot +affect an earlier diagnostic. + +_Toolability:_ Tools that want to correctly model Carbon semantics would be +required to deal with incomplete source code in the vicinity of the cursor. This +is probably the hardest kind of incomplete source code to handle, as the region +of "damaged" code is most likely to be relevant context. In practice, this is +likely to present the same difficulties as tooling in the +[strict global consistency](#strict-global-consistency) model. + +#### Top-down with deferred method bodies + +**This is the proposed approach.** + +We could follow a top-down approach generally, but defer all processing of +bodies of class member functions until we have finished processing the class, at +which point we would process those member function bodies in the order they +appeared within the class. For example, we would reinterpret: + +```carbon +class A { + fn F[me: Self]() -> i32 { return me.n; } + fn Make() -> A { return {.n = 5}; } + var n: i32; +} +``` + +exactly as if it were written with the member function bodies out of line: + +```carbon +class A { + fn F[me: Self]() -> i32; + fn Make() -> A; + var n: i32; +} +fn A.F[me: Self]() -> i32 { return me.n; } +fn A.Make() -> A { return {.n = 5}; } +``` + +The [strict top-down](#strict-top-down) would be applied to this rewritten form +of the program. + +As in C++, member functions of nested classes would be deferred until the +outermost class is complete. This deferral would apply only to the bodies of +member functions, unlike in C++ where it also applies to default arguments, +default member initializers, and exception specifications, and unlike +[top-down with minimally deferred type checking](#top-down-with-minimally-deferred-type-checking), +where it also applies to member function signatures and the declarations and +definitions of non-function members. + +##### Analysis + +This approach generally has the properties as +[strict top-down](#strict-top-down), except as follows: + +_Comprehensibility:_ Slightly reduced due to additional special-casing of class +member functions, but these rules are well-aligned with the rules from C++, +which do not seem to suffer from major comprehensibility problems in this area. + +_Ergonomics:_ Improved within classes, as members can now be named within member +function bodies. + +_Readability:_ The ability to read and understand code may be somewhat harmed, +as the rules for behavior within a member function body are different from the +rules everywhere else in the language. However, the very similar rule in C++ is +not known to cause readability problems, so no major concerns are anticipated. + +_Efficient and simple compilation:_ The complexity of this approach is not +substantially greater than the simple top-down approach. Processing of member +function bodies must be deferred, but this can be done either by storing and +replaying the tokens, or by forming a parse tree but deferring the semantic +analysis and type-checking until the end of the class. + +_Diagnosability:_ Diagnostics for errors in member functions may appear after +diagnostics for later code in the same class. + +Developers may be confused by diagnostics claiming that a class member function +body is not available yet and referring to a point in the program text that +lexically follows the member function definition. It is likely possible to +special-case such diagnostics to explain the nature of the problem. + +_Toolability:_ Tools that want to parse incomplete Carbon code would need to +cope with member function bodies containing errors. For the most part, skipping +a brace-balanced function body should be straightforward, but tools will also +need to consider whether they attempt to detect and recover from mismatched +braces within member functions. + +#### Context-sensitive local consistency + +We could use different behaviors in different contexts, as follows: + +- For contexts that are fundamentally ordered, such as function bodies, a + top-down rule is used. +- For contexts that are defined across multiple source files, such as packages + and namespaces, we guarantee consistent behavior within each source file, + but the behavior may be inconsistent across source files: different source + files may see different sets of names within a package or namespace, + depending on what they have imported. +- For contexts that are defined within a single source file, such as a class + or an interface, we guarantee globally consistent behavior. + +Compared to the [strict global consistency](#strict-global-consistency) model, +this would not guarantee that the contents of a package or namespace are the +same everywhere: you only see the names that you declare or import into a +package or namespace. Also, a top-down rule is used within function bodies, as +that likely better matches programmer expectations. + +Compared to the +[top-down with minimally deferred type checking](#top-down-with-minimally-deferred-type-checking) +and +[top-down with deferred method bodies](#top-down-with-deferred-method-bodies) +models, this would not require forward declarations for namespace and package +members that are declared or defined later in the same source file. + +##### Analysis + +_Comprehensibility:_ This model is probably a little easier to understand than +the minimally-deferred type-checking model, because there is no difference +between top-level declarations and nested declarations. + +Because the behavior of every entity is consistent within a file, and every name +lookup is consistent within a scope -- other than the top-down behavior within +function scopes -- there are fewer opportunities for surprises than in the +top-down approaches, and nearly as few as in the strict global consistency +model. + +_Ergonomics:_ As with the strict global consistency model, code can be organized +as the developer desires, and refactoring doesn't have any surprises due to +disrupting a topological order. + +_Readability:_ As with the strict global consistency model, readability is +improved by not being context-sensitive. However, it is file-sensitive, and a +different set of imports may lead to different behavior. + +Similarly, as with the strict global consistency model, developers accustomed to +the top-down model may find it harder to understand code in which a name can be +referenced before it is declared. + +_Efficient and simple compilation:_ This model has the same general costs as the +top-down with minimal deferred type-checking model. The implementation needs to +be able to separate parsing from type-checking, and type-check lazily in order +to handle forward references. However, more state needs to be accumulated before +type-checking can begin. + +This model supports separate compilation of source files, as no attempt is made +to ensure consistency across source files, only within a source file. + +_Diagnosability:_ The compilation process will be split up into multiple phases, +and it is likely that diagnostics from one phase will precede those from +another. For example, a parsing error for a later construct may precede a type +error for an earlier one. + +If an earlier phase fails, it may not be feasible to diagnose later phases. For +example, if parsing encounters an unrecoverable failure, type errors for the +successfully-parsed portion cannot be generated, because they may depend on +constructs appearing later in the same source file. However, this may also mean +that better diagnostics are produced for cases such as a missing `}`, where the +syntactic error will precede or prevent diagnostics for semantic errors caused +by the misinterpretation of the program resulting from that missing `}`. + +Within a particular phase of processing, diagnostics will generally be produced +in a topological order, and the processing can be arranged such that they are +produced in top-down order except where a forward reference requires that a +later declaration is checked earlier. + +_Toolability:_ Largely the same as the strict global consistency model, except +that tools only need to use the current source file as context. Also largely the +same as the top-down with minimally deferred type checking model.