- Feature Name:
or_patterns
- Start Date: 2018-08-29
- RFC PR: rust-lang/rfcs#2535
- Rust Issue: rust-lang/rust#54883
Allow |
to be arbitrarily nested within a pattern such
that Some(A(0) | B(1 | 2))
becomes a valid pattern.
Nothing this RFC proposes adds anything with respect to expressive power. Instead, the aim is to make the power we already have more easy to wield. For example, we wish to improve ergonomics, readability, and the mental model.
Consider an example match arm such as (1):
Some(Enum::A) | Some(Enum::B) | Some(Enum::C) | Some(Enum::D) => ..
Here, we are repeating Some($pat)
three times.
Compare (1) to how we could have written this with this RFC (2):
Some(Enum::A | Enum::B | Enum::C | Enum::D) => ..
We can see that this is clearly shorter and that the amount of extra work we have to do scales linearly with the number of inner variants we mention. The ability to nest patterns in this way therefore results in improved writing ergonomics.
However, as we know, code is read more than it is written. So are we trading readability for increased ergonomics? We believe this is not the case. Instead, this RFC aims to improve the readability of code by reducing the amount of redundant information that needs to be scanned.
In addition, we aim to more closely align Rust with the mental model that humans have and how we usually speak and communicate.
Consider that you wanted to ask someone what the colour of their car was. Would you be more inclined to ask:
Is your car red, white, or blue?
Or would you instead ask:
Is your car red, your car white, or your car blue?
When researching for this RFC; many people were asked and all of them preferred the first alternative. This user testing was done on both programmers and non-programmers alike and included speakers of: English, German (2) Swedish (3), French (2), Portuguese (1), Spanish (2), Farsi (3), Finnish (1), Esperanto (1), and Japanese (1).
Thus, we conjecture that it's more common for humans to not distribute and to instead use something akin to conjunctive normal form (CNF) when communicating. A likely consequence of this is that a common way to understand snippet (1) formulated in disjunctive normal form (DNF) is to first mentally reconstruct it into CNF and then understand the implications of the pattern.
By allowing users to encode their logic in the way they think instead of going through more indirect routes, we can improve the understandability of code.
A principal way in which programming languages accumulate complexity is by adding more and more rules that a programmer needs to keep in their head to write or understand what a program does. A consequence of this is that often times, caveats and corner cases make for a language that is harder to learn, understand, and write in. To avoid such caveats, it is thus imperative that we should try to keep the language more uniform rather than less. This is an important means through which it becomes possible to give users more expressiveness but at the same time limit the cost each feature takes from our complexity budget.
With this RFC, we try to reduce the complexity of the language by extending a feature which already exists, and which many users already know about, to another place. In a sense, giving the user more capabilities results in a negative increase in complexity.
In concrete terms, where before we only allowed a pattern of the form
pat | pat
at the top level of match
and similar constructs,
which special cased the language, we now allow pat | pat
anywhere a pattern
may occur whereby we simplify the ruleset of the language.
In fact, there are already users that try this expecting it to work but
then find out that it does not.
Furthermore, allowing pat | pat
in the pattern grammar also allows macros to
produce disjunctions such as $p | $q
.
This RFC wouldn't be complete without concrete use cases which it would
facilitate. While there are not an overabundance of cases where pat | pat
would help, there are some where it would. Let's go through a few of them.
-
One example which was raised in the precursor to this RFC was building a state machine which is iterating through
chars_indices
:match iter.next() { Some(_, ' ' | '\n' | '\r' | '\u{21A1}') => { // Change state } Some(index, ch) => { // Look at char } None => return Err(Eof), }
-
Other examples are listed in the equivalent GHC proposal.
-
Another example which was provided in the precursor RFC was:
for event in event_pump.poll_iter() { use sdl2::event::Event; use sdl2::keyboard::Keycode::{Escape, Q}; match event { Event::KeyDown { keycode: Some(Escape | Q), ... } => break 'game, _ => {}, } ... }
-
Other cases where this feature was requested include:
-
Another use case due to @alercah is:
pub fn is_green(self) -> bool { match self { | Tile::Suited(Suit::Souzu, 2 | 3 | 4 | 6 | 8) | Tile::Dragon(Dragon::Green) => true, _ => false, } }
-
Some further examples found with sourcegraph include:
From cc-rs:
match (self.cpp_set_stdlib.as_ref(), cmd.family) { (None, _) => {} (Some(stdlib), ToolFamily::Gnu | ToolFamily::Clang) => { cmd.push_cc_arg(format!("-stdlib=lib{}", stdlib).into()); } _ => { ... } }
From capnproto:
// Check whether the size is compatible. match expected_element_size { None | Some(Void | InlineComposite) => (), Some(Bit) => { ... } Some(Byte | TwoBytes | FourBytes | EightBytes) => { ... }, ... }
From chrono:
fn resolve_year(y: Option<i32>, q: Option<i32>, r: Option<i32>) -> ParseResult<Option<i32>> { match (y, q, r) { (y, None, None) => Ok(y), (Some(y), q, r @ (Some(0...99) | None)) => { ... }, ... } }
From maidsafe's routing:
match self.peer_mgr.connection_info_received(...) { ..., Ok(IsProxy | IsClient | IsJoiningNode) => { ... }, Ok(Waiting | IsConnected) | Err(_) => (), }
Also from routing:
match self.peer_mgr.connection_info_received(...) { Ok(Ready(our_info, their_info)) => { ... } Ok(Prepare(_) | IsProxy | IsClient | IsJoiningNode) => { ... } Ok(Waiting | IsConnected) | Err(_) => (), }
From termion:
for c in self.bytes() { match c { Err(e) => return Err(e), Ok(0 | 3 | 4) => return Ok(None), Ok(0x7f) => { buf.pop(); } Ok(b'\n' | b'\r') => break, Ok(c) => buf.push(c), } }
-
Some other use cases are:
In code using git2-rs:
match obj.kind() { Some(Commit | Tag | Tree) => ... Some(Blob) => ... None => ... }
From debcargo:
match (op, &mmp.clone()) { (&Lt, &(M(0) | MM(0, 0) | MMP(0, 0, 0))) => debcargo_bail!( "Unrepresentable dependency version predicate: {} {:?}", dep.name(), p ), (&Tilde, &(M(_) | MM(_, _))) => { vr.constrain_lt(mmp.inclast()); vr.constrain_ge(mmp); } (&Compatible, &(MMP(0, minor, _) | MM(0, minor))) => { vr.constrain_lt(MM(0, minor + 1)); vr.constrain_ge(mmp); } (&Compatible, &(MMP(major, _, _) | MM(major, _) | M(major))) => { vr.constrain_lt(M(major + 1)); vr.constrain_ge(mmp); } ..., }
-
From rustc, we have:
In
src/librustc_mir/interpret/eval_context.rs
:Some(Def::Static(..) | Def::Const(..) | Def::AssociatedConst(..)) => {},
In
src/librustc_mir/util/borrowck_errors.rs
:(&ty::TyArray(_, _), Some(true) | None) => "array",
In
src/librustc/middle/reachable.rs
:Some(Def::Local(node_id) | Def::Upvar(node_id, ..)) => { .. }
In
src/librustc/infer/error_reporting/mod.rs
:Some(hir_map::NodeBlock(_) | hir_map::NodeExpr(_)) => "body",
In
src/libfmt_macros/lib.rs
:Some((_, '>' | '<' | '^')) => { .. }
In
src/librustc/traits/select.rs
:ty::TyInfer(ty::IntVar(_) | ty::FloatVar(_)) | .. => { .. }
In
src/librustc_typeck/check/mod.rs
:ty::TyInt(ast::IntTy::I8 | ast::IntTy::I16) | ty::TyBool => { .. } ... ty::TyUint(ast::UintTy::U8 | ast::UintTy::U16) => { .. }
In
src/tools/cargo/src/cargo/sources/path.rs
:Some("Cargo.lock" | "target") => continue,
In
src/libsyntax_ext/format_foreign.rs
:('h' | 'l' | 'L' | 'z' | 'j' | 't' | 'q', _) => { state = Type; length = Some(at.slice_between(next).unwrap()); move_to!(next); }, ... let width = match self.width { Some(Num::Next) => { // NOTE: Rust doesn't support this. return None; } w @ Some(Num::Arg(_) | Num::Num(_)) => w, None => None, };
In
src/libsyntax/parse/token.rs
:BinOp(Minus | Star | Or | And) | OrOr => true,
Simply put, $p | $q
where $p
and $q
are some patterns is now itself
a legal pattern.
This means that you may for example write:
enum Foo<T> {
Bar,
Baz,
Quux(T),
}
fn main() {
match Some(Foo::Bar) {
Some(Foo::Bar | Foo::Baz) => { .. },
_ => { .. },
}
}
Because $p | $q
is itself a pattern, this means that you can nest arbitrarily:
fn main() {
match Some(Foo::Bar) {
Some(Foo::Bar | Foo::Quux(0 | 1 | 3)) => { .. },
_ => { .. }
}
}
Note that the operator |
has a low precedence. This means that if you
want the same outcome as foo @ 1 | foo @ 2 | foo @ 3
, you have to write
foo @ (1 | 2 | 3)
instead of writing foo @ 1 | 2 | 3
.
This is discussed in the rationale.
You can also use p | q
in:
-
if let
expressions:if let Foo::Bar | Foo::Quux(1 | 2) = some_computation() { ... }
-
while let
expressions:while let Ok(1 | 2) | Err(3) = different_computation() { ... }
-
let
statements:let Ok(x) | Err(x) = another_computation();
In this case, the pattern must be irrefutable as
Ok(x) | Err(x)
is. -
fn
arguments:fn foo((Ok(x) | Err(x)): Result<u8, u8>) { ... }
Here too, the pattern must be irrefutable.
-
closure arguments:
let closure = |(Ok(x) | Err(x))| x + 1;
Notice that in this case, we have to wrap the pattern in parenthesis. This restriction is currently enforced to avoid backtracking but may possibly be lifted in the future based on other developments in the grammar.
-
macros by example:
macro_rules! foo { ($p:pat) => { ... } } foo!((Ok(x) | Err(x)));
Here we must wrap the pattern in parenthesis since
$p:pat | $q:pat
is already legal in patterns.
We parameterize the pat
grammar by the choice whether to allow top level
pat | pat
. We then change the pattern grammar to:
pat<allow_top_alt>
: pat<allow_top_alt> '|' pat<allow_top_alt>
| ...
;
pat<no_top_alt>
: "(" pat<allow_top_alt> ")"
| ...
;
Here |
has the lowest precedence.
In particular, the operator @
binds more tightly than |
does.
Thus, i @ p | q
associates as (i @ p) | q
as opposed to i @ (p | q)
.
Note: pat<T>
does not entail that the grammar of Rust is context sensitive
because we "monomorphize" the parameterization below.
We then introduce a production:
top_pat : '|'? pat<allow_top_alt> ;
We then change the grammar of let
statements to (as compared to RFC 2175):
let : LET top_pat maybe_ty_ascription maybe_init_expr ';' ;
We change the grammar of if let
expressions to:
expr_if_let : IF LET top_pat '=' expr_nostruct block (ELSE block_or_if)? ;
And for while let
expressions:
expr_while_let : maybe_label WHILE LET top_pat '=' expr_nostruct block ;
For for
loop expressions we now have:
expr_for : maybe_label FOR top_pat IN expr_nostruct block ;
For match
expressions we now have:
expr_match : MATCH expr_nostruct '{' match_clause* nonblock_match_clause? '}' ;
match_clause : nonblock_match_clause ',' | block_match_clause ','? ;
nonblock_match_clause : match_arm (nonblock_expr | block_expr_dot) ;
block_match_clause : match_arm (block | block_expr) ;
match_arm : maybe_outer_attrs top_pat (IF expr_nostruct)? FAT_ARROW ;
In other words, in all of the contexts where a pattern is currently accepted,
the compiler will now accept pattern alternations of form p | q
where
p
and q
are arbitrary patterns.
For the patterns of fn
arguments we now have:
param : pat<no_top_alt> ':' ty_sum ;
For closures we now have:
inferable_param : pat<no_top_alt> maybe_ty_ascription ;
Finally, pat
macro fragment specifiers will also match the pat<no_top_alt>
production as opposed to pat<allow_top_alt>
.
As previously noted, the precedence of the operator |
is lower than that of
the operator @
. This results in i @ p | q
being interpreted as (i @ p) | q
.
In turn, this would result in an error because i
is not defined in all
alternations. An example:
fn main() {
match 1 {
i @ 0 | 1 => {},
}
}
This would result in:
error[E0408]: variable `i` is not bound in all patterns
--> src/main.rs:3:17
|
3 | i @ 0 | 1 => {},
| - ^ pattern doesn't bind `i`
| |
| variable not in all patterns
However, it is quite likely that a user who wrote i @ p | q
wanted the
semantics of i @ (p | q)
because it would be the only thing that would
be a well formed pattern. To guide the user on the way, we recommend special
casing the error message for such circumstances with for example:
error[E0408]: variable `i` is not bound in all patterns
--> src/main.rs:3:17
|
3 | i @ 0 | 1 => {},
| - ^ pattern doesn't bind `i`
| |
| variable not in all patterns
|
| hint: if you wanted `i` to cover both cases, try adding parentheses around:
|
| i @ 0 | 1
| ^^^^^
The particular design of such an error message is left open to implementations.
-
Given a pattern
p | q
at some depth for some arbitrary patternsp
andq
, the pattern is considered ill-formed if:- the type inferred for
p
does not unify with the type inferred forq
, or - the same set of bindings are not introduced in
p
andq
, or - the type of any two bindings with the same name in
p
andq
do not unify with respect to types or binding modes.
Unification of types is in all instances aforementioned exact and implicit type coercions do not apply.
- the type inferred for
-
When type checking an expression
match e_s { a_1 => e_1, ... a_n => e_n }
, for each match arma_i
which contains a pattern of formp_i | q_i
, the patternp_i | q_i
is considered ill formed if, at the depthd
where it exists the fragment ofe_s
at depthd
, the type of the expression fragment does not unify withp_i | q_i
. -
With respect to exhaustiveness checking, a pattern
p | q
is considered to coverp
as well asq
. For some constructorc(x, ..)
the distributive law applies such thatc(p | q, ..rest)
covers the same set of value asc(p, ..rest) | c(q, ..rest)
does. This can be applied recursively until there are no more nested patterns of formp | q
other than those that exist at the top level.Note that by "constructor" we do not refer to tuple struct patterns, but rather we refer to a pattern for any product type. This includes enum variants, tuple structs, structs with named fields, arrays, tuples, and slices.
- The dynamic semantics of pattern matching a scrutinee expression
e_s
against a patternc(p | q, ..rest)
at depthd
wherec
is some constructor,p
andq
are arbitrary patterns, andrest
is optionally any remaining potential factors inc
, is defined as being the same as that ofc(p, ..rest) | c(q, ..rest)
.
With respect to both static and dynamic semantics,
it is always valid to first desugar a pattern c(p | q)
in CNF to its equivalent form in DNF, i.e. c(p) | c(q)
.
However, implementing c(p | q)
in terms of a pure desugaring to c(p) | c(q)
may not be optimal as the desugaring can result in multiplicative blow-up of patterns.
An example of such blow up can be seen with:
match expr {
(0 | 1, 0 | 1, 0 | 1, 0 | 1) => { ... },
}
If we expanded this naively to DNF we would get:
match expr {
| (0, 0, 0, 0)
| (0, 0, 0, 1)
| (0, 0, 1, 0)
| (0, 0, 1, 1)
| (0, 1, 0, 0)
| (0, 1, 0, 1)
| (0, 1, 1, 0)
| (0, 1, 1, 1)
| (1, 0, 0, 0)
| (1, 0, 0, 1)
| (1, 0, 1, 0)
| (1, 0, 1, 1)
| (1, 1, 0, 0)
| (1, 1, 0, 1)
| (1, 1, 1, 0)
| (1, 1, 1, 1)
=> { ... },
}
Instead, it is more likely that a one-step case analysis will be more efficient.
Which implementation technique to use is left open to each Rust compiler.
- Some parsers will have to be rewritten by a tiny bit; We do this with any syntactic change in the language so there should not be any problem.
As for why the change as proposed in this RFC should be done, it is discussed in the motivation.
Since we already use |
for alternation at the top level, the only consistent
operator syntax for alternations in nested patterns would be |
.
Therefore, there are not many design choices to make with respect to how
this change should be done rather than if.
With respect to the precedence of |
, we cannot interpret i @ p | q
as i @ (p | q)
because it is already legal to write i @ p | j @ q
at the top level of a pattern. Therefore, if we say that |
binds more tightly,
then i @ p | j @ q
will associate as i @ (p | j @ q)
which as a different
meaning than what we currently have, thus causing a breaking change.
And even if we could associate i @ p | q
as i @ (p | q)
there is a good
reason why we should not. Simply put, we should understand @
as a
pattern / set intersection operator and the operator |
as the union operator.
This is analogous to multiplication and addition as well as conjunction and
disjunction in logic. In these fields, it is customary for multiplication and
conjunction to bind more tightly. That is, we interpret a * b + c
as
(a * b) + c
and not a * (b + c)
. Similarly, we interpret p ∧ q ∨ r
as (p ∧ q) ∨ r
and not p ∧ (q ∨ r)
.
The only real choice that we do have to make is whether the new addition to the
pattern grammar should be pat : .. | pat "|" pat ;
or if it instead should be
pat : .. | "|"? pat "|" pat ;
. We have chosen the former for 4 reasons:
-
If we chose the former we can later change to the latter but not vice versa. This is thus the conservative choice.
-
There is precedent for such a decision due to OCaml.
-
The benefit to macros is dubious as they don't have to produce leading alternations.
-
Leading alternations inside patterns is considered poor style.
However, there is one notable advantage to permitting leading |
in nested
pattern:
- Libraries or tools such as
syn
will have slightly easier time parsing the grammar of Rust.
In this RFC, we allow p | q
inside patterns of fn
arguments.
The rationale for this is simply consistency with let
which also permit
these and did so before this RFC at the top level with RFC 2175.
See the section on unresolved questions for a brief discussion.
In CSS4 (draft proposal), it is possible to write a selector
div > *:matches(ul, ol)
which is equivalent to div > ul, div > ol
.
The moral equivalent of this in Rust would be: Div(Ul | Ol)
.
Most regular expression formalisms support at least the
following operations (where a
, b
, and c
are arbitrary regexes):
-
Concatenation: "
a
followed byb
". Commonly written by just sayingab
. -
Alternation: "first match
a
or otherwise matchb
" Commonly written asa | b
.|
binds more loosely than concatenation. -
Grouping: used to define the scope of what operators apply to. Commonly written as
(a)
.
Formally, the minimal formalism we need is:
pat : terminal | pat pat | pat "|" pat | "(" pat ")" ;
Given this formalism, it is then possible to encode a regex:
a(b | c)
By the law of distributivity, we can rewrite this as:
ab | ac
This is supported in OCaml. An example from "Real World OCaml" is:
let is_ocaml_source s =
match String.rsplit2 s ~on:'.' with
| Some (_, ("ml" | "mli")) -> true
| _ -> false
While OCaml will permit the following:
let foo =
match Some(1) with
| Some(1 | 2) -> true
| _ -> false
the OCaml compiler will reject:
let foo =
match Some(1) with
| Some(| 1 | 2) -> true (* Note in particular the leading | in Some(..). *)
| _ -> false
We have chosen to impose the same restriction as OCaml here with respect to
not allowing leading |
in nested pattern alternations.
A language which is quite similar to OCaml is F#. With respect to pattern matching, we may write:
let detectZeroOR point =
match point with
| (0, 0) | (0, _) | (_, 0) -> printfn "Zero found."
| _ -> printfn "Both nonzero."
F# calls these "OR pattern"s and includes
pattern1 | pattern2
in the pattern grammar.
The equivalent proposal is currently being discussed for inclusion in Haskell.
There is support for or-patterns in various lisp libraries.
-
Should we allow
top_pat
orpat<allow_top_alt>
ininferable_param
such that closures permit|Ok(x) | Err(x)|
without first wrapping in parenthesis?We defer this decision to stabilization as it may depend on experimentation. Our current inclination is to keep the RFC as-is because the ambiguity is not just for the compiler; for humans, it is likely also ambiguous and thus harder to read.
This also applies to functions which, although do not look as ambiguous, benefit from better consistency with closures. With respect to function arguments there's also the issue that not disambiguating with parenthesis makes it less clear whether the type ascription applies to the or-pattern as a whole or just the last alternative.
-
Should the
pat
macro fragment specifier matchtop_pat
in different Rust editions or should it matchpat<no_top_alt>
as currently specified? We defer such decisions to stabilization because it depends on the outcome of crater runs to see what the extent of the breakage would be.
The benefit of avoiding pat<no_top_alt>
in as many places as possible would
both be grammatical consistency and fewer surprises for uses.
The drawbacks would be possible ambiguity or backtracking for closures and
breakage for macros.