Skip to content

Supporting a new language (classic)

Andy Massimino edited this page Jan 16, 2022 · 8 revisions

Note: match-up offers utility functions to make managing b:match_words easier. When placed in an autocommand or in a file after/ftplugin/{&filetype}.vim, they can be used to customize the matching regular expressions for a particular file type.

matchup#util#append_match_words:

call matchup#util#append_match_words('some:words')

Adds a set of patterns to b:match_words, adding a comma if necessary. Use this instead of concatenating directly.

matchup#util#patch_match_words:

call matchup#util#patch_match_words(before, after)

This function replaces the literal string in before contained in b:match_words with the literal string in after.

Match words

In order for match-up to support a new language, you must define a suitable pattern for b:match_words. If your language has a complicated syntax, or many keywords, you will need to know something about vim's regular-expressions.

The format for b:match_words is similar to that of the 'matchpairs' option: it is a comma (,)-separated list of groups; each group is a colon(:)-separated list of patterns (regular expressions). Commas and backslashes that are part of a pattern should be escaped with backslashes (':' and ','). It is OK to have only one group; the effect is undefined if a group has only one pattern. A simple example is

:let b:match_words = '\<if\>:\<endif\>,'
	\ . '\<while\>:\<continue\>:\<break\>:\<endwhile\>'

(In vim regular expressions, \< and \> denote word boundaries. Thus "if" matches the end of "endif" but "<if>" does not.) Then banging on the "%" key will bounce the cursor between "if" and the matching "endif"; and from "while" to any matching "continue" or "break", then to the matching "endwhile" and back to the "while". It is almost always easier to use literal-strings (single quotes) as above: '<if>' rather than "\<if\>" and so on.

Exception: If the ":" character does not appear in b:match_words, then it is treated as an expression to be evaluated. For example,

:let b:match_words = 'GetMatchWords()'

allows you to define a function. This can return a different string depending on the current syntax, for example. Note: this is deprecated in match-up, try not to use it if possible.

Once you have defined the appropriate value of b:match_words, you will probably want to have this set automatically each time you edit the appropriate file type. The recommended way to do this is by adding the definition to a filetype-plugin file.

Tips: Be careful that your initial pattern does not match your final pattern. See the example above for the use of word-boundary expressions. It is usually better to use ".{-}" (as many as necessary) instead of ".*" (as many as possible). See \{-. For example, in the string "<tag>label</tag>", "<.*>" matches the whole string whereas "<.\{-}>" and "<[^>]*>" match "<tag>" and "</tag>".

Spaces

If "if" is to be paired with "end if" (Note the space!) then word boundaries are not enough. Instead, define a regular expression s:notend that will match anything but "end" and use it as follows:

:let s:notend = '\%(\<end\s\+\)\@<!'
:let b:match_words = s:notend . '\<if\>:\<end\s\+if\>'

This is a simplified version of what is done for Ada. The s:notend is a script-variable. Similarly, you may want to define a start-of-line regular expression

:let s:sol = '\%(^\`;\)\s*'

if keywords are only recognized after the start of a line or after a semicolon (;), with optional white space.

Backrefs

In any group, the expressions \1, \2, ..., \9 refer to parts of the INITIAL pattern enclosed in \(escaped parentheses\). These are referred to as back references, or backrefs. For example,

:let b:match_words = '\<b\(o\+\)\>:\(h\)\1\>'

means that "bo" pairs with "ho" and "boo" pairs with "hoo" and so on. Note that "\1" does not refer to the "(h)" in this example. If you have "(nested (parentheses)) then "\d" refers to the d-th "(" and everything up to and including the matching ")": in "(nested(parentheses))", "\1" refers to everything and "\2" refers to "(parentheses)". If you use a variable such as s:notend or s:sol in the previous paragraph then remember to count any "(" patterns in this variable. You do not have to count groups defined by \%(\).

It should be possible to resolve back references from any pattern in the group. For example,

:let b:match_words = '\(foo\)\(bar\):more\1:and\2:end\1\2'

would not work because "\2" cannot be determined from "morefoo" and "\1" cannot be determined from "andbar". On the other hand,

:let b:match_words = '\(\(foo\)\(bar\)\):\3\2:end\1'

should work (and have the same effect as "foobar:barfoo:endfoobar"), although this has not been thoroughly tested.

You can use zero-width patterns such as \@<= and \zs.
For example, if the keyword "if" must occur at the start of the line, with optional white space, you might use the pattern "(^\s*)@<=if" so that the cursor will end on the "i" instead of at the start of the line. For another example, if HTML had only one tag then one could

:let b:match_words = '<:>,<\@<=tag>:<\@<=/tag>'

so that "%" can bounce between matching "<" and ">" pairs or (starting on "tag" or "/tag") between matching tags. Without the \@<=, the script would bounce from "tag" to the "<" in "", and another "%" would not take you back to where you started.

match-up extensions

On top of matchit compatibility, match-up provides a few extensions to support additional languages.

\g expressions

In your regular expressions, you can use the \g{} pseudo-atom to give special handling. This is entirely a match-up extension; vim's regex engine does not define \g in regular expressions. The syntax is as follows:

/\g{tag;arg1;arg2}/

Currently, two such tags are possible:

  • \g{hlend} terminates highlighting at this place in the regex. This is similar to but distinct from \ze, since this would also terminate the match for the purposes of motions and text objects. I.e., hlend only applies to highlighting.

  • \g{syn;+offset;group} and \g{syn;-offset;!group}. This is experimental. When matching, disambiguate two matches by the syntax group under the match position. The offset is how many bytes from the match position to grab the syntax. In the first alternative, the group must match the regular expression group. In the second, with !, the group must not match the regular expression group.

The midmap

Some languages have blocks that mids can be in but are not distinguishable by the end marker. As an example, consider a language with function:return:end and if:end and the following snippet:

function foo(x)
    if x
        return -1
    end
end

In matchit, the return will be incorrectly matched with if/end since it simply takes the nearest block. In match-up however, we have a special option b:match_midmap to fix this. It is specified in a list of pairs as follows (for example, in ruby):

let b:match_midmap = [
      \ ['rubyRepeat', 'next'],
      \ ['rubyDefine', 'return'],
      \]

The first element of each pair is the syntax group which must be present on the block to consider the return matching it. Suppose if were to match with return without the midmap, but if does not have the group rubyDefine. Then it would be struck, and match-up would instead match the next outer group (repeating this process as many times as necessary).

As it is syntax group based, this mechanism only works and is only required in classic matching.

Reference

Adapted from matchit.txt.