Skip to content

vberlier/tokenstream

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tokenstream

GitHub Actions PyPI PyPI - Python Version Code style: black

A versatile token stream for handwritten parsers.

from tokenstream import TokenStream

def parse_sexp(stream: TokenStream):
    """A basic S-expression parser."""
    with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
        brace, number, name = stream.expect(("brace", "("), "number", "name")
        if brace:
            return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
        elif number:
            return int(number.value)
        elif name:
            return name.value

print(parse_sexp(TokenStream("(hello (world 42))")))  # ['hello', ['world', 42]]

Introduction

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. This package provides a powerful general-purpose token stream that addresses these issues and more.

Features

  • Define the set of recognizable tokens dynamically with regular expressions
  • Transparently skip over irrelevant tokens
  • Expressive API for matching, collecting, peeking, and expecting tokens
  • Clean error reporting with line numbers and column numbers
  • Contextual support for indentation-based syntax
  • Checkpoints for backtracking parsers
  • Works well with Python 3.10+ match statements

Check out the examples directory for practical examples.

Installation

The package can be installed with pip.

pip install tokenstream

Getting started

You can define tokens with the syntax() method. The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.

stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"):
    print([token.value for token in stream])  # ['hello', 'world']

Check out the full API reference for more details.

Expecting tokens

The token stream is iterable and will yield all the extracted tokens one after the other. You can also retrieve tokens from the token stream one at a time by using the expect() method.

stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"):
    print(stream.expect().value)  # "hello"
    print(stream.expect().value)  # "world"

The expect() method lets you ensure that the extracted token matches a specified type and will raise an exception otherwise.

stream = TokenStream("hello world")

with stream.syntax(number=r"\d+", word=r"\w+"):
    print(stream.expect("word").value)  # "hello"
    print(stream.expect("number").value)  # UnexpectedToken: Expected number but got word 'world'

Filtering the stream

Newlines and whitespace are ignored by default. You can reject interspersed whitespace by intercepting the built-in newline and whitespace tokens.

stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"), stream.intercept("newline", "whitespace"):
    print(stream.expect("word").value)  # "hello"
    print(stream.expect("word").value)  # UnexpectedToken: Expected word but got whitespace ' '

The opposite of the intercept() method is ignore(). It allows you to ignore tokens and handle comments pretty easily.

stream = TokenStream(
    """
    # this is a comment
    hello # also a comment
    world
    """
)

with stream.syntax(word=r"\w+", comment=r"#.+$"), stream.ignore("comment"):
    print([token.value for token in stream])  # ['hello', 'world']

Indentation

To enable indentation you can use the indent() method. The stream will now yield balanced pairs of indent and dedent tokens when the indentation changes.

source = """
hello
    world
"""
stream = TokenStream(source)

with stream.syntax(word=r"\w+"), stream.indent():
    stream.expect("word")
    stream.expect("indent")
    stream.expect("word")
    stream.expect("dedent")

To prevent some tokens from triggering unwanted indentation changes you can use the skip argument.

source = """
hello
        # some comment
    world
"""
stream = TokenStream(source)

with stream.syntax(word=r"\w+", comment=r"#.+$"), stream.indent(skip=["comment"]):
    stream.expect("word")
    stream.expect("comment")
    stream.expect("indent")
    stream.expect("word")
    stream.expect("dedent")

Checkpoints

The checkpoint() method returns a context manager that resets the stream to the current token at the end of the with statement. You can use the returned commit() function to keep the state of the stream at the end of the with statement.

stream = TokenStream("hello world")

with stream.syntax(word=r"\w+"):
    with stream.checkpoint():
        print([token.value for token in stream])  # ['hello', 'world']
    with stream.checkpoint() as commit:
        print([token.value for token in stream])  # ['hello', 'world']
        commit()
    print([token.value for token in stream])  # []

Match statements

Match statements make it very intuitive to process tokens extracted from the token stream. If you're using Python 3.10+ give it a try and see if you like it.

from tokenstream import TokenStream, Token

def parse_sexp(stream: TokenStream):
    """A basic S-expression parser that uses Python 3.10+ match statements."""
    with stream.syntax(brace=r"\(|\)", number=r"\d+", name=r"\w+"):
        match stream.expect_any(("brace", "("), "number", "name"):
            case Token(type="brace"):
                return [parse_sexp(stream) for _ in stream.peek_until(("brace", ")"))]
            case Token(type="number") as number :
                return int(number.value)
            case Token(type="name") as name:
                return name.value

Contributing

Contributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses poetry.

$ poetry install

You can run the tests with poetry run pytest.

$ poetry run pytest

The project must type-check with pyright. If you're using VSCode the pylance extension should report diagnostics automatically. You can also install the type-checker locally with npm install and run it from the command-line.

$ npm run watch
$ npm run check
$ npm run verifytypes

The code follows the black code style. Import statements are sorted with isort.

$ poetry run isort tokenstream examples tests
$ poetry run black tokenstream examples tests
$ poetry run black --check tokenstream examples tests

License - MIT