To run, do this at the command prompt:
$ python -m doctest tests.md
Note: nothing will be shown if tests pass. You can add a verbose flag
(-v
) to all results.
Note 2: The toolbox
module was created for Python3, but it may work for
Python2, as well (no promises). These tests, however, will likely only work
for Python3, so, depending on your distribution, you may need to explicitly
use it at the command line:
$ python3 -m doctest tests.md
The toolbox
module is meant to be used as a library, not a script, so
your own code would import it as here:
>>> import toolbox
Since toolbox
is a common module name, in the event of a collision,
you can rename it on import as done here:
>>> import toolbox as tb
The rest of the tests below will use tb
as the module name.
The read_toolbox_file()
function takes a file-like object as the first
parameter. In general this will be an open file, but if the data exists
in memory (as in this test), io.StringIO
objects can be used.
>>> from io import StringIO
Also note that read_toolbox_file()
returns a generator, so you'll need
to cast it as a list, if that's what you need, or just iterate over the
results. The generator yields tuples of (marker, text).
>>> next(tb.read_toolbox_file(StringIO('\\mkr text')))
('\\mkr', 'text')
>>> list(tb.read_toolbox_file(StringIO('\\mkr text')))
[('\\mkr', 'text')]
For convenience, I define a simple lambda expression rtf()
that deals
with the boilerplate code here. It behaves like read_toolbox_file()
, but
takes a string as the first parameter, and returns a list.
>>> rtf = lambda s, strip=True: list(
... tb.read_toolbox_file(StringIO(s), strip=strip)
... )
The tests below show the behavior of the Toolbox parser:
The most basic functionality is to split a marker and the following text:
>>> rtf('\\mkr basic tokens')
[('\\mkr', 'basic tokens')]
The space after the marker is mandatory and not included in the text, but all other non-trailing spacing should be maintained:
>>> rtf('\\mkr maintain internal whitespace')
[('\\mkr', ' maintain internal whitespace')]
Even newlines:
>>> rtf('''\\mkr multi-
... line''')
[('\\mkr', 'multi-\nline')]
Even blank lines:
>>> rtf('''\\mkr multiline
...
... with a gap''')
[('\\mkr', 'multiline\n\nwith a gap')]
But whitespace (including blank lines) at the end will be trimmed:
>>> rtf('''\\mkr multiline
...
... with a gap and trailing newline
... ''')
[('\\mkr', 'multiline\n\nwith a gap and trailing newline')]
Unless the strip=False
parameter is used:
>>> rtf('\\mkr trailing space ', strip=False)
[('\\mkr', 'trailing space ')]
>>> rtf('''\\mkr trailing newline
... ''', strip=False)
[('\\mkr', 'trailing newline\n')]
>>> rtf('''\mkr trailing newline and spaces
... ''', strip=False)
[('\\mkr', 'trailing newline and spaces\n ')]
If no space appears after the marker, the text is considered null:
>>> rtf('\\mkr')
[('\\mkr', None)]
But if a space and nothing else follows, it's considered empty content:
>>> rtf('\\mkr ')
[('\\mkr', '')]
Markers only appear at the beginning of a line:
>>> rtf('\\mkr \\mkr is just a token')
[('\\mkr', '\\mkr is just a token')]
And furthermore only in the first column (note: this means that you can't have marker-like text as the first thing after a newline):
>>> rtf('''\\mkr newline then a marker
... \\mkr''')
[('\\mkr', 'newline then a marker\n \\mkr')]
Markers can be very long:
>>> rtf('\\this_is_a_very_long_marker and some text')
[('\\this_is_a_very_long_marker', 'and some text')]
But in practice they are very short:
>>> rtf('\\f just one char')
[('\\f', 'just one char')]
But they can't be nothing or just a space:
>>> rtf('\\ text')
[]
>>> rtf('\\ ')
[]
>>> rtf('\\\nmkr text')
[]
>>> rtf('\\')
[]
Unicode text is not a problem:
>>> rtf('\\mkr unicode テキスト')
[('\\mkr', 'unicode テキスト')]
Even for markers:
>>> rtf('\\マーカー テキスト')
[('\\マーカー', 'テキスト')]
But (currently) double-width spaces don't count as a marker delimiter:
>>> rtf('\\マーカー テキスト')
[]
But the double-width spaces do get trimmed from the right edge:
>>> rtf('\\マーカー テキスト ')
[('\\マーカー', 'テキスト')]
>>> rtf('\\マーカー テキスト ', strip=False)
[('\\マーカー', 'テキスト ')]
Multiple markers and text are ok, but preceding text without a marker will be ignored:
>>> lines = rtf('''some text
... without \\mkr or some other marker
...
... and some more
... \\mkr finally a marker
... \\mkr2 and another with
... a newline, followed by
...
... \\mkr3 yet another''')
>>> for mkr, text in lines:
... print((mkr, text))
('\\mkr', 'finally a marker')
('\\mkr2', 'and another with\na newline, followed by')
('\\mkr3', 'yet another')
Groups of (marker, value) pairs often come in blocks of related data.
The toolbox.iterparse()
function helps in processing this kind of
data. Given some keys, it will report events with the associated data.
The 'key', 'start', and 'end' events simply give the (marker, value)
pairs as the result if '\key', '+key', or '-key', respectively, were
seen. The result of the 'data' event is the list of (marker, value)
pairs that occur between keys.
>>> pairs = rtf('''
... \\ref 1
... \\a val
... \\b val
... \\ref 2
... \\a val2''')
>>> records = tb.iterparse(pairs, keys=['\\ref'])
>>> for event, result in records:
... print((event, result))
('key', ('\\ref', '1'))
('data', [('\\a', 'val'), ('\\b', 'val')])
('key', ('\\ref', '2'))
('data', [('\\a', 'val2')])
The 'start' and 'end' events are treated as normal keys (no special
grouping of data between related start and end events is performed),
but the user only needs to specify the key without the +
or -
:
>>> pairs = rtf('''
... \\+block
... \\a val
... \\+subblock
... \\b val
... \\-subblock
... \\c val
... \\-block''')
>>> events = tb.iterparse(pairs, keys=['\\block', '\\subblock'])
>>> for event, result in events:
... print((event, result))
('start', ('\\block', None))
('data', [('\\a', 'val')])
('start', ('\\subblock', None))
('data', [('\\b', 'val')])
('end', ('\\subblock', None))
('data', [('\\c', 'val')])
('end', ('\\block', None))
>>> pairs = rtf('''
... \\header info
... \\ref 1
... \\t some words
... \\m some word -s
... \\ref 2
... \\t more words
... \\m more word -s''')
>>> recs = tb.records(pairs, '\\ref')
>>> for ctxt, data in recs:
... print((sorted(ctxt.items()), data))
([('\\ref', None)], [('\\header', 'info')])
([('\\ref', '1')], [('\\t', 'some words'), ('\\m', 'some word -s')])
([('\\ref', '2')], [('\\t', 'more words'), ('\\m', 'more word -s')])
>>> pairs = rtf('''
... \\ref 1
... \\t text
... \\id i1
... \\header info
... \\ref 2
... \\t text again
... \\ref 3
... \\t and again''')
>>> recs = tb.records(pairs, ['\\id', '\\ref'])
>>> for ctxt, data in recs:
... print((sorted(ctxt.items()), data))
([('\\id', None), ('\\ref', '1')], [('\\t', 'text')])
([('\\id', 'i1'), ('\\ref', None)], [('\\header', 'info')])
([('\\id', 'i1'), ('\\ref', '2')], [('\\t', 'text again')])
([('\\id', 'i1'), ('\\ref', '3')], [('\\t', 'and again')])
>>> pairs = rtf('''
... \\ref 1
... \\t text
... \\id i1
... \\header info
... \\ref 2
... \\t text again
... \\page 5
... \\ref 3
... \\t and again''')
>>> recs = tb.records(pairs, ['\\id', '\\ref'], context_keys=['\\page'])
>>> for ctxt, data in recs:
... print((sorted(ctxt.items()), data))
([('\\id', None), ('\\page', None), ('\\ref', '1')], [('\\t', 'text')])
([('\\id', 'i1'), ('\\page', None), ('\\ref', None)], [('\\header', 'info')])
([('\\id', 'i1'), ('\\page', None), ('\\ref', '2')], [('\\t', 'text again')])
([('\\id', 'i1'), ('\\page', '5'), ('\\ref', '3')], [('\\t', 'and again')])
>>> pairs = rtf('''
... \\j 犬が吠える
... \\t inu=ga hoeru
... \\m inu =ga hoe -ru
... \\g dog =NOM bark -IPFV
... \\f A dog barks.
... ''')
>>> for pair_list in tb.field_groups(pairs, set(['\\t', '\\m', '\\g'])):
... print(pair_list) # doctest: +NORMALIZE_WHITESPACE
[('\\j', '犬が吠える')]
[('\\t', 'inu=ga hoeru'),
('\\m', 'inu =ga hoe -ru'),
('\\g', 'dog =NOM bark -IPFV')]
[('\\f', 'A dog barks.')]
Aligned fields that Toolbox has wrapped can be reconstructed into single lines that keep the column spacing intact. Unaligned fields will be wrapped without considering spaced columns.
>>> pairs = rtf('''
... \\ref s1
... \\t waadorappu sareta
... \\m waadorappu sare-ta
... \\g word.wrap PASS-PFV
... \\t tekisuto
... \\m tekisuto
... \\g text
... \\f word-wrapped text
... ''')
>>> normrecs = tb.normalize_record(pairs, set(['\\t', '\\m', '\\g']))
>>> for mkr, val in normrecs:
... print((mkr, val))
('\\ref', 's1')
('\\t', 'waadorappu sareta tekisuto')
('\\m', 'waadorappu sare-ta tekisuto')
('\\g', 'word.wrap PASS-PFV text')
('\\f', 'word-wrapped text')
By default, trailing spaces are stripped from each line, but (as in
toolbox.read_toolbox_file()
), this behavior can be turned off by setting
strip=False
:
>>> pairs = rtf('''
... \\a abc
... \\b s
... \\a def
... \\b tuvwxyz
... ''')
>>> for mkr, val in tb.normalize_record(pairs, ['\\a', '\\b']):
... print((mkr, val))
('\\a', 'abc def')
('\\b', 's tuvwxyz')
>>> for mkr, val in tb.normalize_record(pairs, ['\\a', '\\b'], strip=False):
... print((mkr, val))
('\\a', 'abc def ')
('\\b', 's tuvwxyz')
The align_fields()
function attempts to interpret the interlinear structure
that is implicitly encoded in the spacing of the tokens. Not all lines are
so aligned, so align_fields()
requires a dictionary mapping the marker of
one line to the marker of the line it aligns to. For example, often the \m
morphemes line aligns to the \t
text line, so '\m': '\t'
would encode
this alignment. This function returns tuples of (marker, alignments) for each
line, where marker is the line's marker and alignments is a list of
(token, aligned_tokens) tuples.
>>> pairs = rtf('''
... \\t inu=ga ippiki hoeru
... \\m inu =ga ichi -hiki hoe -ru
... \\g dog =NOM one -CLF.ANIMAL bark -IPFV
... \\f One dog barks.
... \\x''')
>>> algns = tb.align_fields(pairs, alignments={'\\m': '\\t', '\\g': '\\m'})
>>> for algn in algns:
... print(algn) # doctest: +NORMALIZE_WHITESPACE
('\\t', [('inu=ga ippiki hoeru', ['inu=ga', 'ippiki', 'hoeru'])])
('\\m', [('inu=ga', ['inu', '=ga']),
('ippiki', ['ichi', '-hiki']),
('hoeru', ['hoe', '-ru'])])
('\\g', [('inu', ['dog']),
('=ga', ['=NOM']),
('ichi', ['one']),
('-hiki', ['-CLF.ANIMAL']),
('hoe', ['bark']),
('-ru', ['-IPFV'])])
('\\f', [(None, ['One dog barks.'])])
('\\x', [(None, None)])
This function also has some capability to recover from inaccurate column
spacings. There is an optional parameter errors
that defaults to strict
but can be set to ratio
or reanalyze
. The strict
setting raises an error
when a token crosses an observed column boundary, but ratio
will group the
token with whichever column it is most covered by, and reanalyze
will ignore
tabular spacing and rely on token delimiters to find groupings. The possible
misalignment will still raise an warning (instead of an error), but these can
be silenced. However, it may be useful for the maintainer of a corpus to see
the warnings, as it can point out possible errors in the corpus. For example,
here is the ratio
method:
>>> import warnings
>>> warnings.simplefilter('ignore')
>>> pairs = rtf('''
... \\t inu=ga ippiki hoeru
... \\m inu =ga ichi -hiki hoe -ru
... \\g dog =NOM one -CLF.ANIMAL bark -IPFV
... \\f One dog barks.''')
>>> algns = tb.align_fields(
... pairs,
... alignments={'\\m': '\\t', '\\g': '\\m'},
... errors='ratio'
... )
>>> for algn in algns:
... print(algn) # doctest: +NORMALIZE_WHITESPACE
('\\t', [('inu=ga ippiki hoeru', ['inu=ga', 'ippiki', 'hoeru'])])
('\\m', [('inu=ga', ['inu', '=ga']),
('ippiki', ['ichi', '-hiki']),
('hoeru', ['hoe', '-ru'])])
('\\g', [('inu', ['dog']),
('=ga', ['=NOM']),
('ichi', ['one']),
('-hiki', ['-CLF.ANIMAL']),
('hoe', ['bark']),
('-ru', ['-IPFV'])])
('\\f', [(None, ['One dog barks.'])])
And here is the reanalyze
method. Note that morpheme delimiters that group
with their morpheme are kept attached, but unattached delimiters or those not
adjacent to a space get separated as tokens (because it is impossible to
determine which side it attaches to):
>>> import warnings
>>> warnings.simplefilter('ignore')
>>> pairs = rtf('''
... \\t inu=ga ippiki hoeru
... \\m inu =ga ichi-hiki hoe - ru
... \\g dog =NOM one-CLF.ANIMAL bark - IPFV
... \\f One dog barks.''')
>>> algns = tb.align_fields(
... pairs,
... alignments={'\\m': '\\t', '\\g': '\\m'},
... errors='reanalyze'
... )
>>> for algn in algns:
... print(algn) # doctest: +NORMALIZE_WHITESPACE
('\\t', [('inu=ga ippiki hoeru', ['inu=ga', 'ippiki', 'hoeru'])])
('\\m', [('inu=ga', ['inu', '=ga']),
('ippiki', ['ichi', '-', 'hiki']),
('hoeru', ['hoe', '-', 'ru'])])
('\\g', [('inu', ['dog']),
('=ga', ['=NOM']),
('ichi', ['one']),
('-', ['-']),
('hiki', ['CLF', '.', 'ANIMAL']),
('hoe', ['bark']),
('-', ['-']),
('ru', ['IPFV'])])
('\\f', [(None, ['One dog barks.'])])