-
Notifications
You must be signed in to change notification settings - Fork 35
HfstLookUp
Perform fast transducer lookup, i.e. look up a set of input strings in the transducer and print the corresponding output strings.
The help message:
Usage: hfst-lookup [OPTIONS...] [INFILE]
perform transducer lookup (apply)
Common options:
-h, --help Print help message
-V, --version Print version info
-v, --verbose Print verbosely while processing
-q, --quiet Only print fatal erros and requested output
-s, --silent Alias of --quiet
Input/Output options:
-i, --input=INFILE Read input transducer from INFILE
-o, --output=OUTFILE Write output to OUTFILE
-p, --pipe-mode[=STREAM] Control input and output streams
Lookup options:
-I, --input-strings=SFILE Read lookup strings from SFILE
-O, --output-format=OFORMAT Use OFORMAT printing results sets
-e, --epsilon-format=EPS Print epsilon as EPS
-F, --input-format=IFORMAT Use IFORMAT parsing input
-x, --statistics Print statistics
-X, --xfst=VARIABLE Toggle xfst VARIABLE
-c, --cycles=INT How many times to follow input epsilon cycles
-b, --beam=B Output only analyses whose weight is within B from
the best analysis
-t, --time-cutoff=S Limit search after having used S seconds per input
(currently only works in optimized-lookup mode
-P, --progress Show neat progress bar if possible
If OUTFILE or INFILE is missing or -, standard streams will be used.
Format of result depends on format of INFILE
OFORMAT is one of {xerox,cg,apertium}, xerox being default
IFORMAT is one of {text,spaced,apertium}, default being text,
unless OFORMAT is apertium
VARIABLEs relevant to lookup are {print-pairs,print-space,
quote-special,show-flags,obey-flags}
Input epsilon cycles are followed by default INT=5 times.
Epsilon is printed by default as an empty string.
B must be a non-negative float.
S must be a non-negative float. The default, 0.0, indicates no cutoff.
If the input contains several transducers, a set containing
results from all transducers is printed for each input string.
STREAM can be { input, output, both }. If not given, defaults to {both}.
If input file is not specified with -I, input is read interactively line by
line from the user. If you redirect input from a file, use --pipe-mode=input.
--pipe-mode=output is ignored on non-windows platforms.
Todo:
For optimized lookup format, only strings that pass flag diacritic checks
are printed and flag diacritic symbols are not printed.
Support VARIABLE 'print-space' for optimized lookup format
Known bugs:
'quote-special' quotes spaces that come from 'print-space';
Report bugs to <hfst-bugs@helsinki.fi> or directly to our bug tracker at:
<https://sourceforge.net/tracker/?atid=1061990&group_id=224521&func=browse>
The option --input
defines the transducer where strings are looked up.
The free argument can also be used to give the transducer.
The option --input-strings
defines where lookup strings are read.
If either option is not defined, the standard input is used. The following are equivalent
commands:
hfst-lookup --input transducer.hfst --input-strings strings.txt
hfst-lookup transducer.hfst --input-strings strings.txt
cat strings.txt | hfst-lookup transducer.hfst
cat transducer.hfst | hfst-lookup --input-strings strings.txt
NOTE: If the transducer is not in optimized lookup format, the tool will give a warning that the lookup will be slow. You can convert a transducer into optimized lookup format with the tool hfst-fst2fst.
The option --input-format
defines the format of the strings that are looked up.
The formats are { text, spaced, apertium }
, text
being the default unless apertium
is used
as output format (then the default is apertium
for input format as well).
If we want to look up words 'cat' and 'dog', we would use the following inputs with
different input formats.
input format | input | more information |
---|---|---|
text |
todo | |
spaced |
todo | |
apertium |
todo | http://wiki.apertium.org/wiki/Apertium_stream_format |
Output format can be chosen with the option --output-format
from { xerox, cg, apertium }
xerox
being the default.
For example, if we have a weighted transducer cat2chat.hfst
that maps 'cat'
to 'chat'
with weight 3
and a following file named words.txt
that contains words to look up,
cat
dog
the command
hfst-lookup --input cat2chat.hfst --input-strings words.txt --output-format output_format
gives us the following results with different values of output_format
:
(TODO)
See hfst-fst2strings.
However, -X
quote-special
works differently in hfst-lookup
, see below.
Special symbols are printed as follows unless options -X
or -e
are used:
symbol | printed as | note |
---|---|---|
epsilon | '' |
can be changed to EPS with -e EPS
|
colon | ':' |
printed as '\:' if -X quote-special is requested |
tabulator | as such | printed as '\ ' if -X quote-special is requested |
space | ' ' |
printed as '\ ' if -X quote-special is requested |
flag diacritics | '' |
printed if -X print-flags is requested |
We first create a simple transducer singular2plural.hfst
that maps words in singular to their plural forms:
echo 'cat:cats
> mouse:mice
> cactus:cacti
> cactus:cactuses' | hfst-strings2fst -j -f sfst > singular2plural.hfst
Then we look up a set of words in the transducer:
echo 'cat
> dog
> mouse
> cactus' | hfst-lookup singular2plural.hfst
We get the following results:
cat cats 0.000000
dog dog+? inf
mouse mice 0.000000
cactus cacti 0.000000
cactus cactuses 0.000000
We see that the transducer singular2plural.hfst
gives one result for the
strings 'cat' and 'mouse', two results for the string 'cactus' and no results
for the string 'dog'.
Hfst-lookup is very fast if the transducer is in optimized lookup (OL) format. In other cases (openfst-tropical
, sfst
, foma
)
the transducer is converted into generic HFST basic transducer format whose lookup is relatively slow. It is advisable to first convert
a non-ol transducer into ol format with hfst-fst2fst to achieve better performance with hfst-lookup.