Skip to content

OptimizedLookupFormat

eaxelson edited this page Nov 21, 2017 · 2 revisions

Optimized Lookup Format

A binary transducer format optimized for lookup speed (eg. hundreds of thousands of words per second). It is described in HFST Runtime Format - A Compacted Transducer Format Allowing for Fast Lookup.

Lookup is fairly simple to implement. The HFST project has implementations in three languages: one written in C++, by far the fastest and recommended for general use, and, as demonstrations, hfst-optimized-lookup-python?? and hfst-optimized-lookup-java??. The C++ implementation can be used with command line tools hfst-lookup and, to a limited extent, hfst-xfst.

By current convention, transducers in this format are suffixed with .hfst.ol. HFST is currently distributing the following transducers:

  • Finnish (Omorfi), GPL
  • French (Morphalou), Morphalou license
  • German (Morphisto), Creative Commons Attribution-NonCommercial 2.0 license
  • Northern Sami (Divvun project), GPL
  • Swedish (Swelex, based on Den stora svenska ordlistan, copyright (c) 2003 Tom Westerberg), Creative Commons ShareAlike 1.0 license

They are available separately or with the hfst-optimized-lookup-all package (incl. C++ lookup implementation). A conversion utility from the standard hfst format is also available.

Note that hfst-optimized-lookup is not guaranteed to behave identically to hfst-lookup (although it almost always does): input-side multicharacter symbols are not fully supported. If the first character of such a symbol is an ASCII symbol also matching a single-character symbol, it will be tokenized as such.

Testing

This section discusses the C++ lookup utility. All tests were run on hfst's development machine, on which the analysis process ran on a dedicated Intel Xeon E5450 @ 3.00GHz processor and ample RAM.

Performance

Performance is highly dependent on the complexity of the underlying transducer. For example, take the following compiled transducers:

french.hfst.ol   3.1M   (Morphalou project)
sami.hfst.ol     5.0M   (Divvun project)
finnish.hfst.ol  7.0M   (Omorfi project)
french.hwfst.ol  4.1M
sami.hwfst.ol    6.9M
finnish.hwfst.ol 9.5M

The French transducer is a fairly straightforward and simple lexicon with some inflection, Sámi features eg. productive compounding and Finnish is a full-fledged morphology including a prodigious capacity for compounding and some derivation. The averages out of 20 test runs for two million word corpora were as follows:

normal:
  french:  308 000 words per second
  sami:    77 000 words per second
  finnish: 67 000 words per second
fast printing:
  french:  422 000 words per second
  sami:    83 000 words per second
  finnish: 68 000 words per second
weighted transducers:
  french:  283 000 words per second
  sami:    73 000 words per second
  finnish: 65 000 words per second

The times were inclusive of (negligible) startup time. "Fast printing" refers to a command line option which disables eg. suppressing duplicate results, sorting results by weight and only printing n (best) results.

Coverage and performance with the Europarl corpus

Hfst distributes some prebuilt optimized-lookup transducers from third-party projects for German, French, Swedish, Finnish and Sámi. The first four of these were tested for coverage and performance with the Europarl corpus. The corpora were tokenized, normalized and filtered for eg. proper names by a simple custom script to maximize effective coverage. A word was considered to have coverage if it received at least one analysis. The results were as follows:

  • German: 27 023 000 words analysed at 136 000 words per second with 98.2% coverage
  • French: 45 651 000 words analysed at 445 000 words per second with 98.8% coverage
  • Swedish: 33 796 000 words analysed at 392 000 words per second with 98.8% coverage
  • Finnish: 26 251 000 words analysed at 62 000 words per second with 98.4% coverage

Downloading and Installing

NOTE: HFST will be migrated under Github at the end of Februrary 2016.

Currently, the Sourceforge download page provides various packages with installation instructions included. A UNIX-type environment is currently required to run the lookup implementations.

Runtime License

The HFST Runtime Interface distributed by the HFST project has been coded by the HFST project and contains no external code. All rights belong to the University of Helsinki. The HFST Runtime API is licensed under the Apache license. Other licenses are negotiable.

Example

Consider the transducer

0       1       d       d
1       2       o       o
2       3       g       g
3       4       @0@     +NOUN
3       6       @0@     +VERB
4       5       @0@     +SG
5
6       5       @0@     +PRES
6       7       g       +PAST
7       8       e       @0@
8       5       d       @0@

where @0@ marks epsilon. It recognizes the input forms dog and dogged and gives the following analyses

dog:dog+NOUN+SG
dog:dog+VERB+PRES
dogged:dog+VERB+PAST

We give an annotated hexdump of the transducer in hfst optimized lookup format

           <-- NUMERICAL PROPERTIES -->
 0005 ............................. 5 input symbols.
 000a ............................. 10 symbols all in all.
 000c 0000 ........................ 12 entries in the transition index array.
 0013 0000 ........................ 19 entries in the transition table.
 0009 0000 ........................ 9 states.
 0000 0000
           <-- BOOLEAN PROPERTIES -->
 0000 0000 ........................ unweighted
 0001 0000 ........................ deterministic
 0000 0000 ........................ not input deterministic
 0001 0000 ........................ minimized as an automaton
 0000 0000 ........................ not cyclic
 0000 0000 ........................ has no epsilon:epsilon transitions
 0001 0000 ........................ has epsilon:X transitions.
 0000 0000 ........................ has no epsilon:X cycles.
 0000 0000 ........................ has no unweighted epsilon:X cycles.
           <-- TRANSDUCER ALPHABET -->
 3040 0040 ........................ "@0@" the epsilon symbol used.
 0064 ............................. "d" an input symbol.
 006f ............................. "o" an input symbol.
 0067 ............................. "g" an input symbol.
 0065 ............................. "e" an input symbol.
                                    --------------------
                                    4 regular input symbols and epsilon.

 4e2b 554f 004e ................... "+NOUN" a symbol.
 532b 0047 ........................ "+SG" a symbol.
 562b 5245 0042 ................... "+VERB" a symbol.
 502b 5341 0054 ................... "+PAST" a symbol.
 502b 4552 0053 ................... "+PRES" a symbol.
                                    ---------------------
                                    5 output symbols.

           <-- TRANSITION INDEX ARRAY (TIA) -->
 ffff ffff ffff ................... EMPTY
 ffff ffff ffff ................... EMPTY
 0001 0000 8000 ................... there are transitions d:X in TA index 0.
 0000 0009 8000 ................... there are transitions @0@:X in TA index 9.
 ffff ffff ffff ................... EMPTY
 ffff ffff ffff ................... EMPTY
 0003 000a 8000 ................... there are transitions g:X in TA index 10.
 ffff ffff ffff ................... EMPTY
 ffff ffff ffff ................... EMPTY <-- The TIA is padded with empty
 ffff ffff ffff ................... EMPTY     transitions at the end to
 ffff ffff ffff ................... EMPTY     prevent index overflow -->
 ffff ffff ffff ................... EMPTY

           <-- TRANSITION ARRAY (TA) -->
 0001 0001 0001 8000 .............. Transition d:d to TA index 1.
 ffff ffff ffff ffff .............. Non-final state.
 0002 0002 0003 8000 .............. Transition o:o to TA index 3.
 ffff ffff ffff ffff .............. Non-final state.
 0003 0003 0005 8000 .............. Transition g:g to TA index 5.
 ffff ffff ffff ffff .............. Non-final state.
 0000 0005 0010 8000 .............. Transition @0@:+NOUN to TA index 16.
 0000 0007 0002 0000 .............. Transition @0@:+VERB to TIA index 2.
 ffff ffff ffff ffff .............. Non-final state.
 0000 0009 000f 8000 .............. Transition @0@:+PRES to TA index 15.
 0003 0008 000b 8000 .............. Transition g:+PAST to TA index 11.
 ffff ffff ffff ffff .............. Non-final state.
 0004 0000 000d 8000 .............. Transition e:@0@ to TA index 13.
 ffff ffff ffff ffff .............. non-final state.
 0001 0000 000f 8000 .............. Transition d:@0@ to TA index 15.
 ffff ffff 0001 0000 .............. final-state.
 ffff ffff ffff ffff .............. non-final state.
 0000 0006 000f 8000 .............. Transition @0@:+SG to TA index 15.
 ffff ffff ffff ffff .............. extra transition to prevent over-flow.

Flag diacritics

Flag diacritics are detected in the alphabet and processed by the reader, but certain assumptions about transitions with flag diacritics are made.

Flag diacritics may be placed anywhere in the alphabet. Symbols are parsed as flags if they begin and end with "@" (in general, @-delimited symbols are special throughout hfst), are at least 5 characters long, have one of {P,N,R,D,C,U} as their second character and "." as their third character. Flag semantics are as per standard, with the important note that R and D flags with empty value fields respectively require and disallow that the feature in question is set to a non-neutral value, not to a neutral value.

Transitions with flag diacritics on the input side trigger flag diacritic processing. Flags on the output side have no effect. Transition table entries with flag diacritics must be placed before any of the non-epsilon transitions (in other words, epsilons and flag diacritics come first, before everything else). In the index table, flag diacritics have the same index as epsilon, ie. 0.

Weighted transducers

In weighted transducers, transition entries are suffixed with a 4-byte IEEE 754 float representing weight. For final transitions, this must be

static_cast<float>(UINT_MAX)

Final indices also have a weight in place of their target index, denoting an additional final weight for that index.

Clone this wiki locally