Skip to content

Commit

Permalink
lt-trim: new option --match-section
Browse files Browse the repository at this point in the history
May be given multiple times. Any section matching such a
name (id@type) in the analyser will only be trimmed against sections
with the same name in the bidix. Useful for regex sections, which tend
to have a very different structure from regular entries (few states
with lots of transitions + loops) – leading to slowdown when
intersecting.

This gives a 4x speedup (60s → 15s) on nob→nno:

BEFORE:

$ \time lttoolbox/lttoolbox/lt-trim apertium-nob/nob.automorf.bin apertium-nno-nob/nob-nno.autobil.bin /tmp/before.bin
final@inconditional 26 76
main@standard 168643 350041
regex@standard 403 7475
58.73user 0.97system 1:00.45elapsed 98%CPU (0avgtext+0avgdata 2280784maxresident)k
0inputs+3288outputs (0major+574892minor)pagefaults 0swaps

AFTER:

$ \time lttoolbox/lttoolbox/lt-trim --match-section=regex@standard apertium-nob/nob.automorf.bin apertium-nno-nob/nob-nno.autobil.bin /tmp/after.bin
Matched sections regex@standard
final@inconditional 26 76
main@standard 168643 350041
regex@standard 389 7405
14.36user 0.24system 0:14.77elapsed 98%CPU (0avgtext+0avgdata 382136maxresident)k
0inputs+3288outputs (0major+102452minor)pagefaults 0swaps

(timings are the same if lt-comp -j was used to make nob.automorf.bin)
  • Loading branch information
unhammer committed Sep 30, 2022
1 parent e6a3ade commit 5fa5a97
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 12 deletions.
18 changes: 18 additions & 0 deletions lttoolbox/lt-trim.1
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,24 @@ You should not trim a generator unless you have a
.Em very
simple translator pipeline,
since the output of bidix seldom goes unchanged through transfer.
.Sh OPTIONS
.Bl -tag -width Ds
.It Fl s , Fl Fl match-section
A section with this name (id@type) in the analyser will only be
trimmed against a section with the same id in the bidix. (The default
is to trim all sections of the analyser against all sections of the
bidix.) Using this option can some times speed up trimming
considerably. For example, if you have some complicated regular
expressions, try putting them in a

<section id="regex" type="standard">

in both .dix files and passing
.Dq regex@standard
to \fI--match-section\fP.
.Pp
This argument may be used multiple times to specify multiple sections
that must match by name.
.Sh FILES
.Bl -tag -width Ds
.It Ar analyser_binary
Expand Down
48 changes: 36 additions & 12 deletions lttoolbox/lt_trim.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
#include <iostream>

void
trim(FILE* file_mono, FILE* file_bi, FILE* file_out)
trim(FILE* file_mono, FILE* file_bi, FILE* file_out, std::set<UString> match_sections)
{
Alphabet alph_mono;
std::set<UChar32> letters_mono;
Expand All @@ -41,37 +41,53 @@ trim(FILE* file_mono, FILE* file_bi, FILE* file_out)
std::set<int> loopback_symbols; // ints refer to alph_prefix
alph_prefix.createLoopbackSymbols(loopback_symbols, alph_mono, Alphabet::right);

UString union_name = u""; // Not a valid section name, used as key for those where we don't care about names matching
std::map<UString, Transducer> moved_bi_transducers;
for (auto& it : trans_bi) {
if (union_transducer.isEmpty()) {
union_transducer = it.second;
} else {
union_transducer.unionWith(alph_bi, it.second);
if(match_sections.contains(it.first)) {
moved_bi_transducers[it.first] = it.second.appendDotStar(loopback_symbols).moveLemqsLast(alph_prefix);
}
else {
if (union_transducer.isEmpty()) {
union_transducer = it.second;
}
else {
union_transducer.unionWith(alph_bi, it.second);
}
}
}
union_transducer.minimize();

Transducer prefix_transducer = union_transducer.appendDotStar(loopback_symbols);
// prefix_transducer should _not_ be minimized (both useless and takes forever)
Transducer moved_transducer = prefix_transducer.moveLemqsLast(alph_prefix);
// prefix/moved transducer should _not_ be minimized (both useless and takes forever)
moved_bi_transducers[union_name] = union_transducer.appendDotStar(loopback_symbols).moveLemqsLast(alph_prefix);

std::map<UString, Transducer> trans_trim;
std::set<UString> sections_unmatched = match_sections; // just used to warn if user asked for a match that never happened

for (auto& it : trans_mono) {
if (it.second.numberOfTransitions() == 0) {
std::cerr << "Warning: section " << it.first << " is empty! Skipping it..." << std::endl;
continue;
}
// TODO: parallelise this loop (as in lt_compose.cc)
if (moved_bi_transducers.contains(it.first)) {
sections_unmatched.erase(it.first);
}
Transducer& moved_transducer = moved_bi_transducers.contains(it.first)
? moved_bi_transducers[it.first]
: moved_bi_transducers[union_name];
Transducer trimmed = it.second.trim(moved_transducer,
alph_mono,
alph_prefix);
alph_mono,
alph_prefix);
if (trimmed.hasNoFinals()) {
std::cerr << "Warning: section " << it.first << " had no final state after trimming! Skipping it..." << std::endl;
continue;
}
trimmed.minimize();
trans_trim[it.first] = trimmed;
}
for (const auto &name : sections_unmatched) {
std::cerr << "Warning: section " << name << " was not found in both transducers! Skipping if in just one..." << std::endl;
}

if (trans_trim.empty()) {
std::cerr << "Error: Trimming gave empty transducer!" << std::endl;
Expand All @@ -91,13 +107,21 @@ int main(int argc, char *argv[])
cli.add_file_arg("analyser_bin_file", false);
cli.add_file_arg("bidix_bin_file");
cli.add_file_arg("trimmed_bin_file");
cli.add_str_arg('s', "match-section", "A section with this name (id@type) will only be trimmed against a section with the same name. This argument may be used multiple times.", "section_name");
cli.parse_args(argc, argv);

auto strs = cli.get_strs();
std::set<UString> match_sections;
if (strs.find("match-section") != strs.end()) {
for (auto &it : strs["match-section"]) {
match_sections.insert(to_ustring(it.c_str()));
}
}
FILE* analyser = openInBinFile(cli.get_files()[0]);
FILE* bidix = openInBinFile(cli.get_files()[1]);
FILE* output = openOutBinFile(cli.get_files()[2]);

trim(analyser, bidix, output);
trim(analyser, bidix, output, match_sections);

fclose(analyser);
fclose(bidix);
Expand Down

0 comments on commit 5fa5a97

Please sign in to comment.