strange lt-tmxproc number copying #191

unhammer · 2024-11-04T15:32:28Z

$ cat test.tmx
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
  <header
    creationtool="foo"
    creationtoolversion="1.0"
    segtype="phrase"
    o-tmf="tmx"
    adminlang="nb-NO"
    srclang="nb-NO"
    datatype="plaintext"
  />
  <body>
    <tu>
      <tuv xml:lang="nob">
        <seg>foo 1</seg>
      </tuv>
      <tuv xml:lang="nno">
        <seg>foo 1</seg>
      </tuv>
    </tu>
  </body>
</tmx>

$ lt-tmxcomp nob-nno test.tmx test.tmx.bin
nob->nno 9 8
$ echo '4 foo 1'| lt-tmxproc -s test.tmx.bin
4 [foo 4]
$ echo 'foo 4'| lt-tmxproc -s test.tmx.bin
[foo 4]
$ echo '5 foo 4'| lt-tmxproc -s test.tmx.bin
5 [foo 5]
$ echo '5 foo 1'| lt-tmxproc -s test.tmx.bin
5 [foo 5]

why does it match numbers that are not 1, and why does it "copy" previously seen numbers? 😵‍💫

unhammer · 2024-11-04T20:44:42Z

So apparently the tmx handling has this fancy feature for aligning translations even if there are numbers that might differ:

$ lt-print test.tmx.bin
0       1       f       f       0.000000
1       2       o       o       0.000000
2       3       o       o       0.000000
3       4                       0.000000
4       5       <n>     @       0.000000
5       6       ε       (       0.000000
6       7       ε       1       0.000000
7       8       ε       )       0.000000
8       0.000000

lttoolbox/lttoolbox/fst_processor.cc

Lines 281 to 303 in 39db772

    
           case '1': 
        
           case '2': 
        
           case '3': 
        
           case '4': 
        
           case '5': 
        
           case '6': 
        
           case '7': 
        
           case '8': 
        
           case '9': 
        
             { 
        
               UString ws; 
        
               do 
        
               { 
        
                 ws += val; 
        
                 val = input.get(); 
        
               } while(u_isdigit(val)); 
        
               input.unget(val); 
        
               input_buffer.add(alphabet(u"<n>")); 
        
               numbers.push_back(ws); 
        
               return alphabet(u"<n>"); 
        
             } 
        
             break;

Maybe handy if you have big tmx files with things like "Stock market things went up by 999 % in Q5" and you want to use that even if they went up by just 234 %.

But it probably shouldn't copy the previous number – the alignment shouldn't look at stuff outside the matched segment. (And we may want to turn it off completely too?)

unhammer · 2024-11-05T13:44:29Z

lttoolbox/lttoolbox/state.cc

Lines 714 to 749 in 39db772

    
           { 
        
             bool substitute = false; 
        
             for(int j = fragment[i].size() - 1; j >= 0; j--) 
        
             { 
        
               if(fragment[i].size()-j > 3 && fragment[i][j] == '\\' && 
        
                  fragment[i][j+1] == '@' && fragment[i][j+2] == '(') 
        
               { 
        
                 int num = 0; 
        
                 bool correct = true; 
        
                 for(unsigned int k = (unsigned int) j+3, limit2 = fragment[i].size(); 
        
                     k != limit2; k++) 
        
                 { 
        
                   if(u_isdigit(fragment[i][k])) 
        
                   { 
        
                     num = num * 10; 
        
                     num += (int) fragment[i][k] - 48; 
        
                   } 
        
                   else 
        
                   { 
        
                     correct = false; 
        
                     break; 
        
                   } 
        
                 } 
        
                 if(correct) 
        
                 { 
        
                   fragment[i] = fragment[i].substr(0, j) + numbers[num - 1]; 
        
                   substitute = true; 
        
                   break; 
        
                 } 
        
               } 
        
             } 
        
             if(substitute == false) 
        
             { 
        
               fragment[i] += ')'; 
        
             } 
        
           }

😱

unhammer · 2024-11-05T14:13:39Z

This:
fragment[i] = fragment[i].substr(0, j) + numbers[num - 1];
assumes all numbers are only those that have been matched by the fragment, but the processor also adds preceding numbers. The numbers vector is only cleared on a match, but should have been cleared before a match starts too.

(The expected failures failed before HEAD^ as well.)

should fix #191

(The expected failures failed before HEAD^ as well.)

unhammer closed this as completed in 48c24e6 Nov 5, 2024

unhammer added a commit that referenced this issue Nov 5, 2024

Tests for #191

3d32b18

(The expected failures failed before HEAD^ as well.)

unhammer mentioned this issue Nov 5, 2024

lt-tmxproc / lt-tmxcomp flag to turn off special number handling #192

Open

unhammer added a commit that referenced this issue Dec 10, 2024

lt-tmxproc: Clear numbers vector when resetting state

a1f3845

should fix #191

unhammer added a commit that referenced this issue Dec 10, 2024

Tests for #191

50138c4

(The expected failures failed before HEAD^ as well.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange lt-tmxproc number copying #191

strange lt-tmxproc number copying #191

unhammer commented Nov 4, 2024

unhammer commented Nov 4, 2024

unhammer commented Nov 5, 2024

unhammer commented Nov 5, 2024

strange lt-tmxproc number copying #191

strange lt-tmxproc number copying #191

Comments

unhammer commented Nov 4, 2024

unhammer commented Nov 4, 2024

unhammer commented Nov 5, 2024

unhammer commented Nov 5, 2024