Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange lt-tmxproc number copying #191

Closed
unhammer opened this issue Nov 4, 2024 · 3 comments
Closed

strange lt-tmxproc number copying #191

unhammer opened this issue Nov 4, 2024 · 3 comments

Comments

@unhammer
Copy link
Member

unhammer commented Nov 4, 2024

$ cat test.tmx
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
  <header
    creationtool="foo"
    creationtoolversion="1.0"
    segtype="phrase"
    o-tmf="tmx"
    adminlang="nb-NO"
    srclang="nb-NO"
    datatype="plaintext"
  />
  <body>
    <tu>
      <tuv xml:lang="nob">
        <seg>foo 1</seg>
      </tuv>
      <tuv xml:lang="nno">
        <seg>foo 1</seg>
      </tuv>
    </tu>
  </body>
</tmx>
$ lt-tmxcomp nob-nno test.tmx test.tmx.bin
nob->nno 9 8
$ echo '4 foo 1'| lt-tmxproc -s test.tmx.bin
4 [foo 4]
$ echo 'foo 4'| lt-tmxproc -s test.tmx.bin
[foo 4]
$ echo '5 foo 4'| lt-tmxproc -s test.tmx.bin
5 [foo 5]
$ echo '5 foo 1'| lt-tmxproc -s test.tmx.bin
5 [foo 5]

why does it match numbers that are not 1, and why does it "copy" previously seen numbers? 😵‍💫

@unhammer
Copy link
Member Author

unhammer commented Nov 4, 2024

So apparently the tmx handling has this fancy feature for aligning translations even if there are numbers that might differ:

$ lt-print test.tmx.bin
0       1       f       f       0.000000
1       2       o       o       0.000000
2       3       o       o       0.000000
3       4                       0.000000
4       5       <n>     @       0.000000
5       6       ε       (       0.000000
6       7       ε       1       0.000000
7       8       ε       )       0.000000
8       0.000000

case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
{
UString ws;
do
{
ws += val;
val = input.get();
} while(u_isdigit(val));
input.unget(val);
input_buffer.add(alphabet(u"<n>"));
numbers.push_back(ws);
return alphabet(u"<n>");
}
break;

Maybe handy if you have big tmx files with things like "Stock market things went up by 999 % in Q5" and you want to use that even if they went up by just 234 %.

But it probably shouldn't copy the previous number – the alignment shouldn't look at stuff outside the matched segment. (And we may want to turn it off completely too?)

@unhammer
Copy link
Member Author

unhammer commented Nov 5, 2024

{
bool substitute = false;
for(int j = fragment[i].size() - 1; j >= 0; j--)
{
if(fragment[i].size()-j > 3 && fragment[i][j] == '\\' &&
fragment[i][j+1] == '@' && fragment[i][j+2] == '(')
{
int num = 0;
bool correct = true;
for(unsigned int k = (unsigned int) j+3, limit2 = fragment[i].size();
k != limit2; k++)
{
if(u_isdigit(fragment[i][k]))
{
num = num * 10;
num += (int) fragment[i][k] - 48;
}
else
{
correct = false;
break;
}
}
if(correct)
{
fragment[i] = fragment[i].substr(0, j) + numbers[num - 1];
substitute = true;
break;
}
}
}
if(substitute == false)
{
fragment[i] += ')';
}
}

😱

@unhammer
Copy link
Member Author

unhammer commented Nov 5, 2024

This:
fragment[i] = fragment[i].substr(0, j) + numbers[num - 1];
assumes all numbers are only those that have been matched by the fragment, but the processor also adds preceding numbers. The numbers vector is only cleared on a match, but should have been cleared before a match starts too.

unhammer added a commit that referenced this issue Nov 5, 2024
(The expected failures failed before HEAD^ as well.)
unhammer added a commit that referenced this issue Dec 10, 2024
unhammer added a commit that referenced this issue Dec 10, 2024
(The expected failures failed before HEAD^ as well.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant