wordMAP suggests target words in wrong occurrence order #6237

cckozie · 2019-07-11T23:10:03Z

If identical words are suggested, always put them in numerical order (e.g. if 3 "the"s are suggested, they should never show up in the order 1,3,2.)
- NOTE: Someday in the future, this may change. For instance, if we suggest a noun ("book") and we know (either by morphology or syntax trees or something else) that "the" is linked to "book", then we may need to display identical words in a different order than they appear. But this is still a ways in the future.

1.2.0 (89d7abe)
Reported by Robert using 1.1.4

The text was updated successfully, but these errors were encountered:

RobH123 · 2019-07-12T03:58:36Z

Here's another example where to2 is suggested (and to3) but not to1.

jag3773 · 2019-08-21T23:17:06Z

Here is an example from Zec 1:4 in tC 2.0.0 (8a6a8c5)

benjore · 2019-09-03T15:46:29Z

Suggestion: SPIKE it out to figure out how to address.

da1nerd · 2019-10-16T15:29:10Z

@jag3773 @RobH123 Has this only been seen in Hebrew text? And are the suggestions always sequential but reversed?
My hunch is this has something to do with Hebrew being RTL instead of LTR. wordMAP doesn't have any notion of language direction, so this is probably what's causing the reversed suggestion order.

da1nerd · 2019-10-17T07:54:37Z

I've identified three different areas where this bug could be coming from. And a separate bug all together. The problem is related to alignment memory and how it is used to score predictions..

lemma ngram frequency algorithm
ngram frequency algorithm
alignment memory weighting.

I also saw that the corpus index is not using the user-defined maximum n-gram length. This isn't likely to be very noticeable, but it could result in some lost suggestions.

da1nerd · 2019-10-17T15:38:25Z

@RobH123 do you happen to have a sample project where this issue shows up? If so could you share it here?

RobH123 · 2019-10-17T23:33:02Z

It may be correct that it's only in Hebrew? (Haven't been in the NT lately.) Not completely sure what I can give you @neutrinog. I'm aligning UST 1 Samuel and I have lots of other projects loaded in tCore 2.0 as recommended by Larry Sallee to give extra context. It occurs frequently, more likely of course in longer verses. Are you after a Book/Chapter/Verse reference or a zip file or what? I just uploaded to https://git.door43.org/RobH/en_ust_1sa_book.

da1nerd · 2019-10-18T01:13:18Z

@RobH123 a zip like the above, yes. But also a chapter/verse where you see this issue in the book.

RobH123 · 2019-10-18T02:21:18Z

1 Sam 14:32 suggests they2 but not they1. v34 suggests to2 and to3 but not to1. v36 suggests soldiers3 but not soldiers1 or soldiers2. v39 suggests execute2 before execute1. v52 suggests Saul2 before Saul1.

da1nerd · 2019-10-18T04:10:09Z

ok I think I've discovered the problem here.
@klappy we have algorithms for alignment occurrence and alignment position, however we do not take into account relative occurrence. That is, the commonality of the source and target token's occurrence within the sentence. Right now the alignment position is winning over tokens that occur later in the sentence.

For example. Let's say we have the following alignment permutations where the numbers indicate the token's occurrence within the sentence:

x(1)->y(1)
x(2)->y(2)
x(2)->y(1)
x(1)->y(2)

Visually we can see that the obvious prediction should be x(1)->y(1) and x(2)->y(2).

da1nerd · 2019-10-18T08:20:15Z

Alignment Relative Occurrence

Here's my thought for an algorithm.

Given the total occurrences of a token within the target sentence Tx and the total occurrences of a token within the source sentence Ty.
And given we want to determine the relative occurrence of token y and token x.

Sample data:

Ty = 5
Tx = 3

Our known points of equivalence are (1,1) and (3, 5). These two points represent a state of identical relative occurrence. In other words, if both tokens are the first occurrence, or both tokens are the last occurrence, they are relatively equivalent.

Measure the slope between the two points above using the "Two Point Slope Form" equation

(y` - y1)/(y2 - y1) = (x` - x1)/(x2 - x1)

(y` - 1)/(5 - 1) = (x` - 1)/(3 - 1)

Which simplifies to:

y` = 2x` - 1

This graph illustrates we can now translate occurrences between the source and target text:

Now we can evaluate the relative occurrence of two tokens.

Given a token x with occurrence 2
And given a token y with occurrence 4
Determine their equivalence.

y' = 2(2) - 1
y' = 3

NOTE: we could have solved for y' or x'. It doesn't matter.

Now we have two relative occurrences that we can accurately compare.

y` = 3 // translated from x = 2
y = 4

Next we must compare how close these values are to each other relative to their range 1-Ty.

Normalize range to from 1-Ty to 0-1

ny` = 3 / Ty = 3/5 = 0.6
ny = 4 / Ty = 4/5 = 0.8

disparity = abs(ny` - ny) = 0.2

Interpretation:

A disparity close to zero indicates the two tokens are very similar in order of occurrence.
A disparity close to one indicates the two tokens are very different in order of occurrence.

da1nerd · 2019-10-21T04:36:12Z

The above should actually be performed on n-grams. A uni-gram would cover the single token case.
This would require some updates to a few other algorithms so I'll restrict this to just uni-grams for now so we can quickly test if this solves the problem.

da1nerd · 2019-10-21T09:08:02Z

Ok here are the results!
This new algorithm doesn't add anything to to the total prediction score. It's only used to balance out the alignment position score.

Click on the images to see closer.

Before. Notice "with":

After:

PhotoNomad0 · 2019-10-28T20:17:03Z

Verified in translationCore 2.1.0 (36a0501)

da1nerd · 2019-10-29T08:42:33Z

Update: I've fixed a lot of bugs in wordMAP and am closing in on the ones presented in this issue. I expect to finish in a day or two.

cckozie · 2019-11-11T19:52:26Z

2.1.0 (d45bc64)
6237.zip
I'm still seeing wrong occurrence suggestions. These examples were taken with only the attached two projects in tC. (Moving this scenario to a new issue)

da1nerd · 2019-11-12T01:59:59Z

@cckozie the first screenshot should be a valid suggestion. Notice the paired suggestion they did. The reason we see the second occurrence of did and not the first is because only the second occurrence has an adjacent they. Right now wordMAP only supports suggesting alignments of contiguous tokens (contiguous in the alignment, not the entire set of suggestions). If it is desirable to support dis-contiguous alignments then someone needs to vote for this new feature 😉. However, note that we decided early on, not to support this due to it's increased complexity.

The second screenshot does appear to be out of order of occurrence.

cckozie · 2019-11-12T14:33:44Z

@neutrinog OK, thanks. I understand that scenario 1 is working as designed even though it does not comply with the stated requirement in the first comment above. And to the average user it will most likely appear to be a bug (as it did to me:). I will write up a new issue for that scenario for future consideration.

da1nerd · 2020-01-21T15:25:13Z

The remaining bug here (see previous comment) has been resolved in #6647.

Note: since we are basically brute forcing the occurrence order, I left a few escape hatches to prevent it from killing performance in rare situations. In these corner cases wordMAP will slowly become less strict when checking the occurrence order. I've simulated some of these in the unit tests, however I haven't seen these yet in translationCore.

PhotoNomad0 · 2020-01-29T20:51:25Z

@cckozie - Joel's fixes are here: https://github.com/unfoldingWord/translationCore/actions/runs/32631095

I only verified the performance issue on the Mac build.

cckozie · 2020-01-29T22:57:48Z

2.1.0 (228ed0f)
Judges 7:9 now looks good, but Judges 6:27 now does not show any suggestions:

da1nerd · 2020-02-05T05:27:39Z

@cckozie 6:27 doesn't show any suggestions now because it cannot satisfy the stricter occurrence rules.

In this case it only had one suggested use of the word "did" (the second occurrence) e.g. "they did(2)". However, since suggested words are now required to begin with the first occurrence (see #6237 (comment)), there was no valid combination of suggestions.

The only solutions I see to this are:

remove the requirement that suggested words begin with the first occurrence (numerical order will still be enforced).
remove the offending suggestion entirely (occurrence order will be preserved at the cost of a suggestion).
if an empty set comes up, take the first best guess at the suggestions (occurrence order won't be guaranteed). This already happens if a strict occurrence prediction can't be completed within a reasonable amount of time.

The easiest ones to implement would be 1 and 3. Option 2 could take O^2 time depending on how many problem words there are and where they are at in the sentence.

Edit

We probably need to implement option 3 in either case, because even without the requirement that words begin at the first occurrence, it's still possible to get an empty set if the alignment memory has two pairs of words that are forced out of order due to their counterparts. e.g. "the word(1)", "a word(3)", where another "word (2)" exists, but because it is not adjacent to a "the" or "a" it cannot be used.

cckozie · 2020-02-05T13:07:05Z

@neutrinog - I think #3 looks like a good option for now. @birchamp - What say you?

birchamp · 2020-02-05T13:57:26Z

I'll take what's behind door #3. @cckozie @neutrinog since it looks like option #3 needs to be implemented, let's go with that. It follows what we're already doing. tC is only making "suggestions", and we've given users a clear path to reject them.

da1nerd · 2020-03-11T09:20:33Z

ok I've implemented number 3. See this PR #6748

PhotoNomad0 · 2020-03-13T14:39:55Z

Verified in Verified in 2.2.0 (5170810)

cckozie · 2020-03-16T17:57:15Z

2.2.0 (2375ae5)
Seeing suggestions in 6:27 now, and it looks like they're all in the correct order.

07-JDG.usfm.zip
7:9 has the third occurrence of "that" and "to" suggested instead of the second occurrences. This was also the case in 2.1 and 2.2 builds before this fix.

cckozie · 2020-03-16T21:16:49Z

@neutrinog - Are the wrong order occurrences in 7:9 above expected?

da1nerd · 2020-03-17T09:27:40Z

@cckozie nope that's definitely due to a bug in the code. It took awhile to track down, but now that I've found it, it should be pretty easy to fix.

da1nerd · 2020-03-17T12:24:27Z

well.. I was able to fix the problem. Strangely enough, I was able to recreate the problem in a unit test and have fixed it there, but when I use the update in translationCore there's no difference. At least now the logs tell you want's going on. I'll try sleeping on it.

da1nerd · 2020-03-24T10:21:49Z

I went ahead and put in a PR for this at #6784.

Through some painful debugging and manually constructing input data I was able to discover and fix some more bugs in wordMAP. However, the issue above still remains.
I've gotten to the point where I need to replicate the wordMAP environment as seen in production. But that's not sustainable unless I build in a way to export and import environments.

BincyJ · 2020-04-29T14:24:41Z

@birchamp to advise on next steps based on discussions scheduled later this week.

birchamp · 2020-06-30T15:22:59Z

@jag3773 could take a look at this and tell us the impact on the content and GL teams?

cckozie · 2020-06-30T15:30:13Z

Rather than a large refactor, maybe we could just have tC make a check after wordMAP runs to see if there are target words out of order. If so, put up an alert that informs the user that one of more target words are out of order.

jag3773 · 2020-07-01T13:09:38Z

I don't think this should hold up the 3.0 release. If it can't be fixed relatively simply then I'd save it for later (some of the wrong order scenarios were fixed which helps). I'm not sure that a pop up would be desired, that might be too distracting.

cckozie added the Kind/Bug label Jul 11, 2019

benjore mentioned this issue Sep 3, 2019

SPIKE: #6237 Display duplicate words in order #6418

Closed

cckozie added Hotfix issue and removed Hotfix issue labels Oct 10, 2019

da1nerd self-assigned this Oct 17, 2019

This was referenced Oct 21, 2019

Adds relative occurrence unfoldingWord/wordMAP#49

Merged

updated wordmap to fix wrong occurrence suggestion #6502

Merged

word occurrence is suggested in the wrong order when aligned to dissimilar ngrams. unfoldingWord/wordMAP#50

Open

da1nerd mentioned this issue Oct 29, 2019

updated wordmap #6515

Merged

16 tasks

cckozie added the QA/Fail label Nov 11, 2019

cckozie mentioned this issue Nov 12, 2019

WordMAP design sometimes causes suggestions to appear in wrong occurrence order #6538

Open

da1nerd mentioned this issue Nov 15, 2019

tC hangs for two minutes after navigating to a verse in wA #6554

Closed

da1nerd mentioned this issue Jan 21, 2020

update wordmap to provide performance improvements #6647

Merged

16 tasks

birchamp modified the milestones: tC Sprint #88, tC Sprint #89 Jan 29, 2020

birchamp modified the milestones: tC Sprint #89, tC Sprint #90 Feb 12, 2020

birchamp modified the milestones: tC Sprint #90, tC Sprint #91 Mar 5, 2020

da1nerd mentioned this issue Mar 11, 2020

updated wordmap to correct missing suggestions #6748

Merged

16 tasks

cckozie mentioned this issue Mar 17, 2020

Critical error screen when opening project in wA #6758

Closed

birchamp modified the milestones: tC Sprint #91, tC Sprint #92 Mar 18, 2020

This was referenced Mar 24, 2020

disabled verbose warnings from wordmap unfoldingWord/wordAlignment#259

Merged

updated wordmap version to correct occurrence order #6784

Merged

wordMAP suggests target words in wrong occurrence order #6237

wordMAP suggests target words in wrong occurrence order #6237

Comments

cckozie commented Jul 11, 2019 • edited by benjore Loading

RobH123 commented Jul 12, 2019

jag3773 commented Aug 21, 2019

benjore commented Sep 3, 2019

da1nerd commented Oct 16, 2019

da1nerd commented Oct 17, 2019 • edited Loading

da1nerd commented Oct 17, 2019

RobH123 commented Oct 17, 2019

da1nerd commented Oct 18, 2019

RobH123 commented Oct 18, 2019

da1nerd commented Oct 18, 2019

da1nerd commented Oct 18, 2019 • edited Loading

Alignment Relative Occurrence

da1nerd commented Oct 21, 2019 • edited Loading

da1nerd commented Oct 21, 2019

PhotoNomad0 commented Oct 28, 2019

da1nerd commented Oct 29, 2019

cckozie commented Nov 11, 2019 • edited Loading

da1nerd commented Nov 12, 2019

cckozie commented Nov 12, 2019 • edited Loading

da1nerd commented Jan 21, 2020 • edited Loading

PhotoNomad0 commented Jan 29, 2020

cckozie commented Jan 29, 2020

da1nerd commented Feb 5, 2020 • edited Loading

Edit

cckozie commented Feb 5, 2020

birchamp commented Feb 5, 2020

da1nerd commented Mar 11, 2020

PhotoNomad0 commented Mar 13, 2020

cckozie commented Mar 16, 2020 • edited Loading

cckozie commented Mar 16, 2020

da1nerd commented Mar 17, 2020

da1nerd commented Mar 17, 2020 • edited Loading

da1nerd commented Mar 24, 2020

BincyJ commented Apr 29, 2020

birchamp commented Jun 30, 2020

cckozie commented Jun 30, 2020 • edited Loading

jag3773 commented Jul 1, 2020

cckozie commented Jul 11, 2019 •

edited by benjore

Loading

da1nerd commented Oct 17, 2019 •

edited

Loading

da1nerd commented Oct 18, 2019 •

edited

Loading

da1nerd commented Oct 21, 2019 •

edited

Loading

cckozie commented Nov 11, 2019 •

edited

Loading

cckozie commented Nov 12, 2019 •

edited

Loading

da1nerd commented Jan 21, 2020 •

edited

Loading

da1nerd commented Feb 5, 2020 •

edited

Loading

cckozie commented Mar 16, 2020 •

edited

Loading

da1nerd commented Mar 17, 2020 •

edited

Loading

cckozie commented Jun 30, 2020 •

edited

Loading