Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wordMAP suggests target words in wrong occurrence order #6237

Open
1 task
cckozie opened this issue Jul 11, 2019 · 36 comments
Open
1 task

wordMAP suggests target words in wrong occurrence order #6237

cckozie opened this issue Jul 11, 2019 · 36 comments
Assignees
Milestone

Comments

@cckozie
Copy link

cckozie commented Jul 11, 2019

  • If identical words are suggested, always put them in numerical order (e.g. if 3 "the"s are suggested, they should never show up in the order 1,3,2.)
    • NOTE: Someday in the future, this may change. For instance, if we suggest a noun ("book") and we know (either by morphology or syntax trees or something else) that "the" is linked to "book", then we may need to display identical words in a different order than they appear. But this is still a ways in the future.

1.2.0 (89d7abe)
Reported by Robert using 1.1.4
image.png

@RobH123
Copy link

RobH123 commented Jul 12, 2019

Here's another example where to2 is suggested (and to3) but not to1.

Screenshot_20190712_155326

@jag3773
Copy link
Contributor

jag3773 commented Aug 21, 2019

Here is an example from Zec 1:4 in tC 2.0.0 (8a6a8c5) Screen Shot 2019-08-21 at 7 07 05 PM

@benjore
Copy link

benjore commented Sep 3, 2019

Suggestion: SPIKE it out to figure out how to address.

@da1nerd
Copy link
Contributor

da1nerd commented Oct 16, 2019

@jag3773 @RobH123 Has this only been seen in Hebrew text? And are the suggestions always sequential but reversed?
My hunch is this has something to do with Hebrew being RTL instead of LTR. wordMAP doesn't have any notion of language direction, so this is probably what's causing the reversed suggestion order.

@da1nerd
Copy link
Contributor

da1nerd commented Oct 17, 2019

I've identified three different areas where this bug could be coming from. And a separate bug all together. The problem is related to alignment memory and how it is used to score predictions..

  1. lemma ngram frequency algorithm
  2. ngram frequency algorithm
  3. alignment memory weighting.

I also saw that the corpus index is not using the user-defined maximum n-gram length. This isn't likely to be very noticeable, but it could result in some lost suggestions.

@da1nerd da1nerd self-assigned this Oct 17, 2019
@da1nerd
Copy link
Contributor

da1nerd commented Oct 17, 2019

@RobH123 do you happen to have a sample project where this issue shows up? If so could you share it here?

@RobH123
Copy link

RobH123 commented Oct 17, 2019

It may be correct that it's only in Hebrew? (Haven't been in the NT lately.) Not completely sure what I can give you @neutrinog. I'm aligning UST 1 Samuel and I have lots of other projects loaded in tCore 2.0 as recommended by Larry Sallee to give extra context. It occurs frequently, more likely of course in longer verses. Are you after a Book/Chapter/Verse reference or a zip file or what? I just uploaded to https://git.door43.org/RobH/en_ust_1sa_book.

@da1nerd
Copy link
Contributor

da1nerd commented Oct 18, 2019

@RobH123 a zip like the above, yes. But also a chapter/verse where you see this issue in the book.

@RobH123
Copy link

RobH123 commented Oct 18, 2019

1 Sam 14:32 suggests they2 but not they1. v34 suggests to2 and to3 but not to1. v36 suggests soldiers3 but not soldiers1 or soldiers2. v39 suggests execute2 before execute1. v52 suggests Saul2 before Saul1.

@da1nerd
Copy link
Contributor

da1nerd commented Oct 18, 2019

ok I think I've discovered the problem here.
@klappy we have algorithms for alignment occurrence and alignment position, however we do not take into account relative occurrence. That is, the commonality of the source and target token's occurrence within the sentence. Right now the alignment position is winning over tokens that occur later in the sentence.

For example. Let's say we have the following alignment permutations where the numbers indicate the token's occurrence within the sentence:

  • x(1)->y(1)
  • x(2)->y(2)
  • x(2)->y(1)
  • x(1)->y(2)

Visually we can see that the obvious prediction should be x(1)->y(1) and x(2)->y(2).

@da1nerd
Copy link
Contributor

da1nerd commented Oct 18, 2019

Alignment Relative Occurrence

Here's my thought for an algorithm.

  • Given the total occurrences of a token within the target sentence Tx and the total occurrences of a token within the source sentence Ty.
  • And given we want to determine the relative occurrence of token y and token x.

Sample data:

Ty = 5
Tx = 3

Our known points of equivalence are (1,1) and (3, 5). These two points represent a state of identical relative occurrence. In other words, if both tokens are the first occurrence, or both tokens are the last occurrence, they are relatively equivalent.

Measure the slope between the two points above using the "Two Point Slope Form" equation

(y` - y1)/(y2 - y1) = (x` - x1)/(x2 - x1)

(y` - 1)/(5 - 1) = (x` - 1)/(3 - 1)

Which simplifies to:

y` = 2x` - 1

This graph illustrates we can now translate occurrences between the source and target text:
image

Now we can evaluate the relative occurrence of two tokens.

  • Given a token x with occurrence 2
  • And given a token y with occurrence 4
    Determine their equivalence.
y' = 2(2) - 1
y' = 3

NOTE: we could have solved for y' or x'. It doesn't matter.

Now we have two relative occurrences that we can accurately compare.

y` = 3 // translated from x = 2
y = 4

Next we must compare how close these values are to each other relative to their range 1-Ty.

Normalize range to from 1-Ty to 0-1

ny` = 3 / Ty = 3/5 = 0.6
ny = 4 / Ty = 4/5 = 0.8

disparity = abs(ny` - ny) = 0.2

Interpretation:

  • A disparity close to zero indicates the two tokens are very similar in order of occurrence.
  • A disparity close to one indicates the two tokens are very different in order of occurrence.

@da1nerd
Copy link
Contributor

da1nerd commented Oct 21, 2019

The above should actually be performed on n-grams. A uni-gram would cover the single token case.
This would require some updates to a few other algorithms so I'll restrict this to just uni-grams for now so we can quickly test if this solves the problem.

@da1nerd
Copy link
Contributor

da1nerd commented Oct 21, 2019

Ok here are the results!
This new algorithm doesn't add anything to to the total prediction score. It's only used to balance out the alignment position score.

Click on the images to see closer.

Before. Notice "with":
before

After:
after-wordMAP-update

@PhotoNomad0
Copy link
Contributor

Verified in translationCore 2.1.0 (36a0501)

@da1nerd
Copy link
Contributor

da1nerd commented Oct 29, 2019

Update: I've fixed a lot of bugs in wordMAP and am closing in on the ones presented in this issue. I expect to finish in a day or two.

@da1nerd da1nerd mentioned this issue Oct 29, 2019
16 tasks
@cckozie
Copy link
Author

cckozie commented Nov 11, 2019

2.1.0 (d45bc64)
6237.zip
I'm still seeing wrong occurrence suggestions. These examples were taken with only the attached two projects in tC. (Moving this scenario to a new issue)
image

image

@da1nerd
Copy link
Contributor

da1nerd commented Nov 12, 2019

@cckozie the first screenshot should be a valid suggestion. Notice the paired suggestion they did. The reason we see the second occurrence of did and not the first is because only the second occurrence has an adjacent they. Right now wordMAP only supports suggesting alignments of contiguous tokens (contiguous in the alignment, not the entire set of suggestions). If it is desirable to support dis-contiguous alignments then someone needs to vote for this new feature 😉. However, note that we decided early on, not to support this due to it's increased complexity.

The second screenshot does appear to be out of order of occurrence.

@cckozie
Copy link
Author

cckozie commented Nov 12, 2019

@neutrinog OK, thanks. I understand that scenario 1 is working as designed even though it does not comply with the stated requirement in the first comment above. And to the average user it will most likely appear to be a bug (as it did to me:). I will write up a new issue for that scenario for future consideration.

@da1nerd
Copy link
Contributor

da1nerd commented Jan 21, 2020

The remaining bug here (see previous comment) has been resolved in #6647.

Note: since we are basically brute forcing the occurrence order, I left a few escape hatches to prevent it from killing performance in rare situations. In these corner cases wordMAP will slowly become less strict when checking the occurrence order. I've simulated some of these in the unit tests, however I haven't seen these yet in translationCore.

@birchamp birchamp modified the milestones: tC Sprint #88, tC Sprint #89 Jan 29, 2020
@PhotoNomad0
Copy link
Contributor

@cckozie - Joel's fixes are here: https://github.com/unfoldingWord/translationCore/actions/runs/32631095

I only verified the performance issue on the Mac build.

@cckozie
Copy link
Author

cckozie commented Jan 29, 2020

2.1.0 (228ed0f)
Judges 7:9 now looks good, but Judges 6:27 now does not show any suggestions:

@da1nerd
Copy link
Contributor

da1nerd commented Feb 5, 2020

@cckozie 6:27 doesn't show any suggestions now because it cannot satisfy the stricter occurrence rules.

In this case it only had one suggested use of the word "did" (the second occurrence) e.g. "they did(2)". However, since suggested words are now required to begin with the first occurrence (see #6237 (comment)), there was no valid combination of suggestions.

The only solutions I see to this are:

  1. remove the requirement that suggested words begin with the first occurrence (numerical order will still be enforced).
  2. remove the offending suggestion entirely (occurrence order will be preserved at the cost of a suggestion).
  3. if an empty set comes up, take the first best guess at the suggestions (occurrence order won't be guaranteed). This already happens if a strict occurrence prediction can't be completed within a reasonable amount of time.

The easiest ones to implement would be 1 and 3. Option 2 could take O^2 time depending on how many problem words there are and where they are at in the sentence.

Edit

We probably need to implement option 3 in either case, because even without the requirement that words begin at the first occurrence, it's still possible to get an empty set if the alignment memory has two pairs of words that are forced out of order due to their counterparts. e.g. "the word(1)", "a word(3)", where another "word (2)" exists, but because it is not adjacent to a "the" or "a" it cannot be used.

@cckozie
Copy link
Author

cckozie commented Feb 5, 2020

@neutrinog - I think #3 looks like a good option for now. @birchamp - What say you?

@birchamp
Copy link
Contributor

birchamp commented Feb 5, 2020

I'll take what's behind door #3. @cckozie @neutrinog since it looks like option #3 needs to be implemented, let's go with that. It follows what we're already doing. tC is only making "suggestions", and we've given users a clear path to reject them.

@da1nerd
Copy link
Contributor

da1nerd commented Mar 11, 2020

ok I've implemented number 3. See this PR #6748

@PhotoNomad0
Copy link
Contributor

Verified in Verified in 2.2.0 (5170810)

@cckozie
Copy link
Author

cckozie commented Mar 16, 2020

2.2.0 (2375ae5)
Seeing suggestions in 6:27 now, and it looks like they're all in the correct order.
image

07-JDG.usfm.zip
7:9 has the third occurrence of "that" and "to" suggested instead of the second occurrences. This was also the case in 2.1 and 2.2 builds before this fix.
image

@cckozie
Copy link
Author

cckozie commented Mar 16, 2020

@neutrinog - Are the wrong order occurrences in 7:9 above expected?

@da1nerd
Copy link
Contributor

da1nerd commented Mar 17, 2020

@cckozie nope that's definitely due to a bug in the code. It took awhile to track down, but now that I've found it, it should be pretty easy to fix.

@da1nerd
Copy link
Contributor

da1nerd commented Mar 17, 2020

well.. I was able to fix the problem. Strangely enough, I was able to recreate the problem in a unit test and have fixed it there, but when I use the update in translationCore there's no difference. At least now the logs tell you want's going on. I'll try sleeping on it.

@da1nerd
Copy link
Contributor

da1nerd commented Mar 24, 2020

I went ahead and put in a PR for this at #6784.

Through some painful debugging and manually constructing input data I was able to discover and fix some more bugs in wordMAP. However, the issue above still remains.
I've gotten to the point where I need to replicate the wordMAP environment as seen in production. But that's not sustainable unless I build in a way to export and import environments.

@BincyJ
Copy link

BincyJ commented Apr 29, 2020

@birchamp to advise on next steps based on discussions scheduled later this week.

@birchamp
Copy link
Contributor

@jag3773 could take a look at this and tell us the impact on the content and GL teams?

@cckozie
Copy link
Author

cckozie commented Jun 30, 2020

Rather than a large refactor, maybe we could just have tC make a check after wordMAP runs to see if there are target words out of order. If so, put up an alert that informs the user that one of more target words are out of order.

@jag3773
Copy link
Contributor

jag3773 commented Jul 1, 2020

I don't think this should hold up the 3.0 release. If it can't be fixed relatively simply then I'd save it for later (some of the wrong order scenarios were fixed which helps). I'm not sure that a pop up would be desired, that might be too distracting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants