Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlapping errors cause bad suggestions #29

Closed
snomos opened this issue Jul 13, 2019 · 4 comments
Closed

Overlapping errors cause bad suggestions #29

snomos opened this issue Jul 13, 2019 · 4 comments
Assignees
Labels
Milestone

Comments

@snomos
Copy link
Member

snomos commented Jul 13, 2019

$ echo "Ii oktage dieđe gean lea ovddasvástadus ." | divvun-checker -a tools/grammarcheckers/se.zcheck | jq .
{
  "errs": [
    [
      "ovddasvástadus",
      25,
      39,
      "typo",
      "Ii leat sátnelisttus",
      [
        "ovddasfástádus",
        "ovddasvástádus"
      ],
      "Čállinmeattáhusat"
    ],
    [
      ".",
      40,
      41,
      "space-before-punct-mark",
      "Lea gaska \".\" ovddas",
      [
        "ovddasvástadus."
      ],
      "Sátnegaskameattáhusat"
    ]
  ],
  "text": "Ii oktage dieđe gean lea ovddasvástadus ."
}

The punctuation error contains the preceding work (uncorrected) as part of the cofrrection suggestion, while the spelling error corrects the same word independently. The end result - when running automically / unsupervised at least - is that the misspelled word gets duplicated. This makes automatized testing much harder.

@snomos snomos added the bug label Jul 13, 2019
@snomos snomos changed the title Overlapping errors causes bad suggestions Overlapping errors cause bad suggestions Jul 13, 2019
@snomos
Copy link
Member Author

snomos commented Jul 13, 2019

Actually, the errors in themselves are not overlapping, which is part of the problem: because the punctuation error is so short in terms of character length, we extend the error context to the preceding (or following word), to make the error visible in an interactive context (= blue underline in LO etc). In those contexts the whole string is replaced, including the preceding/following word, where as in the command line interface, the only string replaced is the actual error — but the replacement still contains the full context as given by the CG rules (=preceding/following word). This in practice leads to a duplication of the context word in question.

@snomos snomos added this to the 0.3.5 milestone Jul 13, 2019
@unhammer
Copy link
Member

unhammer commented Aug 6, 2019

From the json, it's obvious the indices are wrong in the second error (40–41, i.e. just one character, should be 25–41).

When I look at the grammar checker output, I see one oddity: There are two error tags on the same &LINK reading "ovddasvástádus" […] &LINK &space-before-punct-mark &typo – so when divvun-suggest is trying to find what reading to connect "." […] &space-before-punct-mark R:LEFT:6 to, it gets confused.

Full output from grammar checker:

$ echo "Ii oktage dieđe gean lea ovddasvástadus ." | $GTHOME/langs/sme/tools/grammarcheckers/modes/smegram8-gc.mode 
"<Ii>"
        "ii" <aux> V IV Neg Ind Sg3 <W:0.0> @+FAUXV #1->1
: 
"<oktage>"
        "okta" Pron Indef Sg Nom Foc/Neg-ge <W:0.0> @<SUBJ #2->2
        "okta" Pron Indef Sg Nom Foc/Pos-ge <W:0.0> @<SUBJ #2->2
: 
"<dieđe>"
        "diehtit" <mv> V <EX-Nom-Ani> <TH-Acc-Any><TH-Inf> <TH-Acc-Any><TH-PrfPrc> <TH-Acc-Any><TH-AktioEss> <TH-birra-Any> <TH-FS-Qpron> <TH-FS-Qst> <TH-ahte> <TH-Acc-Any> TV Imprt ConNeg <W:0.0> @-FMAINV #3->3
        "diehtit" <mv> V <EX-Nom-Ani> <TH-Acc-Any><TH-Inf> <TH-Acc-Any><TH-PrfPrc> <TH-Acc-Any><TH-AktioEss> <TH-birra-Any> <TH-FS-Qpron> <TH-FS-Qst> <TH-ahte> <TH-Acc-Any> TV Ind Prs ConNeg <W:0.0> @-FMAINV #3->3
: 
"<gean>"
        "gii" §TH Pron Sem/Hum Rel Sg Acc <W:0.0> @<OBJ #4->3
: 
"<lea>"
        "leat" <mv> §TH V <copula> <TH-Nom-Any> <mielde> <OR-Loc-HumGroup> <OR-eret-Plc> <dušše><TH-Inf> <árvvus> <LO-Loc-johtu><DE-Ill-Plc> <AT-Loc-Mat> <AT-Abe-Any> <AT-Nom-Any> <AT-Nom-Adj><EX-Ill-Ani> <PO-Loc-Hum> <PO-Gen-Hum> <MA-mielde-Any> <MA-Adv-Manner> <XT-Gen-Measr> <LO-maŋŋil-Time> <LO-Acc-Time> <LO-Loc-Time> <CO-Com-Ani> <ID-Nom-Any> <TH-Nom-Any><RO-Ess-Any><EX-Ill-Any> <EX-Ill-Ani><TH-Nom-Adj> <EX-Ill-Ani> <TH-Nom-Obj><RE-Ill-Ani> <LO-Loc-Any> <AktioEss> <BE-Ill-Ani><PU-Ess-Any> <RO-Ess-Any><PU-Ill-Act> <RO-Ess-Any> IV Ind Prs Sg3 <W:0.0> @FS-<ADVL #5->3
: 
"<ovddasvástadus>"
        "ovddasvástádus" Err/Orth-a-á N <BE-Ill-Any> Sem/Perc-emo Sg Nom <W:0.0> @<SUBJ &LINK &space-before-punct-mark &typo #6->6 ID:6
        "ovddasvástádus" N <BE-Ill-Any> Sem/Perc-emo Sg Nom <W:0.0> @<SUBJ &typo &SUGGEST #6->6 ID:6
: 
"<.>"
        "." CLB <W:0.0> <SpaceBeforePunctMark> &space-before-punct-mark #7->7 ID:7 R:LEFT:6
        "." CLB <W:0.0> <SpaceBeforePunctMark> "<ovddasvástadus.>" &space-before-punct-mark &SUGGESTWF #7->7 ID:7 R:LEFT:6
:\n

If I make them separate readings, so we have

"<ovddasvástadus>"
	"ovddasvástádus" Err/Orth-a-á N <BE-Ill-Any> Sem/Perc-emo Sg Nom <W:0.0> @<SUBJ &LINK &space-before-punct-mark #6->6 ID:6
	"ovddasvástádus" Err/Orth-a-á N <BE-Ill-Any> Sem/Perc-emo Sg Nom <W:0.0> @<SUBJ &typo #6->6 ID:6
	"ovddasvástádus" N <BE-Ill-Any> Sem/Perc-emo Sg Nom <W:0.0> @<SUBJ &typo &SUGGEST #6->6 ID:6

and send it all through divvun-suggest, both errors will cover the the same range:

{
  "errs": [
    [
      "ovddasvástadus .",
      25,
      41,
      "typo",
      "Ii leat sátnelisttus",
      [
        "ovddasfástádus .",
        "ovddasvástádus ."
      ],
      "Čállinmeattáhusat"
    ],
    [
      "ovddasvástadus .",
      25,
      41,
      "space-before-punct-mark",
      "Lea gaska \".\" ovddas",
      [
        "ovddasvástadus."
      ],
      "Sátnegaskameattáhusat"
    ]
  ],
  "text": "Ii oktage dieđe gean lea ovddasvástadus .\n"
}

unhammer added a commit that referenced this issue Aug 6, 2019
unhammer added a commit that referenced this issue Aug 6, 2019
unhammer added a commit that referenced this issue Aug 6, 2019
@unhammer
Copy link
Member

unhammer commented Aug 6, 2019

divvun-suggest expects at most one error tag (&typo) etc. per reading. I've changed this in d43a550 so it should now handle having several.

(In this case it's fine to have several error tags on one reading, it's just about stretching the underline, but IIRC there are cases where we still need to put error tags on separate readings in CG.)

@unhammer
Copy link
Member

unhammer commented Aug 6, 2019

$ echo "Ii oktage dieđe gean lea ovddasvástadus ." | $GTHOME/langs/sme/tools/grammarcheckers/modes/smegram8-gc.mode |src/divvun-suggest -g $GTHOME/langs/sme/tools/grammarcheckers/generator-gramcheck-gt-norm.hfstol -m $GTHOME/langs/sme/tools/grammarcheckers/errors.xml -l se  -j|jq .
{
  "errs": [
    [
      "ovddasvástadus .",
      25,
      41,
      "typo",
      "Ii leat sátnelisttus",
      [
        "ovddasfástádus .",
        "ovddasvástádus ."
      ],
      "Čállinmeattáhusat"
    ],
    [
      "ovddasvástadus .",
      25,
      41,
      "space-before-punct-mark",
      "Lea gaska \".\" ovddas",
      [
        "ovddasvástadus."
      ],
      "Sátnegaskameattáhusat"
    ]
  ],
  "text": "Ii oktage dieđe gean lea ovddasvástadus .\n"
}

@unhammer unhammer closed this as completed Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants