Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random results from foliavalidator and folia2txt #100

Open
kosloot opened this issue May 25, 2021 · 19 comments
Open

Random results from foliavalidator and folia2txt #100

kosloot opened this issue May 25, 2021 · 19 comments

Comments

@kosloot
Copy link
Collaborator

kosloot commented May 25, 2021

given this file:
buggy.txt

foliavalidator and folia2text give random outcome:

> foliavalidator buggy.txt 
Validated successfully: buggy.txt
> foliavalidator buggy.txt 
VALIDATION ERROR on full parse by library (stage 2/3), in buggy.txt
ParseError: FoLiA exception in handling of <p> @ line 35 (in parent <text> @ parent line 34) : [InconsistentText] Text for <Sentence at 139872518657136 id=KB.DDD17C.NicolineCrowd.73272.p.1.s.5 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
Den Primo Vizir heeft aen de Bassa, die in Ongaryen zijn, orde gesonden, om een Corps van de beste Troupen in Buda te brengen, en al de , welcke in die Stadt leggen daer uyt te trecken.
****> BUT FOUND (strict text after normalization) ****>
Den Primo Vizir heeft aen de Bassa, die in Ongaryen zijn, orde gesonden, om een Corps van de beste Troupen in Buda te brengen, en al de nieugeworvene, welcke in die Stadt leggen daer uyt te trecken.
******* DEVIATION POINT:  en al de <*HERE*>nieugeworv
(also checked against older rules prior to FoLiA v2.4.1)
@kosloot
Copy link
Collaborator Author

kosloot commented May 25, 2021

I can reproduce with a much smaller file too:
buggy2.txt

@proycon
Copy link
Owner

proycon commented May 27, 2021

I'm not sure whether the example file is a real-world example or just an example to illustrate this issue, but I see two problems:

  1. correction KB.DDD17C.NicolineCrowd.73272.p.1.s.1.correction.1 corrects only the N. It would make more sense to correct all the previous words. In the current way it is expressed, these words are not removed! (but they have no text representation in the TICCL class). If I fix this I get https://download.anaproy.nl/buggy3.xml which does pass the validation.
  2. Though allowed, it is a bit confusing to use the default current class when it is not in fact the 'current' most authoritative text representation. As I argued before, I'd recommend to make the ticcl class the current one (assuming this is the text representation that is deemed most current/correct and should be used as a basis for further processing), and assign something like an OCR or original class to the other text representation.

@proycon
Copy link
Owner

proycon commented May 27, 2021

I'm still a bit puzzled what caused the actual randomness though..

@proycon
Copy link
Owner

proycon commented May 27, 2021

Actually, I think I the randomness may have been caused by the unordered map from which text classes were read, that order was not guaranteed and behaviour differed based on which was checked first..

@kosloot
Copy link
Collaborator Author

kosloot commented May 27, 2021

Ok, buggy2.txt was just a made-up excerpt. And the use of class-names is questionable, but still: I think it is valid FoLiA, and foliavalidator should be deterministic

@proycon
Copy link
Owner

proycon commented May 27, 2021

Yes, the non-deterministic quirk is definitely not good. I did commit a fix for that.

@kosloot
Copy link
Collaborator Author

kosloot commented May 29, 2021

But the fix isn't released, so I wonder if that helps @martinreynaert using LaMachine?

@proycon
Copy link
Owner

proycon commented May 31, 2021

released now

@proycon proycon closed this as completed May 31, 2021
@kosloot
Copy link
Collaborator Author

kosloot commented Jun 1, 2021

I am quite sure this is NOT solved completely yet.
Given this file: incorrect.txt
I really think foliavalidator is wrong here:

foliavalidator incorrect.txt
VALIDATION ERROR on full parse by library (stage 2/3), in incorrect.txt
ParseError: FoLiA exception in handling of

@ line 35 (in parent @ parent line 34) : [InconsistentText] Text for <Sentence at 139728228001272 id=KB.DDD17C.NicolineCrowd.73272.p.1.s.3 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
Malvasie ,
****> BUT FOUND (strict text after normalization) ****>
Malvasie komende,
******* DEVIATION POINT: Malvasie <HERE>komende,
(also checked against older rules prior to FoLiA v2.4.1)

@kosloot kosloot reopened this Jun 1, 2021
@kosloot
Copy link
Collaborator Author

kosloot commented Jun 1, 2021

Addition:
I tried to rename the <new> node to <current>. Which I thought might solve the problem, but then:

> foliavalidator notsilly.xml 
VALIDATION ERROR on full parse by library (stage 2/3), in notsilly.xml

ParseError: FoLiA exception in handling of <correction> @ line 43 (in parent <s> @ parent line 36) : [Exception] Can't add 
Original item to Correction if there is a Current item

Which puzzles me too.

@proycon
Copy link
Owner

proycon commented Jun 2, 2021

I really think foliavalidator is wrong here

Agreed, the document looks okay, the validator seems to miss a word entirely when getting the deep text..

I tried to rename the node to . Which I thought might solve the problem, but then: Original item to Correction if there is a Current item

That's a legit error, Original can only come with New, not with Current (Current (and New) can come with Suggestion).

@kosloot
Copy link
Collaborator Author

kosloot commented Jun 2, 2021

That's a legit error, Original can only come with New, not with Current (Current (and New) can come with Suggestion).

Ok, I will add this limitation to libfolia too.

@proycon
Copy link
Owner

proycon commented Jun 3, 2021

There was already a failing test case I missed that covers this issue actually, the release was a bit too premature...

@kosloot
Copy link
Collaborator Author

kosloot commented Jun 3, 2021

Ok, I will add this limitation to libfolia too.

Done. Needed quite a bit of rework, unfortunately

@proycon
Copy link
Owner

proycon commented Jun 3, 2021

This should now be solved, the tests are all green again and the example you gave also validates. Has been released as foliapy v2.5.3.

@kosloot
Copy link
Collaborator Author

kosloot commented Jun 4, 2021

Still, I assume something is wrong. see:
special.txt

Here the (unlogical) correction from class="current" to class="Ticcl" is made.
I still assume the FoLiA is correct, but foliavalidator rejects it.
folialint accepts this construction.

@kosloot kosloot reopened this Jun 4, 2021
@proycon
Copy link
Owner

proycon commented Jun 4, 2021

yes, you're right, there's still some assumption that the "current" class is always the latest/most-current class. I'll see if we can get rid of that without breaking anything else.

@proycon proycon removed the PRIORITY label Jun 4, 2021
@kosloot
Copy link
Collaborator Author

kosloot commented Jun 4, 2021

I'm not sure about the Python implementation, but libfolia checks the text for every textclass in the document.
So in this case both the implicit "current" and "Ticcl".

@proycon
Copy link
Owner

proycon commented Jun 4, 2021

yeah, foliapy does the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants