do we need textclass attribute on structure nodes? #32

kosloot · 2017-08-22T07:47:01Z

consider the following FoLiA fragment:

<s id="s1">
  <w id="w1" class="WORD-WITHSUFFIX">
   <t>Zo</t>
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2" class="WORD">
   <t>probleem</t>
   <t textclass="other">probleem</t>
  </w>
</s>

Clearly the sentence is tokenized by ucto on the 'other' textclass, but there is now way to express this.
A simple solution would be to allow for textclass on the word level:

<s id="s1">
  <w id="w1" class="WORD-WITHSUFFIX" textclass="other">
   <t>Zo</t>
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2" class="WORD">
   <t>probleem</t>
   <t textclass="other">probleem</t>
  </w>
</s>

But this raises some questions on the 'orphaned' current text. Wouldn't it be better to have these connected to another word? like this:

<s id="s1">
  <w id="w1.1">
   <t>Zo</t>
  </w>
  <w id="w1" class="WORD-WITHSUFFIX" textclass="other">
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2.1">
   <t>probleem</t>
  </w>
  <w id="w2" class="WORD">
   <t textclass="other">probleem</t>
  </w>
</s>

This could also be raised to the sentence level then:

<s id="s1.1">
  <t>Zo probleem</t>
  <w id="w1.1">
   <t>Zo</t>
  </w>
  <w id="w2.1">
   <t>probleem</t>
  </w>
</s>
<s id="s1" textclass="other">
  <t textclass="other">Zo'n probleem</t>
  <w id="w1" class="WORD-WITHSUFFIX" textclass="other">
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2" class="WORD">
   <t textclass="other">probleem</t>
  <w>
</s>

This might be a solution for the problem of multiple/different tokenizations in one FoLiA document.
But again it raises questions:

Should we then disallow/dis-encourage multiple <t> nodes per structure?
making textclass redundant on <t> nodes? or implicit...

The text was updated successfully, but these errors were encountered:

proycon · 2017-08-22T08:21:14Z

textclass on w seems a decent solution to make the information on what text was the source for tokenisation explicit, I'm not sure if it's even needed for other structural elements then.

Multiple mutually exclusive structural nodes would be quite a change and problematic, with regard to backward compatibility especially. It relates to the current limitation that FoLiA can't really deal with multiple tokenisations, revising that would be a major operation (FoLiA v2.0?) and I'm not even sure that this would be the way to go about it.

kosloot · 2017-08-23T08:09:50Z

I agree that this has a major impact, and needs more thought.
So lets introduce the textclass on words only for now.

I would strongly suggest to add a constraint too:
IF a word has an explicit textclass attribute
THEN it may have 1 textcontent child only, of the same class.

This assures that when software starts to use this new feature, it never construct FoLiA which would need repair in the future.

For backward compatibility, we need to accept words without an explicit textclass and several textcontents in different classes. (the latter is enforced already)
A folia 2.0 would need to repair these.

kosloot · 2017-09-07T12:22:47Z

OK, the constraint sounds plausible BUT:

it would disallow the example we started with. Which already exists.
if every <w> has at most 1 <t>, the whats the point of having textclass on the <w>?

So it will not work this way.

proycon self-assigned this Aug 22, 2017

proycon added the enhancement label Aug 22, 2017

proycon mentioned this issue Aug 25, 2017

Frog mblem crash: folia::ValueError: attribute 'class' is required for lemma (empty class passed) LanguageMachines/frog#38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do we need textclass attribute on structure nodes? #32

do we need textclass attribute on structure nodes? #32

kosloot commented Aug 22, 2017 •

edited

Loading

proycon commented Aug 22, 2017

kosloot commented Aug 23, 2017 •

edited

Loading

kosloot commented Sep 7, 2017

do we need textclass attribute on structure nodes? #32

do we need textclass attribute on structure nodes? #32

Comments

kosloot commented Aug 22, 2017 • edited Loading

proycon commented Aug 22, 2017

kosloot commented Aug 23, 2017 • edited Loading

kosloot commented Sep 7, 2017

kosloot commented Aug 22, 2017 •

edited

Loading

kosloot commented Aug 23, 2017 •

edited

Loading