Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do we need textclass attribute on structure nodes? #32

Open
kosloot opened this issue Aug 22, 2017 · 3 comments
Open

do we need textclass attribute on structure nodes? #32

kosloot opened this issue Aug 22, 2017 · 3 comments
Assignees

Comments

@kosloot
Copy link
Collaborator

kosloot commented Aug 22, 2017

consider the following FoLiA fragment:

<s id="s1">
  <w id="w1" class="WORD-WITHSUFFIX">
   <t>Zo</t>
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2" class="WORD">
   <t>probleem</t>
   <t textclass="other">probleem</t>
  </w>
</s>

Clearly the sentence is tokenized by ucto on the 'other' textclass, but there is now way to express this.
A simple solution would be to allow for textclass on the word level:

<s id="s1">
  <w id="w1" class="WORD-WITHSUFFIX" textclass="other">
   <t>Zo</t>
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2" class="WORD">
   <t>probleem</t>
   <t textclass="other">probleem</t>
  </w>
</s>

But this raises some questions on the 'orphaned' current text. Wouldn't it be better to have these connected to another word? like this:

<s id="s1">
  <w id="w1.1">
   <t>Zo</t>
  </w>
  <w id="w1" class="WORD-WITHSUFFIX" textclass="other">
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2.1">
   <t>probleem</t>
  </w>
  <w id="w2" class="WORD">
   <t textclass="other">probleem</t>
  </w>
</s>

This could also be raised to the sentence level then:

<s id="s1.1">
  <t>Zo probleem</t>
  <w id="w1.1">
   <t>Zo</t>
  </w>
  <w id="w2.1">
   <t>probleem</t>
  </w>
</s>
<s id="s1" textclass="other">
  <t textclass="other">Zo'n probleem</t>
  <w id="w1" class="WORD-WITHSUFFIX" textclass="other">
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2" class="WORD">
   <t textclass="other">probleem</t>
  <w>
</s>

This might be a solution for the problem of multiple/different tokenizations in one FoLiA document.
But again it raises questions:

  • Should we then disallow/dis-encourage multiple <t> nodes per structure?
  • making textclass redundant on <t> nodes? or implicit...

@proycon proycon self-assigned this Aug 22, 2017
@proycon
Copy link
Owner

proycon commented Aug 22, 2017

textclass on w seems a decent solution to make the information on what text was the source for tokenisation explicit, I'm not sure if it's even needed for other structural elements then.

Multiple mutually exclusive structural nodes would be quite a change and problematic, with regard to backward compatibility especially. It relates to the current limitation that FoLiA can't really deal with multiple tokenisations, revising that would be a major operation (FoLiA v2.0?) and I'm not even sure that this would be the way to go about it.

@kosloot
Copy link
Collaborator Author

kosloot commented Aug 23, 2017

I agree that this has a major impact, and needs more thought.
So lets introduce the textclass on words only for now.

I would strongly suggest to add a constraint too:
IF a word has an explicit textclass attribute
THEN it may have 1 textcontent child only, of the same class.

This assures that when software starts to use this new feature, it never construct FoLiA which would need repair in the future.

For backward compatibility, we need to accept words without an explicit textclass and several textcontents in different classes. (the latter is enforced already)
A folia 2.0 would need to repair these.

@kosloot
Copy link
Collaborator Author

kosloot commented Sep 7, 2017

OK, the constraint sounds plausible BUT:

  • it would disallow the example we started with. Which already exists.
  • if every <w> has at most 1 <t>, the whats the point of having textclass on the <w>?

So it will not work this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants