You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code that assigns higher structure FoLiA tags to tokenized text from FoLiA documents is rather messy.
An attempt is made to see whether a 'root' bearing the text is a structure or not.
But this code is not exhaustive, (recently we added Cell to the list)
A more generic solution would be preferable.
I tried such an approach but that raises a question: Do we always want to generate a Paragraph, even when only one sentence is present? This might be a bit of an overkill.
This redundancy seems a bit of overkill, but now consider this example:
<?xml version="1.0" encoding="UTF-8"?>
<FoLiAgenerator="teiExtractText.pl"version="1.4"xml:id="doc"xmlns="http://ilk.uvt.nl/folia">
<metadata>
<annotations>
</annotations>
</metadata>
<textxml:id="text">
<divxml:id="div.1">
<tablexml:id="table.1">
<rowxml:id="row.1">
<cellxml:id="cell.1">
<t>Een lange zin. Gevolgde door nog een Zin. Dit is dus een paragraaf?</t>
</cell>
</row>
</table>
</div>
</text>
</FoLiA>
A quick fix is 'easy': always add a paragraph level.
We could 'count' sentences and leave the paragraph out when only one sentence is present.
That would require exceptions again, i guess for 'div' and 'text' nodes at least. Maybe 'head' and others too?
The text was updated successfully, but these errors were encountered:
The code that assigns higher structure FoLiA tags to tokenized text from FoLiA documents is rather messy.
An attempt is made to see whether a 'root' bearing the text is a structure or not.
But this code is not exhaustive, (recently we added Cell to the list)
A more generic solution would be preferable.
I tried such an approach but that raises a question: Do we always want to generate a Paragraph, even when only one sentence is present? This might be a bit of an overkill.
Example:
The current implementation generates the following tokenization:
More generic, the cell would also get a paragraph:
This redundancy seems a bit of overkill, but now consider this example:
After tokenization we get:
And I think this is WRONG or at least questionable.
Shouldn't it not better be:
A quick fix is 'easy': always add a paragraph level.
We could 'count' sentences and leave the paragraph out when only one sentence is present.
That would require exceptions again, i guess for 'div' and 'text' nodes at least. Maybe 'head' and others too?
The text was updated successfully, but these errors were encountered: