assigning paragraphs to FoLiA structure elements, yes, no, maybe? #40

kosloot · 2017-12-05T14:30:10Z

The code that assigns higher structure FoLiA tags to tokenized text from FoLiA documents is rather messy.
An attempt is made to see whether a 'root' bearing the text is a structure or not.
But this code is not exhaustive, (recently we added Cell to the list)
A more generic solution would be preferable.
I tried such an approach but that raises a question: Do we always want to generate a Paragraph, even when only one sentence is present? This might be a bit of an overkill.

Example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA generator="teiExtractText.pl" version="1.4" xml:id="doc" xmlns="http://ilk.uvt.nl/folia">
  <metadata>
    <annotations>
    </annotations>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Word one</t>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

The current implementation generates the following tokenization:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:14:58" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Word one</t>
            <s xml:id="cell.1.s.1">
              <w xml:id="cell.1.s.1.w.1" class="WORD">
                <t>Word</t>
              </w>
              <w xml:id="cell.1.s.1.w.2" class="WORD">
                <t>one</t>
              </w>
            </s>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

More generic, the cell would also get a paragraph:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:17:01" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Word one</t>
            <p xml:id="cell.1.p.1">
              <s xml:id="cell.1.s.1">
                <w xml:id="cell.1.s.1.w.1" class="WORD">
                  <t>Word</t>
                </w>
                <w xml:id="cell.1.s.1.w.2" class="WORD">
                  <t>one</t>
                </w>
              </s>
            </p>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

This redundancy seems a bit of overkill, but now consider this example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA generator="teiExtractText.pl" version="1.4" xml:id="doc" xmlns="http://ilk.uvt.nl/folia">
  <metadata>
    <annotations>
    </annotations>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Een lange zin. Gevolgde door nog een Zin. Dit is dus een paragraaf?</t>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

After tokenization we get:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:21:48" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Een lange zin. Gevolgde door nog een Zin. Dit is dus een paragraaf?</t>
            <s xml:id="cell.1.s.1">
              <w xml:id="cell.1.s.1.w.1" class="WORD">
                <t>Een</t>
              </w>
              <w xml:id="cell.1.s.1.w.2" class="WORD">
                <t>lange</t>
              </w>
              <w xml:id="cell.1.s.1.w.3" class="WORD" space="no">
                <t>zin</t>
              </w>
              <w xml:id="cell.1.s.1.w.4" class="PUNCTUATION">
                <t>.</t>
              </w>
            </s>
            <s xml:id="cell.1.s.2">
              <w xml:id="cell.1.s.2.w.1" class="WORD">
                <t>Gevolgde</t>
              </w>
              <w xml:id="cell.1.s.2.w.2" class="WORD">
                <t>door</t>
              </w>
              <w xml:id="cell.1.s.2.w.3" class="WORD">
                <t>nog</t>
              </w>
              <w xml:id="cell.1.s.2.w.4" class="WORD">
                <t>een</t>
              </w>
              <w xml:id="cell.1.s.2.w.5" class="WORD" space="no">
                <t>Zin</t>
              </w>
              <w xml:id="cell.1.s.2.w.6" class="PUNCTUATION">
                <t>.</t>
              </w>
            </s>
            <s xml:id="cell.1.s.3">
              <w xml:id="cell.1.s.3.w.1" class="WORD">
                <t>Dit</t>
              </w>
              <w xml:id="cell.1.s.3.w.2" class="WORD">
                <t>is</t>
              </w>
              <w xml:id="cell.1.s.3.w.3" class="WORD">
                <t>dus</t>
              </w>
              <w xml:id="cell.1.s.3.w.4" class="WORD">
                <t>een</t>
              </w>
              <w xml:id="cell.1.s.3.w.5" class="WORD" space="no">
                <t>paragraaf</t>
              </w>
              <w xml:id="cell.1.s.3.w.6" class="PUNCTUATION">
                <t>?</t>
              </w>
            </s>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

And I think this is WRONG or at least questionable.
Shouldn't it not better be:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:21:48" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Een lange zin. Gevolgde door nog een Zin. Dit is dus een paragraaf?</t>
            <p xml:id="cell.1.p.1">
              <s xml:id="cell.1.s.1">
                <w xml:id="cell.1.s.1.w.1" class="WORD">
                  <t>Een</t>
                </w>
                <w xml:id="cell.1.s.1.w.2" class="WORD">
                  <t>lange</t>
                </w>
                <w xml:id="cell.1.s.1.w.3" class="WORD" space="no">
                  <t>zin</t>
                </w>
                <w xml:id="cell.1.s.1.w.4" class="PUNCTUATION">
                  <t>.</t>
                </w>
              </s>
              <s xml:id="cell.1.s.2">
                <w xml:id="cell.1.s.2.w.1" class="WORD">
                  <t>Gevolgde</t>
                </w>
                <w xml:id="cell.1.s.2.w.2" class="WORD">
                  <t>door</t>
                </w>
                <w xml:id="cell.1.s.2.w.3" class="WORD">
                  <t>nog</t>
                </w>
                <w xml:id="cell.1.s.2.w.4" class="WORD">
                  <t>een</t>
                </w>
                <w xml:id="cell.1.s.2.w.5" class="WORD" space="no">
                  <t>Zin</t>
                </w>
                <w xml:id="cell.1.s.2.w.6" class="PUNCTUATION">
                  <t>.</t>
                </w>
              </s>
              <s xml:id="cell.1.s.3">
                <w xml:id="cell.1.s.3.w.1" class="WORD">
                  <t>Dit</t>
                </w>
                <w xml:id="cell.1.s.3.w.2" class="WORD">
                  <t>is</t>
                </w>
                <w xml:id="cell.1.s.3.w.3" class="WORD">
                  <t>dus</t>
                </w>
                <w xml:id="cell.1.s.3.w.4" class="WORD">
                  <t>een</t>
                </w>
                <w xml:id="cell.1.s.3.w.5" class="WORD" space="no">
                  <t>paragraaf</t>
                </w>
                <w xml:id="cell.1.s.3.w.6" class="PUNCTUATION">
                  <t>?</t>
                </w>
              </s>
            </p>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

A quick fix is 'easy': always add a paragraph level.
We could 'count' sentences and leave the paragraph out when only one sentence is present.
That would require exceptions again, i guess for 'div' and 'text' nodes at least. Maybe 'head' and others too?

kosloot · 2019-11-18T11:57:20Z

As we reworked ucto completely (using FoLiA Engine) this is now solved differently.

kosloot added enhancement question labels Dec 5, 2017

kosloot assigned proycon and kosloot Dec 5, 2017

kosloot mentioned this issue Dec 8, 2017

revise class hierarchy considering paragraphs and sentences proycon/folia#42

Closed

kosloot closed this as completed Nov 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assigning paragraphs to FoLiA structure elements, yes, no, maybe? #40

assigning paragraphs to FoLiA structure elements, yes, no, maybe? #40

kosloot commented Dec 5, 2017

kosloot commented Nov 18, 2019

assigning paragraphs to FoLiA structure elements, yes, no, maybe? #40

assigning paragraphs to FoLiA structure elements, yes, no, maybe? #40

Comments

kosloot commented Dec 5, 2017

kosloot commented Nov 18, 2019