Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assigning paragraphs to FoLiA structure elements, yes, no, maybe? #40

Closed
kosloot opened this issue Dec 5, 2017 · 1 comment
Closed

Comments

@kosloot
Copy link
Contributor

kosloot commented Dec 5, 2017

The code that assigns higher structure FoLiA tags to tokenized text from FoLiA documents is rather messy.
An attempt is made to see whether a 'root' bearing the text is a structure or not.
But this code is not exhaustive, (recently we added Cell to the list)
A more generic solution would be preferable.
I tried such an approach but that raises a question: Do we always want to generate a Paragraph, even when only one sentence is present? This might be a bit of an overkill.

Example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA generator="teiExtractText.pl" version="1.4" xml:id="doc" xmlns="http://ilk.uvt.nl/folia">
  <metadata>
    <annotations>
    </annotations>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Word one</t>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

The current implementation generates the following tokenization:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:14:58" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Word one</t>
            <s xml:id="cell.1.s.1">
              <w xml:id="cell.1.s.1.w.1" class="WORD">
                <t>Word</t>
              </w>
              <w xml:id="cell.1.s.1.w.2" class="WORD">
                <t>one</t>
              </w>
            </s>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

More generic, the cell would also get a paragraph:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:17:01" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Word one</t>
            <p xml:id="cell.1.p.1">
              <s xml:id="cell.1.s.1">
                <w xml:id="cell.1.s.1.w.1" class="WORD">
                  <t>Word</t>
                </w>
                <w xml:id="cell.1.s.1.w.2" class="WORD">
                  <t>one</t>
                </w>
              </s>
            </p>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

This redundancy seems a bit of overkill, but now consider this example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA generator="teiExtractText.pl" version="1.4" xml:id="doc" xmlns="http://ilk.uvt.nl/folia">
  <metadata>
    <annotations>
    </annotations>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Een lange zin. Gevolgde door nog een Zin. Dit is dus een paragraaf?</t>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

After tokenization we get:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:21:48" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Een lange zin. Gevolgde door nog een Zin. Dit is dus een paragraaf?</t>
            <s xml:id="cell.1.s.1">
              <w xml:id="cell.1.s.1.w.1" class="WORD">
                <t>Een</t>
              </w>
              <w xml:id="cell.1.s.1.w.2" class="WORD">
                <t>lange</t>
              </w>
              <w xml:id="cell.1.s.1.w.3" class="WORD" space="no">
                <t>zin</t>
              </w>
              <w xml:id="cell.1.s.1.w.4" class="PUNCTUATION">
                <t>.</t>
              </w>
            </s>
            <s xml:id="cell.1.s.2">
              <w xml:id="cell.1.s.2.w.1" class="WORD">
                <t>Gevolgde</t>
              </w>
              <w xml:id="cell.1.s.2.w.2" class="WORD">
                <t>door</t>
              </w>
              <w xml:id="cell.1.s.2.w.3" class="WORD">
                <t>nog</t>
              </w>
              <w xml:id="cell.1.s.2.w.4" class="WORD">
                <t>een</t>
              </w>
              <w xml:id="cell.1.s.2.w.5" class="WORD" space="no">
                <t>Zin</t>
              </w>
              <w xml:id="cell.1.s.2.w.6" class="PUNCTUATION">
                <t>.</t>
              </w>
            </s>
            <s xml:id="cell.1.s.3">
              <w xml:id="cell.1.s.3.w.1" class="WORD">
                <t>Dit</t>
              </w>
              <w xml:id="cell.1.s.3.w.2" class="WORD">
                <t>is</t>
              </w>
              <w xml:id="cell.1.s.3.w.3" class="WORD">
                <t>dus</t>
              </w>
              <w xml:id="cell.1.s.3.w.4" class="WORD">
                <t>een</t>
              </w>
              <w xml:id="cell.1.s.3.w.5" class="WORD" space="no">
                <t>paragraaf</t>
              </w>
              <w xml:id="cell.1.s.3.w.6" class="PUNCTUATION">
                <t>?</t>
              </w>
            </s>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

And I think this is WRONG or at least questionable.
Shouldn't it not better be:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:21:48" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Een lange zin. Gevolgde door nog een Zin. Dit is dus een paragraaf?</t>
            <p xml:id="cell.1.p.1">
              <s xml:id="cell.1.s.1">
                <w xml:id="cell.1.s.1.w.1" class="WORD">
                  <t>Een</t>
                </w>
                <w xml:id="cell.1.s.1.w.2" class="WORD">
                  <t>lange</t>
                </w>
                <w xml:id="cell.1.s.1.w.3" class="WORD" space="no">
                  <t>zin</t>
                </w>
                <w xml:id="cell.1.s.1.w.4" class="PUNCTUATION">
                  <t>.</t>
                </w>
              </s>
              <s xml:id="cell.1.s.2">
                <w xml:id="cell.1.s.2.w.1" class="WORD">
                  <t>Gevolgde</t>
                </w>
                <w xml:id="cell.1.s.2.w.2" class="WORD">
                  <t>door</t>
                </w>
                <w xml:id="cell.1.s.2.w.3" class="WORD">
                  <t>nog</t>
                </w>
                <w xml:id="cell.1.s.2.w.4" class="WORD">
                  <t>een</t>
                </w>
                <w xml:id="cell.1.s.2.w.5" class="WORD" space="no">
                  <t>Zin</t>
                </w>
                <w xml:id="cell.1.s.2.w.6" class="PUNCTUATION">
                  <t>.</t>
                </w>
              </s>
              <s xml:id="cell.1.s.3">
                <w xml:id="cell.1.s.3.w.1" class="WORD">
                  <t>Dit</t>
                </w>
                <w xml:id="cell.1.s.3.w.2" class="WORD">
                  <t>is</t>
                </w>
                <w xml:id="cell.1.s.3.w.3" class="WORD">
                  <t>dus</t>
                </w>
                <w xml:id="cell.1.s.3.w.4" class="WORD">
                  <t>een</t>
                </w>
                <w xml:id="cell.1.s.3.w.5" class="WORD" space="no">
                  <t>paragraaf</t>
                </w>
                <w xml:id="cell.1.s.3.w.6" class="PUNCTUATION">
                  <t>?</t>
                </w>
              </s>
            </p>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

A quick fix is 'easy': always add a paragraph level.
We could 'count' sentences and leave the paragraph out when only one sentence is present.
That would require exceptions again, i guess for 'div' and 'text' nodes at least. Maybe 'head' and others too?

@kosloot
Copy link
Contributor Author

kosloot commented Nov 18, 2019

As we reworked ucto completely (using FoLiA Engine) this is now solved differently.

@kosloot kosloot closed this as completed Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants