Skip to content

Commit

Permalink
some corrections in metadata
Browse files Browse the repository at this point in the history
  • Loading branch information
anne17 committed Nov 22, 2024
1 parent 9ee55c8 commit 2df648a
Show file tree
Hide file tree
Showing 4 changed files with 93 additions and 30 deletions.
14 changes: 14 additions & 0 deletions sparv/modules/hunpos/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,13 @@ example_output: |-
<token pos="PN">sig</token>
<token pos="MAD">.</token>
```
example_extra: |-
In order to use this annotation you need to add the following setting to your Sparv corpus configuration file:
```yaml
metadata:
language: swe
variety: "1800"
```
model: |-
- [suc3_suc-tags_default-setting_utf8.model](https://github.com/spraakbanken/sparv-models/blob/master/hunpos/suc3_suc-tags_default-setting_utf8.model?raw=true)
- a word list along with the words' morphosyntactic information generated from the [Dalin
Expand Down Expand Up @@ -145,6 +152,13 @@ example_output: |-
<token msd="PN.UTR+NEU.SIN+PLU.DEF.OBJ">sig</token>
<token msd="MAD">.</token>
```
example_extra: |-
In order to use this annotation you need to add the following setting to your Sparv corpus configuration file:
```yaml
metadata:
language: swe
variety: "1800"
```
model: |-
- [suc3_suc-tags_default-setting_utf8.model](https://github.com/spraakbanken/sparv-models/blob/master/hunpos/suc3_suc-tags_default-setting_utf8.model?raw=true)
- a word list along with the words' morphosyntactic information generated from the [Dalin
Expand Down
56 changes: 28 additions & 28 deletions sparv/modules/lexical_classes/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,11 @@ created: 2017-09-05
id: swe-lexical_classes_text-sparv-blingbring
parent: blingbring-parent
name:
swe: Lexikala klasser från Blingbring, dokumentnivå
eng: Lexical classes from Blingbring, document-level
swe: Lexikala klasser från Blingbring, textnivå
eng: Lexical classes from Blingbring, text-level
short_description:
swe: Lexikala klasser från Blingbring på dokumentnivå
eng: Lexical classes from Blingbring on document-level
swe: Lexikala klasser från Blingbring på textnivå
eng: Lexical classes from Blingbring on text-level
annotations:
- <text>:lexical_classes.blingbring
example_output: |-
Expand All @@ -97,26 +97,26 @@ example_output: |-
```
description:
swe: |-
Token slås upp i Blingbring för att berikas med information om sina lexikala klasser. Därefter berikas dokument med
Token slås upp i Blingbring för att berikas med information om sina lexikala klasser. Därefter berikas texter med
information om lexikala klasser baserat på vilka klasser som är relevanta för token i dem.
Blingbring-frekvensmodellen](https://github.com/spraakbanken/sparv-models/blob/master/lexical_classes/blingbring.freq.gp2008%2Bsuc3%2Bromi.pickle)
(tränad på [Göteborgsposten 2008](https://spraakbanken.gu.se/resurser/gp2008), [SUC
3.0](https://spraakbanken.gu.se/resurser/suc3) och [Bonniersromaner I
(1976–77)](https://spraakbanken.gu.se/resurser/romi)) används som referens för att rangordna de Blingbring-klasser
som förekommer i varje dokument. Med hjälp av informationen om lexikala klasser på tokennivå beräknar och tilldelar
modellen de mest relevanta klasserna för varje dokument. Dessa klasser filtreras och rangordnas baserat på sin
som förekommer i varje text. Med hjälp av informationen om lexikala klasser på tokennivå beräknar och tilldelar
modellen de mest relevanta klasserna för varje text. Dessa klasser filtreras och rangordnas baserat på sin
frekvens och dominans jämfört med referensmaterialet.
Dominans avser i detta fallet den relativa betydelsen för en lexikal klass i ett givet dokument jämfört med ett
referensmaterial. Dominansen beräknas genom att jämföra den observerade frekvensen av en lexikal klass i dokumentet
Dominans avser i detta fallet den relativa betydelsen för en lexikal klass i en given text jämfört med ett
referensmaterial. Dominansen beräknas genom att jämföra den observerade frekvensen av en lexikal klass i texten
med dess förväntade (relativa) frekvens i referensmaterialet.
Blingbring (version 0.2) bygger på innehållet i Brings Svenskt ordförråd ordnat i begreppsklasser (1930). Ingångarna
i Blingbring har försetts med motsvarande SALDO-ordbetydelser. I föreliggande version är ordbetydelselänkarna ibland
flertydiga, något som kommer att åtgärdas i framtida versioner.
eng: |-
Tokens are looked up in Blingbring in order to enrich them with information about their lexical classes. Documents
Tokens are looked up in Blingbring in order to enrich them with information about their lexical classes. Texts
are then enriched with information about lexical classes based on which classes are relevant for the tokens within
them.
Expand All @@ -125,12 +125,12 @@ description:
(trained on [Göteborgsposten 2008](https://spraakbanken.gu.se/resurser/gp2008), [SUC
3.0](https://spraakbanken.gu.se/resurser/suc3) and [Bonniersromaner I
(1976–77)](https://spraakbanken.gu.se/resurser/romi)) is used as reference for ranking the Blingbring classes
occurring in each document. Using token-level lexical class information, it calculates and assigns the most relevant
classes for each document. These classes are filtered and ranked based on their frequency and dominance compared to
occurring in each text. Using token-level lexical class information, it calculates and assigns the most relevant
classes for each text. These classes are filtered and ranked based on their frequency and dominance compared to
the reference material.
Dominance refers to the relative importance or prominence of a lexical class in a given document compared to a
reference material. Dominance is derived by comparing the observed frequency of a lexical class in the document to
Dominance refers to the relative importance or prominence of a lexical class in a given text compared to a
reference material. Dominance is derived by comparing the observed frequency of a lexical class in the text to
its expected (relative) frequency in the reference material.
Blingbring (version 0.2) is based on the content of Bring's Svenskt ordförråd ordnat i begreppsklasser [The Swedish
Expand Down Expand Up @@ -189,11 +189,11 @@ created: 2017-09-21
id: swe-lexical_classes_text-sparv-swefn
parent: swefn-parent
name:
swe: Lexikala klasser från SweFN, dokumentnivå
eng: Lexical classes from SweFN, document-level
swe: Lexikala klasser från SweFN, textnivå
eng: Lexical classes from SweFN, text-level
short_description:
swe: Lexikala klasser från SweFN på dokumentnivå
eng: Lexical classes from SweFN on document-level
swe: Lexikala klasser från SweFN på textnivå
eng: Lexical classes from SweFN on text-level
annotations:
- <text>:lexical_classes.swefn
example_output: |-
Expand All @@ -220,35 +220,35 @@ description:
swe: |-
Token slås upp i [Svenskt frasnät](https://spraakbanken.gu.se/resurser/swefn) (SweFN, en lexikal-semantisk resurs
som är baserad på teorin om ramsemantik) för att berikas med information om sina lexikala klasser. Därefter berikas
dokument med information om lexikala klasser baserat på vilka klasser som är relevanta för token i dem.
texter med information om lexikala klasser baserat på vilka klasser som är relevanta för token i dem.
[SweFN-frekvensmodellen](https://github.com/spraakbanken/sparv-models/blob/master/lexical_classes/swefn.freq.gp2008%2Bsuc3%2Bromi.pickle)
(tränad på [Göteborgsposten 2008](https://spraakbanken.gu.se/resurser/gp2008), [SUC
3.0](https://spraakbanken.gu.se/resurser/suc3) och [Bonniersromaner I
(1976–77)](https://spraakbanken.gu.se/resurser/romi)) används som referens för att rangordna de SweFN-klasser som
förekommer i varje dokument. Med hjälp av informationen om lexikala klasser på tokennivå beräknar och tilldelar
modellen de mest relevanta klasserna för varje dokument. Dessa klasser filtreras och rangordnas baserat på sin
förekommer i varje text. Med hjälp av informationen om lexikala klasser på tokennivå beräknar och tilldelar
modellen de mest relevanta klasserna för varje text. Dessa klasser filtreras och rangordnas baserat på sin
frekvens och dominans jämfört med referensmaterialet.
Dominans avser i detta fallet den relativa betydelsen för en lexikal klass i ett givet dokument jämfört med ett
referensmaterial. Dominansen beräknas genom att jämföra den observerade frekvensen av en lexikal klass i dokumentet
Dominans avser i detta fallet den relativa betydelsen för en lexikal klass i en given text jämfört med ett
referensmaterial. Dominansen beräknas genom att jämföra den observerade frekvensen av en lexikal klass i texten
med dess förväntade (relativa) frekvens i referensmaterialet.
eng: |-
Tokens are looked up in [Swedish FrameNet](https://spraakbanken.gu.se/en/resources/swefn) (SweFN, lexical-semantic
resource that follows the theory of Frame Semantics) in order to enrich them with information about their lexical
classes. Documents are then enriched with information about lexical classes based on which classes are relevant for
classes. Texts are then enriched with information about lexical classes based on which classes are relevant for
the tokens within them.
The [SweFN frequency
model](https://github.com/spraakbanken/sparv-models/blob/master/lexical_classes/swefn.freq.gp2008%2Bsuc3%2Bromi.pickle)
(trained on [Göteborgsposten 2008](https://spraakbanken.gu.se/resurser/gp2008), [SUC
3.0](https://spraakbanken.gu.se/resurser/suc3) and [Bonniersromaner I
(1976–77)](https://spraakbanken.gu.se/resurser/romi)) is used as reference for ranking the SweFN classes occurring
in each document. Using token-level lexical class information, it calculates and assigns the most relevant classes
for each document. These classes are filtered and ranked based on their frequency and dominance compared to the
in each text. Using token-level lexical class information, it calculates and assigns the most relevant classes
for each text. These classes are filtered and ranked based on their frequency and dominance compared to the
reference material.
Dominance refers to the relative importance or prominence of a lexical class in a given document compared to a
reference material. Dominance is derived by comparing the observed frequency of a lexical class in the document to
Dominance refers to the relative importance or prominence of a lexical class in a given text compared to a
reference material. Dominance is derived by comparing the observed frequency of a lexical class in the text to
its expected (relative) frequency in the reference material.
created: 2017-09-21
2 changes: 0 additions & 2 deletions sparv/modules/segment/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -399,8 +399,6 @@ example_output: |-
```
model: "[punkt-nltk-svenska.pickle](https://github.com/spraakbanken/sparv-models/blob/master/segment/punkt-nltk-svenska.pickle?raw=true)"
trained_on: "[StorSUC](https://spraakbanken.gu.se/resurser/storsuc)"
tagset: ''
evaluation_results: ''
description:
swe: |-
Meningssegmenteraren är baserad på NLTKs
Expand Down
51 changes: 51 additions & 0 deletions sparv/modules/stanza/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,12 @@ example_output: |-
<token pos="NN">corpus</token>
<token pos=".">.</token>
```
example_extra: |-
In order to use this annotation you need to add the following setting to your Sparv corpus configuration file:
```yaml
metadata:
language: eng
```
---
id: eng-sentence-stanza
parent: stanza-parent-eng
Expand Down Expand Up @@ -260,6 +266,12 @@ example_output: |-
<token>.</token>
</sentence>
```
example_extra: |-
In order to use this annotation you need to add the following setting to your Sparv corpus configuration file:
```yaml
metadata:
language: eng
```
---
id: eng-tokenization-stanza
parent: stanza-parent-eng
Expand All @@ -280,6 +292,15 @@ example_output: |-
<token>corpus</token>
<token>.</token>
```
example_extra: |-
In order to use this annotation you need to add the following settings to your Sparv corpus configuration file:
```yaml
metadata:
language: eng
classes:
token: stanza.token
```
---
id: eng-lemmatization-stanza
parent: stanza-parent-eng
Expand All @@ -303,6 +324,12 @@ example_output: |-
<token baseform="word">words</token>
<token baseform=".">.</token>
```
example_extra: |-
In order to use this annotation you need to add the following setting to your Sparv corpus configuration file:
```yaml
metadata:
language: eng
```
---
id: eng-dependency-stanza
parent: stanza-parent-eng
Expand All @@ -329,6 +356,12 @@ example_output: |-
<token dephead_ref="5" deprel="obj" ref="7">words</token>
<token dephead_ref="4" deprel="punct" ref="8">.</token>
```
example_extra: |-
In order to use this annotation you need to add the following setting to your Sparv corpus configuration file:
```yaml
metadata:
language: eng
```
---
id: eng-namedentity-stanza
parent: stanza-parent-eng
Expand Down Expand Up @@ -368,6 +401,12 @@ example_output: |-
</ne>
<token>.</token>
```
example_extra: |-
In order to use this annotation you need to add the following setting to your Sparv corpus configuration file:
```yaml
metadata:
language: eng
```
description:
swe: |-
Namnigenkänning (NER) gör det möjligt att märka upp namnentiteter (som t.ex. personnamn, organisationer, ortnamn) i
Expand Down Expand Up @@ -396,6 +435,12 @@ example_output: |-
<token upos="NOUN">corpus</token>
<token upos="PUNCT">.</token>
```
example_extra: |-
In order to use this annotation you need to add the following setting to your Sparv corpus configuration file:
```yaml
metadata:
language: eng
```
---
id: eng-msd-stanza-ufeats
parent: stanza-parent-eng
Expand All @@ -417,3 +462,9 @@ example_output: |-
<token ufeats="Number=Sing">corpus</token>
<token>.</token>
```
example_extra: |-
In order to use this annotation you need to add the following setting to your Sparv corpus configuration file:
```yaml
metadata:
language: eng
```

0 comments on commit 2df648a

Please sign in to comment.