Normalization in SVALA means editing of the original learner text in such a way that the sentences a) are correct with respect to orthography, morphology and syntax and b) are semantically coherent. The first point corresponds to the Minimal Target Hypothesis in the German learner corpus Falko (Reznicek et al. 2012, page 42 ff.), whereas the latter point goes beyond this by allowing changes that affect meaning. A German example that highlights the difference (from Boyd 2018, slide 7) is the Minimal Target Hypothesis "Rufst du deine Antwort an?", which is semantically anomalous ("Do you call your answer?") but syntactically correct. To represent correction of semantics, pragmatics and stylistics, there is an additional level in Falko called the Extended Target Hypothesis, which in this example is "Rufst du mich wegen deiner Antwort an?" ("Will you call me regarding your answer?"). The reason that the target hypothesis in SVALA has a wider definition than the Minimal Target Hypothesis in Falko is that SVALA only has a single level of target hypotheses.
The purpose of normalization is twofold:
-
To render the text in a version which is amenable to automatic annotation using a standard linguistic analysis pipeline. (For Swedish, such a pipeline is efselab or Sparv. The SweLL data is currently processed with the Sparv-pipeline. It is possible to re-annotate later with other pipelines.)
-
To obtain a separate, explicit representation of the corrections (that is, the target hypotheses). Such a representation is highly useful by allowing for many new types of search of the corpus.
-
Finding missing occurrences of a construction in the learner text in the sense that it could or should have been used, for example, when something that ought to have been expressed using a passive was not.
-
Finding mismatches between the learner and corrected text, for example, an adjective in the learner text that corresponds to an adverb in the corrected text.
-
Although we strive to write guidelines to maximize inter-annotator agreement with respect to normalization, there will typically (if not always) be multiple possible normalizations of a deviating expression in a learner text. There are two reasons for this:
-
Different target hypotheses can be assumed.
Jag trivs mycket bor med dem ← Original
Jag trivs mycket med att bo med dem ← Target hypothesis 1 (preferred)
Jag trivs mycket bra med dem ← Target hypothesis 2
Here, it is not clear if by "mycket bor" the learner meant "mycket med att bo" or "mycket bra". Since our notion of minimal target hypothesis requires not changing the meaning (or at least changing it as little as possible), we prefer "mycket med att bo" despite the fact that this involved more tokens that are changed (see further below).
-
Assuming a particular target hypothesis, several normalizations may be possible.
Mit Bostaden är stor och ser gul farg fint hus ← Original
Min bostad är stor och har gul färg , ett fint hus ← Smaller change (preferred)
Min bostad är stor och har gul färg , och ligger i ett fint hus ← Bigger change
Min bostad är stor och gul , och ligger i ett fint hus ← Even bigger change
The basic principle of normalization in SVALA is that of miminal change. This means changing as few tokens as possible and the meaning as little as possible while correcting orthography, morphology and syntax and ensuring that the sentence is semantically coherent. In case of conflict between these, token changes are preferred to meaning changes. The pragmatics and style should not be changed.
-
Change as few tokens as possible.
In item (2) above, inserting "och ligger i ett fint hus" is more idiomatic but not preferred since it changes more words than "ett fint hus"
ett fint hus ← Preferred change to obtain a proper noun phrase
och ligger i ett fint hus ← Bigger change
- Change the meaning as little as possible, and do not change pragmatics or style.
Examples of style that we do not correct:
Main-clause word order in dependent clause with "att": "Allt är bra för att jag kan inte bo ensam"
"Orsaken beror på..."
-
If there is a conflict between minimizing the number of tokens and retaining the meaning, then give priority to retaining the meaning (permitting more tokens to be changed).
In item (1) above (different target hypotheses), changing "bor" to "bra" is a bigger change than changing "bor" to "med att bo" because it replaces one word with another word with a different meaning
If a verb does not match its arguments, retain the verb and adjust the arguments. Dummies are added for missing mandatory objects. The same applies to verb-dependent prepositional objects. (Adapted from Falko (Reznicek et al. 2012, page 45). Is it relevant? Swedish examples?)
In the absence of agreement between determiner, attribute(s) and nominal head, the head is retained. (Adapted from Falko (Reznicek et al. 2012, page 46).)
(Is this a consequence of the previous principle?) Keep the form of the content word and change its surrounding context, rather than keeping the function word
det är mycket skillnader
det är många skillnader ← Preferred change
det är stora skillnader ← "stora" is more idiomatic but is further away from the original ("mycket")
det är stor skillnad ← The form of the content word is changed, which we regard as a bigger change
min kompis och jag träffar till klubb
min kompis och jag träffas på klubb ← Preferred change (function word changed)
min kompis och jag går till klubb ← Content word changed
If it is otherwise not clear which target hypothesis to prefer, use the one that is most frequent in Korp (or some other reference material), and which can thus be assumed to be the more conventionalised expression.
Det var mycket svårt för oss att hitta den lägenhet i Stockholm
Det var mycket svårt för oss att hitta en lägenhet i Stockholm ← Preferred change
Frequency in Korp (all corpora selected): "hitta lägenhet" 1904, "hitta en lägenhet" 4180
In cases where it is difficult to decide whether an expression is correct or incorrect, assume that the learner was correct.
Do we have grammatical examples of this?
Note: This principle is also useful in transcription, in case of squiggles that could be interpreted in different ways. For example, in some handwriting it is difficult to distinguish "a" from "o". We may then assume, for example, that a partly illegible word should be read as the present verb form "dansar" (English: danse) rather than the illegal form "dansor". Similarly, we may assume strokes above "a" or "o" to be correct diacritics for the Swedish characters "å", "ä" or "ö", rather than errors.
Note: Old text commented out (still available in markdown)