WordNet Gloss Corpus

This is a continuation of the project originally developed at Princeton, whose original files can be found this link or in the princeton branch of this repository.

See the releases page for the official releases.

What

This repository hosts a semantic concordance – a textual corpus (the WordNet glosses) and a lexicon (WordNet) where every content word in the corpus is linked to its sense(s) in the lexicon.

Why

There are several corpora with WordNet sense annotations (the SemCor, the senseval datases), but only by sense-tagging the WordNet itself can we guarantee that its definitional completeness – the property that all of its definitions only use words which are already defined by WordNet.

How

We are annotating the corpus using this tool. It depends on the gloss corpus in the format available in this repository. Details about the format itself can be found in the tool’s repository.

Release

cd data
split -l 1000 -a 2 annotation.json annotation-
for f in annotation-??; do mv $f $f.jl; done

Checking

Number of tokens per kind:

for f in *.jl ; do jq -r ".tokens | .[] | .kind | .[0] " $f; done | sort | uniq -c

extracting word forms:

for f in *.new; do jq -r ".tokens|.[] |.form " $f ; done | sort | uniq -c | sort -nr >> ../words.txt

Statistics

globs

sense tagged	53212	0.94
not sense tagged	3350	0.06
total taggable	56562	1.00

among the sense tagged ones, two last lines are errors:

39864	auto
13348	man

word forms

sense tagged	448091	0.56
not sense tagged	334068	0.43
total taggable	792281	1.00

among the token tagged:

126940	auto
321151	man

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
data		data
scripts		scripts
src		src
LICENSE		LICENSE
README		README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WordNet Gloss Corpus

What

Why

How

Release

Checking

Statistics

globs

word forms

About

Releases 3

Packages

Contributors 4

Languages

License

own-pt/glosstag

Folders and files

Latest commit

History

Repository files navigation

WordNet Gloss Corpus

What

Why

How

Release

Checking

Statistics

globs

word forms

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Languages

Packages