Skip to content

texttechnologylab/biofid-gazetteer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BIOfid Gazetteer

Skip-Gram Based Taxon Tagger for the TextImager Pipeline

BIOfid version latest

Paper 1 Conference 1

Article Journal

Description

A Java-based gazetteer tagger, developed for the BIOfid project. Recognizes biological entities provided with large lists (gazetters) in texts.

Utilizes a Java ConcurrentHashMap-backed tree-search algorithm parallelized with Java 8 streams that tags arbitrary texts of n words in O(c · n) time by looking up (→ c) each word in a previously created tree. Each node in the tree represents a word from the given input lists. All leaves must have a label (usually an URI); any node in the tree may have a label. Also allows to create skip-grams and abbreviations from input terms.

Note

The tagger is highly suspectible to false positives, such as vernacular names that double as common names of people (espcially prominent in German, eg. Schneider). Please keep this in mind while curating input lists/gazetteers.

⚠ Deprecation Pending ⚠

This repository is in the process of being replaced by a Rust implemenation: gazetteer-rs

Citation

The tool was used in the creation of all BIOfid corpora. Please cite

About

Skip-Gram Based Taxon Tagger For The UIMA Pipeline

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages