-
Notifications
You must be signed in to change notification settings - Fork 2
Gazetteers
A vital part of a NER system is a list of named entities. List of czech names is at https://cs.wikipedia.org/wiki/Seznam_k%C5%99estn%C3%ADch_jmen
which can be downloaded using the wikipedia package
pip install wikipedia
import wikipedia`
wikipedia.set_lang("cs")
p = wikipedia.page("Seznam Křestních jmen")
content = p.content().split('\n')
f = open("output.txt", "w")
for line in content:
if len(line) == 0:
continue
if line.startswit('=='):
continue
f.write(line)
f.write('\n')
Note there are some useless lines added at the end. A list of czech addresses can be found on the czech Ministry of Interior website: http://aplikace.mvcr.cz/adresy/
It's in XML format, which requires a bit more work
import xml.etree.ElementTree
e = xml.etree.ElementTree.parse("adresy.xml").getroot()
f = open('output.txt', 'w')
for street in e.getiterator('ulice'):
f.write(street.get('nazev'))
f.write('\n')
We could possibly add city names as well?
For czech last names, the Ministry of Interior has a list at http://www.mvcr.cz/clanek/cetnost-jmen-a-prijmeni-722752.aspx Sadly, it's saved in xls format so you would either have to open it in excel/calc and convert it to tsv/csv
To add a new gazetteer as a feature function, look at src/common/feature_extractor.py
-
Add a loading function that creates a set of words in the feature extractor:
def _load_name_gzttr(self, filename): self.name_gzttr = set() with open(filename) as f: for l in f: self.name_gzttr.add(l)
-
add the feature function itself:
def ft_name_gzttr(self, *params, init=False): if init: self._load_name_gzttr(params[0]) return token = params[0] flag = token in self.name_gzttr return "name", flag
Note it should contain an init flag if you load a file
-
add the function name to the external_functs dictionary
external_functs = {'addr_gzttr', 'name_gzttr', 'POS_curr', 'clusters_8'}
Now, when you write the model.txt file, add the filename of the gazetteer after the function, it will load itself after the initialisation.
gazett label to_lower name_gzttr czech_names