Skip to content

Gazetteers

nguyeho7 edited this page Sep 29, 2016 · 18 revisions

A vital part of a NER system is a list of named entities. List of czech names is at https://cs.wikipedia.org/wiki/Seznam_k%C5%99estn%C3%ADch_jmen

which can be downloaded using the wikipedia package

pip install wikipedia

import wikipedia`
wikipedia.set_lang("cs")
p = wikipedia.page("Seznam Křestních jmen")
content = p.content().split('\n')
f = open("output.txt", "w")
for line in content:
    if len(line) == 0:
        continue
    if line.startswit('=='):
        continue
    f.write(line)
    f.write('\n')

Note there are some useless lines added at the end. A list of czech addresses can be found on the czech Ministry of Interior website: http://aplikace.mvcr.cz/adresy/

It's in XML format, which requires a bit more work

import xml.etree.ElementTree
e = xml.etree.ElementTree.parse("adresy.xml").getroot()
f = open('output.txt', 'w')    
for street in e.getiterator('ulice'):
     f.write(street.get('nazev'))
     f.write('\n')

We could possibly add city names as well?

For czech last names, the Ministry of Interior has a list at http://www.mvcr.cz/clanek/cetnost-jmen-a-prijmeni-722752.aspx Sadly, it's saved in xls format so you would either have to open it in excel/calc and convert it to tsv/csv

Adding a custom gazetteer

To add a new gazetteer as a feature function, look at src/common/feature_extractor.py

  1. Add a loading function that creates a set of words in the feature extractor:

    def _load_name_gzttr(self, filename): self.name_gzttr = set() with open(filename) as f: for l in f: self.name_gzttr.add(l)

  2. add the feature function itself:

    def ft_name_gzttr(self, *params, init=False): if init: self._load_name_gzttr(params[0]) return token = params[0] flag = token in self.name_gzttr return "name", flag

Note it should contain an init flag if you load a file

  1. add the function name to the external_functs dictionary

    external_functs = {'addr_gzttr', 'name_gzttr', 'POS_curr', 'clusters_8'}

Now, when you write the model.txt file, add the filename of the gazetteer after the function, it will load itself after the initialisation.

gazett label to_lower name_gzttr czech_names
Clone this wiki locally