TokensRegex or regexner annotators in corenlp Python #33

tanusrib · 2015-10-15T04:06:18Z

I am wondering if there is any documentation of how to use regexner and TokensRegex annotators in Python wrapper of corenlp. And also, how can I use my own customised regular expression?

matthayes · 2016-11-15T22:03:22Z

This may be helpful: http://nlp.stanford.edu/pubs/tokensregex-tr-2014.pdf

victoriastuart · 2017-10-27T00:24:33Z

Update (2020-01): this repo (stanford-corenlp-python) is old and appears to be unmaintained -- the last commit was 2014-10.

The standordnlp (Python) repo -- which is provided by Stanford and provides Pythonic access to a CoreNLP server -- is more recent and well-supported.

Superseding my older answer below, I just posted an Issue at stanfordnlp that describes how to blend both default CoreNLP and RegexNER NER tagging in Python (with a link there to accomplishing the same task in JAVA, if that is your preference).

Can we call RegexNER in stanfordnlp?
https://github.com/stanfordnlp/stanfordnlp/issues/184

it is possible! :-D

I edited my corenlp.py file to work with the latest CoreNLP (3.7.0), then edited the default.properties file, basically as shown here:

# Works:
# annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref, regexner
# All of these appear to be required for regexner to work:
annotators = tokenize, ssplit, pos, lemma, ner, parse, regexner

# A true-casing annotator is also available (see below)
#annotators = tokenize, ssplit, pos, lemma, truecase
# ----------------------------------------------------------------------------
# REGEXNER:
# A simple regex NER annotator is also available
# annotators = tokenize, ssplit, regexner
# Victoria -- regexner depends on tokenize + ssplit
# More:
#   https://nlp.stanford.edu/software/regexner.html
#   https://stanfordnlp.github.io/CoreNLP/regexner.html#description
regexner.mapping = /home/victoria/projects/ie/entities.txt
# ----------------------------------------------------------------------------

My tab-delimited entities.txt file (just for testing; path defined in default.properties, above) is:

p53	GENE
super-tumor suppressor	PROTEIN
tumor	DISEASE
p53-ptpn14-yap	GENE_COMPLEX
pancreatic cancer	MOLECULAR_PROCESS
p53 transcription factor	PROTEIN
Ptpn14	GENE
Yap	GENE	PERSON
Yap oncoprotein	PROTEIN

Usage (Python 2.7 venv; Arch Linux):

(py27) [victoria@victoria stanford-corenlp-python]$ pwd
/mnt/Vancouver/apps/stanford-corenlp-python

(py27) [victoria@victoria stanford-corenlp-python]$ ls -l
total 204
-rw-r--r-- 1 victoria victoria   535 Oct 26 15:37  client.py
-rw-r--r-- 1 victoria victoria 11103 Oct 26 16:49  corenlp.py
-rw-r--r-- 1 victoria victoria  8263 Oct 26 16:49  corenlp.pyc
-rw-r--r-- 1 victoria victoria  3885 Oct 26 16:52  default.properties
drwxr-xr-x 3 victoria victoria  4096 Oct 26 15:38  docs
-rw-r--r-- 1 victoria victoria 43179 Oct 26 15:37  jsonrpc.py
-rw-r--r-- 1 victoria victoria 45801 Oct 26 15:45  jsonrpc.pyc
-rw-r--r-- 1 victoria victoria 18092 Oct 26 15:37  LICENSE
-rw-r--r-- 1 victoria victoria 13562 Oct 26 15:37  progressbar.py
-rw-r--r-- 1 victoria victoria 16945 Oct 26 15:45  progressbar.pyc
drwxr-xr-x 2 victoria victoria  4096 Oct 26 16:23  __pycache__
-rw-r--r-- 1 victoria victoria  9463 Oct 26 15:37  README.md
-rw-r--r-- 1 victoria victoria   662 Oct 26 15:41 '_readme - stanford-corenlp-python - Victoria.txt'

(py27) [victoria@victoria stanford-corenlp-python]$ P
[P: python]
Python 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> from corenlp import *
>>> corenlp = StanfordCoreNLP()
Loading Models: 5/5                                                                                                                                                                               

>>> parse_test = corenlp.parse("A p53 Super-tumor Suppressor Reveals a Tumor Suppressive p53-Ptpn14-Yap Axis in Pancreatic Cancer.")

>>> parse_test
'{"sentences": [{"parsetree": "[Text=p53 CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=NN Lemma=p53 NamedEntityTag=GENE] [Text=Super-tumor CharacterOffsetBegin=6 CharacterOffsetEnd=17 PartOfSpeech=NN Lemma=super-tumor NamedEntityTag=O] [Text=Suppressor CharacterOffsetBegin=18 CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=Suppressor NamedEntityTag=O] [Text=Reveals CharacterOffsetBegin=29 CharacterOffsetEnd=36 PartOfSpeech=VBZ Lemma=reveal NamedEntityTag=O] [Text=a CharacterOffsetBegin=37 CharacterOffsetEnd=38 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=Tumor CharacterOffsetBegin=39 CharacterOffsetEnd=44 PartOfSpeech=NN Lemma=tumor NamedEntityTag=MISC] [Text=Suppressive CharacterOffsetBegin=45 CharacterOffsetEnd=56 PartOfSpeech=JJ Lemma=suppressive NamedEntityTag=MISC] [Text=p53-Ptpn14-Yap CharacterOffsetBegin=57 CharacterOffsetEnd=71 PartOfSpeech=NN Lemma=p53-ptpn14-yap NamedEntityTag=MISC] [Text=Axis CharacterOffsetBegin=72 CharacterOffsetEnd=76 PartOfSpeech=NNP Lemma=Axis NamedEntityTag=MISC] [Text=in CharacterOffsetBegin=77 CharacterOffsetEnd=79 PartOfSpeech=IN Lemma=in NamedEntityTag=O] [Text=Pancreatic CharacterOffsetBegin=80 CharacterOffsetEnd=90 PartOfSpeech=JJ Lemma=pancreatic NamedEntityTag=O] [Text=Cancer CharacterOffsetBegin=91 CharacterOffsetEnd=97 PartOfSpeech=NN Lemma=cancer NamedEntityTag=O] [Text=. CharacterOffsetBegin=97 CharacterOffsetEnd=98 PartOfSpeech=. Lemma=. NamedEntityTag=O] (ROOT (S (NP (DT A) (NN p53) (NN Super-tumor) (NNP Suppressor)) (VP (VBZ Reveals) (S (NP (DT a) (NN Tumor) (JJ Suppressive) (NN p53-Ptpn14-Yap)) (NP (NP (NNP Axis)) (PP (IN in) (NP (JJ Pancreatic) (NN Cancer)))))) (. .)))", "text": "A p53 Super-tumor Suppressor Reveals a Tumor Suppressive p53-Ptpn14-Yap Axis in Pancreatic Cancer.", "dependencies": [["root", "ROOT", "Reveals"], ["det", "Suppressor", "A"], ["compound", "Suppressor", "p53"], ["compound", "Suppressor", "Super-tumor"], ["nsubj", "Reveals", "Suppressor"], ["det", "p53-Ptpn14-Yap", "a"], ["compound", "p53-Ptpn14-Yap", "Tumor"], ["amod", "p53-Ptpn14-Yap", "Suppressive"], ["nsubj", "Axis", "p53-Ptpn14-Yap"], ["xcomp", "Reveals", "Axis"], ["case", "Cancer", "in"], ["amod", "Cancer", "Pancreatic"], ["nmod:in", "Axis", "Cancer"], ["punct", "Reveals", "."]], "words": [["A", {"NamedEntityTag": "O", "CharacterOffsetEnd": "1", "Lemma": "a", "PartOfSpeech": "DT", "CharacterOffsetBegin": "0"}]]}]}'
>>>

Just a demo (I've been trying it out today), but this repo (stanford-corenlp-python) is the only Pythonic way to access/use the CoreNLP regexner class, outside of Java!

P.S. Here is that output, in a more readable ("wrapped") format:

parse_test '{"sentences": [{"parsetree": "[Text=p53 CharacterOffsetBegin=2
CharacterOffsetEnd=5 PartOfSpeech=NN Lemma=p53 NamedEntityTag=GENE]
[Text=Super-tumor CharacterOffsetBegin=6 CharacterOffsetEnd=17 PartOfSpeech=NN
Lemma=super-tumor NamedEntityTag=O] [Text=Suppressor CharacterOffsetBegin=18
CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=Suppressor NamedEntityTag=O]
[Text=Reveals CharacterOffsetBegin=29 CharacterOffsetEnd=36 PartOfSpeech=VBZ
Lemma=reveal NamedEntityTag=O] [Text=a CharacterOffsetBegin=37
CharacterOffsetEnd=38 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=Tumor
CharacterOffsetBegin=39 CharacterOffsetEnd=44 PartOfSpeech=NN Lemma=tumor
NamedEntityTag=MISC] [Text=Suppressive CharacterOffsetBegin=45
CharacterOffsetEnd=56 PartOfSpeech=JJ Lemma=suppressive NamedEntityTag=MISC]
[Text=p53-Ptpn14-Yap CharacterOffsetBegin=57 CharacterOffsetEnd=71
PartOfSpeech=NN Lemma=p53-ptpn14-yap NamedEntityTag=MISC] [Text=Axis
CharacterOffsetBegin=72 CharacterOffsetEnd=76 PartOfSpeech=NNP Lemma=Axis
NamedEntityTag=MISC] [Text=in CharacterOffsetBegin=77 CharacterOffsetEnd=79
PartOfSpeech=IN Lemma=in NamedEntityTag=O] [Text=Pancreatic
CharacterOffsetBegin=80 CharacterOffsetEnd=90 PartOfSpeech=JJ Lemma=pancreatic
NamedEntityTag=O] [Text=Cancer CharacterOffsetBegin=91 CharacterOffsetEnd=97
PartOfSpeech=NN Lemma=cancer NamedEntityTag=O] [Text=. CharacterOffsetBegin=97
CharacterOffsetEnd=98 PartOfSpeech=. Lemma=. NamedEntityTag=O] (ROOT (S (NP
(DT A) (NN p53) (NN Super-tumor) (NNP Suppressor)) (VP (VBZ Reveals) (S (NP
(DT a) (NN Tumor) (JJ Suppressive) (NN p53-Ptpn14-Yap)) (NP (NP (NNP Axis))
(PP (IN in) (NP (JJ Pancreatic) (NN Cancer)))))) (. .)))", "text": "A p53
Super-tumor Suppressor Reveals a Tumor Suppressive p53-Ptpn14-Yap Axis in
Pancreatic Cancer.", "dependencies": [["root", "ROOT", "Reveals"], ["det",
"Suppressor", "A"], ["compound", "Suppressor", "p53"], ["compound",
"Suppressor", "Super-tumor"], ["nsubj", "Reveals", "Suppressor"], ["det",
"p53-Ptpn14-Yap", "a"], ["compound", "p53-Ptpn14-Yap", "Tumor"], ["amod",
"p53-Ptpn14-Yap", "Suppressive"], ["nsubj", "Axis", "p53-Ptpn14-Yap"],
["xcomp", "Reveals", "Axis"], ["case", "Cancer", "in"], ["amod", "Cancer",
"Pancreatic"], ["nmod:in", "Axis", "Cancer"], ["punct", "Reveals", "."]],
"words": [["A", {"NamedEntityTag": "O", "CharacterOffsetEnd": "1", "Lemma":
"a", "PartOfSpeech": "DT", "CharacterOffsetBegin": "0"}]]}]}'

bpatidar · 2018-09-30T11:25:04Z

Could got it working only with CoreNLP version 2014-08-27 and not the new version. In addition, had been using Java 11 jdk on Mac OS X. Needed 3 more jars namely 1. Javax.xml.bind 2. activation.jar 3. jaxb-impl2.2.jar that I copied into the unzipped folder of stanfordcorenlp. Updated my corenlp.py to add these 3 jars as well. Finally the setup worked and helped parse the custom entities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokensRegex or regexner annotators in corenlp Python #33

TokensRegex or regexner annotators in corenlp Python #33

tanusrib commented Oct 15, 2015

matthayes commented Nov 15, 2016

victoriastuart commented Oct 27, 2017 •

edited

Loading

bpatidar commented Sep 30, 2018

TokensRegex or regexner annotators in corenlp Python #33

TokensRegex or regexner annotators in corenlp Python #33

Comments

tanusrib commented Oct 15, 2015

matthayes commented Nov 15, 2016

victoriastuart commented Oct 27, 2017 • edited Loading

bpatidar commented Sep 30, 2018

victoriastuart commented Oct 27, 2017 •

edited

Loading