Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokensRegex or regexner annotators in corenlp Python #33

Open
tanusrib opened this issue Oct 15, 2015 · 3 comments
Open

TokensRegex or regexner annotators in corenlp Python #33

tanusrib opened this issue Oct 15, 2015 · 3 comments

Comments

@tanusrib
Copy link

I am wondering if there is any documentation of how to use regexner and TokensRegex annotators in Python wrapper of corenlp. And also, how can I use my own customised regular expression?

@matthayes
Copy link

This may be helpful: http://nlp.stanford.edu/pubs/tokensregex-tr-2014.pdf

@victoriastuart
Copy link

victoriastuart commented Oct 27, 2017

Update (2020-01): this repo (stanford-corenlp-python) is old and appears to be unmaintained -- the last commit was 2014-10.

The standordnlp (Python) repo -- which is provided by Stanford and provides Pythonic access to a CoreNLP server -- is more recent and well-supported.

Superseding my older answer below, I just posted an Issue at stanfordnlp that describes how to blend both default CoreNLP and RegexNER NER tagging in Python (with a link there to accomplishing the same task in JAVA, if that is your preference).

Can we call RegexNER in stanfordnlp?
https://github.com/stanfordnlp/stanfordnlp/issues/184


it is possible! :-D

I edited my corenlp.py file to work with the latest CoreNLP (3.7.0), then edited the default.properties file, basically as shown here:

# Works:
# annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref, regexner
# All of these appear to be required for regexner to work:
annotators = tokenize, ssplit, pos, lemma, ner, parse, regexner

# A true-casing annotator is also available (see below)
#annotators = tokenize, ssplit, pos, lemma, truecase
# ----------------------------------------------------------------------------
# REGEXNER:
# A simple regex NER annotator is also available
# annotators = tokenize, ssplit, regexner
# Victoria -- regexner depends on tokenize + ssplit
# More:
#   https://nlp.stanford.edu/software/regexner.html
#   https://stanfordnlp.github.io/CoreNLP/regexner.html#description
regexner.mapping = /home/victoria/projects/ie/entities.txt
# ----------------------------------------------------------------------------

My tab-delimited entities.txt file (just for testing; path defined in default.properties, above) is:

p53	GENE
super-tumor suppressor	PROTEIN
tumor	DISEASE
p53-ptpn14-yap	GENE_COMPLEX
pancreatic cancer	MOLECULAR_PROCESS
p53 transcription factor	PROTEIN
Ptpn14	GENE
Yap	GENE	PERSON
Yap oncoprotein	PROTEIN

Usage (Python 2.7 venv; Arch Linux):

(py27) [victoria@victoria stanford-corenlp-python]$ pwd
/mnt/Vancouver/apps/stanford-corenlp-python

(py27) [victoria@victoria stanford-corenlp-python]$ ls -l
total 204
-rw-r--r-- 1 victoria victoria   535 Oct 26 15:37  client.py
-rw-r--r-- 1 victoria victoria 11103 Oct 26 16:49  corenlp.py
-rw-r--r-- 1 victoria victoria  8263 Oct 26 16:49  corenlp.pyc
-rw-r--r-- 1 victoria victoria  3885 Oct 26 16:52  default.properties
drwxr-xr-x 3 victoria victoria  4096 Oct 26 15:38  docs
-rw-r--r-- 1 victoria victoria 43179 Oct 26 15:37  jsonrpc.py
-rw-r--r-- 1 victoria victoria 45801 Oct 26 15:45  jsonrpc.pyc
-rw-r--r-- 1 victoria victoria 18092 Oct 26 15:37  LICENSE
-rw-r--r-- 1 victoria victoria 13562 Oct 26 15:37  progressbar.py
-rw-r--r-- 1 victoria victoria 16945 Oct 26 15:45  progressbar.pyc
drwxr-xr-x 2 victoria victoria  4096 Oct 26 16:23  __pycache__
-rw-r--r-- 1 victoria victoria  9463 Oct 26 15:37  README.md
-rw-r--r-- 1 victoria victoria   662 Oct 26 15:41 '_readme - stanford-corenlp-python - Victoria.txt'

(py27) [victoria@victoria stanford-corenlp-python]$ P
[P: python]
Python 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> from corenlp import *
>>> corenlp = StanfordCoreNLP()
Loading Models: 5/5                                                                                                                                                                               

>>> parse_test = corenlp.parse("A p53 Super-tumor Suppressor Reveals a Tumor Suppressive p53-Ptpn14-Yap Axis in Pancreatic Cancer.")

>>> parse_test
'{"sentences": [{"parsetree": "[Text=p53 CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=NN Lemma=p53 NamedEntityTag=GENE] [Text=Super-tumor CharacterOffsetBegin=6 CharacterOffsetEnd=17 PartOfSpeech=NN Lemma=super-tumor NamedEntityTag=O] [Text=Suppressor CharacterOffsetBegin=18 CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=Suppressor NamedEntityTag=O] [Text=Reveals CharacterOffsetBegin=29 CharacterOffsetEnd=36 PartOfSpeech=VBZ Lemma=reveal NamedEntityTag=O] [Text=a CharacterOffsetBegin=37 CharacterOffsetEnd=38 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=Tumor CharacterOffsetBegin=39 CharacterOffsetEnd=44 PartOfSpeech=NN Lemma=tumor NamedEntityTag=MISC] [Text=Suppressive CharacterOffsetBegin=45 CharacterOffsetEnd=56 PartOfSpeech=JJ Lemma=suppressive NamedEntityTag=MISC] [Text=p53-Ptpn14-Yap CharacterOffsetBegin=57 CharacterOffsetEnd=71 PartOfSpeech=NN Lemma=p53-ptpn14-yap NamedEntityTag=MISC] [Text=Axis CharacterOffsetBegin=72 CharacterOffsetEnd=76 PartOfSpeech=NNP Lemma=Axis NamedEntityTag=MISC] [Text=in CharacterOffsetBegin=77 CharacterOffsetEnd=79 PartOfSpeech=IN Lemma=in NamedEntityTag=O] [Text=Pancreatic CharacterOffsetBegin=80 CharacterOffsetEnd=90 PartOfSpeech=JJ Lemma=pancreatic NamedEntityTag=O] [Text=Cancer CharacterOffsetBegin=91 CharacterOffsetEnd=97 PartOfSpeech=NN Lemma=cancer NamedEntityTag=O] [Text=. CharacterOffsetBegin=97 CharacterOffsetEnd=98 PartOfSpeech=. Lemma=. NamedEntityTag=O] (ROOT (S (NP (DT A) (NN p53) (NN Super-tumor) (NNP Suppressor)) (VP (VBZ Reveals) (S (NP (DT a) (NN Tumor) (JJ Suppressive) (NN p53-Ptpn14-Yap)) (NP (NP (NNP Axis)) (PP (IN in) (NP (JJ Pancreatic) (NN Cancer)))))) (. .)))", "text": "A p53 Super-tumor Suppressor Reveals a Tumor Suppressive p53-Ptpn14-Yap Axis in Pancreatic Cancer.", "dependencies": [["root", "ROOT", "Reveals"], ["det", "Suppressor", "A"], ["compound", "Suppressor", "p53"], ["compound", "Suppressor", "Super-tumor"], ["nsubj", "Reveals", "Suppressor"], ["det", "p53-Ptpn14-Yap", "a"], ["compound", "p53-Ptpn14-Yap", "Tumor"], ["amod", "p53-Ptpn14-Yap", "Suppressive"], ["nsubj", "Axis", "p53-Ptpn14-Yap"], ["xcomp", "Reveals", "Axis"], ["case", "Cancer", "in"], ["amod", "Cancer", "Pancreatic"], ["nmod:in", "Axis", "Cancer"], ["punct", "Reveals", "."]], "words": [["A", {"NamedEntityTag": "O", "CharacterOffsetEnd": "1", "Lemma": "a", "PartOfSpeech": "DT", "CharacterOffsetBegin": "0"}]]}]}'
>>>

Just a demo (I've been trying it out today), but this repo (stanford-corenlp-python) is the only Pythonic way to access/use the CoreNLP regexner class, outside of Java!

P.S. Here is that output, in a more readable ("wrapped") format:

parse_test '{"sentences": [{"parsetree": "[Text=p53 CharacterOffsetBegin=2
CharacterOffsetEnd=5 PartOfSpeech=NN Lemma=p53 NamedEntityTag=GENE]
[Text=Super-tumor CharacterOffsetBegin=6 CharacterOffsetEnd=17 PartOfSpeech=NN
Lemma=super-tumor NamedEntityTag=O] [Text=Suppressor CharacterOffsetBegin=18
CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=Suppressor NamedEntityTag=O]
[Text=Reveals CharacterOffsetBegin=29 CharacterOffsetEnd=36 PartOfSpeech=VBZ
Lemma=reveal NamedEntityTag=O] [Text=a CharacterOffsetBegin=37
CharacterOffsetEnd=38 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=Tumor
CharacterOffsetBegin=39 CharacterOffsetEnd=44 PartOfSpeech=NN Lemma=tumor
NamedEntityTag=MISC] [Text=Suppressive CharacterOffsetBegin=45
CharacterOffsetEnd=56 PartOfSpeech=JJ Lemma=suppressive NamedEntityTag=MISC]
[Text=p53-Ptpn14-Yap CharacterOffsetBegin=57 CharacterOffsetEnd=71
PartOfSpeech=NN Lemma=p53-ptpn14-yap NamedEntityTag=MISC] [Text=Axis
CharacterOffsetBegin=72 CharacterOffsetEnd=76 PartOfSpeech=NNP Lemma=Axis
NamedEntityTag=MISC] [Text=in CharacterOffsetBegin=77 CharacterOffsetEnd=79
PartOfSpeech=IN Lemma=in NamedEntityTag=O] [Text=Pancreatic
CharacterOffsetBegin=80 CharacterOffsetEnd=90 PartOfSpeech=JJ Lemma=pancreatic
NamedEntityTag=O] [Text=Cancer CharacterOffsetBegin=91 CharacterOffsetEnd=97
PartOfSpeech=NN Lemma=cancer NamedEntityTag=O] [Text=. CharacterOffsetBegin=97
CharacterOffsetEnd=98 PartOfSpeech=. Lemma=. NamedEntityTag=O] (ROOT (S (NP
(DT A) (NN p53) (NN Super-tumor) (NNP Suppressor)) (VP (VBZ Reveals) (S (NP
(DT a) (NN Tumor) (JJ Suppressive) (NN p53-Ptpn14-Yap)) (NP (NP (NNP Axis))
(PP (IN in) (NP (JJ Pancreatic) (NN Cancer)))))) (. .)))", "text": "A p53
Super-tumor Suppressor Reveals a Tumor Suppressive p53-Ptpn14-Yap Axis in
Pancreatic Cancer.", "dependencies": [["root", "ROOT", "Reveals"], ["det",
"Suppressor", "A"], ["compound", "Suppressor", "p53"], ["compound",
"Suppressor", "Super-tumor"], ["nsubj", "Reveals", "Suppressor"], ["det",
"p53-Ptpn14-Yap", "a"], ["compound", "p53-Ptpn14-Yap", "Tumor"], ["amod",
"p53-Ptpn14-Yap", "Suppressive"], ["nsubj", "Axis", "p53-Ptpn14-Yap"],
["xcomp", "Reveals", "Axis"], ["case", "Cancer", "in"], ["amod", "Cancer",
"Pancreatic"], ["nmod:in", "Axis", "Cancer"], ["punct", "Reveals", "."]],
"words": [["A", {"NamedEntityTag": "O", "CharacterOffsetEnd": "1", "Lemma":
"a", "PartOfSpeech": "DT", "CharacterOffsetBegin": "0"}]]}]}'

@bpatidar
Copy link

Could got it working only with CoreNLP version 2014-08-27 and not the new version. In addition, had been using Java 11 jdk on Mac OS X. Needed 3 more jars namely 1. Javax.xml.bind 2. activation.jar 3. jaxb-impl2.2.jar that I copied into the unzipped folder of stanfordcorenlp. Updated my corenlp.py to add these 3 jars as well. Finally the setup worked and helped parse the custom entities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants