Skip to content

Commit

Permalink
Additional un-escaping during training
Browse files Browse the repository at this point in the history
  • Loading branch information
nikitakit committed Jun 28, 2019
1 parent 6f9094a commit 589825e
Showing 1 changed file with 7 additions and 0 deletions.
7 changes: 7 additions & 0 deletions src/parse_nk.py
Original file line number Diff line number Diff line change
@@ -962,6 +962,13 @@ def parse_batch(self, sentences, golds=None, return_label_scores_charts=False):
cleaned_words = []
for _, word in sentence:
word = BERT_TOKEN_MAPPING.get(word, word)
# This un-escaping for / and * was not yet added for the
# parser version in https://arxiv.org/abs/1812.11760v1
# and related model releases (e.g. benepar_en2)
word = word.replace('\\/', '/').replace('\\*', '*')
# Mid-token punctuation occurs in biomedical text
word = word.replace('-LSB-', '[').replace('-RSB-', ']')
word = word.replace('-LRB-', '(').replace('-RRB-', ')')
if word == "n't" and cleaned_words:
cleaned_words[-1] = cleaned_words[-1] + "n"
word = "'t"

0 comments on commit 589825e

Please sign in to comment.