To read in Portuguese, click here: README in portuguese
Chunking is a way of grouping sequential elements from text (sentences), which can be noun phrase, verb phrase, prepositional phrase etc, using its part of speech (POS) tags. Unlike named entity recognition (NER), which finds and sorts relevant pieces of text.
In this work, we extract the noun phrases (phrases that have a noun as their head).
We use two methods to generate the POS-tags of sentences:
- The
spacy
library, to tokenize and extract the POS-tag of each word of the sentence, which uses the corpuspt_core_news_md
. - A token-sequence
BERT
model trained with the corpusMacMorpho
, using as checkpoint the BioBERTpt model, trained with clinical and biomedical texts in Portuguese.
Next, we create a function that extracts all the nouns from the sentence, grouping with its complements (adjectives, adverbs, etc).
Example:
---Original sentence:---
Data de Criação do Documento: 22/04/2014 Dispneia importante aos esforços + dor tipo peso no peito no esforço. Obeso, has, icc c # cintilografia miocardica para avaliar angina.
---Sentence´s chunks:---
['Data de Criação do Documento 22/04/2014', 'Dispneia importante aos esforços', 'dor tipo peso no peito no esforço', 'Obeso', 'has', 'icc', 'cintilografia miocardica', 'angina']
In addition to the POS-tagger model provided by spacy
, we also trained our own model using the fine-tuning of the language model BioBERTpt(all) with the corpus for Portuguese MacMorpho, with 10 epochs, reaching an overall F1-Score of 0.9814.
Our model is in the official repository of Hugging Faces
, you can access it through the address: https://huggingface.co/pucpr-br/postagger-bio-portuguese.
If you appreciate our work, don't forget to like the model on Hugging Faces
❤️
How to use the POS-tagger model (without the chunking part):
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("pucpr-br/postagger-bio-portuguese")
model = AutoModelForTokenClassification.from_pretrained("pucpr-br/postagger-bio-portuguese")
Here you have the grammatical types returned by the model:
Acronym | Meaning |
---|---|
ADJ | adjective |
ADV | adverb |
ADV-KS | Subordinate subjunctive adverb |
ADV-KS-REL | Subordinate relative adverb |
ART | Article |
CUR | currency |
IN | Interjection |
KC | Coordinating conjunction |
KS | Subordinating conjunction |
N | noun |
NPROP | Proper noun |
NUM | Number |
PCP | Participle |
PDEN | Denotative word |
PREP | Preposition |
PROADJ | Adjective pronoun |
PRO-KS | Subordinate subjunctive pronoun |
PRO-KS-REL | Subordinate connective relative pronoun |
PROPESS | Personal pronoun |
PROSUB | Noun pronoun |
V | verb |
VAUX | auxiliary verb |
More information and examples at: http://nilc.icmc.usp.br/macmorpho/macmorpho-manual.pdf
PS: In case you need other POS-taggers trained for the portuguese language, in clinical or medical domain, you can also try these models trained with Flair.
To generate the chunks (noun phrases), you can run it directly from these notebooks: with spacy and with POS-Tagger Bio Portuguese
Or run a server to access via an web interface, following the steps below (the following examples are using the spacy
library, as it is a lighter model to run, especially within containers).
- Clone this repository
- Install the necessary libraries (if you prefer, use Anaconda)
pip install flask == 4.3.0
pip install spacy == 2.3.7
or through the command:
pip install -r requirements.txt
- Run
app.py
(it is configured to run on port 5000)
python app.py
-
In the browser, go to http://localhost:5000/
-
Write a clinical sentence or select some example sentence and click in the search button.
All the chunks identified in the input sentence will be returned colored.
-
To run the API inside a
Docker
container, where it is not necessary to worry about the environment and libraries, just follow the steps: -
If you don't have it, install
Docker
following these guidelines. -
Run the following commands (to run the container on port 5000)
docker build -t chunking .
docker run --name chunking_instance -p 0.0.0.0:5000:5000 -d chunking
You also can run directly by our image in Dockerhub:
docker run --name chunking_instance -p 0.0.0.0:5000:5000 -d terumi/chunking:version1
- In the browser, go to http://localhost:5000/
** soon **