Skip to content

trinker/tagger

Repository files navigation

tagger Follow

Project Status: Active - The project has reached a stable, usable state and is being actively developed.Build Status Coverage Status Version

tagger wraps the NLP and openNLP packages for easier part of speech tagging. tagger uses the openNLP annotator to compute "Penn Treebank parse annotations using the Apache OpenNLP chunking parser for English."

The main functions and descriptions are listed in the table below.

Function Description
tag_pos Tag parts of speech
select_tags Select specific part of speech tags from tag_pos
count_tags Cross tabs of tags by grouping variable

Table of Contents

Installation

To download the development version of tagger:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(c(
    "trinker/termco", 
    "trinker/coreNLPsetup",        
    "trinker/tagger"
))

Contact

You are welcome to:

Examples

The following examples demonstrate some of the functionality of tagger.

Load the Tools/Data

library(dplyr); library(tagger)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following object is masked from 'package:qdap':
## 
##     %>%

## The following object is masked from 'package:qdapTools':
## 
##     id

## The following objects are masked from 'package:qdapRegex':
## 
##     escape, explain

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data(presidential_debates_2012)
mwe <- data_frame(
    person = c("Tyler", "Norah", "Tyler"),
    talk = c(
        "I need $54 to go to the movies.",
        "They refuse to permit us to obtain the refuse permit",
        "This is the tagger package; like it?"
    )
)

Tagging

Let's begin with a minimal example.

tag_pos(mwe$talk)

## [1] "I/PRP need/VBP $/$ 54/CD to/TO go/VB to/TO the/DT movies/NNS ./."                     
## [2] "They/PRP refuse/VBP to/TO permit/VB us/PRP to/TO obtain/VB the/DT refuse/NN permit/NN"
## [3] "This/DT is/VBZ the/DT tagger/NN package/NN ;/: like/IN it/PRP ?/."

Note that the out put pretty pints but the underlying structure is simply a lst of named vectors, where the elements in the vectors are the tokens and the names are the part of speech tags. We can use c on the object to see it's true structure.

tag_pos(mwe$talk) %>%
    c()

## [[1]]
##      PRP      VBP        $       CD       TO       VB       TO       DT 
##      "I"   "need"      "$"     "54"     "to"     "go"     "to"    "the" 
##      NNS        . 
## "movies"      "." 
## 
## [[2]]
##      PRP      VBP       TO       VB      PRP       TO       VB       DT 
##   "They" "refuse"     "to" "permit"     "us"     "to" "obtain"    "the" 
##       NN       NN 
## "refuse" "permit" 
## 
## [[3]]
##        DT       VBZ        DT        NN        NN         :        IN 
##    "This"      "is"     "the"  "tagger" "package"       ";"    "like" 
##       PRP         . 
##      "it"       "?"

Let's try it on a larger example, the built in presidential_debates_2012 data set. It'll take 30 seconds or so to run, depending on the machine.

tag_pos(presidential_debates_2012$dialogue)

## 1.    We/PRP 'll/MD talk/VB about/IN specifically/RB ...
## 2.    But/CC what/WP do/VBP you/PRP support/VB the/DT ...
## 3.    What/WP I/PRP support/VBP is/VBZ no/DT change/NN ...
## 4.    And/CC the/DT president/NN supports/VBZ taking/VBG ...
## 5.    And/CC what/WP about/IN the/DT vouchers/NNS ?/.
## .
## .
## .
## 2908. Thank/VB you/PRP so/RB much/RB ./.
## 2909. Gentlemen/NNS ,/, thank/VB you/PRP both/DT so/RB ...
## 2910. That/DT brings/VBZ an/DT end/NN to/TO this/DT ...
## 2911. As/IN I/PRP always/RB do/VBP at/IN the/DT end/NN ...
## 2912. Good/JJ night/NN ./.

This output is built into tagger as the presidential_debates_2012_pos data set, which we'll use form this point on in the demo.

Note that the user may choose to use CoreNLP as a backend by setting engine = "coreNLP". To ensure that coreNLP is setup properly use check_setup.

Plotting

The user can generate a horizontal barplot of the used tags.

presidential_debates_2012_pos %>%
    plot()

Interpreting Tags

The tags generated by openNLP are from Penn Treebank. As such there are many tags, more than the few parts of speech we learned in grade school. Remembering the meaning of each tags may be difficult, therefore the penn_tags creates a left aligned data frame of the possible tags and their meaning.

penn_tags()

##    Tag  Description                                 
## 1  $    dollar                                      
## 2  ``   opening quotation mark                      
## 3  ''   closing quotation mark                      
## 4  (    opening parenthesis                         
## 5  )    closing parenthesis                         
## 6  ,    comma                                       
## 7  -    dash                                        
## 8  .    sentence terminator                         
## 9  :    colon or ellipsis                           
## 10 CC   conjunction, coordinating                   
## 11 CD   numeral, cardinal                           
## 12 DT   determiner                                  
## 13 EX   existential there                           
## 14 FW   foreign word                                
## 15 IN   preposition or conjunction, subordinating   
## 16 JJ   adjective or numeral, ordinal               
## 17 JJR  adjective, comparative                      
## 18 JJS  adjective, superlative                      
## 19 LS   list item marker                            
## 20 MD   modal auxiliary                             
## 21 NN   noun, common, singular or mass              
## 22 NNP  noun, proper, singular                      
## 23 NNPS noun, proper, plural                        
## 24 NNS  noun, common, plural                        
## 25 PDT  pre-determiner                              
## 26 POS  genitive marker                             
## 27 PRP  pronoun, personal                           
## 28 PRP$ pronoun, possessive                         
## 29 RB   adverb                                      
## 30 RBR  adverb, comparative                         
## 31 RBS  adverb, superlative                         
## 32 RP   particle                                    
## 33 SYM  symbol                                      
## 34 TO   "to" as preposition or infinitive marker    
## 35 UH   interjection                                
## 36 VB   verb, base form                             
## 37 VBD  verb, past tense                            
## 38 VBG  verb, present participle or gerund          
## 39 VBN  verb, past participle                       
## 40 VBP  verb, present tense, not 3rd person singular
## 41 VBZ  verb, present tense, 3rd person singular    
## 42 WDT  WH-determiner                               
## 43 WP   WH-pronoun                                  
## 44 WP$  WH-pronoun, possessive                      
## 45 WRB  Wh-adverb

Counts

The user can generate a count of the tags by grouping variable as well. The number of columns explodes quickly, even with this minimal example.

tag_pos(mwe$talk) %>%
    count_tags(mwe$person) 

##   person n.tokens       $        .       :      CD       DT      IN
## 1  Norah       10       0        0       0       0 1(10.0%)       0
## 2  Tyler       19 1(5.3%) 2(10.5%) 1(5.3%) 1(5.3%) 3(15.8%) 1(5.3%)
##         NN     NNS      PRP       TO       VB      VBP     VBZ
## 1 2(20.0%)       0 2(20.0%) 2(20.0%) 2(20.0%) 1(10.0%)       0
## 2 2(10.5%) 1(5.3%) 2(10.5%) 2(10.5%)  1(5.3%)  1(5.3%) 1(5.3%)

The default is a pretty printing (counts + proportions) that can be turned off to print raw counts only.

tag_pos(mwe$talk) %>%
    count_tags(mwe$person) %>%
    print(pretty = FALSE)

##    person n.tokens $ . : CD DT IN NN NNS PRP TO VB VBP VBZ
## 1:  Tyler       19 1 2 1  1  3  1  2   1   2  2  1   1   1
## 2:  Norah       10 0 0 0  0  1  0  2   0   2  2  2   1   0

Select Tags

The user may wish to select specific tags. The select_tags function enables selection of specific tags via element matching (which can be negated) or regular expression.

Here we select only the nouns.

presidential_debates_2012_pos %>%
    select_tags(c("NN", "NNP", "NNPS", "NNS"))

## 1.    health/NN care/NN moment/NN
## 2.    voucher/NN system/NN Governor/NNP
## 3.    change/NN retirees/NNS retirees/NNS Medicare/NNP
## 4.    president/NN dollar/NN program/NN
## 5.    vouchers/NNS
## .
## .
## .
## 2908. 
## 2909. Gentlemen/NNS
## 2910. end/NN year/NN debates/NNS Lynn/NNP University/NNP ...
## 2911. end/NN debates/NNS words/NNS mom/NN vote/NN
## 2912. night/NN

This could also have been accomplished with a simpler regex call by setting regex = TRUE.

presidential_debates_2012_pos %>%
    select_tags("NN", regex=TRUE)

## 1.    health/NN care/NN moment/NN
## 2.    voucher/NN system/NN Governor/NNP
## 3.    change/NN retirees/NNS retirees/NNS Medicare/NNP
## 4.    president/NN dollar/NN program/NN
## 5.    vouchers/NNS
## .
## .
## .
## 2908. 
## 2909. Gentlemen/NNS
## 2910. end/NN year/NN debates/NNS Lynn/NNP University/NNP ...
## 2911. end/NN debates/NNS words/NNS mom/NN vote/NN
## 2912. night/NN

In this way we could quickly select the nouns and verbs with the following call.

presidential_debates_2012_pos %>%
    select_tags("^(VB|NN)", regex=TRUE)

## 1.    talk/VB health/NN care/NN moment/NN
## 2.    do/VBP support/VB voucher/NN system/NN Governor/NNP
## 3.    support/VBP is/VBZ change/NN retirees/NNS ...
## 4.    president/NN supports/VBZ taking/VBG dollar/NN ...
## 5.    vouchers/NNS
## .
## .
## .
## 2908. Thank/VB
## 2909. Gentlemen/NNS thank/VB
## 2910. brings/VBZ end/NN year/NN debates/NNS want/VBP ...
## 2911. do/VBP end/NN debates/NNS leave/VBP words/NNS ...
## 2912. night/NN

Note that the output is a tag_pos class and the plotting, count_tags, and as_word_tag functions can be used on the result.

presidential_debates_2012_pos %>%
    select_tags("^(VB|NN)", regex=TRUE) %>%
    plot()

presidential_debates_2012_pos %>%
    select_tags("^(VB|NN)", regex=TRUE) %>%
    count_tags()

## # A tibble: 2,912 × 11
##    n.tokens        NN      NNP  NNPS       NNS       VB   VBD      VBG
##       <dbl>     <chr>    <chr> <chr>     <chr>    <chr> <chr>    <chr>
## 1         4  3(75.0%)        0     0         0 1(25.0%)     0        0
## 2         5  2(40.0%) 1(20.0%)     0         0 1(20.0%)     0        0
## 3         6  1(16.7%) 1(16.7%)     0  2(33.3%)        0     0        0
## 4         5  3(60.0%)        0     0         0        0     0 1(20.0%)
## 5         1         0        0     0 1(100.0%)        0     0        0
## 6         3  1(33.3%)        0     0         0        0     0        0
## 7        16  4(25.0%) 2(12.5%)     0   1(6.2%) 4(25.0%)     0  1(6.2%)
## 8         1 1(100.0%)        0     0         0        0     0        0
## 9         6  1(16.7%)        0     0  1(16.7%) 3(50.0%)     0        0
## 10        5  2(40.0%)        0     0         0 2(40.0%)     0        0
## # ... with 2,902 more rows, and 3 more variables: VBN <chr>, VBP <chr>,
## #   VBZ <chr>

Altering Tag Display

As Word Tags

The traditional way to display tags is to incorporate them into the sentence, placing them after/before their respective token, separated by a forward slash (e.g., talk/VB). This is the default printing style of tag_pos though not truly the structure of the output. The user can coerce the underlying structure with the as_word_tag function, converting the named list of vectors into a list of part of speech incorporated, unnamed vectors. Below I only print the first 6 elements of as_word_tag.

presidential_debates_2012_pos %>%
    as_word_tag() %>%
    head()

## [1] "We/PRP 'll/MD talk/VB about/IN specifically/RB about/IN health/NN care/NN in/IN a/DT moment/NN ./."                                        
## [2] "But/CC what/WP do/VBP you/PRP support/VB the/DT voucher/NN system/NN ,/, Governor/NNP ?/."                                                 
## [3] "What/WP I/PRP support/VBP is/VBZ no/DT change/NN for/IN current/JJ retirees/NNS and/CC near/IN retirees/NNS to/TO Medicare/NNP ./."        
## [4] "And/CC the/DT president/NN supports/VBZ taking/VBG dollar/NN seven/CD hundred/CD sixteen/CD billion/CD out/IN of/IN that/DT program/NN ./."
## [5] "And/CC what/WP about/IN the/DT vouchers/NNS ?/."                                                                                           
## [6] "So/IN that/DT 's/VBZ that/DT 's/VBZ number/NN one/CD ./."

As Tuples

Python uses a tuple construction of parts of speech to display tags. This can be a useful structure. Essentially the structure is a lists of lists of two element vectors. Each vector contains a word and a part of speech tag. as_tuple uses the following R structuring:

list(list(c("word", "tag"), c("word", "tag")), list(c("word", "tag")))

but prints to the console in the Python way. Using print(as_tuple(x), truncate=Inf, file="out.txt") allows the user to print to an external file.

tag_pos(mwe$talk) %>%
    as_tuple() %>%
    print(truncate=Inf)

## [[("I", "PRP"), ("need", "VBP"), ("$", "$"), ("54", "CD"), ("to", "TO"), ("go", "VB"), ("to", "TO"), ("the", "DT"), ("movies", "NNS"), (".", ".")], [("They", "PRP"), ("refuse", "VBP"), ("to", "TO"), ("permit", "VB"), ("us", "PRP"), ("to", "TO"), ("obtain", "VB"), ("the", "DT"), ("refuse", "NN"), ("permit", "NN")], [("This", "DT"), ("is", "VBZ"), ("the", "DT"), ("tagger", "NN"), ("package", "NN"), (";", ":"), ("like", "IN"), ("it", "PRP"), ("?", ".")]]

As Universal Tags

Petrov, Das, & McDonald (2011) provide a mapping to convert Penn Treebank tags into universal part of speech tags. The as_universal function harnesses this mapping.

tag_pos(mwe$talk) %>%
    as_universal()

## [1] "I/PRON need/VERB $/. 54/NUM to/PRT go/VERB to/PRT the/DET movies/NOUN ./."                          
## [2] "They/PRON refuse/VERB to/PRT permit/VERB us/PRON to/PRT obtain/VERB the/DET refuse/NOUN permit/NOUN"
## [3] "This/DET is/VERB the/DET tagger/NOUN package/NOUN ;/. like/ADP it/PRON ?/."

The out put is a tag_pos object and thus has a generic plot method.

tag_pos(mwe$talk) %>%
    as_universal() %>%
    plot()

tag_pos(mwe$talk) %>%
    as_universal() %>%
    count_tags()

##   n.tokens        .      ADP      DET     NOUN      NUM     PRON      PRT
## 1       10 2(20.0%)        0 1(10.0%) 1(10.0%) 1(10.0%) 1(10.0%) 2(20.0%)
## 2       10        0        0 1(10.0%) 2(20.0%)        0 2(20.0%) 2(20.0%)
## 3        9 2(22.2%) 1(11.1%) 2(22.2%) 2(22.2%)        0 1(11.1%)        0
##       VERB
## 1 2(20.0%)
## 2 3(30.0%)
## 3 1(11.1%)

As Basic Tags

as_basic provides an even more coarse tagset than as_universal. Basic tags include: (a) nouns, (b) adjectives, (c) prepositions, (d) articles, (e) verb, (f) pronouns, (g) adverbs, (h) interjections, & (i) conjunctions. The X and . tags are retained for punctuation and unclassified parts of speech.

tag_pos(mwe$talk) %>%
    as_basic()

## [1] "I/pronoun need/verb $/. 54/adjective to/preposition go/verb to/preposition the/article movies/noun ./."                       
## [2] "They/pronoun refuse/verb to/preposition permit/verb us/pronoun to/preposition obtain/verb the/article refuse/noun permit/noun"
## [3] "This/adjective is/verb the/article tagger/noun package/noun ;/. like/preposition it/pronoun ?/."

This tagset can be useful for more coarse purposes, including formality (Heylighen & Dewaele, 2002) scoring.

  • Heylighen, F., & Dewaele, J.M. (2002). Variation in the contextuality of language: An empirical measure. Context in Context, Special issue of Foundations of Science, 7 (3), 293-340.

The output is a tag_pos object and thus has a generic plot method.

tag_pos(mwe$talk) %>%
    as_basic() %>%
    plot()

tag_pos(mwe$talk) %>%
    as_basic() %>%
    count_tags()

##   n.tokens        . adjective  article     noun preposition  pronoun
## 1       10 2(20.0%)  1(10.0%) 1(10.0%) 1(10.0%)    2(20.0%) 1(10.0%)
## 2       10        0         0 1(10.0%) 2(20.0%)    2(20.0%) 2(20.0%)
## 3        9 2(22.2%)  1(11.1%) 1(11.1%) 2(22.2%)    1(11.1%) 1(11.1%)
##       verb
## 1 2(20.0%)
## 2 3(30.0%)
## 3 1(11.1%)

Releases

No releases published

Packages

No packages published

Languages