Skip to content

Part of Speech Tagging

rameshjesswani edited this page Oct 7, 2017 · 17 revisions

Parts-of-Speech

  • Also known as POS, word classes or syntactic categories
  • Procedure of determining part-of-speech tagger to each word present in a sentence
  • POS tagging also called as sequence labeling problem because task is map a sentence x1, x2,...,xn to a tag sequence y1, y2, ..., yn

Why Parts-of-Speech?

  • POS are helpful as they provide information about a word and its neighbours
  • For example, if a word is noun then it tells about neighboring words as nouns are preceded by adjectives and determiners(articles are sub-type). Or, if a word is verb then its verbs are preceded by nouns
  • Also gives information about the syntactic structure around the word as nouns are generally part of noun phrases
    • Therefore parts-of-speech are one of the important component of syntactic parsing
  • Apart from this, POS are also useful for finding named entities such as people or locations in the given text
  • Also Useful for text-to-speech (how to pronounce "lead"?)
    • "lead" pronunciation is different for noun and verb

Word Classes

  • POS are divided into two categories:
    • Closed class
      • Called closed class because they have relatively fixed membership
      • Determiners(articles are sub-type): a, an, the, this, that, etc.
      • Pronouns: she, he, it, they, etc
      • Prepositions: on, under, over, etc
      • Conjunctions: and, or, etc
      • Auxiliary verbs: can, could, might, are, is, etc
      • Particles: up, down, on, off, by, etc
      • Numerals: one, two, three, first, second, third
      • Interjection: oh, hey, alas, uh, etc
      • New prepositions, determiners and pronouns are not invented unlike new nouns and verbs
      • Also called function words such as: it, and, of, that occur frequently and have structuring uses in grammar
    • Open class
      • Called open class because everyday new members are being introduced
      • Nouns, Verbs, Adjectives, Adverbs
      • Nouns like iPhone, Samsung or verbs like to fax are created

Penn Treebank Part-of-Speech Tagset

  • Modern language processing on English use Penn Treebank tagset it has 45 POS tags
  • Used on variety of corpora such as Brown corpus, Wall Street Journal corpus, etc.

Penn_Treebank

  • Image taken from Speech and Language Processing. Daniel Jurafsky & James H. Martin

Part-of-Speech Tagging

  • Since tags are also applied to punctuation such as comma, left quote, right quote, colon(:), etc
  • Therefore, word tokenization is performed on the given sentence
  • Input to a tagging algorithm is a sequence of words and a tagset, and the output is a sequence of tags(single best tag for each word)
  • Tagging is a disambiguation task; words are ambiguous as it is possible for the word to have more than one part-of-speech tag, while target is find the suitable tag according to the situation
    • For example: Word "book" can be a verb(book that flight) or a noun(hand me that book please)
  • Challenge is to resolve these ambiguities, and choose the suitable tag as per context.

POS Performance

  • Performance is determined with the help of accuracy(what percent of tags algorithm gets right)
  • Current POS taggers get around 97% accuracy
  • But baseline POS taggers have 90% accuracy
    • Baseline works:
      • Tag every word with its most common tag in the training data
      • Tag unknown/new word as nouns

Source of Information for POS


  • First, knowledge of neighboring words is very important for POS tagging
    • Bill saw that man yesterday
    • NNP NN DT NN NN
    • VB VB(D) IN VB NNImportant
    • Here, we have lot of tagging ambiguity, like if "Bill" is proper noun or verb, "saw" is proper noun or verb and so on
  • Another important evidence of information for POS is the knowledge of word probabilities
    • Like man is rarely used as a verb, so without looking to context of sentence we can assign the POS tags
  • These both sources of evidence are also useful for assigning POS tags to each word

POS Can be Improved by taking more Features from Words

  • If we take more features from words, we can perform well:
    • Word the: the-> DT (like "the" has always POS tag DT)
    • Prefixes unbelievable: un -> JJ(if we know prefixes of word, then they are mostly adjectives(JJ) )
    • Suffixes slowly: -ly -> RB(knowing suffixes of word ending with -ly, they they are mostly adverbs(RB) )
    • Capitalization Pacific: P capital -> NNP(If first word is capital, then most probably it is proper noun)
    • Word shapes 40-year: d-x(digital followed by sequence) -> JJ(adjective)
  • So, Maxtent POS Tagger is constructed based on taking individual features associated with words, and its accuracy is:
    • Maxent P(t|w) 93.7% overall/82.6% unknown words

Rough Accuracy of POS Taggers Models

  • Baseline tagger(Works on most frequent tags) gives accuracy of ~90% overall/~50% unknown words
  • Maxent P( t(tag)|w(word) ) gives accuracy of 93.7% overall/82.6% unknown words. It uses word features to assign POS tags.
  • Trigrams'n'Tags(HMM+) gives accuracy of 96.2% overall/86.0% unknown words:
      • This model extends idea of extracting word features with hidden Markov model and it gets more accuracy.
      • It considers another feature called wider tag context
      • It uses history to predict the word tag; probability of a current tag depend on the two previous tags in the sentence
      • It uses Vitrebi algorithm to find the most likely tag sequence
      • To get POS tags of unknown words, it uses suffixes of words such as words ending with -s are likely to be plural nouns, or words ending with -able are considered as adjectives.
      • Furthermore, this model is efficient and is trainable on different languages and on any tagset.
  • Maximum Entropy Markov Models(MEMM) tagger gives accuracy of 96.9% / 86.9%:
    • It uses more features to assign the POS of words as shown in figure:

MEMM_POS

    • Image taken from Speech and Language Processing. Daniel Jurafsky & James H. Martin
  • Bidirectional dependencies gives accuracy of 97.2% / 90.0%:
      • Problem with MEMM and HMM models is that they run from left-to-right
      • Therefore, results might become better if decision about tag of word wi, can also use information from future tags t(i+1) and t(i+2)
      • Stanford tagger uses a bidirectional version of Maximum Entropy Markov Models(MEMM)
      • Any tagger model can be converted into bidirectional version by using multiple passes. In other words, tagger can be run twice, once left-to-right and once right-to-left.
      • Greedy decoding is used for each word to choose tag with highest score from right-to-left and left-to-right classifier