Part of Speech Tagging

Parts-of-Speech

Also known as POS, word classes or syntactic categories
Procedure of determining part-of-speech tagger to each word present in a sentence
POS tagging also called as sequence labeling problem because task is map a sentence x1, x2,...,xn to a tag sequence y1, y2, ..., yn

POS are helpful as they provide information about a word and its neighbours
For example, if a word is noun then it tells about neighboring words as nouns are preceded by adjectives and determiners(articles are sub-type). Or, if a word is verb then its verbs are preceded by nouns
Also gives information about the syntactic structure around the word as nouns are generally part of noun phrases
- Syntactic structure: "Principles by which words are used in phrases and sentences to construct meaningful combinations"(ref: http://www.thefreedictionary.com/Syntactic+structure)
- Therefore parts-of-speech are one of the important component of syntactic parsing
Apart from this, POS are also useful for finding named entities such as people or locations in the given text
Also Useful for text-to-speech (how to pronounce "lead"?)
- "lead" pronunciation is different for noun and verb

POS are divided into two categories:
- Closed class
- - Called closed class because they have relatively fixed membership
- - Determiners(articles are sub-type): a, an, the, this, that, etc.
- - Pronouns: she, he, it, they, etc
- - Prepositions: on, under, over, etc
- - Conjunctions: and, or, etc
- - Auxiliary verbs: can, could, might, are, is, etc
- - Particles: up, down, on, off, by, etc
- - Numerals: one, two, three, first, second, third
- - Interjection: oh, hey, alas, uh, etc
- - New prepositions, determiners and pronouns are not invented unlike new nouns and verbs
- - Also called function words such as: it, and, of, that occur frequently and have structuring uses in grammar
- Open class
- - Called open class because everyday new members are being introduced
- - Nouns, Verbs, Adjectives, Adverbs
- - Nouns like iPhone, Samsung or verbs like to fax are created

Modern language processing on English use Penn Treebank tagset it has 45 POS tags
Used on variety of corpora such as Brown corpus, Wall Street Journal corpus, etc.

Penn_Treebank

Image taken from Speech and Language Processing. Daniel Jurafsky & James H. Martin

Since tags are also applied to punctuation such as comma, left quote, right quote, colon(:), etc
Therefore, word tokenization is performed on the given sentence
Input to a tagging algorithm is a sequence of words and a tagset, and the output is a sequence of tags(single best tag for each word)
Tagging is a disambiguation task; words are ambiguous as it is possible for the word to have more than one part-of-speech tag, while target is find the suitable tag according to the situation
- For example: Word "book" can be a verb(book that flight) or a noun(hand me that book please)
Challenge is to resolve these ambiguities, and choose the suitable tag as per context.

Performance is determined with the help of accuracy(what percent of tags algorithm gets right)
Current POS taggers get around 97% accuracy
But baseline POS taggers have 90% accuracy
- Baseline works:
- - Tag every word with its most common tag in the training data
- - Tag unknown/new word as nouns

First, knowledge of neighboring words is very important for POS tagging
- Bill saw that man yesterday
- NNP NN DT NN NN
- VB VB(D) IN VB NNImportant
- Here, we have lot of tagging ambiguity, like if "Bill" is proper noun or verb, "saw" is proper noun or verb and so on
Another important evidence of information for POS is the knowledge of word probabilities
- Like man is rarely used as a verb, so without looking to context of sentence we can assign the POS tags
These both sources of evidence are also useful for assigning POS tags to each word

If we take more features from words, we can perform well:
- Word the: the-> DT (like "the" has always POS tag DT)
- Prefixes unbelievable: un -> JJ(if we know prefixes of word, then they are mostly adjectives(JJ) )
- Suffixes slowly: -ly -> RB(knowing suffixes of word ending with -ly, they they are mostly adverbs(RB) )
- Capitalization Pacific: P capital -> NNP(If first word is capital, then most probably it is proper noun)
- Word shapes 40-year: d-x(digital followed by sequence) -> JJ(adjective)
So, Maxtent POS Tagger is constructed based on taking individual features associated with words, and its accuracy is:
- Maxent P(t|w) 93.7% overall/82.6% unknown words

Baseline tagger(Works on most frequent tags) gives accuracy of ~90% overall/~50% unknown words
Maxent P( t(tag)|w(word) ) gives accuracy of 93.7% overall/82.6% unknown words. It uses word features to assign POS tags.
Trigrams'n'Tags(HMM+) gives accuracy of 96.2% overall/86.0% unknown words:
- - This model extends idea of extracting word features with hidden Markov model and it gets more accuracy.
- - It considers another feature called wider tag context
- - It uses history to predict the word tag; probability of a current tag depend on the two previous tags in the sentence
- - It uses Vitrebi algorithm to find the most likely tag sequence
- - To get POS tags of unknown words, it uses suffixes of words such as words ending with -s are likely to be plural nouns, or words ending with -able are considered as adjectives.
- - Furthermore, this model is efficient and is trainable on different languages and on any tagset.
Maximum Entropy Markov Models(MEMM) tagger gives accuracy of 96.9% / 86.9%:
- It uses more features to assign the POS of words as shown in figure:

MEMM_POS

- Image taken from Speech and Language Processing. Daniel Jurafsky & James H. Martin
Bidirectional dependencies gives accuracy of 97.2% / 90.0%:
- - Problem with MEMM and HMM models is that they run from left-to-right
- - Therefore, results might become better if decision about tag of word wi, can also use information from future tags t(i+1) and t(i+2)
- - Stanford tagger uses a bidirectional version of Maximum Entropy Markov Models(MEMM)
- - Any tagger model can be converted into bidirectional version by using multiple passes. In other words, tagger can be run twice, once left-to-right and once right-to-left.
- - Greedy decoding is used for each word to choose tag with highest score from right-to-left and left-to-right classifier