-
Notifications
You must be signed in to change notification settings - Fork 14
Part of Speech Tagging
rameshjesswani edited this page Oct 7, 2017
·
17 revisions
- Also known as POS, word classes or syntactic categories
- Procedure of determining part-of-speech tagger to each word present in a sentence
- POS tagging also called as sequence labeling problem because task is map a sentence x1, x2,...,xn to a tag sequence y1, y2, ..., yn
- POS are helpful as they provide information about a word and its neighbours
- For example, if a word is noun then it tells about neighboring words as nouns are preceded by adjectives and determiners(articles are sub-type). Or, if a word is verb then its verbs are preceded by nouns
- Also gives information about the syntactic structure around the word as nouns are generally part of noun phrases
-
- Syntactic structure: "Principles by which words are used in phrases and sentences to construct meaningful combinations"(ref: http://www.thefreedictionary.com/Syntactic+structure)
-
- Therefore parts-of-speech are one of the important component of syntactic parsing
- Apart from this, POS are also useful for finding named entities such as people or locations in the given text
- Also Useful for text-to-speech (how to pronounce "lead"?)
-
- "lead" pronunciation is different for noun and verb
- POS are divided into two categories:
-
- Closed class
-
-
- Called closed class because they have relatively fixed membership
-
-
-
- Determiners(articles are sub-type): a, an, the, this, that, etc.
-
-
-
- Pronouns: she, he, it, they, etc
-
-
-
- Prepositions: on, under, over, etc
-
-
-
- Conjunctions: and, or, etc
-
-
-
- Auxiliary verbs: can, could, might, are, is, etc
-
-
-
- Particles: up, down, on, off, by, etc
-
-
-
- Numerals: one, two, three, first, second, third
-
-
-
- Interjection: oh, hey, alas, uh, etc
-
-
-
- New prepositions, determiners and pronouns are not invented unlike new nouns and verbs
-
-
-
- Also called function words such as: it, and, of, that occur frequently and have structuring uses in grammar
-
-
- Open class
-
-
- Called open class because everyday new members are being introduced
-
-
-
- Nouns, Verbs, Adjectives, Adverbs
-
-
-
- Nouns like iPhone, Samsung or verbs like to fax are created
-
- Modern language processing on English use Penn Treebank tagset it has 45 POS tags
- Used on variety of corpora such as Brown corpus, Wall Street Journal corpus, etc.
- Image taken from Speech and Language Processing. Daniel Jurafsky & James H. Martin
- Since tags are also applied to punctuation such as comma, left quote, right quote, colon(:), etc
- Therefore, word tokenization is performed on the given sentence
- Input to a tagging algorithm is a sequence of words and a tagset, and the output is a sequence of tags(single best tag for each word)
- Tagging is a disambiguation task; words are ambiguous as it is possible for the word to have more than one part-of-speech tag, while target is find the suitable tag according to the situation
-
- For example: Word "book" can be a verb(book that flight) or a noun(hand me that book please)
- Challenge is to resolve these ambiguities, and choose the suitable tag as per context.
- Performance is determined with the help of accuracy(what percent of tags algorithm gets right)
- Current POS taggers get around 97% accuracy
- But baseline POS taggers have 90% accuracy
-
- Baseline works:
-
-
- Tag every word with its most common tag in the training data
-
-
-
- Tag unknown/new word as nouns
-
- First, knowledge of neighboring words is very important for POS tagging
-
- Bill saw that man yesterday
-
- NNP NN DT NN NN
-
- VB VB(D) IN VB NNImportant
-
- Here, we have lot of tagging ambiguity, like if "Bill" is proper noun or verb, "saw" is proper noun or verb and so on
- Another important evidence of information for POS is the knowledge of word probabilities
-
- Like man is rarely used as a verb, so without looking to context of sentence we can assign the POS tags
- These both sources of evidence are also useful for assigning POS tags to each word
- If we take more features from words, we can perform well:
-
- Word the: the-> DT (like "the" has always POS tag DT)
-
- Prefixes unbelievable: un -> JJ(if we know prefixes of word, then they are mostly adjectives(JJ) )
-
- Suffixes slowly: -ly -> RB(knowing suffixes of word ending with -ly, they they are mostly adverbs(RB) )
-
- Capitalization Pacific: P capital -> NNP(If first word is capital, then most probably it is proper noun)
-
- Word shapes 40-year: d-x(digital followed by sequence) -> JJ(adjective)
- So, Maxtent POS Tagger is constructed based on taking individual features associated with words, and its accuracy is:
-
- Maxent P(t|w) 93.7% overall/82.6% unknown words
- Baseline tagger(Works on most frequent tags) gives accuracy of ~90% overall/~50% unknown words
- Maxent P( t(tag)|w(word) ) gives accuracy of 93.7% overall/82.6% unknown words. It uses word features to assign POS tags.
- Trigrams'n'Tags(HMM+) gives accuracy of 96.2% overall/86.0% unknown words:
-
-
- This model extends idea of extracting word features with hidden Markov model and it gets more accuracy.
-
-
-
- It considers another feature called wider tag context
-
-
-
- It uses history to predict the word tag; probability of a current tag depend on the two previous tags in the sentence
-
-
-
- It uses Vitrebi algorithm to find the most likely tag sequence
-
-
-
- To get POS tags of unknown words, it uses suffixes of words such as words ending with -s are likely to be plural nouns, or words ending with -able are considered as adjectives.
-
-
-
- Furthermore, this model is efficient and is trainable on different languages and on any tagset.
-
- Maximum Entropy Markov Models(MEMM) tagger gives accuracy of 96.9% / 86.9%:
-
- It uses more features to assign the POS of words as shown in figure:
-
- Image taken from Speech and Language Processing. Daniel Jurafsky & James H. Martin
- Bidirectional dependencies gives accuracy of 97.2% / 90.0%:
-
-
- Problem with MEMM and HMM models is that they run from left-to-right
-
-
-
- Therefore, results might become better if decision about tag of word wi, can also use information from future tags t(i+1) and t(i+2)
-
-
-
- Stanford tagger uses a bidirectional version of Maximum Entropy Markov Models(MEMM)
-
-
-
- Any tagger model can be converted into bidirectional version by using multiple passes. In other words, tagger can be run twice, once left-to-right and once right-to-left.
-
-
-
- Greedy decoding is used for each word to choose tag with highest score from right-to-left and left-to-right classifier
-