Skip to content

Latest commit

 

History

History
299 lines (208 loc) · 536 KB

Natural Language Processing.md

File metadata and controls

299 lines (208 loc) · 536 KB

NLP is the study of the computational treatment of natural (human) language. It enables machines to understand human communication to extract different information. Examples of NLP applications: Analysis of text in emails, human speech, social media,  or optical character recognition (OCR)  from documents (text that is scanned from actual documents).

NLP has its origins in machine translation from the 1950s. NLP advanced over the years by combining the power of artificial intelligence (AI), computational linguistics, and computer science.

Natural language processing applications

  • Machine translation
  • Information retrieval: Search engines
  • Spell checkers
  • Natural language assistants

Most popular NLP tasks:

  • Machine translation: Automatically translating one language to another
  • Information retrieval: Search engines, such as Google and Bing
  • Spell checkers
  • Natural language assistants, such as Siri and Alexa

Natural language processing challenges

  • Domains: Higher accuracy for specific domains compared to generic domains.
  • Language: English gets the most attention because it is an international language.
  • Medium: Processing speech is more difficult than processing text.

You can understand your NLP problem by focusing on the following areas:

  • Become familiar with your data.
  • Understand the challenges of your particular use case.
  • Review the state-of-the-art solutions and technologies for similar problems.

Information Extraction

  • Goal: Parse the input text to extract valuable output. Examples: Entity extraction, relation identification extraction, text summarization

  • Unstructured text: Dynamic structure (for example emails, newspaper articles, and user reviews).

  • Structured text: Defined structure (for example, a database table).

NLP technology can be used to extracting information from unstructured text such as emails, newspaper articles, and user reviews into structured text.

Entity extraction refers to extracting entities from the text such as organizations, people, locations and so on. For example, the World Health Organization, IBM, Sara, John, Paris, US.

Relation extraction refers to identifying the relationship between entities, for example, “Abraham Lincoln was a US president”; “Ginni Rometty is the CEO of IBM”.

Text summarization refers to the technique of shortening long pieces of text. Automatic text summarization is a common use case in machine learning and natural language processing.

Structured text mostly takes the form of tables or values in a structured form. 

The goal of information extraction is to parse the incoming text, identify important mentions and their relations, and extract valuable output into structured text. Doing so can be used to automate the process of reading articles and passages to convert this information into a structured format. Computer systems can then manage this information and take proper actions.

Sentiment analysis

  • The process of identifying emotions or opinions that are expressed in user input.
  • This analysis is used in marketing, retention plans, and emotional intelligence for chatbots.
  • Sentiment analysis is the process of identifying emotions or opinions that are expressed in user input.
  • Sentiment analysis answers various questions, such as how people feel about your product or whether your customers are satisfied with your customer service.
  • It is used in marketing and retention plans, and emotional intelligence for chatbots, that is, it enables chatbots to direct the conversation.
  • Machine learning algorithms brought many advances to this field and are still improving.

Speech Recognition

  • Convert language into text.
  • Retrieving answers from forums.
  • Building a Frequently Asked Questions (FAQs) system.
  • Training chatbots

Speech recognition is another use case that helps advancing the capabilities of many different applications. It converts spoken language into text. It can be used in many applications in several domains, such as having an interactive talk with a chatbot. It can also be used in Internet of Things (IoT) applications.

 Natural language processing basic concepts and terminology

  • Synonyms: Words that are written differently but are similar in meaning. Example: Clever and smart

  • Antonyms: Words that have meanings that are opposite to each other. Example: Clever and stupid

  • Usage example: In information retrieval, you might want to expand the keywords search by retrieving the synonyms of the query words.

Synonyms are words that are written differently but are similar in meaning. For example:

  • Clever and smart
  • Begin and start
  • Beautiful and pretty
  • Sad and unhappy

Antonyms are words that have meanings that are opposite to each other. For example:

  • Clever and stupid
  • Begin and end
  • Beautiful and ugly
  • Sad and happy

Usage example: In information retrieval, you might want to expand the keywords search by retrieving the synonyms of the query words.

Homonyms: Words that have the same written form but have unrelated meanings. There are two types of homonyms:

  • Homographs: Words that have the same written form. For example:

  • This answer is right.

  • The building is on the right side of the river.

  • You have the right to remain silent.

  • Come here right now.

  • Homophones: Words that sound similar when spoken but have different meanings and spellings. For example:

  • “left” and “lift”.

  • “right” and “write”.

Homonyms challenges:

  • How do you translate “right” so that it has the correct meaning?
  • How do you differentiate two words that sound similar when you convert speech to text?

Homonyms introduce challenges into NLP operations such as machine translation and speech recognition. How do you translate right so that it has the correct meaning? How do you differentiate two words that sound similar when you convert speech to text?

  • Polysemy: Words that have the same written form and a related meaning. For example: You must face your fear. Her face is beautiful.

  • Hyponymy: A word is a hyponym of another word if it represents a subclass of the other word. For example: Orange is a hyponym of fruit. Yellow is a hyponym of color.

Hypernymy: One word is the hypernym of another word if it represents a superclass of the other word. For example: Fruit is a hypernym of orange. Color is a hypernym of yellow.

Usage example: Comparing the semantic similarity.

Open-source NLP tools:

  • Apache OpenNLP: Provides tokenizers, sentence segmentation, partof-speech tagging, named entity extraction, chunking, parsing, coreference resolution, and more.
  • Stanford Core NLP: A suite of NLP tools that provide part-of-speech tagging, a named entity recognizer, a co-reference resolution system, sentiment analysis, and more.
  • Natural Language Toolkit (NLTK): A Python library that provides modules for processing text, classifying, tokenizing, stemming, tagging, parsing, and more.
  • WordNet: One of the most popular lexical databases for the English language. Supported by various API and programming languages.

There are many open source tools that you can use for NLP. For example:

  • Open NLP that is based on Java. It provides many functions for text processing, such as tokenizers, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, co-reference resolution, and more. For more information, see https://opennlp.apache.org/
  • Stanford Core NLP, which is written in Java. It is a suite of NLP tools that provide part-of-speech tagging, a named entity recognizer, a co-reference resolution system, sentiment analysis, and more. It supports many languages, such as English, German, French, Arabic, Spanish, and Chinese. For more information, see https://stanfordnlp.github.io/CoreNLP/.
  • NLTK provides the same processes as the other NLP suites, but in the Python language. For more information, see https://www.nltk.org/.
  • WordNet is a popular lexical database that is used in research. There are many APIs and languages that you can use to access WordNet. For example, you can make a call to retrieve a synonym of a word. WordNet is available online and as an offline version that you can download. For more information, see  https://wordnet.princeton.edu/.

There are other libraries, such as Unstructured Information Management Architecture (UIMA). IBM Watson uses UIMA to analyze unstructured data. The Apache Clinical Text Analysis and Knowledge Extraction System (Apache cTAKES) is a UIMA-based system that is used to extract information from medical records.

Services:

  • Examples: IBM Cloud, Microsoft Cloud (Azure), and Google Cloud

  • IBM offers its AI services through IBM Cloud. The NLP services that are provided include the following ones (among others):

  • Watson Natural Language Classifier for text classification

  • Watson Natural Language Understanding for entity identification and relation extraction

Categories of NLP

There are two categories of NLP:

  • Natural language understanding (NLU): It is the NLP task of extracting insights from natural language inputs.
  • Natural language generation (NLG): it is the NLP task of building natural language outputs from non-linguistic inputs.

![[Pasted image 20230702140343.png]]

NLU applications

  • Unstructured to structured
  • Question and answer system
  • Sentiment analysis  
  • Machine translation
  • Text summarization
  • Weather forecasting system

NLU analyzes language to gain insights into the text. There many examples of NLU applications:

  • Mapping a user’s unstructured input to a computer representation (structured data) and relation extraction
  • Question and answering system
  • Sentiment analysis

Language Ambiguities

Lexical ambiguity: A primitive level, such as at the word level Verb versus noun: ƒ   “We will dance all night.” ƒ “This is the salsa dance.” ƒ “John will go to work.” ƒ   “His uncle left him millions in his will.”

Building applications with NLP processes is not a trivial task. Natural language includes many ambiguities, such as a lexical ambiguity, which is a primitive level, such as at the word level.

Two examples:

  • The word dance can be a verb (“We will dance all night.”) or a noun (“This is the salsa dance.”).
  • The word will can be a helping verb to indicate an action in the future (“John will go to work”) or a noun (“His uncle left him millions in his will.”).

Syntactic-level ambiguity: A sentence that can be parsed in various ways. ƒ   Example: “She pointed at the guy with the umbrella.”

Anaphora is an expression for which the interpretation depends on another expression that was previously introduced. The referring term is called anaphor and it is usually a pronoun. The pronoun takes the place of a noun but, to avoid ambiguity, the pronoun must refer clearly to the noun that the pronoun replaces. Anaphora ambiguity occurs when more than one possible antecedent exists. For example: “When Mary invited Ann to play she did not know that she would be late.” Is the first “she” replacing Mary or Ann? Who did not know? Is the second “she” replacing Mary or Ann? Who would be late?

NLP Pipeline

A pipeline is a way to design a program in which the output of one module feeds into the input of the next module. Using the NLP pipeline divides the tasks of NLU, which makes NLU less complicated. For example, use NLP pipeline to understand the following sentence: “Yes, I received your invitation, and I will happily attend your party.”

When you work on an NLP task like machine translation, there are a set of processes and activities that you must perform. This set of processes is the NLP pipeline.

A pipeline is a process for designing a program in which the output of one module feeds into the input of the next module. You use the pipeline to break down the complexity of the NLP task into a smaller set of less complicated tasks. There are no strict rules for the activities that must be done  in the pipeline.

  1. Sentence Segmentation : It focuses on finding the boundaries of sentences in text, that is, where the sentence starts and ends. This is not an easy task to accomplish due to the  possible ambiguity that is caused by punctuation marks. Example: ƒ “Yes, I received your invitation, and I will happily attend your party.”
  2. Tokenization : a basic process that breaks a sentence into a group of words, punctuations, numbers, and alphanumeric that are called tokens. The tokenization can be done on multiple delimiters. Assume in the example that you use the white space as delimiter and apply tokenization to the example: First sentence: “Yes””,” “I” “received” “your” “invitation” Second sentence: “I” “will” “happily” “attend” “your” “party”
  3. POS Tagging : Tagging each token by what it corresponds to in a certain data set. POS helps the computer to understand language and grammar and derive meaning from the input sentence. The most famous data sets is the Penn Treebank project. POS helps the computer to understand language and grammar and derive meaning from the input sentence. In the example, the data set is English grammar, so according to the POS that are defined, we want to label each token by its proper grammar value, that is, the corresponding part of speech.
  4.   Morphological processing : Defines how morphemes are constructed. Word meanings change when you add or remove any affixes. For example, certain and uncertain, or late and latest. Morphological parsing is the process of determining the morphemes of a word. Morphemes represent the smallest grammatical units in a language and are not identical to a  word. The main difference between a morpheme and word is that a morpheme might or not stand alone, and a word is freestanding. Morphology focus on recognizing how base words were modified to form other words with similar meanings but often with different syntactic categories.
  5. Word Level Semantics : Deals with meaning of word.    Example: “yes/UH” “,/,” ”I/PRP” “got/VBD” “your/PRP$” “invitation/NN” “,/,”. Can replace the “got” token with “received”. Example: “yes/UH” “,/,” ”I/PRP” “received/VBD” “your/PRP$” “invitation/NN” “,/,”.
  6. Parsing : Parsing or syntactic analysis is the process of evaluating text in terms of grammatical correctness (if required). First sentence: “yes/UH” “,/,” ”I/PRP” “received/VBD” “your/PRP$” “invitation/NN” “,/,” Second sentence: “I/PRP” “will/MD” “happily/RB” “attend/VB” “your/PRP$” “party/NN” The input from the previous stage was: First sentence: “yes/UH” “,/,” ”I/PRP” “received/VBD” “your/PRP$” “invitation/NN” “,/,” Second sentence: “I/PRP” “will/MD” “happily/RB” “attend/VB” “your/PRP$” “party/NN”

(ROOT (S (S (INTJ (UH yes)) (, ,) (NP (PRP I)) (VP (VBD got) (NP (PRP$ your)  (NN invitation)))) (, ,) (CC and) (S (NP (PRP I)) (VP (MD will) (ADVP (RB happily)) (VP (VB attend) (NP (PRP$ your) (NN party))))))) ![[Pasted image 20230704205154.png]]

Information Retrieval

Retrieve relevant information from a collection of information resources. Information retrieval is the foundation of many search engines.

Stemming: Reduce a word to its word stem by removing its affixes For example, the word “unbelievable” can be stemmed into “believ” by removing the prefix “un” and the suffix “able”. The stem does not have to match the morphological root of the word. For example, “Chance” stemmed to “Chanc”. The most popular English stemming algorithm is the Porter's algorithm (Porter, 1980). Stemming is a task where algorithms reduce a word to its word stem by removing its affixes. For example, the word “unbelievable” may be stemmed into “believ” by removing the prefix “un” and the suffix “able”. The stem does not have to match the morphological root of the word. Therefore, the stem form can have no actual meaning, for example, the stem of the word "chance" is "chanc“, which is not an English word on its own.

Porter stemmer

Step Number Rule Example
1 sses -> ss ies -> I

ss -> ss s -> ø
glasses -> glass parties -> parti loss -> loss

hats -> hat
2 ing -> ø ed -> ø talking -> talk discovered -> discover
3 ational -> ate izer -> ize ator -> ate operational -> operate recognizer -> recognize collaborator -> collaborate
4 al -> ø able -> ø ate -> ø electrical -> electric doable -> do investigate -> investing

Normalization: Text normalization is the process of transforming text into a  single form,  which ensures consistency before operations are performed on it. For example, imagine that you have a dictionary that contains a set of words, and you also have query text that includes one of the words in your dictionary. Assume that the word in the query is  Child with a capital letter C. But the equivalent word in the dictionary is child with a lowercase c letter. Thus, both words do not match. The solution is to ensure that the query text is “normalized” into lowercase and the set of words in dictionary is also normalized into lowercase so that the query and dictionary have consistent text.

Here are some examples of normalization: Examples:

  • Case folding: Child -> child
  • Duplication removal: Hiiiiii -> Hi
  • Acronyms processing: WHO -> World Health Organization
  • Format normalization: $100 -> 100 dollars
  • Value normalization: 2 July 1980 -> DATE

TF-IDF: TF-IDF is a combination of two weighting methods for information retrieval to determine how important a term is. TF: Term Frequency measures how many times a term t occurs in a document d. This value is denoted by tf_t,d_. IDF: Inverse Document Frequency measures how rare a term is. You must decrease the frequent terms weight while increasing the weight of the exceptional ones.

idf= log (Total number of documents (N) / Number of documents with term t (dft))

![[Pasted image 20230704210541.png]]

![[Pasted image 20230704210551.png]]

Information Extraction

Refers to the automatic extraction of structured information from unstructured or semi-structured text Entities and relation identification ![[Pasted image 20230712223037.png]] “Nobel Prize” and “Wilhelm Conrad Rontgen” are identified as entities. The relationship is identified as awarded, which indicates the relationship between Wilhelm and the Nobel Prize.

Sentiment analysis Sentiment analysis is the process of identifying emotions or opinions that are expressed in user input. It is used heavily in the fields of chatbots and social media analysis. It is also used in marketing because it captures a user’s opinion about a particular product or service so that an organization can take corrective action to keep the user satisfied.

“Their service is amazing” is a positive sentiment. “The quality of food in this restaurant is terrible” is a negative sentiment. “I am going to school” is a neutral sentiment.

System Evaluation

How can we measure the solution quality? Target: You developed a new search engine. You must define how well it works.

![[Pasted image 20230712225918.png]] Assume that you developed a search algorithm that helps you to retrieve related words from a  corpus that contains 1000 documents. From these 1000 documents, assume 200 are  relevant to the word cat, and the other 800 documents are irrelevant. You ran a search test for the word “cat”. After the test ran, the search engine retrieved the  documents that are shown here.

![[Pasted image 20230712230539.png]]

You test your solution by searching for the word “cat”. Your algorithm returns 250 documents, where 150 documents are relevant (which means your algorithm missed 50 relevant documents)  and 100 documents are irrelevant (which means your algorithm correctly eliminated 700 of the irrelevant documents).

Confusion Matrix

A confusion matrix, also known as an error matrix, is a specific table layout that enables visualization of the performance of an algorithm.

![[Pasted image 20230712230922.png]]

How many relevant documents were retrieved by the algorithm? 150 documents: True positive (Tp) How many irrelevant documents were retrieved by the algorithm? 100 documents: False positive (Fp) (total 250 documents retrieved – 150 relevant documents).

How many relevant documents did the algorithm not retrieve? 50 documents: False negative (Fn). How many irrelevant documents did the algorithm not retrieve? 700 documents: True negative (Tn).

The objective is to improve the algorithm to decrease the Fp and Fn values.

![[Pasted image 20230713110733.png]]