Skip to content

Latest commit

 

History

History
133 lines (108 loc) · 9.95 KB

16.Text with Language service.MD

File metadata and controls

133 lines (108 loc) · 9.95 KB

Analyze text with the Language service

  • Explore text mining and text analysis with the Language service's NLP features, which include sentiment analysis, key phrase extraction, named entity recognition, and language detection.

  • As an example, you might read some text and identify some key phrases that indicate the main talking points of the text. You might also recognize names of people or well-known landmarks such as the Eiffel Tower. Although difficult at times, you might also be able to get a sense for how the person was feeling when they wrote the text, also commonly known as sentiment.

Text Analytics Techniques

  • Text analytics is a process where an AI algorithm, running on a computer, evaluates these same attributes in text, to determine specific insights. A person will typically rely on their own experiences and knowledge to achieve the insights. A computer must be provided with similar knowledge to be able to perform the task. There are some commonly used techniques that can be used to build software to analyze text, including:

    1. Statistical analysis of terms used in the text. For example, removing common "stop words" (words like "the" or "a", which reveal little semantic information about the text), and performing frequency analysis of the remaining words (counting how often each word appears) can provide clues about the main subject of the text.
    2. Extending frequency analysis to multi-term phrases, commonly known as N-grams (a two-word phrase is a bi-gram, a three-word phrase is a tri-gram, and so on).
    3. Applying stemming or lemmatization algorithms to normalize words before counting them - for example, so that words like "power", "powered", and "powerful" are interpreted as being the same word.
    4. Applying linguistic structure rules to analyze sentences - for example, breaking down sentences into tree-like structures such as a noun phrase, which itself contains nouns, verbs, adjectives, and so on.
    5. Encoding words or terms as numeric features that can be used to train a machine learning model. For example, to classify a text document based on the terms it contains. This technique is often used to perform sentiment analysis, in which a document is classified as positive or negative.
    6. Creating vectorized models that capture semantic relationships between words by assigning them to locations in n-dimensional space. This modeling technique might, for example, assign values to the words "flower" and "plant" that locate them close to one another, while "skateboard" might be given a value that positions it much further away.
  • In MS Azure, the Language cognitive service can help simplify application development by using pre-trained models that can:

    1. Determine the language of a document or text (for example, French or English).
    2. Perform sentiment analysis on text to determine a positive or negative sentiment.
    3. Extract key phrases from text that might indicate its main talking points.
    4. Identify and categorize entities in the text. Entities can be people, places, organizations, or even everyday items such as dates, times, quantities, and so on.

Get started with text analysis

  • The Language service is a part of the Azure Cognitive Services offerings that can perform advanced natural language processing over raw text.

  • Azure resources for the Language service

    1. A Language resource - if you only plan to use NLP services, or if you want to manage access and billing for the resource separately from other services.
    2. A Cognitive Services resource - if you plan to use the Language service in combination with other cognitive services, and you want to manage access and billing for these services together.
  • Language detection - identify the language in which text is written.

    1. The language name (for example "English").
    2. The ISO 6391 language code (for example, "en").
    3. A score indicating a level of confidence in the language detection.
    4. For example, restaurant where customers can complete surveys: 1. Review 1: "A fantastic place for lunch. The soup was delicious." 2. Review 2: "Comida maravillosa y gran servicio." 3. Review 3: "The croque monsieur avec frites was terrific. Bon appetit!" 4. use the text analytics to detect the language:
          Document	Language Name	ISO 6391 Code	Score
          Review 1	English	en	1.0
          Review 2	Spanish	es	1.0
          Review 3	English	en	0.9
      
  • Ambiguous or mixed language content - These situations can present a challenge to the service. An ambiguous content example would be a case where the document contains limited text, or only punctuation. For example, using the service to analyze the text ":-)", results in a value of unknown for the language name and the language identifier, and a score of NaN (which is used to indicate not a number).

  • Sentiment analysis - can evaluate text and return sentiment scores and labels for each sentence. This capability is useful for detecting positive and negative sentiment in social media, customer reviews, discussion forums and more. Returns a sentiment score in the range of 0 to 1, with values closer to 1 being a positive sentiment. Scores that are close to the middle of the range (0.5) are considered neutral or indeterminate.

    1. For example, the following two restaurant reviews could be analyzed for sentiment: ``` mark "We had dinner at this restaurant last night and the first thing I noticed was how courteous the staff was. We were greeted in a friendly manner and taken to our table right away. The table was clean, the chairs were comfortable, and the food was amazing."

      and

      "Our dining experience at this restaurant was one of the worst I've ever had. The service was slow, and the food was awful. I'll never eat at this establishment again." ```

    2. 1st review might be around 0.9 (positive) sentiment; while 2nd might be closer to 0.1 (negative)

  • Indeterminate sentiment - A score of 0.5 might indicate that the sentiment of the text is indeterminate, and could result from text that does not have sufficient context to discern a sentiment or insufficient phrasing. For example, a list of words in a sentence that has no structure, could result in an indeterminate score. Another example where a score may be 0.5 is in the case where the wrong language code was used. A language code (such as "en" for English, or "fr" for French) is used to inform the service which language the text is in. If you pass text in French but tell the service the language code is en for English, the service will return a score of precisely 0.5.

  • Key phrase extraction - is the concept of evaluating the text of a document, or documents, and then identifying the main talking points of the document(s). Consider the restaurant scenario discussed previously. Depending on the volume of surveys that you have collected, it can take a long time to read through the reviews. Instead, you can use the key phrase extraction capabilities of the Language service to summarize the main points.

  • Entity recognition - Provide unstructured text and it will return a list of entities in the text that it recognizes. The service can also provide links to more information about that entity on the web. An entity is essentially an item of a particular type or a category; and in some cases, subtype, such as those as shown in the following table.

        Type	SubType	Example
        Person		"Bill Gates", "John"
        Location		"Paris", "New York"
        Organization		"Microsoft"
        Quantity	Number	"6" or "six"
        Quantity	Percentage	"25%" or "fifty percent"
        Quantity	Ordinal	"1st" or "first"
        Quantity	Age	"90 day old" or "30 years old"
        Quantity	Currency	"10.99"
        Quantity	Dimension	"10 miles", "40 cm"
        Quantity	Temperature	"45 degrees"
        DateTime		"6:30PM February 4, 2012"
        DateTime	Date	"May 2nd, 2017" or "05/02/2017"
        DateTime	Time	"8am" or "8:00"
        DateTime	DateRange	"May 2nd to May 5th"
        DateTime	TimeRange	"6pm to 7pm"
        DateTime	Duration	"1 minute and 45 seconds"
        DateTime	Set	"every Tuesday"
        URL		"https://www.bing.com"
        Email		"support@microsoft.com"
        US-based Phone Number		"(312) 555-0176"
        IP Address		"10.0.1.125"
    
    1. For example, to detect entities in the following restaurant review extract:
        ``` mark
          "I ate at the restaurant in Seattle last week."
    
          Entity	Type	SubType	Wikipedia URL
          Seattle	Location		https://en.wikipedia.org/wiki/Seattle
          last week	DateTime	DateRange	
        ```
    

Exercise - Explore text analytics

  • https://microsoftlearning.github.io/AI-900-AIFundamentals/instructions/04-module-04.html

  • Create a Cognitive Services resource

  • Run Cloud Shell PowerShell ``` mark

      git clone https://github.com/MicrosoftLearning/AI-900-AIFundamentals ai-900
    
      code .
    
      folder ai-900/analyze-text.ps1.
    
      update Keys and Endpoints, save 
    
       cd ai-900
      ./analyze-text.ps1 review1.txt
    
       ./analyze-text.ps1 review2.txt
    
       ./analyze-text.ps1 review3.txt
    
       ./analyze-text.ps1 review4.txt
    
       Review the output.
      ```
    

Quiz

  1. You want to use the Language service to determine the key talking points in a text document. Which feature of the service should you use?
  • Sentiment analysis
  • Key phrase extraction
  • Entity detection
  1. You use the Language service to perform sentiment analysis on a document, and a score of 0.99 is returned. What does this score indicate about the document sentiment?
  • The document is positive.
  • The document is neutral.
  • The document is negative.
  1. When might you see NaN returned for a score in Language Detection?
  • When the score calculated by the service is outside the range of 0 to 1
  • When the predominant language in the text is mixed with other languages
  • When the language is ambiguous