AI, Experience 2 - Yield quick insights from unstructured data with Knowledge Mining and Cognitive Search
- Data & AI Tech Immersion Workshop – Product Review Guide and Lab Instructions
- AI, Experience 2 - Yield quick insights from unstructured data with Knowledge Mining and Cognitive Search
- Technology overview
- Scenario overview
- Task 1: Populate Cosmos DB with tweets from a generator
- Task 2: Create a basic Cognitive Search pipeline using the Azure portal
- Task 3: Enhance the Cognitive Search pipeline
- Task 4: Query data in the Azure portal
- Task 5: Add a knowledge store
- Task 6: Visualize enriched documents in Power BI
- Task 7: Publish Function App for custom skills
- Task 8: Create Forms Recognizer Pipeline
- Task 9: Create an Anomaly Detection pipeline
- Wrap-up
- Additional resources and more information
Cognitive search is an AI feature in Azure Cognitive Search, used to extract text from images, blobs, and other unstructured data sources - enriching the content to make it more searchable in an Azure Cognitive Search index. Extraction and enrichment are implemented through cognitive skills attached to an indexing pipeline. AI enrichments are supported in the following ways:
- Natural language processing skills include entity recognition, language detection, key phrase extraction, text manipulation, and sentiment detection. With these skills, unstructured text can assume new forms, mapped as searchable and filterable fields in an index.
- Image processing skills include Optical Character Recognition (OCR) and identification of visual features, such as facial detection, image interpretation, image recognition (famous people and landmarks) or attributes like colors or image orientation. You can create text-representations of image content, searchable using all the query capabilities of Azure Cognitive Search.
Cognitive skills in Azure Cognitive Search are based on machine learning models in Cognitive Services APIs: Computer Vision and Text Analysis.
Natural language and image processing is applied during the data ingestion phase, with results becoming part of a document's composition in a searchable index in Azure Cognitive Search. Data is sourced as an Azure data set and then pushed through an indexing pipeline using whichever built-in skills you need. The architecture is extensible so if the built-in skills are not sufficient, you can create and attach custom skills to integrate custom processing. Examples might be a custom entity module or document classifier targeting a specific domain such as finance, scientific publications, or medicine. Other cognitive searches that could be used for this are:
- Form processing skills use the Form Recognizer cognitive service to extract key-value pairs and table data from form documents.
- Anomaly detection skills leverage the Anomaly Detector API, which enables you to monitor and detect abnormalities in your time series data with machine learning. Using your time series data, the API determines boundaries for anomaly detection, expected values, and which data points are anomalies.
- Translation skills can take advantage of the Translator Text API, which is a neural machine translation service that can be used to easily and accurately detect the language of any text string, simplifying development processes and allowing you to quickly send for translation or serve localized content.
ContosoAuto is interested in leveraging their unstructured data to gain further insights into multiple business areas. First, they are interested in improving their understanding of how customers perceive their business, and the key things their customers are talking about. To accomplish this, they are looking for a pilot that would use tweets streamed from Twitter into a tweets
container in their Cosmos DB instance to better understand what customers are saying about their organization on the platform. They are also looking to get a better understanding of whether the trend of messages is positive, negative, or neutral by performing sentiment analysis on the tweets. In addition, they are look for options for using the information gain through this process to better target content and experiences to those users.
In this experience, you will learn the mechanics of using Cognitive Search and Knowledge Mining to yield rapid insights into unstructured data. Using a combination of pre-configured and custom cognitive skills in Azure Cognitive Search, you will create a series of Cognitive Search indexing pipelines that enriches source data in route to an index. Cognitive skills are natural language processing (NLP) and image analysis operations that extract text and text representations of an image, detect language, entities, key phrases, and more. The end result is rich additional content in an Azure Cognitive Search index, created by a cognitive search indexing pipeline. The output is a full-text searchable index on Azure Cognitive Search.
Before proceeding to lab we have to perform following steps:
- Go to the storage account named techimmersionstrgxxxxxx (xxxxxx refers to deployment id)
- Click on containers, select the forms container
- Click on upload , select files from
C:\autofiles\forms
upload all the files in the container.
For this experience, you will be using the tweets
container in ContosoAuto's Cosmos DB as a data source for your Cognitive Search pipeline. In order to use Cosmos DB as a data source, documents must exist in the target container prior to creating the Data Source in Azure Cognitive Search. In this task, you will populate the tweets
container in your Cosmos DB ContosoAuto
database using a tweet generator application running in Visual Studio.
-
Open File Explorer and navigate to
C:\lab-files\ai\2
. Double-click onCognitiveSearch.sln
to open the solution in Visual Studio. If you are prompted about how to open the file, choose Visual Studio 2019. If you are prompted by Visual Studio to log in, use the Azure credentials you are using for this workshop.The Visual Studio solution contains the following projects:
- CosmosDb.Common: Common library containing models and classes used by other projects within the solution to communicate with Azure Cosmos DB.
- CustomSkillFunctions: Contains the Azure Functions that are used to perform actions behind custom cognitive skills, such as translating non-English tweets to English and recognizing form fields.
- DataGenerator: Console app that generates simulated tweets and vehicle telemetry data and sends it to Cosmos DB.
- PipelineEnhancer: Console app that interacts with Azure Cognitive Search Service SDK and REST APIs to enhance the Cognitive Search pipeline.
-
In the Solution Explorer on the left-hand side of Visual Studio, expand the DataGenerator project, and then locate and open the
appsettings.json
file. -
Next, you need to retrieve your Cosmos DB connection string. This will be used to enable the
DataGenerator
to write data into yourtweets
container. In the Azure portal, select Resource groups from the left-hand menu, and then select the tech-immersion-XXXXX resource group (where XXXXX is the unique identifier assigned to you for this workshop). -
Select the tech-immersionXXXXX Azure Cosmos DB account from the list of resources (where XXXXX is the unique identifier assigned to you for this workshop).
IMPORTANT: There may be two Cosmos DB accounts in your resource group. Select the Cosmos DB account named tech-immersionXXXXX, with no hyphen between immersion and XXXXX.
-
On your Cosmos DB blade, select Keys from the left-hand menu.
-
Copy the Primary Connection String value by selecting the copy button to the right of the field.
-
Return to the
appsettings.json
file in Visual Studio and paste the Primary Connection String into the value for theConnectionString
setting within theCosmosDb
section. Yourappsettings.json
file should look similar to the following: -
Save
appsettings.json
. -
Right-click on the
DataGenerator
project, and select Set as StartUp Project from the context menu. -
Run the console app by selecting the button in the toolbar with the green arrow and text of DataGenerator.
-
In the console window, enter "1" at the prompt to start generating tweets. You will see statistics about tweets being streamed into Cosmos DB.
Leave the
DataGenerator
console app running in the background while you move on to the following tasks in this experience. The app will run for 10 minutes, sending random tweets into your Cosmos DBtweets
container, so you have data to work with in the following tasks of this experience. In the next task, you will set up an Azure Cognitive Search Index which points to thetweets
container in Cosmos DB, so as new tweets are added, they will be indexed.
With data now streaming into your Cosmos DB tweets
container, you are ready to set up a basic Cognitive Search pipeline using the Azure portal. In this task, you will create an Azure Cognitive Search Index and configure an Azure Cognitive Search Indexer to read tweets from your Cosmos DB container. You will also include several pre-configured skills linked to your Cognitive Services account to extract more information out of the tweets being indexed.
-
Return to your Azure Cosmos DB account blade in the Azure portal, and select Data Explorer from the toolbar on the overview blade.
-
Under the
ContosoAuto
database, expand the tweets container and then select Items. -
In the Items pane, select any of the documents listed and inspect a tweet document. Documents are stored in JSON (JavaScript Object Notation) format in Cosmos DB.
The
text
field, which contains the content of the tweet, is what you will be using as you begin building your Cognitive Search pipeline. -
With a better understanding of the structure of the tweet documents stored in Cosmos DB, let's move on to creating a basic Cognitive Search pipeline. From your Cosmos DB blade in the Azure portal, select Add Azure Cognitive Search from the left-hand menu, select your tech-immersion search service, and then select Next: Connect to your data.
-
On the Connect to your data tab, enter the following:
- Data source: This should be pre-populated with Cosmos DB. The data source object tells Azure Cognitive Search how to retrieve external source data.
- Name: Enter tweets-cosmosdb.
- Cosmos DB account: This should be pre-populated with the connection string for your Cosmos DB account.
- Database: Select the ContosoAuto database.
- Collection: Select the tweets container.
- Query: Paste the SQL statement below into the field.
SELECT * FROM c WHERE c._ts > @HighWaterMark ORDER BY c._ts
- Query results ordered by _ts: Check this box.
-
Select Next: Add cognitive search (Optional).
Cognitive search is an AI feature in Azure Cognitive Search, used to extract text from images, blobs, and other unstructured data sources - enriching the content to make it more searchable in an Azure Cognitive Search index. Extraction and enrichment are implemented through cognitive skills attached to an indexing pipeline. Cognitive skills in Azure Cognitive Search are based on machine learning models in the Cognitive Services APIs: Computer Vision and Text Analysis. To learn more, read What is "cognitive search" in Azure Cognitive Search.
-
On the Add cognitive search (Optional) tab, do the following:
- Expand the Attach Cognitive Services section and select your tech-immersion-cog-services instance from the list. This will associate your Cognitive Services account with the Skillset you are creating.
- Expand the Add Enrichments section and set the following configuration:
- Enter tweet-skillset as the name.
- Select text as the source data field.
- Check the box next to the Text Cognitive Skills header to select all of the options except extract personally identifiable information.
In the section above, you added a set of enrichment steps to the data being ingested from Cosmos DB. In a Cognitive Search pipeline, individual enrichment steps are called skills, and the collection of enrichment steps is a skillset. The predefined skills available at this step through the UI use pre-trained models to extract additional information from the documents. The EntityRecognitionSkill extracts entities (people, organizations, locations) from the document. The KeyPhraseExtractionSkill detects important phrases based on term placement, linguistic rules, proximity to other terms, and how unusual the term is within the source data. The LanguageDetectionSkill is used to detect the primary language used in the document, and the TranslationSkill is used to translate input text into a variety of languages for normalization or localization. You can learn more by reading about the available predefined cognitive skills.
-
Select Next: Customize target index.
-
On the Customize target index tab, do the following:
- Name: Enter tweet-index.
- Key: Leave this set to rid.
- Suggester name: Leave this blank.
- Search mode: Leave this blank.
- Before setting the check boxes for each field, expand users and entities, and any sub-properties within each.
- Check the Retrievable, Filterable and Searchable boxes at the top, to check all fields under each category, as shown in the image below.
On the Index page, you are presented with a list of fields with a data type and a series of check boxes for setting index attributes. You can bulk-select attributes by clicking the checkbox at the top of an attribute column. Choose Retrievable and Searchable for every field that should be returned to a client app and subject to full text search processing. You'll notice that integers are not full text or fuzzy searchable (numbers are evaluated verbatim and are often useful in filters). Read the description of index attributes for more information.
-
Select Next: Create an indexer.
-
On the Create an indexer tab, set the following:
- Name: Enter tweet-indexer.
- Schedule: Select Once.
An indexer in Azure Cognitive Search is a crawler that extracts searchable data and metadata from an external Azure data source and populates an index based on field-to-field mappings between the index and your data source. This approach is sometimes referred to as a 'pull model' because the service pulls data in without you having to write any code that adds data to an index. For this experience, we will be making multiple updates to the indexer, so we did not configure a schedule for the indexer. In production scenarios, you would want to select a schedule to allow new data entering your system to be indexed.
-
Select Submit. You will receive a notification in the portal when the Azure Cognitive Search pipeline as been successfully configured.
-
Next, navigate to the techimmersionXXXXX Azure Cognitive Search Service (where XXXXX is the unique identifier assigned to you for this workshop) in the Azure portal by selecting it from the list of resources in the tech-immersion-XXXXX resource group.
-
On the Azure Cognitive Search service blade, select Indexers.
-
You specified the indexer should run once, so it should have automatically started upon creation. If your indexer has a status of No history, you can force the indexer to run by selecting the indexer, and then selecting Run on the Indexer blade.
-
Once your Indexer has run, select Search explorer on the Search Service toolbar.
-
On the Search explorer tab, select Search and observe the results.
-
Looking at the items in the search results, you will see that each result "value" resembles the following:
{ "@search.score": 1, "created_at": "2019-12-01T20:23:26.209Z", "id_str": "858845341", "text": "Quand le 2020 #Ford #Explorer sera-t-il disponible au Canada?", "id": "cc4c5916-1485-443f-976f-a75bb543f046", "rid": "Tk5zWUFKQ2tVZzRRQUFBQUFBQUFBQT090", "people": [], "organizations": [ "Ford" ], "locations": [ "Canada" ], "keyphrases": [ "Explorer", "PM", "Ford", "Canada" ], "language": "fr", "translated_text": "When will 2020 #Ford #Explorer be available in Canada?", "user": { "id": 557330088, "id_str": "522948157", "name": "Margarito Gantner", "screen_name": "MargaritoGantner", "location": "Madison, WI", "url": "", "description": "" }, "entities": { "hashtags": [ { "indices": null, "text": "Ford" }, { "indices": null, "text": "Explorer" } ], "user_mentions": [ { "id": 2244994945, "id_str": "2244994945", "indices": [ 0, 12 ], "name": "Contoso Auto", "screen_name": "ContosoAuto" } ] } }
As you inspect the result documents, take note of the various components that were added to enrich the tweet data using built-in cognitive skills. You can go back into Cosmos DB to see the base document structure, and compare that to the search result with the additional data. The cognitive search enrichment pipeline created fields are people, organizations, locations, keyphrases, language, and translated_text. These fields contain information extracted from the
text
field by the individual cognitive skills you selected in the Add cognitive search page above.
In the previous task, you created a basic Cognitive Search pipeline using the Azure portal user interface. This pipeline uses built-in Cognitive Skills for extracting people, organization and location names, key phrases and for detecting the primary language. In this task, you will enhance your Cognitive Search pipeline using functionality available only through the Azure Cognitive Search Service REST APIs.
-
You will use a console application running from Visual Studio to add skills not available through the Azure portal to your cognitive search pipeline. The console application calls the Azure Cognitive Search Service REST APIs to update various components of your cognitive search pipeline with these enhancements. In the steps below you will be updating the skillset, index, and indexer of your search pipeline with these enhancements.
-
To prepare the console application, you first need to add multiple values for various Azure services into the
appsettings.json
file. Return to Visual Studio, and open theappsettings.json
file located under thePipelineEnhancer
project.The
appsettings.json
file will look like the following, and here you can see the values that you need to retrieve before moving on. -
To retrieve values for the required settings for your Azure Cognitive Search Service, navigate to your Azure Cognitive Search Service in the Azure portal by selecting it from the list of resources in the tech-immersion-XXXXX resource group (where XXXXX is the unique identifier assigned to you for this workshop).
-
On the overview blade of your search service, copy the name of your Search Service.
-
Return to Visual Studio, and in the
appsettings.json
file, locate theSearch
section, which contains the settings you need to update to connect to your Search Service."Search": { "ServiceName": "<enter your Azure Cognitive Search Service name here>", "Key": "<enter your Azure Cognitive Search Service Key here>", "DataSourceName": "tweets-cosmosdb", "IndexName": "tweet-index", "IndexerName": "tweet-indexer", "SkillsetName": "tweet-skillset", "ApiVersion": "2019-05-06-Preview" }
-
Within the
Search
section, locate the line that looks like the following:"ServiceName": "<enter your Azure Cognitive Search Service name here>"
-
Replace the value of the
ServiceName
setting by pasting the copied name within double-quotes. The line should now look similar to this:"ServiceName": "tech-immersion"
-
Return to your Search Service blade in the Azure portal, select Keys from the left-hand menu, and then select the Copy button for the Primary admin key value.
-
Return to Visual Studio and the
appsettings.json
file, and update theKey
setting within theSearch
section. Paste the key you copied into the value for this setting. It should look similar to:"Key": "4DB94C2CC80B42ACA459C839A0863A8A"
-
Next you will retrieve the values for your Cognitive Services account. Return to the Azure portal, and select the Cognitive Services account named tech-immersion-cogserv from the list of resources under the tech-immersion-XXXXX resource group (where XXXXX is the unique identifier assigned to you for this workshop).
NOTE: There are multiple Cognitive Services accounts in the resource group. You will want the one named tech-immersion-cogserv for this step.
-
On the Cognitive Services blade, select Properties from the left-hand menu, and then copy the value for the Resource ID field by selecting the copy button to the right of the field.
-
Return to Visual Studio and in the
appsettings.json
file, paste the Resource ID value into theResourceId
field value within theCognitiveServices
section. It should look similar to the following:"ResourceId": "/subscriptions/30fc406c-c745-44f0-be2d-63b1c860cde0/resourceGroups/tech-immersion/providers/Microsoft.CognitiveServices/accounts/tech-immersion-cogserv"
-
Return to your Cognitive Services account in the Azure portal, and select Keys and Endpoint from the left-hand menu. On the Keys blade, copy the Key 1 value by selecting the copy button to the right of the field.
-
Return to Visual Studio and in the
appsettings.json
file, paste the Cognitive Services Key 1 value into the value for theKey
field. It will looks similar to the following."Key": "872353ecac8d43a7bf5a60c3ece9ff4a"
-
Next, you will retrieve the endpoint and key for your Form Recognizer Cognitive Service. In the Azure portal, select the Cognitive Services account named tech-immersion-form-recog from the list of resources under the tech-immersion-XXXXX resource group (where XXXXX is the unique identifier assigned to you for this workshop).
-
Select Keys and Endpoint from the left-hand menu of the Form Recognizer Cognitive Services blade. From this screen, you will copy the Key1 and Endpoint values.
-
Return to Visual Studio and in the
appsettings.json
file, paste the Endpoint value into theEndpoint
field value within theFormRecognizer
section. It should look similar to the following:"Endpoint": "https://tech-immersion-form-recog.cognitiveservices.azure.com/"
-
Next, copy the Key1 value and in the
appsettings.json
file, paste it into the value for theKey
field within theFormRecognizer
section. It will look similar to the following."Key": "9d1079dd70494ac3b366a8a91e363b5b"
-
To retrieve the required values for your Azure Blob Storage account, select the techimmersionstorageXXXXX Storage account resource from your resource group (where XXXXX is the unique identifier assigned to you for this workshop).
-
On the Storage account blade, select Access keys from the left-hand menu, and then copy the Storage account name.
-
Return to Visual Studio and in the
appsettings.json
file, paste the Name value into the value for theAccountName
field within theBlobStorage
section. It will look similar to the following."AccountName": "techimmersionstorage"
-
Return to the Access keys blade of your storage account in the Azure portal, and copy the key1 Connection string.
-
Return to Visual Studio and in the
appsettings.json
file, paste the Connection string value into the value for theConnectionString
field within theBlobStorage
section. It will look similar to the following."ConnectionString": "DefaultEndpointsProtocol=https;AccountName=techimmersionstorage;AccountKey=4JBkkA1ot5bDZoLs4DvlH+7e5UXwrFxxrYb4taYMgkkrSdB8fan7E0coGlzvtzrlqPBzJg+DKpAFPoCHBIxlag==;EndpointSuffix=core.windows.net"
-
The final setting you need for your Blob storage account is a shared access signature, or SAS token. Return to your Blob storage account in the Azure portal and select Shared access signature from the left-hand menu. On the Shared access signature blade, ensure that the Allowed resource types is all checked, and then enter an End date for a week or two in the future, select Generate SAS and connection string and then copy the SAS token value.
-
Return to Visual Studio and in the
appsettings.json
file, paste the SAS token string value into the value for theSasToken
field within theBlobStorage
section. It will look similar to the following."SasToken": "?sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-07-02T01:20:49Z&st=2019-06-15T17:20:49Z&spr=https&sig=8LBK6113sDnaqp1X7A3nyXQL5l%2F5VgBsa5Ma6%2BYawuY%3D"
-
Next, you need to retrieve your Cosmos DB connection string. In the Azure portal, select the tech-immersion-XXXXX resource group (where XXXXX is the unique identifier assigned to you for this workshop), and then select the tech-immersionXXXXX Azure Cosmos DB account from the list of resources (where XXXXX is the unique identifier assigned to you for this workshop) within the resource group.
IMPORTANT: There may be two Cosmos DB accounts in your resource group. Select the Cosmos DB account named tech-immersionXXXXX, with no hyphen between immersion and XXXXX.
-
On your Cosmos DB blade, select Keys from the left-hand menu.
-
Copy the Primary Connection String value by selecting the copy button to the right of the field.
-
Return to the
appsettings.json
file in Visual Studio and paste the Primary Connection String into the value for theConnectionString
setting within theCosmosDb
section. The settings should look similar to the following:"ConnectionString": "AccountEndpoint=https://tech-immersion.documents.azure.com:443/;AccountKey=p9avU3FcaeffHI50SeenA6zfvUcoZEk3rYwg4FdrsRfqrj3AfobcKdacfBkHlGZ1eiBnMafwhjxdoEUSai8LLA==;"
-
The final settings you need to retrieve are those for your Azure Function App. In the Azure portal, navigate to your ti-function-day2-XXXXX Function App (where XXXXX is the unique identifier assigned to you for this workshop), and copy the URL on the Overview blade.
-
Return to the
appsettings.json
file for thePipelineEnhancer
project in Visual Studio, and paste the value into theUrl
setting within theFunctionApp
section. It should look similar to:"Url": "https://tech-immersion-functions.azurewebsites.net"
-
In the Azure portal, navigate to your ti-function-day2-XXXXX Function App (where XXXXX is the unique identifier assigned to you for this workshop). In the Overview blade, select Switch to classic experience, Continue to classic experience, and then select Function app settings under
Configured features
. -
On the Function app settings tab, select the Copy button next to the default Host Key.
-
Return to Visual Studio and the
appsettings.json
file for thePipelineEnhancer
project, and paste default host key value into theDefaultHostKey
setting within theFunctionApp
section. It will look similar to the following:"DefaultHostKey": "h3CqiI4JFKMGaN2BHwtYwxmgfwtqW0kaWbpaEQkyAcR3Lle5fKs9rg=="
-
Save
appsettings.json
. The file should now resemble the following. -
You are now ready to move on to adding the enhancements to your pipeline. The create skillset API uses the following endpoint:
PUT https://[servicename].search.windows.net/skillsets/[skillset name]?api-version=2017-11-11-Preview api-key: [admin key] Content-Type: application/json
-
To add the sentiment analysis pre-built skill to your search pipeline, the
PipelineEnhancer
app will append the following JSON to the body of the Skillset you created through the Azure portal UI.{ "@odata.type": "#Microsoft.Skills.Text.SentimentSkill", "inputs": [ { "name": "text", "source": "/document/text" } ], "outputs": [ { "name": "score", "targetName": "sentiment" } ] }
In the above JSON, the
inputs
specify the field in the source data document to send for analysis. Theoutputs
section dictates that thescore
value returned by the Text Analytics endpoint in Cognitive Services should be mapped to an output field namedsentiment
in the search results. This is sent into the REST API, along with the JSON from the previously built skillset to update or create the skillset. -
To add sentiment analysis to your pipeline you will run the
PipelineEnhancer
project within the CognitiveSearch solution in Visual Studio. To run the project, right-click thePipelineEnhancer
project in Visual Studio and select Set as StartUp Project.Note: You have to exit or stop the
DataGenerator
before you can setupPipelineEnhancer
as your startup project. -
Now, select the run button on the Visual Studio toolbar, which is the one with a green arrow followed by the text "PipelineEnhancer."
Note: If the program stops because of a breakpoint in the code, select Continue in the top menu.
-
At the command prompt for the console app, enter 1 to incorporate the Sentiment cognitive skill to your pipeline.
-
When the console app completes you will receive a message stating that the sentiment analysis skill was successfully added.
In addition to updating the Skillset JSON, the Indexer and Index were also be updated to include a new field named
sentiment
. -
The process above deleted and recreated your Index, Indexer and Skillset, so you may need to select Indexers and the tweet-indexer on your Azure Cognitive Search Service blade, and then select Run to force the Indexer to run against your tweet data again before attempting to run a search against the index in the next step.
The Indexer Run screen does not refresh when the indexer has finished, so you can return to the overview blade of the Search service, and then select Indexers. Then, you can use the Refresh button on the Search service tool bar, next to Search explorer, to refresh the status. The tweet-indexer will display a status of Success when it finishes.
-
Return to your Azure Cognitive Search service in the Azure portal and select Search explorer on the Search Service toolbar.
-
On the Search explorer tab, select Search and inspect one of the records in the search results.
{ "@search.score": 1, "created_at": "2019-12-01T20:24:02.683Z", "id_str": "752561732", "text": "@ContosoAuto, can you make a #Chevrolet #Corvette that can go 300 mph please?", "id": "5d1fdf92-b7bc-4e0d-a10b-18058058ac41", "rid": "Tk5zWUFKQ2tVZzQ4QUFBQUFBQUFDQT090", "people": [], "organizations": [ "Chevrolet" ], "locations": [], "keyphrases": [ "Corvette", "Chevrolet", "mph", "ContosoAuto" ], "language": "en", "translated_text": "@ContosoAuto, can you make a #Chevrolet #Corvette that can go 300 mph please?", "sentiment": 0.5, "user": { "id": 476862525, "id_str": "971503000", "name": "Herma Dupaski", "screen_name": "HermaDupaski", "location": "San Jose, CA", "url": "", "description": "" }, "entities": { "symbols": [], "urls": [], "hashtags": [ { "indices": null, "text": "Chevrolet" }, { "indices": null, "text": "Corvette" } ], "user_mentions": [ { "id": 2244994945, "id_str": "2244994945", "indices": [ 0, 12 ], "name": "Contoso Auto", "screen_name": "ContosoAuto" } ] }, "extracted_entities": [ { "name": "Chevrolet", "wikipediaId": null, "wikipediaLanguage": null, "wikipediaUrl": null, "bingId": null, "type": "Organization", "subType": null, "matches": [ { "text": "Chevrolet", "offset": 30, "length": 9 } ] }, { "name": "300 mph", "wikipediaId": null, "wikipediaLanguage": null, "wikipediaUrl": null, "bingId": null, "type": "Quantity", "subType": "Dimension", "matches": [ { "text": "300 mph", "offset": 62, "length": 7 } ] }, { "name": "Chevrolet Corvette", "wikipediaId": "Chevrolet Corvette", "wikipediaLanguage": "en", "wikipediaUrl": "https://en.wikipedia.org/wiki/Chevrolet_Corvette", "bingId": "527f9e03-9f24-d571-1165-653ac75ac37d", "type": "Other", "subType": null, "matches": [ { "text": "Corvette", "offset": 41, "length": 8 } ] } ], "named_entities": [ { "category": "Organization", "value": "Chevrolet", "offset": 30, "confidence": 0.9223402738571168 }, { "category": "Quantity", "value": "300 mph", "offset": 62, "confidence": 0.8 } ] }
Notice the addition of the
sentiment
field in the results. The value contained in this field is a numeric prediction made by a machine learning model about the sentiment of the contents of thetranslated_text
field in the tweet. Thetranslated_text
field was used to standardize the scores against a single language (English in the case). Scores range from 0 to 1. Scores close to 1 indicate positive sentiment, and scores close to 0 indicate negative sentiment. Scores in the middle are considered to be neutral in the expression of sentiment. In the record above, the sentiment was determined to be neutral, 0.5, by the ML model.In addition to the
sentiment
field, you may have also noticed two new objects within the search documents,extracted_entities
andnamed_entities
. These properties are part of theEntityRecognitionSkill
added through the Azure portal, but are not included in the search documents by default. We added these using the REST APIs to further enhance the enrichments available in search index.
In this task, you will run various queries against your Search Index to explore a few of the search capabilities of Azure Cognitive Search.
-
In the Azure portal, navigate to your Search service resource and select Search explorer in the toolbar on the overview blade.
The last step of the previous task deleted and recreated your Index, Indexer and Skillset. If you don't see any search results, you may need to select Indexers on the Overview blade and then tweet-indexer. Select Run to force the Indexer to run against your tweet data again before attempting to run a search against the index. If the status is In progress, select Refresh in the toolbar, and wait for the status to change to Success.
-
On the Search explorer tab, select Search and observe the results.
-
You can now play around with the search functionality. Below, you will enter a few queries that simulate what an application user may enter. These queries tend to resemble natural language, so we will start there.
-
In the Query string field, enter "language is es", and observe the results. This will return records in the search results where the language was detected as Spanish.
{ "@search.score": 0.5382439, "created_at": "2019-12-01T20:27:35.286Z", "id_str": "827406160", "text": "¡El sistema de sonido en mi nuevo #Dodge #Charger es increíble!", "id": "e688dfa3-f240-4f8e-b1e3-e53a82133e0f", "rid": "Tk5zWUFKQ2tVZzRlQVFBQUFBQUFDQT090", "people": [], "organizations": [ "Dodge", "Charger" ], "locations": [], "keyphrases": [ "Dodge", "The sound system on my new", "Charger", "amazing" ], "language": "es", "translated_text": "The sound system on my new #Dodge #Charger is amazing!", "sentiment": 0.5, "user": { "id": 901610181, "id_str": "762411151", "name": "Mickey Kohm", "screen_name": "MickeyKohm", "location": "Denver, CO", "url": "", "description": "" }, "entities": { "symbols": [], "urls": [], "hashtags": [ { "indices": null, "text": "Dodge" }, { "indices": null, "text": "Charger" } ], "user_mentions": [ { "id": 2244994945, "id_str": "2244994945", "indices": [ 0, 12 ], "name": "Contoso Auto", "screen_name": "ContosoAuto" } ] }, "extracted_entities": [ { "name": "Dodge", "wikipediaId": "Dodge", "wikipediaLanguage": "es", "wikipediaUrl": "https://es.wikipedia.org/wiki/Dodge", "bingId": "f4c7e3f7-b855-e6b4-cf0c-bf4919141399", "type": "Organization", "subType": null, "matches": [ { "text": "Dodge", "offset": 28, "length": 5 } ] }, { "name": "Charger", "wikipediaId": null, "wikipediaLanguage": null, "wikipediaUrl": null, "bingId": null, "type": "Organization", "subType": null, "matches": [ { "text": "Charger", "offset": 35, "length": 7 } ] } ], "named_entities": [ { "category": "Organization", "value": "Dodge", "offset": 28, "confidence": 0.8 }, { "category": "Organization", "value": "Charger", "offset": 35, "confidence": 0.6400768756866455 } ] }
Notice the addition of the
translated_text
field to the results documents. This was added by the pre-built text translation skill you added above and it contains the English translation of the contents of thetext
field in the tweet. You will also notice that thekeyphrases
field contains only English words and phrases. This is because theKeyPhraseExtractionSkill
was pointed to the newtranslated_text
field with the latest updates to the search pipeline.In addition, notice that each record returned contains a
@search.score
field with a numeric value. This value indicates the confidence of the match with the search query. The higher the value, the more likely it is to be a match for the query. If you scroll down past the records where"Language"="es"
, you will see the search score values drop, as those records don't match the query. -
Try another search, such as "cold battery" and observe the results.
-
Now, let's try a slightly more advanced search. We want to look for only records that mention a "corvette", and we only want to retrieve the
text
,user.location
, andsentiment
fields in our results. Paste the following into the Query string box, and select Search:search=corvette&$select=text,user/location,sentiment&$count=true
Adding the
$select
parameter limits results to the explicitly named fields for more readable output in the Search explorer. -
You can take that query even further by adding in the
$filter
parameter. Use the$filter
parameter when you want to specify precise criteria rather than free text search. This example searches for sentiment less than 0.25, so we can target tweets with negative sentiment in the search results.search=corvette&$select=text,user/location,sentiment&$filter=sentiment lt 0.25&$count=true
-
The final query we will run adds the
$orderBy
parameter, which allow you to specify the sort order of your results. In this case, let's search for records where the sentiment is the highest, filtering for records where the sentiment is greater than 0.9, and ordering the results in descending order.search=*&$select=text,user/location,sentiment&$filter=sentiment gt 0.9&$count=true&$orderby=sentiment desc
In the search results, observe that the sentiment values slowly decrease as you scroll down through the returned documents.
Above, you added a series of enrichments to your Azure Cognitive Search index using both pre-built and custom cognitive skills. These enrichments were used to add structure and metadata to documents within your Azure Cognitive Search index, making searching more effective. There are many scenarios, however, where access to the additional metadata inserted into to your documents by the use of cognitive search can be useful outside of Azure Cognitive Search, such as for knowledge mining.
Azure Cognitive Search provides the ability to export enriched documents to a knowledge store. The knowledge store feature enables you to save enriched documents to either blobs or tables in an Azure Storage account by using projections. Projections are views of enriched documents creating using a Shaper Skill that can be saved to physical storage for knowledge mining purposes. A projection lets you "project" your data into a shape that aligns with your needs, preserving relationships so that tools like Power BI can read the data with no additional effort. Using the knowledge store feature allows documents to be saved for subsequent evaluation, exploration, and to potentially become inputs to a downstream data science workload. Any tool or process that can connect to Azure Storage can consume the contents of a knowledge store.
The knowledge store supports two types of projections:
- Tables: For data that is best represented as rows and columns, table projections allow you to define a schematized shape or projection in Table storage.
- Objects: When you need a JSON representation of your data and enrichments, object projections are saved as blobs.
For this workshop, we will target table projects. In the steps below, you will use the PipelineEnhancer
to add a knowledge store to your search pipeline and send the documents to Azure Table storage. The PipelineEnhancer
application will add two items to the Skillset associated with your cognitive search pipeline, a ShaperSkill
and a knowledgeStore
definition.
The Shaper Cognitive Skill can be used to consolidate multiple inputs into a complex type, and output that as a projection, as you can see in the JSON below.
{
"@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
"name": "ShaperSkill",
"description": "ShaperSkill for knowledge store",
"context": "/document",
"inputs": [
{
"name": "created_at",
"source": "/document/created_at",
"sourceContext": null,
"inputs": []
},
{
"name": "id_str",
"source": "/document/id_str",
"sourceContext": null,
"inputs": []
},
{
"name": "text",
"source": "/document/text",
"sourceContext": null,
"inputs": []
},
{
"name": "translated_text",
"source": "/document/translated_text",
"sourceContext": null,
"inputs": []
},
{
"name": "sentiment",
"source": "/document/sentiment",
"sourceContext": null,
"inputs": []
},
{
"name": "user",
"source": "/document/user",
"sourceContext": null,
"inputs": []
},
{
"name": "entities",
"source": "/document/entities",
"sourceContext": null,
"inputs": []
},
{
"name": "id",
"source": "/document/id",
"sourceContext": null,
"inputs": []
},
{
"name": "rid",
"source": "/document/rid",
"sourceContext": null,
"inputs": []
},
{
"name": "KeyPhrases",
"source": "/document/text/keyphrases/*",
"sourceContext": null,
"inputs": []
},
{
"name": "languageCode",
"source": "/document/language",
"sourceContext": null,
"inputs": []
},
{
"name": "Entities",
"source": null,
"sourceContext": "/document/text/extracted_entities/*",
"inputs": [
{
"name": "Name",
"source": "/document/text/extracted_entities/*/name",
"sourceContext": null,
"inputs": []
},
{
"name": "Type",
"source": "/document/text/extracted_entities/*/type",
"sourceContext": null,
"inputs": []
},
{
"name": "Url",
"source": "/document/text/extracted_entities/*/wikipediaUrl",
"sourceContext": null,
"inputs": []
}
]
}
],
"outputs": [
{
"name": "output",
"targetName": "tableprojection"
}
]
}
In the JSON above, fields from the documents are added to an output object named
tableprojection
. This output is used by the knowledge store to create tables that will be written to Azure Table storage.
The second object being added to the Skillset by the PipelineEnhancer
is the definition of the knowledge store. This defined the path to the storage account hosting the knowledge store, along with definitions of the table projections that will be created there. The code below defines four tables (Documents
, KeyPhrases
, Entities
, and Users
) that will be created from the tableprojection
object created by the ShaperSkill
.
"knowledgeStore": {
"storageConnectionString": "[storage-connection-string]",
"projections": [
{
"tables": [
{
"tableName": "Documents",
"generatedKeyName": "DocumentId",
"source": "/document/tableprojection",
"sourceContext": null,
"inputs": []
},
{
"tableName": "KeyPhrases",
"generatedKeyName": "KeyPhraseId",
"source": null,
"sourceContext": "/document/tableprojection/KeyPhrases/*",
"inputs": [
{
"name": "Keyphrases",
"source": "/document/tableprojection/KeyPhrases/*",
"sourceContext": null,
"inputs": []
}
]
},
{
"tableName": "Entities",
"generatedKeyName": "EntityId",
"source": "/document/tableprojection/Entities/*",
"sourceContext": null,
"inputs": []
},
{
"tableName": "Users",
"generatedKeyName": "UserId",
"source": null,
"sourceContext": "/document/tableprojection/user",
"inputs": [
{
"name": "id",
"source": "/document/tableprojection/user/id",
"sourceContext": null,
"inputs": []
},
{
"name": "id_str",
"source": "/document/tableprojection/user/id_str",
"sourceContext": null,
"inputs": []
},
{
"name": "name",
"source": "/document/tableprojection/user/name",
"sourceContext": null,
"inputs": []
},
{
"name": "screen_name",
"source": "/document/tableprojection/user/screen_name",
"sourceContext": null,
"inputs": []
},
{
"name": "location",
"source": "/document/tableprojection/user/location",
"sourceContext": null,
"inputs": []
}
]
}
],
"objects": []
}
]
}
Now, let's add the knowledge store to your search pipeline.
-
Return to the
PipelineEnhancer
console application, and enter 2 at the prompt and press enter. This will execute the steps to add a knowledge store.The knowledge store feature is still in preview, and the the .NET SDK currently does not provide objects for working with the it, so the code executing for this step is building raw JSON and sending that to the Azure Cognitive Search REST API to update the pipeline. The JSON snippets above are added to the JSON retrieve for the existing pipeline and then sent back to the REST API to recreate the pipeline components. The
storageConnectionString
value is updated within code to add the connection string of your Azure Storage account, which you added to theappsettings.json
in a previous step. -
When the console app completes you will receive a message stating that the knowledge store was successfully added.
-
Return to your Azure Cognitive Search service in the Azure portal, and select Indexers.
The process above deleted and recreated your Index, Indexer and Skillset, so you may need to select the tweet-indexer on your Azure Cognitive Search Service blade, and then select Run to force the Indexer to run against your tweet data again before attempting to run a search against the index in the next step.
-
When the indexer finishes running you may notice a status of Warning.
-
To view the details of any warnings, select the tweet-indexer, then select the most recent run from the Execution history shown on the right-hand side, and finally select one of the messages with warning. On the right-hand screen, you will be able to view any errors and warnings in more detail.
The warning in this instance is caused by the fact that not every
ExtractedEntity
has itswikipediaUrl
property populated. This warning can be safely ignored. -
Projections are the mechanism you use to structure data in a knowledge store. Using the code above the projection output was saved as a collection of related tables. Using the Azure portal, an easy way to view knowledge store contents is through the built-in Storage Explorer for Azure storage.
-
In the Azure portal, navigate to the techimmersionstoreXXXXX storage account resource (where XXXXX is the unique identifier assigned to you for this workshop), select Storage Explorer (preview) from the left-hand menu and then select TABLES to expand it.
-
Select the Documents table and review the structure that was created by the knowledge store. Notice that new columns were added to provide identifiers and enable relationships between the tables. Fields from the documents in the search index are located at the end of each row.
-
Take a moment to explore the other tables that were created. You will be accessing the data within these tables from Power BI in the next task.
In this task you will do some knowledge mining using Power BI and data written to the knowledge store.
-
On your JumpBox VM, open Power BI Desktop from the Windows Start menu.
If you see Power BI Desktop splash screen skip to Step 4 below.
-
On the Welcome to Power BI Desktop dialog, select the Already have a Power BI account? Sign in link in the bottom.
-
On the Sign in screen that appears, select the X in the upper right-hand corner to close the dialog.
-
Next, on the Power BI Desktop splash screen, select Get data.
-
In the Get data dialog, select Azure on the left-hand list, and then select Azure Table Storage from the Azure options.
-
Select Connect.
-
On the Azure Table Storage dialog, enter the name of your Azure Storage account, which you can get from the Environment Details sheet provided to you for this lab. The storage account name should be techimmersionstoreXXXXX (where XXXXX is the unique identifier assigned to you for this workshop).
If you are asked provide an account access key, navigate to Access keys page of your Azure Storage account and use the Key under key1.
-
On the Navigator dialog, select all four tables by checking the box next to each one, and then select Load.
-
It will take a moment to load the data from each of the tables. A loading dialog will be displayed to provide a progress indicator for each table.
-
Once loaded, you will see the tables appear under Fields on the right-hand side of the Power BI Desktop window. Expand the Documents table and review the fields listed.
-
As you can see, the content fields from Azure Table Storage are not yet accessible. To access the field in the table we need to edit the queries that are used to pull the table data from your Storage account. Right-click on Documents, select Edit query.
-
Right-click the Content column header, and in the context menu select Remove Other Columns. This will remove the
PartitionKey
,RowKey
, andTimestamp
columns from the table. -
Next, select the Expand Content button within the Content column.
-
In the dialog that appears, ensure all columns are checked in the list. Uncheck **Use original column name as a prefix unchecked, and then select OK.
-
This action will expand each column with the Content field into a new column in the tableOn the right-hand side of the dialog you can view the query transformation steps that have been applied. Should you wish to undo a change, you can select the X next to that step in the list.
-
Within the expanded columns for the Documents table, locate the sentiment field, select the Data Type icon for the column, and change it to Decimal Number.
-
Next, select Entities under Queries and repeat the steps above to remove the unnecessary columns and expand the contents of the Content column. There are no columns that need to have their data types changes, so skip that step for the remaining tables.
-
Repeat this process for the KeyPhrases and Users tables.
-
When the query edits have been made on all of the tables, select Close & Apply in the Power Query Editor window's toolbar.
-
When the updated queries finish loading, expand Documents under Fields on the right-hand side of the Power BI Desktop window, and confirm that you can now see the Content fields.
-
With the fields reflecting properly, you are now ready to add some visualizations for your enriched data. In the Visualizations pane, select the Gauge visual to add it to the report canvas.
-
In the properties pane for the Gauge visualization, drag the
sentiment
field under the Documents table into the Value box. -
Next, select the drop down arrow next to
sentiment
in the Value field, and select Average. -
Select the Format icon in the Visualizations properties pane.
-
On the Format pane of the Gauge visualization, expand Gauge axis and enter the following values:
- Min: 0
- Max: 1
- Target: 0.75
-
Now expand the Title section and enter Sentiment KPI as the Title text.
-
Click anywhere in the whitespace of the report canvas to deselect the Gauge visualization. The gauge should look similar to the following:
-
Before moving on to the next visualization, let's review the relationships between the tables. On the left-hand side of the Power BI report canvas, select the Model icon.
-
The relationship are represented as lines connecting the tables. Return to the report canvas by selecting the Report icon in the left-hand toolbar.
-
On the Report view, select the Map visualization to add that to the canvas.
-
Under Fields, expand the Users table and drag the location field into the Location box of the Map visualization. Then, drag the sentiment field from the Documents table into the Size field.
-
As you did previously, select the drop down arrow next to sentiment in the Size field and select Average from the context menu.
-
Finally, select the Format icon as you did above, expand the Title section, and set the Title text of the Map visualization to "Average sentiment by location".
-
Click anywhere in the whitespace of the report canvas to unselect the Map visualization.
-
Next, select a Treemap visualization and drag the Keyphrases field from the KeyPhrases table into the Group and Values fields on the Treemap visualization.
-
Select the Format icon, and set the visualization title to Keyphrase treemap.
-
Click anywhere in the whitespace of the report canvas to unselect the Treemap visualization.
-
The final visualization you will add is a Table visualization. Select it from the list of visualizations and set the following:
- Drag the Keyphrases field from the KeyPhrases table into the Values box.
- Drag the Keyphrases field into the Values box a second time, and on the second Keyphrases item select the drop down arrow and select Count from the context menu.
-
Click anywhere in the whitespace of the report canvas to unselect the Table visualization.
-
Resize and reposition the visualizations so your report dashboard looks similar to the following:
In the steps above, you created a knowledge store from your search index and then used that to consume the enriched documents using Power BI with Power Query. Reshaping performed by the Shaper Skill created multiple projections that allowed you to define tables that align with the intended use of the data while preserving relationships.
In addition to predefined Cognitive skills, you also have the ability to integrate custom skills into your Cognitive Search enrichment pipelines. In this task, you will update a few values in the Function code inside the CustomSkillsFunctions
project in Visual Studio, and then deploy the Function App to Azure. For this experience we are using an Azure Function App to wrap the custom cognitive skills, so that they implements the required custom skill interface. You will be implementing three different custom skills in the following tasks, so you will configure the Functions now, so you don't have to publish them within each exercise.
While this example uses an Azure Function to host a web API, you can use any approach as long as it meets the interface requirements for a cognitive skill. Azure Functions, however, make it very easy to create custom skills.
-
The first custom skill you add to the pipeline will use the Form Recognizer API, so let's start by adding the Service Endpoint and API key to the AnalyzeForm Function. As you've done with previous resources, select your tech-immersion-form-recog Cognitive Services resource from the tech-immersion-XXXXX resource group (where XXXXX is the unique identifier assigned to you for this workshop) in the Azure portal.
-
On the Form Recognizer Cognitive Services blade, select Keys and Endpoint from the left-hand menu, and then select the Copy button next to the value for Key 1.
-
Return to
CustomSkillFunctions
project in Visual Studio, and in the Solution Explorer on the right-hand side, openAnalyzeFormFunction.cs
. -
In the
AnalyzeFormFunction.cs
file, locate the lines of code (starting on line 20) that look like the following:// TODO: Replace the service endpoint with the endpoint for your Forms Recognizer service. private static readonly string serviceEndpoint = "<enter your service endpoint here>"; // TODO: Replace this example key with a valid subscription key. private static readonly string key = "<enter your api key here>";
-
Replace the
key
value within double-quotes (<enter your api key here>
) with the API key you copied for the Form Recognizer Cognitive Service. The line should now look similar to this:private static readonly string key = "9d1079dd70494ac3b366a8a91e363b5b";
-
Return to the Keys and Endpoint blade in the Azure portal and copy the ENDPOINT value.
-
Back in Visual Studio, replace the
serviceEndpoint
value in theAnalyzeFormFunction.cs
file with the endpoint of your Forms Recognizer service. The completed line will look similar to:private static readonly string serviceEndpoint = "https://tech-immersion-form-recog.cognitiveservices.azure.com/";
IMPORTANT: Your Forms Recognizer pipeline will fail if the
serviceEndpoint
is pointing to the incorrect region. -
Save
AnalyzeFormFunction.cs
.Take a few minutes to look over the code in
AnalyzeFormFunction.cs
. This file defines a Function that will be deployed to your Azure Function App. The function code receives the URL reference for a form file stored in Blob Storage from the custom skill. The bytes of the form as extracted and then passed to the Form Recognizer. The trained recognizer model is run against the form data. -
Next, you will update another function that will be used for detecting anomalies in vehicle telemetry. Return to the Azure portal, and select the tech-immersion-anomaly-detector resource from the list of resources in the tech-immersion-XXXXX resource group (where XXXXX is the unique identifier assigned to you for this workshop).
-
On the Anomaly Detector Cognitive Services blade, select Keys and Endpoint from the left-hand menu, and then select the Copy button next to the value for Key 1.
-
Return to
CustomSkillFunctions
project in Visual Studio, and in the Solution Explorer on the right-hand side, openDetectAnomaliesFunction.cs
. -
In the
DetectAnomaliesFunction.cs
file, locate the lines of code (starting on line 20) that look like the following:// TODO: Replace the service endpoint with the endpoint for your Anomaly Detector service. private static readonly string serviceEndpoint = "<enter your service endpoint here>"; // TODO: Replace the key with a valid service key. private static readonly string key = "<enter your api key here>";
-
Replace the
key
value within double-quotes (<enter your api key here>
) with the API key you copied for the Translator Cognitive Service. The line should now look similar to this:private static readonly string key = "e5fe6a9a9702447680864369d7a8965e";
-
Return to the Anomaly Detector Keys and Endpoint blade in the Azure portal and copy the ENDPOINT value.
-
Back in Visual Studio, replace the
serviceEndpoint
value in theDetectAnomaliesFunction.cs
file with the endpoint of your Anomaly Detector service. The completed line will look similar to:private static readonly string serviceEndpoint = "https://tech-immersion-anomaly-detector.cognitiveservices.azure.com/";
IMPORTANT: Your Anomaly Detection pipeline will fail if the
serviceEndpoint
is pointing to the incorrect region. -
Save
DetectAnomaliesFunction.cs
. -
You are now ready to deploy the function into your Azure Function App. Right-click the
CustomSkillFunctions
project, and select Publish from the context menu. -
On the Publish, Target dialog, select Azure and then select Next.
If you are prompted to sign in, use the Azure credentials provided to you for this experience.
-
On the Publish, Functions instance dialog, select your Subscription. Select the Function App named ti-function-day2-XXXXX from the list of resources (where XXXXX is the unique identifier assigned to you for this workshop).
You may need to enter the credentials of the account you are using for this workshop before you can see any resources for your subscription.
-
Ensure that the Run from package file checkbox is selected and then select Finish.
-
Select Publish to start the publish process. You will see an animated progress icon next to the Publish button while the deployment is in progress.
-
When the publish is complete, you can open the Output window at the bottom left-hand corner of the Visual Studio window to observe the results. You should see messages that the Build Succeeded and Publish Succeeded.
If the Output window is not visible, you can display it by selecting the View menu in Visual Studio, and then selecting Output.
Now that you've had a chance to explore some of the cognitive search capabilities of Azure Cognitive Search, let's dive into some more advanced Cognitive Services that can be integrated within the cognitive search pipeline through the use of custom skills. Custom skill are added using the custom skill interface, which is accessed by adding a WebApiSkill
to the skillset. In this task, you will add a custom skill into a new Forms Recognizer Cognitive Search pipeline.
The Function App you deployed into Azure contained two functions. One of the functions leverages the Form Recognizer service, which you will implement in this task.
The Forms Recognizer is an AI-powered document extraction service, currently in preview, designed specifically to recognize and extract information from forms. Form Recognizer applies advanced machine learning to accurately extract text, key/value pairs, and tables from documents. With just a few samples, Form Recognizer tailors its understanding to your documents, both on-premises and in the cloud. It enables you to turn forms into usable data at a fraction of the time and cost, so you can focus more time acting on the information rather than compiling it.
To use the Form Recognizer, you will first call the Train Model API, passing in as little as 5 example forms to train the model to recognize forms of that type. To use the Train API, you will pass in a source Azure Storage blob container Uri where the training forms are located. Once the model is trained, you can create a custom cognitive skill, which will then be able to extract form fields, key-value pairs, and tables while indexing the forms storage location.
ContosoAuto has provided you access to an Azure Blob storage account where they keep invoice forms. They have requested a demo of how those forms can be indexed, and how information about the structure, key-value pairs, and tables within those forms can be made searchable. To implement this functionality, we will create a new search index, which uses Azure Blob storage as the data source. Your storage account has been preloaded with forms that will be used train the Form Recognizer model. The forms are invoices resembling the following:
-
To get started, return to the
PipelineEnhancer
console application. ThePipelineEnhancer
uses the Azure Cognitive Search SDK to add the custom skill to your search pipeline using aWebApiSkill
. This ultimate results in the following JSON being appended to the body of the Skillset.{ "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill", "description": "Custom Form Recognizer skill", "context": "/document", "uri": "https://<your-function-app-name>.azurewebsites.net/api/AnalyzeForm?code=<your-function-app-default-key>", "httpMethod": "POST", "timeout": "PT30S", "batchSize": 1, "degreeOfParallelism": null, "inputs": [ { "name": "contentType", "source": "/document/fileContentType", "sourceContext": null, "inputs": [] }, { "name": "storageUri", "source": "/document/storageUri", "sourceContext": null, "inputs": [] }, { "name": "storageSasToken", "source": "/document/sasToken", "sourceContext": null, "inputs": [] } ], "outputs": [ { "name": "formHeight", "targetName": "formHeight" }, { "name": "formWidth", "targetName": "formWidth" }, { "name": "formKeyValuePairs", "targetName": "formKeyValuePairs" }, { "name": "formColumns", "targetName": "formColumns" } ], "httpHeaders": {} }
In the above JSON, the
inputs
specify the fields in the source data document to send for analysis. Theoutputs
section dictates that the form field values returned by your Function App should output into fields namedformHeight
,formWidth
,formKeyValuePairs
, andformColumns
in the search results. This is sent into the REST API, along with the JSON from the previously built skillset to update or create the skillset. -
Note that there are two values within the
uri
field of your custom skill that will need to be supplied to your custom skill, so it can connect to your Function app: the Function App name and default host code. You already added these values to thePipelineEnhancer
'sappsettings.json
file. -
Relaunch the
PipelineEnhancer
console app by selecting the Run button on the Visual Studio toolbar. -
At the prompt, enter 3 and press enter.
-
In the first task of this experience, you saw how to create an Azure Cognitive Search index through the Azure portal UI. In this task, will not need to go through those steps, as the new search pipeline is being created via code, accessing the Azure Cognitive Search Service REST API and the Azure Cognitive Search .NET SDK. Using the SDK and API, it is possible to quickly and easily create new pipelines, as well as update and manage existing pipelines.
-
Observe the output of the previous command in
PipelineEnhancer
console app. In addition to creating a new search pipeline using a Blob Storage data source, which includes a Form Recognizer custom skill, the application also passed in your Blob Storage account info to allow the Form Recognizer model to be trained with the sample forms. -
Now, let's take a look at the results of our new search index. Navigate to your Search Service in the Azure portal, and select Indexers. Under Indexers, notice the new forms-indexer. The indexer should have run upon creation, so very you see a status of Success, and then select Search explorer from the toolbar.
NOTE: If the status is No history, you will need to select the forms-indexer and select Run on the indexer blade.
-
On the Search explorer blade, select Change index, and select forms-index from the list.
-
Select Search and observe the results. In addition to the built-in cognitive skills fields you reviewed previously in the tweet-index, the Form Recognizer custom skill has added fields details information extracted from the indexed forms. Specifically, the fields below were added to the index:
"formHeight": 792, "formWidth": 612, "formKeyValuePairs": [ "Microsoft: ", "Address:: 1111 8th st. Bellevue, WA 99501", "Invoice For:: Alpine Ski House 1025 Enterprise Way Sunnyvale, CA 94024", "Page: 1 of", "__Tokens__: Microsoft Page 1 of 1" ], "formColumns": [ "Invoice Number: 458176", "Invoice Date: 3/28/2018", "Invoice Due Date: 4/16/2018", "Charges: $89,024.34", "VAT ID: ET" ]
-
The format of the output was specified in the custom skill, and can be updated or changed depending on how you would like to be able to search and use the extracted information.
Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract key-value pairs and table data from form documents. It then outputs structured data that includes the relationships in the original file. Unsupervised learning allows the model to understand the layout and relationships between fields and entries without manual data labeling or intensive coding and maintenance. This capability allows you to easily extract information about the structure and fields within your forms, making that information searchable with minimal development effort.
ContosoAuto has also asked if it would be possible to add incoming vehicle telemetry to a search index. They store this data in Cosmos DB, in a container named vehicle-telemetry
. They have also asked it would be possible use a custom cognitive skill to inspect some of this data for anomalies, and add that information to the search index, so they can easily find anomalous data.
To accomplish this, we will use the the final function you deployed to your Function App. This was set up to use the Anomaly Detector API. With the Anomaly Detector API, you can monitor and detect abnormalities in your time series data, using machine learning. The Anomaly Detector API adapts by automatically identifying and applying the best-fitting models to your data, regardless of industry, scenario, or data volume. Using your time series data, the API determines boundaries for anomaly detection, expected values, and which data points are anomalies.
-
As with the Tweet data generated for the first task, to begin this task we need to send generated vehicle telemetry data into a Cosmos DB container. To get started, return to Visual Studio.
-
In the Visual Studio solution explorer, right-click on the
DataGenerator
project, and select Debug > Start new instance from the context menu. -
Debugging will launch the console app. In the
DataGenerator
console window, enter "2" at the prompt to start sending generated vehicle telemetry data into Cosmos DB. You will see statistics about telemetry data being streamed into Cosmos DB.Leave the
DataGenerator
console app running in the background while you move on to the following tasks in this experience. The app will run for 10 minutes, sending vehicle telemetry into your Cosmos DBvehicle-telemetry
container, so you have data to work with in this task. In this task, you will use thePipelineEnhancer
console app to set up an Azure Cognitive Search Index which points to thevehicle-telemetry
container in Cosmos DB. -
Return to your Azure Cosmos DB account blade in the Azure portal, and select Data Explorer from the toolbar on the overview blade.
IMPORTANT: There may be two Cosmos DB accounts in your resource group. Select the Cosmos DB account named tech-immersionXXXXX, with no hyphen between immersion and XXXXX.
-
Under the
ContosoAuto
database, expand the vehicle-telemetry container and then select Items. -
In the Items pane, select any of the documents listed and inspect a vehicle-telemetry document. Each document should look similar to the JSON below.
{ "vin": "O62T8AMDH0XS7LW96", "city": "Madison", "region": null, "outsideTemperature": 42, "engineTemperature": 318, "speed": 69, "fuel": 6, "engineoil": 39, "tirepressure": 10, "odometer": 164256, "accelerator_pedal_position": 93, "parking_brake_status": true, "brake_pedal_status": true, "headlamp_status": true, "transmission_gear_position": "first", "ignition_status": true, "windshield_wiper_status": false, "abs": true, "timestamp": "2019-06-16T15:53:59.5555928Z", "collectionType": "Telemetry", "id": "0cdaa1f1-e46f-4a3b-9ce1-bb52e51039a4", "_rid": "e4txAIjuuI2BhB4AAAAAAA==", "_self": "dbs/e4txAA==/colls/e4txAIjuuI0=/docs/e4txAIjuuI2BhB4AAAAAAA==/", "_etag": "\"04006220-0000-0800-0000-5d0666180000\"", "_attachments": "attachments/", "_ts": 1560700440 }
The
engineTemperature
field will be used to demonstrate the capabilities of the Anomaly Detector API. TheengineTemperature
value for each document will be compared against time series data of engine temperatures to determine if the value is outside of the normal range. -
To create the new anomaly detector search index, return to the open
PipelineEnhancer
console app, and enter 4 at the prompt. -
When you see the output that the anomaly detection pipeline was successfully created, navigate to your Search Service in the Azure portal, select Indexers, and observe the new telemetry-indexer.
NOTE: The vehicle-telemetry collection will contain a significant number of records, so you may not want to wait until you see a status of Success. You will be able to search the index for some of the documents that have already been indexed, while the status is still In progress.
-
Select Search explorer from the toolbar.
-
On the Search explorer blade, select Change index, select the telemetry-index, and then select Search.
-
In each result, you can see the data pulled in from the Cosmos DB documents. In addition, you will see an
engineTemperatureAnalysis
section, which contains the results of the Anomaly Detector analysis.{ "@search.score": 1, "vin": "G5SWN08OAKCRZ0TWX", "city": "San Diego", "outsideTemperature": 91, "engineTemperature": 99, "speed": 7, "fuel": 14, "engineoil": 50, "tirepressure": 34, "odometer": 103943, "accelerator_pedal_position": 93, "parking_brake_status": false, "brake_pedal_status": false, "headlamp_status": false, "transmission_gear_position": "first", "ignition_status": false, "windshield_wiper_status": true, "abs": true, "timestamp": "2019-06-16T15:53:59.405Z", "collectionType": "Telemetry", "id": "f930fb63-3b0d-44ae-b3a1-49338f6c73cb", "rid": "ZTR0eEFJanV1STJCaEI0QUFBQUFCQT090", "engineTemperatureAnalysis": { "isAnomaly": true, "isPositiveAnomaly": false, "isNegativeAnomaly": true, "expectedValue": 318.1797, "upperMargin": 15.9089851, "lowerMargin": 15.9089851 } }
-
Enter X at the
PipelineEnhancer
prompt to close the application. -
If the
DataGenerator
is still running, press CTRL+C to stop generating telemetry data, and then close the application.
Using the Anomaly Detector API, we were able to get information about anomalous data without needing to have any prior experience with machine learning. The Anomaly Detector API provides two methods of anomaly detection. You can either detect anomalies as a batch throughout your times series, or as your data is generated by detecting the anomaly status of the latest data point. The detection model returns anomaly results along with each data point's expected value, and the upper and lower anomaly detection boundaries. You can use these values to visualize the range of normal values, and anomalies in the data.
In this experience, you learned how to leverage Azure Cognitive Search and Cognitive Services to perform knowledge mining on unstructured data stored in Cosmos DB. Using a combination of pre-configured and custom cognitive skills, you built a Cognitive Search pipeline to enrich the source data in route to an Azure Cognitive Search Index.
Using pre-built cognitive skills, you were able to add language detection, sentiment analysis, and key phrase and entity extraction to your search pipeline. These skills enriched your search index with additional metadata about the tweets being indexed.
You then used an Azure Function App to create a custom cognitive skill, which used the Translator Text Cognitive Service to translate tweets into English. Using the Custom Web API skill, you integrated the custom skill to your cognitive search pipeline.
The end result is rich additional content in an Azure Cognitive Search index, created by a cognitive search indexing pipeline. The output is a full-text searchable index on Azure Cognitive Search.
To continue learning and expand your understanding of Knowledge Mining with Cognitive Search, use the links below.
- Introduction to Azure Cognitive Search
- Introduction to Cognitive Services
- Introduction to Cognitive Search
- Attach a Cognitive Services resource with a skillset in Azure Cognitive Search
- Azure Cognitive Search Service REST API
- Predefined Cognitive Search skills
- Custom Web API skill
- How to add a custom skill to a cognitive search pipeline
- Learn how to call cognitive search APIs
- Learn Cognitive Search
- Enterprise Knowledge Mining Bootcamp
- Azure Cognitive Search pricing
- Anomaly Detector API
- Form Recognizer
- Text Translate API