add cognitive services for big data sample

microsoft · May 7, 2021 · 923628f · 923628f
1 parent aa28f11
commit 923628f
Showing 1 changed file with 313 additions and 0 deletions.
diff --git a/notebooks/samples/Cognitive Services for Big Data.ipynb b/notebooks/samples/Cognitive Services for Big Data.ipynb
@@ -0,0 +1,313 @@
+{
+ "metadata": {
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": 3
+  },
+  "orig_nbformat": 2
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2,
+ "cells": [
+  {
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "1. Follow the steps in [Getting started](getting-started.md) to set up your Azure Databricks and Cognitive Services environment. This tutorial shows you how to install MMLSpark and how to create your Spark cluster in Databricks.\n",
+    "1. After you create a new notebook in Azure Databricks, copy the **Shared code** below and paste into a new cell in your notebook.\n",
+    "1. Choose a service sample, below, and copy paste it into a second new cell in your notebook.\n",
+    "1. Replace any of the service subscription key placeholders with your own key.\n",
+    "1. Choose the run button (triangle icon) in the upper right corner of the cell, then select **Run Cell**.\n",
+    "1. View results in a table below the cell."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "## Shared code\n",
+    "\n",
+    "To get started, we'll need to add this code to the project:"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from mmlspark.cognitive import *\n",
+    "import os\n",
+    "\n",
+    "# A general Cognitive Services key for Text Analytics and Computer Vision (or use separate keys that belong to each service)\n",
+    "service_key = os.environ[\"TEXT_API_KEY\"]\n",
+    "# A Bing Search v7 subscription key\n",
+    "bing_search_key = os.environ[\"BING_IMAGE_SEARCH_KEY\"]\n",
+    "# An Anomaly Dectector subscription key\n",
+    "anomaly_key = os.environ[\"ANOMALY_DETECTION_KEY\"]\n",
+    "\n",
+    "# Validate the key\n",
+    "assert service_key != \"ADD_YOUR_SUBSCRIPION_KEY\""
+   ]
+  },
+  {
+   "source": [
+    "## Text Analytics sample\n",
+    "\n",
+    "The [Text Analytics](../text-analytics/index.yml) service provides several algorithms for extracting intelligent insights from text. For example, we can find the sentiment of given input text. The service will return a score between 0.0 and 1.0 where low scores indicate negative sentiment and high score indicates positive sentiment.  This sample uses three simple sentences and returns the sentiment for each."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.sql.functions import col\n",
+    "\n",
+    "# Create a dataframe that's tied to it's column names\n",
+    "df = spark.createDataFrame([\n",
+    "  (\"I am so happy today, its sunny!\", \"en-US\"),\n",
+    "  (\"I am frustrated by this rush hour traffic\", \"en-US\"),\n",
+    "  (\"The cognitive services on spark aint bad\", \"en-US\"),\n",
+    "], [\"text\", \"language\"])\n",
+    "\n",
+    "# Run the Text Analytics service with options\n",
+    "sentiment = (TextSentiment()\n",
+    "    .setTextCol(\"text\")\n",
+    "    .setLocation(\"eastus\")\n",
+    "    .setSubscriptionKey(service_key)\n",
+    "    .setOutputCol(\"sentiment\")\n",
+    "    .setErrorCol(\"error\")\n",
+    "    .setLanguageCol(\"language\"))\n",
+    "\n",
+    "# Show the results of your text query in a table format\n",
+    "display(sentiment.transform(df).select(\"text\", col(\"sentiment\")[0].getItem(\"sentiment\").alias(\"sentiment\")))"
+   ]
+  },
+  {
+   "source": [
+    "## Computer Vision sample\n",
+    "\n",
+    "[Computer Vision](../computer-vision/index.yml) analyzes images to identify structure such as faces, objects, and natural-language descriptions. In this sample, we tag a list of images. Tags are one-word descriptions of things in the image like recognizable objects, people, scenery, and actions."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a dataframe with the image URLs\n",
+    "df = spark.createDataFrame([\n",
+    "        (\"https://raw.githubusercontent.com/Azure-Samples/cognitive-services-sample-data-files/master/ComputerVision/Images/objects.jpg\", ),\n",
+    "        (\"https://raw.githubusercontent.com/Azure-Samples/cognitive-services-sample-data-files/master/ComputerVision/Images/dog.jpg\", ),\n",
+    "        (\"https://raw.githubusercontent.com/Azure-Samples/cognitive-services-sample-data-files/master/ComputerVision/Images/house.jpg\", )\n",
+    "    ], [\"image\", ])\n",
+    "\n",
+    "# Run the Computer Vision service. Analyze Image extracts infortmation from/about the images.\n",
+    "analysis = (AnalyzeImage()\n",
+    "    .setLocation(\"eastus\")\n",
+    "    .setSubscriptionKey(service_key)\n",
+    "    .setVisualFeatures([\"Categories\",\"Color\",\"Description\",\"Faces\",\"Objects\",\"Tags\"])\n",
+    "    .setOutputCol(\"analysis_results\")\n",
+    "    .setImageUrlCol(\"image\")\n",
+    "    .setErrorCol(\"error\"))\n",
+    "\n",
+    "# Show the results of what you wanted to pull out of the images.\n",
+    "display(analysis.transform(df).select(\"image\", \"analysis_results.description.tags\"))"
+   ]
+  },
+  {
+   "source": [
+    "## Bing Image Search sample\n",
+    "\n",
+    "[Bing Image Search](../bing-image-search/overview.md) searches the web to retrieve images related to a user's natural language query. In this sample, we use a text query that looks for images with quotes. It returns a list of image URLs that contain photos related to our query."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.ml import PipelineModel\n",
+    "\n",
+    "# Number of images Bing will return per query\n",
+    "imgsPerBatch = 10\n",
+    "# A list of offsets, used to page into the search results\n",
+    "offsets = [(i*imgsPerBatch,) for i in range(100)]\n",
+    "# Since web content is our data, we create a dataframe with options on that data: offsets\n",
+    "bingParameters = spark.createDataFrame(offsets, [\"offset\"])\n",
+    "\n",
+    "# Run the Bing Image Search service with our text query\n",
+    "bingSearch = (BingImageSearch()\n",
+    "    .setSubscriptionKey(bing_search_key)\n",
+    "    .setOffsetCol(\"offset\")\n",
+    "    .setQuery(\"Martin Luther King Jr. quotes\")\n",
+    "    .setCount(imgsPerBatch)\n",
+    "    .setOutputCol(\"images\"))\n",
+    "\n",
+    "# Transformer that extracts and flattens the richly structured output of Bing Image Search into a simple URL column\n",
+    "getUrls = BingImageSearch.getUrlTransformer(\"images\", \"url\")\n",
+    "\n",
+    "# This displays the full results returned, uncomment to use\n",
+    "# display(bingSearch.transform(bingParameters))\n",
+    "\n",
+    "# Since we have two services, they are put into a pipeline\n",
+    "pipeline = PipelineModel(stages=[bingSearch, getUrls])\n",
+    "\n",
+    "# Show the results of your search: image URLs\n",
+    "display(pipeline.transform(bingParameters))"
+   ]
+  },
+  {
+   "source": [
+    "## Speech-to-Text sample\n",
+    "The [Speech-to-text](../speech-service/index-speech-to-text.yml) service converts streams or files of spoken audio to text. In this sample, we transcribe two audio files. The first file is easy to understand, and the second is more challenging."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a dataframe with our audio URLs, tied to the column called \"url\"\n",
+    "df = spark.createDataFrame([(\"https://mmlspark.blob.core.windows.net/datasets/Speech/audio2.wav\",)\n",
+    "                           ], [\"url\"])\n",
+    "\n",
+    "# Run the Speech-to-text service to translate the audio into text\n",
+    "speech_to_text = (SpeechToTextSDK()\n",
+    "    .setSubscriptionKey(service_key)\n",
+    "    .setLocation(\"eastus\")\n",
+    "    .setOutputCol(\"text\")\n",
+    "    .setAudioDataCol(\"url\")\n",
+    "    .setLanguage(\"en-US\")\n",
+    "    .setProfanity(\"Masked\"))\n",
+    "\n",
+    "# Show the results of the translation\n",
+    "display(speech_to_text.transform(df).select(\"url\", \"text.DisplayText\"))"
+   ]
+  },
+  {
+   "source": [
+    "## Anomaly Detector sample\n",
+    "\n",
+    "[Anomaly Detector](../anomaly-detector/index.yml) is great for detecting irregularities in your time series data. In this sample, we use the service to find anomalies in the entire time series."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.sql.functions import lit\n",
+    "\n",
+    "# Create a dataframe with the point data that Anomaly Detector requires\n",
+    "df = spark.createDataFrame([\n",
+    "    (\"1972-01-01T00:00:00Z\", 826.0),\n",
+    "    (\"1972-02-01T00:00:00Z\", 799.0),\n",
+    "    (\"1972-03-01T00:00:00Z\", 890.0),\n",
+    "    (\"1972-04-01T00:00:00Z\", 900.0),\n",
+    "    (\"1972-05-01T00:00:00Z\", 766.0),\n",
+    "    (\"1972-06-01T00:00:00Z\", 805.0),\n",
+    "    (\"1972-07-01T00:00:00Z\", 821.0),\n",
+    "    (\"1972-08-01T00:00:00Z\", 20000.0),\n",
+    "    (\"1972-09-01T00:00:00Z\", 883.0),\n",
+    "    (\"1972-10-01T00:00:00Z\", 898.0),\n",
+    "    (\"1972-11-01T00:00:00Z\", 957.0),\n",
+    "    (\"1972-12-01T00:00:00Z\", 924.0),\n",
+    "    (\"1973-01-01T00:00:00Z\", 881.0),\n",
+    "    (\"1973-02-01T00:00:00Z\", 837.0),\n",
+    "    (\"1973-03-01T00:00:00Z\", 9000.0)\n",
+    "], [\"timestamp\", \"value\"]).withColumn(\"group\", lit(\"series1\"))\n",
+    "\n",
+    "# Run the Anomaly Detector service to look for irregular data\n",
+    "anamoly_detector = (SimpleDetectAnomalies()\n",
+    "  .setSubscriptionKey(anomaly_key)\n",
+    "  .setLocation(\"eastus\")\n",
+    "  .setTimestampCol(\"timestamp\")\n",
+    "  .setValueCol(\"value\")\n",
+    "  .setOutputCol(\"anomalies\")\n",
+    "  .setGroupbyCol(\"group\")\n",
+    "  .setGranularity(\"monthly\"))\n",
+    "\n",
+    "# Show the full results of the analysis with the anomalies marked as \"True\"\n",
+    "display(anamoly_detector.transform(df).select(\"timestamp\", \"value\", \"anomalies.isAnomaly\"))"
+   ]
+  },
+  {
+   "source": [
+    "## Arbitrary web APIs\n",
+    "\n",
+    "With HTTP on Spark, any web service can be used in your big data pipeline. In this example, we use the [World Bank API](http://api.worldbank.org/v2/country/) to get information about various countries around the world."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from requests import Request\n",
+    "from mmlspark.io.http import HTTPTransformer, http_udf\n",
+    "from pyspark.sql.functions import udf, col\n",
+    "\n",
+    "# Use any requests from the python requests library\n",
+    "def world_bank_request(country):\n",
+    "  return Request(\"GET\", \"http://api.worldbank.org/v2/country/{}?format=json\".format(country))\n",
+    "\n",
+    "# Create a dataframe with spcificies which countries we want data on\n",
+    "df = (spark.createDataFrame([(\"br\",),(\"usa\",)], [\"country\"])\n",
+    "  .withColumn(\"request\", http_udf(world_bank_request)(col(\"country\"))))\n",
+    "\n",
+    "# Much faster for big data because of the concurrency :)\n",
+    "client = (HTTPTransformer()\n",
+    "      .setConcurrency(3)\n",
+    "      .setInputCol(\"request\")\n",
+    "      .setOutputCol(\"response\"))\n",
+    "\n",
+    "# Get the body of the response\n",
+    "def get_response_body(resp):\n",
+    "  return resp.entity.content.decode()\n",
+    "\n",
+    "# Show the details of the country data returned\n",
+    "display(client.transform(df).select(\"country\", udf(get_response_body)(col(\"response\")).alias(\"response\")))"
+   ]
+  },
+  {
+   "source": [
+    "## See also\n",
+    "\n",
+    "* [Recipe: Anomaly Detection](./recipes/anomaly-detection.md)\n",
+    "* [Recipe: Art Explorer](./recipes/art-explorer.md)"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  }
+ ]
+}