diff --git a/notebooks/samples/Cognitive Services - Overview.ipynb b/notebooks/samples/Cognitive Services - Overview.ipynb index bc90555cb6..3a586efe84 100644 --- a/notebooks/samples/Cognitive Services - Overview.ipynb +++ b/notebooks/samples/Cognitive Services - Overview.ipynb @@ -348,9 +348,9 @@ }, { "source": [ - "## Azure Cognitive search - Creating a searchable Art Database with The MET's open-access collection sample\n", + "## Azure Cognitive search sample\n", "\n", - "In this example, we show how you can enrich data using Cognitive Skills and write to an Azure Search Index using MMLSpark. We use a subset of The MET's open-access collection and enrich it by passing it through 'Describe Image' and a custom 'Image Similarity' skill. The results are then written to a searchable index." + "In this example, we show how you can enrich data using Cognitive Skills and write to an Azure Search Index using MMLSpark." ], "cell_type": "markdown", "metadata": {} @@ -361,115 +361,37 @@ "metadata": {}, "outputs": [], "source": [ - "import os, sys, time, json, requests\n", - "from pyspark.ml import Transformer, Estimator, Pipeline\n", - "from pyspark.ml.feature import SQLTransformer\n", - "from pyspark.sql.functions import lit, udf, col, split" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ + "# import os, sys, time, json, requests\n", + "# from pyspark.ml import Transformer, Estimator, Pipeline\n", + "# from pyspark.ml.feature import SQLTransformer\n", + "# from pyspark.sql.functions import lit, udf, col, split\n", + "from mmlspark.cognitive import *\n", + "\n", "VISION_API_KEY = os.environ['VISION_API_KEY']\n", "AZURE_SEARCH_KEY = os.environ['AZURE_SEARCH_KEY']\n", "search_service = \"mmlspark-azure-search\"\n", - "search_index = \"test\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data = spark.read\\\n", - " .format(\"csv\")\\\n", - " .option(\"header\", True)\\\n", - " .load(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/metartworks_sample.csv\")\\\n", - " .withColumn(\"searchAction\", lit(\"upload\"))\\\n", - " .withColumn(\"Neighbors\", split(col(\"Neighbors\"), \",\").cast(\"array\"))\\\n", - " .withColumn(\"Tags\", split(col(\"Tags\"), \",\").cast(\"array\"))\\\n", - " .limit(25)" - ] - }, - { - "source": [ - "" - ], - "cell_type": "markdown", - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from mmlspark.cognitive import AnalyzeImage\n", - "from mmlspark.stages import SelectColumns\n", - "\n", - "#define pipeline\n", - "describeImage = (AnalyzeImage()\n", - " .setSubscriptionKey(VISION_API_KEY)\n", - " .setLocation(\"eastus\")\n", - " .setImageUrlCol(\"PrimaryImageUrl\")\n", - " .setOutputCol(\"RawImageDescription\")\n", - " .setErrorCol(\"Errors\")\n", - " .setVisualFeatures([\"Categories\", \"Tags\", \"Description\", \"Faces\", \"ImageType\", \"Color\", \"Adult\"])\n", - " .setConcurrency(5))\n", - "\n", - "df2 = describeImage.transform(data)\\\n", - " .select(\"*\", \"RawImageDescription.*\").drop(\"Errors\", \"RawImageDescription\")" - ] - }, - { - "source": [ - "" - ], - "cell_type": "markdown", - "metadata": {} - }, - { - "source": [ - "Before writing the results to a Search Index, you must define a schema which must specify the name, type, and attributes of each field in your index. Refer [Create a basic index in Azure Search](https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index) for more information." - ], - "cell_type": "markdown", - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from mmlspark.cognitive import *\n", - "df2.writeToAzureSearch(\n", - " subscriptionKey=AZURE_SEARCH_KEY,\n", + "search_index = \"test-33467690\"\n", + "\n", + "df = spark.createDataFrame([(\"upload\", \"0\", \"https://mmlspark.blob.core.windows.net/datasets/DSIR/test1.jpg\"), \n", + " (\"upload\", \"1\", \"https://mmlspark.blob.core.windows.net/datasets/DSIR/test2.jpg\")], \n", + " [\"searchAction\", \"id\", \"url\"])\n", + "\n", + "tdf = AnalyzeImage()\\\n", + " .setSubscriptionKey(VISION_API_KEY)\\\n", + " .setLocation(\"eastus\")\\\n", + " .setImageUrlCol(\"url\")\\\n", + " .setOutputCol(\"analyzed\")\\\n", + " .setErrorCol(\"errors\")\\\n", + " .setVisualFeatures([\"Categories\", \"Tags\", \"Description\", \"Faces\", \"ImageType\", \"Color\", \"Adult\"])\\\n", + " .transform(df)\\\n", + " .select(\"*\", \"analyzed.*\")\\\n", + " .drop(\"errors\", \"analyzed\")\n", + "\n", + "tdf.writeToAzureSearch(subscriptionKey=AZURE_SEARCH_KEY,\n", " actionCol=\"searchAction\",\n", " serviceName=search_service,\n", " indexName=search_index,\n", - " keyCol=\"ObjectID\"\n", - ")" - ] - }, - { - "source": [ - "The Search Index can be queried using the [Azure Search REST API](https://docs.microsoft.com/rest/api/searchservice/) by sending GET or POST requests and specifying query parameters that give the criteria for selecting matching documents. For more information on querying refer [Query your Azure Search index using the REST API](https://docs.microsoft.com/en-us/rest/api/searchservice/Search-Documents)" - ], - "cell_type": "markdown", - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "url = 'https://{}.search.windows.net/indexes/{}/docs/search?api-version=2019-05-06'.format(search_service, search_index)\n", - "requests.post(url, json={\"search\": \"Glass\"}, headers = {\"api-key\": AZURE_SEARCH_KEY}).json()" + " keyCol=\"id\")" ] }, { diff --git a/notebooks/samples/LightGBM - Quantile Regression for Drug Discovery.ipynb b/notebooks/samples/LightGBM - Quantile Regression for Drug Discovery.ipynb deleted file mode 100644 index ceece82376..0000000000 --- a/notebooks/samples/LightGBM - Quantile Regression for Drug Discovery.ipynb +++ /dev/null @@ -1,178 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## LightGBM - Quantile Regression for Drug Discovery\n", - "\n", - "We will demonstrate how to use the LightGBM quantile regressor with\n", - "TrainRegressor and ComputeModelStatistics on the Triazines dataset.\n", - "\n", - "\n", - "This sample demonstrates how to use the following APIs:\n", - "- [`TrainRegressor`\n", - " ](http://mmlspark.azureedge.net/docs/pyspark/TrainRegressor.html)\n", - "- [`LightGBMRegressor`\n", - " ](http://mmlspark.azureedge.net/docs/pyspark/LightGBMRegressor.html)\n", - "- [`ComputeModelStatistics`\n", - " ](http://mmlspark.azureedge.net/docs/pyspark/ComputeModelStatistics.html)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "triazines = spark.read.format(\"libsvm\")\\\n", - " .load(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/triazines.scale.svmlight\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# print some basic info\n", - "print(\"records read: \" + str(triazines.count()))\n", - "print(\"Schema: \")\n", - "triazines.printSchema()\n", - "triazines.limit(10).toPandas()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Split the dataset into train and test" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train, test = triazines.randomSplit([0.85, 0.15], seed=1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Train the quantile regressor on the training data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from mmlspark.lightgbm import LightGBMRegressor\n", - "model = LightGBMRegressor(objective='quantile',\n", - " alpha=0.2,\n", - " learningRate=0.3,\n", - " numLeaves=31).fit(train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can save and load LightGBM to a file using the LightGBM native representation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from mmlspark.lightgbm import LightGBMRegressionModel\n", - "model.saveNativeModel(\"/mymodel\")\n", - "model = LightGBMRegressionModel.loadNativeModelFromFile(\"/mymodel\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "View the feature importances of the trained model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(model.getFeatureImportances())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Score the regressor on the test data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "scoredData = model.transform(test)\n", - "scoredData.limit(10).toPandas()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Compute metrics using ComputeModelStatistics" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from mmlspark.train import ComputeModelStatistics\n", - "metrics = ComputeModelStatistics(evaluationMetric='regression',\n", - " labelCol='label',\n", - " scoresCol='prediction') \\\n", - " .transform(scoredData)\n", - "metrics.toPandas()" - ] - } - ], - "metadata": { - "anaconda-cloud": {}, - "kernelspec": { - "display_name": "Python [default]", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.3" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file diff --git a/notebooks/samples/Vowpal Wabbit - Overview.ipynb b/notebooks/samples/Vowpal Wabbit - Overview.ipynb index 3bfaf98bae..d9180b303f 100644 --- a/notebooks/samples/Vowpal Wabbit - Overview.ipynb +++ b/notebooks/samples/Vowpal Wabbit - Overview.ipynb @@ -466,7 +466,7 @@ ")\n", "vw_train_data = vw_featurizer.transform(train_data)['target', 'features']\n", "vw_test_data = vw_featurizer.transform(test_data)['target', 'features']\n", - "display(vw_train_data.limit(10).toPandas())" + "display(vw_train_data)" ] }, { @@ -493,9 +493,7 @@ ")\n", "\n", "# To reduce number of partitions (which will effect performance), use `vw_train_data.repartition(1)`\n", - "vw_train_data_2 = vw_train_data.repartition(1)\n", - "print(vw_train_data_2.count())\n", - "vw_model = vwr.fit(vw_train_data_2.repartition(1))\n", + "vw_model = vwr.fit(vw_train_data.repartition(1))\n", "vw_predictions = vw_model.transform(vw_test_data)\n", "\n", "display(vw_predictions.limit(20).toPandas())" @@ -673,7 +671,7 @@ "metadata": {}, "outputs": [], "source": [ - "data = spark.read.format(\"json\").option(\"inferSchema\", True).load(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/vwcb_input.dsjson\")" + "data = spark.read.format(\"json\").load(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/vwcb_input.dsjson\")" ] }, { @@ -744,7 +742,7 @@ }, { "source": [ - "Buiild VowpalWabbit Contextual Bandit model and compute performance statistics." + "Build VowpalWabbit Contextual Bandit model and compute performance statistics." ], "cell_type": "markdown", "metadata": {} diff --git a/notebooks/samples/Vowpal Wabbit - Quantile Regression for Drug Discovery.ipynb b/notebooks/samples/Vowpal Wabbit - Quantile Regression for Drug Discovery.ipynb deleted file mode 100644 index 4ffafef149..0000000000 --- a/notebooks/samples/Vowpal Wabbit - Quantile Regression for Drug Discovery.ipynb +++ /dev/null @@ -1,146 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Vowpal Wabbit - Quantile Regression for Drug Discovery\n", - "\n", - "We will demonstrate how to use the VowpalWabbit quantile regressor with\n", - "TrainRegressor and ComputeModelStatistics on the Triazines dataset.\n", - "\n", - "\n", - "This sample demonstrates how to use the following APIs:\n", - "- [`TrainRegressor`\n", - " ](http://mmlspark.azureedge.net/docs/pyspark/TrainRegressor.html)\n", - "- [`VowpalWabbitRegressor`\n", - " ](http://mmlspark.azureedge.net/docs/pyspark/VowpalWabbitRegressor.html)\n", - "- [`ComputeModelStatistics`\n", - " ](http://mmlspark.azureedge.net/docs/pyspark/ComputeModelStatistics.html)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "triazines = spark.read.format(\"libsvm\")\\\n", - " .load(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/triazines.scale.svmlight\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# print some basic info\n", - "print(\"records read: \" + str(triazines.count()))\n", - "print(\"Schema: \")\n", - "triazines.printSchema()\n", - "triazines.limit(10).toPandas()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Split the dataset into train and test" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train, test = triazines.randomSplit([0.85, 0.15], seed=1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Train the quantile regressor on the training data.\n", - "\n", - "Note: have a look at stderr for the task to see VW's output\n", - "\n", - "Full command line argument docs can be found [here](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Command-Line-Arguments).\n", - "\n", - "Learning rate, numPasses and power_t are exposed to support grid search." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from mmlspark.vw import VowpalWabbitRegressor\n", - "model = (VowpalWabbitRegressor(numPasses=20, args=\"--holdout_off --loss_function quantile -q :: -l 0.1\")\n", - " .fit(train))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Score the regressor on the test data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "scoredData = model.transform(test)\n", - "scoredData.limit(10).toPandas()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Compute metrics using ComputeModelStatistics" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from mmlspark.train import ComputeModelStatistics\n", - "metrics = ComputeModelStatistics(evaluationMetric='regression',\n", - " labelCol='label',\n", - " scoresCol='prediction') \\\n", - " .transform(scoredData)\n", - "metrics.toPandas()" - ] - } - ], - "metadata": { - "anaconda-cloud": {}, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.5" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file