- Please bring a wireless enabled laptop.
- Download free Postman app for API development.
- Make sure your machine has an ssh client with port-forwarding capability. On Mac or Linux, simply run the ssh command in a terminal window. On Windows, download plink.exe from here. Alternatively, see this page for details on the Windows shell options.
- Provision a Linux CentOS Data Science VM (DSVM) on Azure Portal following these instructions.
- Make sure to provision Standard DS12_V2 type.
- IMPORTANT: For the VM user name please use remoteuser!
We will provide Azure Data Science Virtual Machines (running Spark 2.0.2) for attendees to use during the tutorial. You will use your laptop to connect to your allocated virtual machine.
-
Connect to your DSVM
- Linux, Mac, or Windows Linux Shell: Command line to connect using
ssh
: Replace XXX with the public IP address of your Data Science Virtual Machine (e.g. remoteuser@13.64.107.209)
ssh -L localhost:8787:localhost:8787 remoteuser@XXX
- Windows: Command line to connect with plink.exe - run the following commands in a Windows command prompt window - replace XXX with the public IP address of your Data Science Virtual Machine (e.g. remoteuser@13.64.107.209)
cd directory-containing-plink.exe .\plink.exe -L localhost:8787:localhost:8787 remoteuser@XXX
See this page for details on the Windows shell options. We are creating an SSH tunnel to the VM by mapping localhost:8787 on the VM to the client machine. This is the port on the VM opened to RStudio Server.
- Linux, Mac, or Windows Linux Shell: Command line to connect using
-
Once you are connected, become a root user on the cluster. In the SSH session, use the following command.
sudo su -
-
Download the course material from the git repository using the following command
git clone https://github.com/vapaunic/mlads2017s-mrsdeploy.git
-
Change the permissions on the custom script file and run the script. Use the following commands.
cd mlads2017s-mrsdeploy chmod +x DSVM_Customization_Script.sh dos2unix ./DSVM_Customization_Script.sh ./DSVM_Customization_Script.sh
-
After connecting via the above command lines, you can access RStudio Server by opening a web browser and typing the following URL. You will be prompted to sign in with your credentials.
http://localhost:8787/
Microsoft R Server general information: https://msdn.microsoft.com/en-us/microsoft-r/rserver. Microsoft R Servers are installed on both Azure Linux DSVMs and HDInsight clusters (see below), and will be used to run R code in the tutorial.
Microsoft R Server operationalization service general information: https://msdn.microsoft.com/en-us/microsoft-r/operationalize/about
Configuring operationalization: https://msdn.microsoft.com/en-us/microsoft-r/operationalize/configuration-initial
SparkR general information: http://spark.apache.org/docs/latest/sparkr.html
SparkR 2.0.2 functions: https://spark.apache.org/docs/2.0.2/api/R/index.html
RevoScaleR functions: https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler
Information on Linux DSVM: https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.linux-data-science-vm
The Linux DSVM has Spark (2.0.2) installed, as well as Yarn for job management, as well as HDFS. So, you can use the DSVM to run regular R code as well as code that run on Spark (e.g. using SparkR package). You will use DSVM as a single node Spark machine for hands-on exercises. We will provision these machines and assign them to you at the beginning of the tutorial.