Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for EMR #175

Merged
merged 4 commits into from
Dec 2, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/images/emr/api_key.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/emr/emr_bootstrap_action.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/emr/emr_config.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/emr/emr_instance_profile.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/emr/emr_policy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/emr/emr_security_group.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/emr/emr_vpc_0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/emr/emr_vpc_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/emr/secrets_manager.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 23 additions & 25 deletions docs/integrations/databricks/api_key.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,32 @@
# Hopsworks API Key
# Hopsworks API key

In order for the Databricks cluster to be able to communicate with the Hopsworks Feature Store, the clients running on Databricks need to be able to access a Hopsworks API Key.
In order for the Databricks cluster to be able to communicate with the Hopsworks Feature Store, the clients running on Databricks need to be able to access a Hopsworks API key.

## Generating an API Key
## Generate an API key

In Hopsworks, click on your *username* in the top-right corner and select *Settings* to open the user settings. Select *Api keys*. Give the key a name and select the job, featurestore and project scopes before creating the key. Copy the key into your clipboard for the next step.
In Hopsworks, click on your *username* in the top-right corner and select *Settings* to open the user settings. Select *API keys*. Give the key a name and select the job, featurestore and project scopes before creating the key. Copy the key into your clipboard for the next step.

!!! success "Scopes"
The created API-Key should at least have the following scopes:
The API key should contain at least the following scopes:

1. featurestore
2. project
3. job

<p align="center">
<figure>
<img src="../../../assets/images/api-key.png" alt="Generating an API Key on Hopsworks">
<figcaption>API-Keys can be generated in the User Settings on Hopsworks</figcaption>
<img src="../../../assets/images/api-key.png" alt="Generating an API key on Hopsworks">
<figcaption>API keys can be created in the User Settings on Hopsworks</figcaption>
</figure>
</p>

!!! info
You are only ably to retrieve the API Key once. If you miss to copy it to your clipboard, delete it again and create a new one.
You are only able to retrieve the API key once. If you forget to copy it to your clipboard, delete it and create a new one.

## Quickstart API Key File
## Quickstart API key Argument

!!! hint "Save API Key as File"
To get started quickly, without saving the Hopsworks API in a secret storage, you can simply create a file with the previously created Hopsworks API Key and place it on the environment from which you wish to connect to the Hopsworks Feature Store. That is either save it on the Databricks File System (DBFS) or in your Databricks workspace.

You can then connect by simply passing the path to the key file when instantiating a connection:
!!! hint "API key as Argument"
To get started quickly, without saving the Hopsworks API in a secret storage, you can simply supply it as an argument when instantiating a connection:
```python hl_lines="6"
import hsfs
conn = hsfs.connection(
Expand All @@ -41,13 +39,13 @@ In Hopsworks, click on your *username* in the top-right corner and select *Setti
fs = conn.get_feature_store() # Get the project's default feature store
```

## Storing the API Key
## Store the API key

### AWS

#### Option 1: Using the AWS Systems Manager Parameter Store

**Storing the API Key in the AWS Systems Manager Parameter Store**
**Store the API key in the AWS Systems Manager Parameter Store**

In the AWS Management Console, ensure that your active region is the region you use for Databricks.
Go to the *AWS Systems Manager* choose *Parameter Store* and select *Create Parameter*.
Expand All @@ -56,13 +54,13 @@ As name enter `/hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key` replacing `[MY
<p align="center">
<figure>
<a href="../../../assets/images/databricks/aws/databricks_parameter_store.png">
<img src="../../../assets/images/databricks/aws/databricks_parameter_store.png" alt="Storing the Feature Store API Key in the Parameter Store">
<img src="../../../assets/images/databricks/aws/databricks_parameter_store.png" alt="Storing the Feature Store API key in the Parameter Store">
</a>
<figcaption>Storing the Feature Store API Key in the Parameter Store</figcaption>
<figcaption>Storing the Feature Store API key in the Parameter Store</figcaption>
</figure>
</p>

**Granting access to the secret to the Databricks notebook role**
**Grant access to the secret to the Databricks notebook role**

In the AWS Management Console, go to *IAM*, select *Roles* and then the role that is used when creating Databricks clusters.
Select *Add inline policy*. Choose *Systems Manager* as service, expand the *Read* access level and check *GetParameter*.
Expand All @@ -81,7 +79,7 @@ Click on *Review*, give the policy a name und click on *Create policy*.

#### Option 2: Using the AWS Secrets Manager

**Storing the API Key in the AWS Secrets Manager**
**Store the API key in the AWS Secrets Manager**

In the AWS management console ensure that your active region is the region you use for Databricks.
Go to the *AWS Secrets Manager* and select *Store new secret*. Select *Other type of secrets* and add *api-key*
Expand All @@ -90,9 +88,9 @@ as the key and paste the API key created in the previous step as the value. Clic
<p align="center">
<figure>
<a href="../../../assets/images/databricks/aws/databricks_secrets_manager_step_1.png">
<img src="../../../assets/images/databricks/aws/databricks_secrets_manager_step_1.png" alt="Storing a Feature Store API Key in the Secrets Manager Step 1">
<img src="../../../assets/images/databricks/aws/databricks_secrets_manager_step_1.png" alt="Storing a Feature Store API key in the Secrets Manager Step 1">
</a>
<figcaption>Storing a Feature Store API Key in the Secrets Manager Step 1</figcaption>
<figcaption>Storing a Feature Store API key in the Secrets Manager Step 1</figcaption>
</figure>
</p>

Expand All @@ -103,13 +101,13 @@ Then click on the secret in the secrets list and take note of the *Secret ARN*.
<p align="center">
<figure>
<a href="../../../assets/images/databricks/aws/databricks_secrets_manager_step_2.png">
<img src="../../../assets/images/databricks/aws/databricks_secrets_manager_step_2.png" alt="Storing a Feature Store API Key in the Secrets Manager Step 2">
<img src="../../../assets/images/databricks/aws/databricks_secrets_manager_step_2.png" alt="Storing a Feature Store API key in the Secrets Manager Step 2">
</a>
<figcaption>Storing a Feature Store API Key in the Secrets Manager Step 2</figcaption>
<figcaption>Storing a Feature Store API key in the Secrets Manager Step 2</figcaption>
</figure>
</p>

**Granting access to the secret to the Databricks notebook role**
**Grant access to the secret to the Databricks notebook role**

In the AWS Management Console, go to *IAM*, select *Roles* and then the role that is used when creating Databricks clusters.
Select *Add inline policy*. Choose *Secrets Manager* as service, expand the *Read* access level and check *GetSecretValue*.
Expand All @@ -127,7 +125,7 @@ Click on *Review*, give the policy a name und click on *Create policy*.

### Azure

On Azure we currently do not support storing the API Key in a secret storage. Instead just store the API Key in a file in your Databricks workspace so you can access it when connecting to the Feature Store.
On Azure we currently do not support storing the API key in a secret storage. Instead just store the API key in a file in your Databricks workspace so you can access it when connecting to the Feature Store.

## Next Steps

Expand Down
24 changes: 12 additions & 12 deletions docs/integrations/databricks/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,37 +11,37 @@ In order to be able to configure a Databricks cluster to use the Feature Store o

If you haven't done so already, follow the networking guides for either [AWS](networking.md#aws) or [Azure](networking.md#azure) for instructions on how to configure networking properly between Databricks' VPC (or Virtual Network on Azure) and the Hopsworks.ai VPC/VNet.

### Hopsworks API Key
### Hopsworks API key

In order for the Feature Store API to be able to communicate with the user's Hopsworks instance, the client library (HSFS) needs to have access to a previously generated API Key from Hopsworks. For ways to setup and store the Hopsworks API Key, please refer to the [API Key guide for Databricks](api_key.md).
In order for the Feature Store API to be able to communicate with the user's Hopsworks instance, the client library (HSFS) needs to have access to a previously generated API key from Hopsworks. For ways to setup and store the Hopsworks API key, please refer to the [API key guide for Databricks](api_key.md).

## Databricks API Key
## Databricks API key

Hopsworks uses the Databricks REST APIs to communicate with the Databricks instance and configure clusters on behalf of users.
To achieve that, the first step is to register an instance and a valid API Key in Hopsworks.
To achieve that, the first step is to register an instance and a valid API key in Hopsworks.

Users can get a valid Databricks API Key by following the [Databricks Documentation](https://docs.databricks.com/dev-tools/api/latest/authentication.html#generate-a-personal-access-token)
Users can get a valid Databricks API key by following the [Databricks Documentation](https://docs.databricks.com/dev-tools/api/latest/authentication.html#generate-a-personal-access-token)

!!! warning "Cluster access control"

If users have enabled [Databricks Cluster access control](https://docs.databricks.com/security/access-control/cluster-acl.html#cluster-access-control), it is important that the users running the cluster configuration (i.e. the user generating the API Key) has `Can Manage` privileges on the cluster they are trying to configure.
If users have enabled [Databricks Cluster access control](https://docs.databricks.com/security/access-control/cluster-acl.html#cluster-access-control), it is important that the users running the cluster configuration (i.e. the user generating the API key) has `Can Manage` privileges on the cluster they are trying to configure.

## Register a new Databricks Instance

Users can register a new Databricks instance by navigating to the `Integrations` tab of a project Feature Store. Registering a Databricks instance requires adding the instance address and the API Key.
Users can register a new Databricks instance by navigating to the `Integrations` tab of a project Feature Store. Registering a Databricks instance requires adding the instance address and the API key.

The instance address should be in the format `[UUID].cloud.databricks.com` (or `adb-[UUID].19.azuredatabricks.net` for Databricks on Azure), essentially the same web address used to reach the Databricks instance from the browser.

<p align="center">
<figure>
<a href="../../../assets/images/databricks/databricks-integration.png">
<img src="../../../assets/images/databricks/databricks-integration.png" alt="Register a Databricks Instance along with a Databricks API Key">
<img src="../../../assets/images/databricks/databricks-integration.png" alt="Register a Databricks Instance along with a Databricks API key">
</a>
<figcaption>Register a Databricks Instance along with a Databricks API Key</figcaption>
<figcaption>Register a Databricks Instance along with a Databricks API key</figcaption>
</figure>
</p>

The API Key will be stored in the Hopsworks secret store for the user and will be available only for that user. If multiple users need to configure Databricks clusters, each has to generate an API Key and register an instance. The Databricks instance registration does not have a project scope, meaning that once registered, the user can configure clusters for all projects they are part of.
The API key will be stored in the Hopsworks secret store for the user and will be available only for that user. If multiple users need to configure Databricks clusters, each has to generate an API key and register an instance. The Databricks instance registration does not have a project scope, meaning that once registered, the user can configure clusters for all projects they are part of.

## Databricks Cluster

Expand Down Expand Up @@ -79,8 +79,8 @@ When a cluster is configured for a specific project user, all the operations wit
At the end of the configuration, Hopsworks will start the cluster.
Once the cluster is running users can establish a connection to the Hopsworks Feature Store from Databricks:

!!! note "API Key on Azure"
Please note, for Azure it is necessary to store the Hopsworks API Key locally on the cluster as a file. As we currently do not support storing the API Key on an Azure Secret Management Service as we do for AWS. Consult the [API Key guide for Azure](api_key.md#azure), for more information.
!!! note "API key on Azure"
Please note, for Azure it is necessary to store the Hopsworks API key locally on the cluster as a file. As we currently do not support storing the API key on an Azure Secret Management Service as we do for AWS. Consult the [API key guide for Azure](api_key.md#azure), for more information.

=== "AWS"

Expand Down
33 changes: 18 additions & 15 deletions docs/integrations/databricks/networking.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,51 +4,54 @@ In order for Spark to communicate with the Feature Store from Databricks, networ

## AWS

### Step 1: Ensuring network connectivity
### Step 1: Ensure network connectivity

The Spark DataFrames API needs to be able to connect directly to the IP on which the Feature Store is listening.
The DataFrame API needs to be able to connect directly to the IP on which the Feature Store is listening.
This means that if you deploy the Feature Store on AWS you will either need to deploy the Feature Store in the same VPC as your Databricks
cluster or to set up [VPC Peering](https://docs.databricks.com/administration-guide/cloud-configurations/aws/vpc-peering.html) between your Databricks VPC and the Feature Store VPC.

**Option 1: Deploying the Feature Store in the Databricks VPC**
**Option 1: Deploy the Feature Store in the Databricks VPC**

When deploying the Feature Store Hopsworks instance, select the Databricks *VPC* and *Availability Zone* as the VPC and Availability Zone of your Feature Store cluster.
When you deploy the Feature Store Hopsworks instance, select the Databricks *VPC* and *Availability Zone* as the VPC and Availability Zone of your Feature Store cluster.
Identify your Databricks VPC by searching for VPCs containing Databricks in their name in your Databricks AWS region in the AWS Management Console:

<p align="center">
<figure>
<a href="../../../assets/images/databricks/aws/databricks_vpc.png">
<img src="../../../assets/images/databricks/aws/databricks_vpc.png" alt="Identifying the Databricks VPC">
<img src="../../../assets/images/databricks/aws/databricks_vpc.png" alt="Identify the Databricks VPC">
</a>
<figcaption>Identifying the Databricks VPC</figcaption>
<figcaption>Identify the Databricks VPC</figcaption>
</figure>
</p>

!!! info "Hopsworks installer"
If you are performing an installation using the [Hopsworks installer script](https://hopsworks.readthedocs.io/en/stable/getting_started/installation_guide/platforms/hopsworks-installer.html), ensure that the machines you are going to install Hopsworks on are configured with the respective VPC.
If you are performing an installation using the [Hopsworks installer script](https://hopsworks.readthedocs.io/en/stable/getting_started/installation_guide/platforms/hopsworks-installer.html), ensure that the virtual machines you install Hopsworks on are deployed in the EMR VPC.

!!! info "Hopsworks.ai"
If you are working on **Hopsworks.ai**, you can directly deploy the Hopsworks instance to the Databricks VPC, by simply selecting it at the [VPC selection step during cluster creation](../../hopsworksai/aws/cluster_creation.md#step-6-vpc-selection).

**Option 2: Setting up VPC peering**
**Option 2: Set up VPC peering**

Follow the guide [VPC Peering](https://docs.databricks.com/administration-guide/cloud-configurations/aws/vpc-peering.html) to set up VPC peering between the Feature Store cluster and Databricks. Get your Feature Store *VPC ID* and *CIDR* by searching for thr Feature Store VPC in the AWS Management Console:

!!! info "Hopsworks.ai"
On **Hopsworks.ai**, the VPC is shown in the cluster details.

<p align="center">
<figure>
<a href="../../../assets/images/databricks/aws/hopsworks_vpc.png">
<img src="../../../assets/images/databricks/aws/hopsworks_vpc.png" alt="Identifying the Feature Store VPC">
<img src="../../../assets/images/databricks/aws/hopsworks_vpc.png" alt="Identify the Feature Store VPC">
</a>
<figcaption>Identifying the Feature Store VPC</figcaption>
<figcaption>Identify the Feature Store VPC</figcaption>
</figure>
</p>

### Step 2: Configuring the Security Group
### Step 2: Configure the Security Group

The Feature Store *Security Group* needs to be configured to allow traffic from your Databricks clusters to be able to connect to the Feature Store.

!!! note "Hopsworks.ai"
If you deployed your Hopsworks Feature Store instance with Hopsworks.ai, it suffices to enable [outside access of the Feature Store and Online Feature Store services](../../hopsworksai/aws/getting_started/#step-5-outside-access-to-the-feature-store).
If you deployed your Hopsworks Feature Store with Hopsworks.ai, you only need to enable [outside access of the Feature Store services](../../../hopsworksai/aws/getting_started/#step-5-outside-access-to-the-feature-store).

Open your feature store instance under EC2 in the AWS Management Console and ensure that ports *443*, *9083*, *9085*, *8020* and *50010* are reachable
from the Databricks Security Group:
Expand All @@ -75,7 +78,7 @@ Connectivity from the Databricks Security Group can be allowed by opening the Se

## Azure

### Step 1: Setting up VNet peering between Hopsworks and Databricks
### Step 1: Set up VNet peering between Hopsworks and Databricks

VNet peering between the Hopsworks and the Databricks virtual network is required to be able to connect
to the Feature Store from Databricks.
Expand Down Expand Up @@ -206,7 +209,7 @@ Wait for the peering to show up as *Connected*. There should now be bi-direction
</figure>
</p>

### Step 2: Configuring the Network Security Group
### Step 2: Configure the Network Security Group

The *Network Security Group* of the Feature Store on Azure needs to be configured to allow traffic from your Databricks clusters to be able to connect to the Feature Store.

Expand All @@ -217,4 +220,4 @@ Ensure that ports *443*, *9083*, *9085*, *8020* and *50010* are reachable from t

## Next Steps

Continue with the [Hopsworks API Key guide](api_key.md) to setup access to a Hopsworks API Key from the Databricks Cluster, in order to be able to perform requests to the Hopsworks Feature Store.
Continue with the [Hopsworks API key guide](api_key.md) to setup access to a Hopsworks API key from the Databricks Cluster, in order to be able to use the Hopsworks Feature Store.
Loading