Skip to content

Commit

Permalink
Expand Databricks Integration guide (#162)
Browse files Browse the repository at this point in the history
* complete integration guide databricks

* fix links
  • Loading branch information
moritzmeister authored Nov 19, 2020
1 parent 1e2b5c1 commit b37a313
Show file tree
Hide file tree
Showing 34 changed files with 424 additions and 29 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
117 changes: 117 additions & 0 deletions docs/integrations/databricks/api_key.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Hopsworks API Key

In order for the Databricks cluster to be able to communicate with the Hopsworks Feature Store, the clients running on Databricks need to be able to access a Hopsworks API Key.

## Generating an API Key

In Hopsworks, click on your *username* in the top-right corner and select *Settings* to open the user settings. Select *Api keys*. Give the key a name and select the job, featurestore and project scopes before creating the key. Copy the key into your clipboard for the next step.

!!! success "Scopes"
The created API-Key should at least have the following scopes:

1. featurestore
2. project
3. job

<p align="center">
<figure>
<img src="../../../assets/images/api-key.png" alt="Generating an API Key on Hopsworks">
<figcaption>API-Keys can be generated in the User Settings on Hopsworks</figcaption>
</figure>
</p>

!!! info
You are only ably to retrieve the API Key once. If you miss to copy it to your clipboard, delete it again and create a new one.


## Storing the API Key

### AWS

#### Option 1: Using the AWS Systems Manager Parameter Store

**Storing the API Key in the AWS Systems Manager Parameter Store**

In the AWS Management Console, ensure that your active region is the region you use for Databricks.
Go to the *AWS Systems Manager* choose *Parameter Store* and select *Create Parameter*.
As name enter `/hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key` replacing `[MY_DATABRICKS_ROLE]` with the AWS role used by the Databricks cluster that should access the Feature Store. Select *Secure String* as type and create the parameter.

<p align="center">
<figure>
<a href="../../../assets/images/databricks/aws/databricks_parameter_store.png">
<img src="../../../assets/images/databricks/aws/databricks_parameter_store.png" alt="Storing the Feature Store API Key in the Parameter Store">
</a>
<figcaption>Storing the Feature Store API Key in the Parameter Store</figcaption>
</figure>
</p>

**Granting access to the secret to the Databricks notebook role**

In the AWS Management Console, go to *IAM*, select *Roles* and then the role that is used when creating Databricks clusters.
Select *Add inline policy*. Choose *Systems Manager* as service, expand the *Read* access level and check *GetParameter*.
Expand Resources and select *Add ARN*.
Enter the region of the *Systems Manager* as well as the name of the parameter **WITHOUT the leading slash** e.g. *hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key* and click *Add*.
Click on *Review*, give the policy a name und click on *Create policy*.

<p align="center">
<figure>
<a href="../../../assets/images/databricks/aws/databricks_parameter_store_policy.png">
<img src="../../../assets/images/databricks/aws/databricks_parameter_store_policy.png" alt="Configuring the access policy for the Parameter Store">
</a>
<figcaption>Configuring the access policy for the Parameter Store</figcaption>
</figure>
</p>

#### Option 2: Using the AWS Secrets Manager

**Storing the API Key in the AWS Secrets Manager**

In the AWS management console ensure that your active region is the region you use for Databricks.
Go to the *AWS Secrets Manager* and select *Store new secret*. Select *Other type of secrets* and add *api-key*
as the key and paste the API key created in the previous step as the value. Click next.

<p align="center">
<figure>
<a href="../../../assets/images/databricks/aws/databricks_secrets_manager_step_1.png">
<img src="../../../assets/images/databricks/aws/databricks_secrets_manager_step_1.png" alt="Storing a Feature Store API Key in the Secrets Manager Step 1">
</a>
<figcaption>Storing a Feature Store API Key in the Secrets Manager Step 1</figcaption>
</figure>
</p>

As secret name, enter *hopsworks/role/[MY_DATABRICKS_ROLE]* replacing [MY_DATABRICKS_ROLE] with the AWS role used
by the Databricks instance that should access the Feature Store. Select next twice and finally store the secret.
Then click on the secret in the secrets list and take note of the *Secret ARN*.

<p align="center">
<figure>
<a href="../../../assets/images/databricks/aws/databricks_secrets_manager_step_2.png">
<img src="../../../assets/images/databricks/aws/databricks_secrets_manager_step_2.png" alt="Storing a Feature Store API Key in the Secrets Manager Step 2">
</a>
<figcaption>Storing a Feature Store API Key in the Secrets Manager Step 2</figcaption>
</figure>
</p>

**Granting access to the secret to the Databricks notebook role**

In the AWS Management Console, go to *IAM*, select *Roles* and then the role that is used when creating Databricks clusters.
Select *Add inline policy*. Choose *Secrets Manager* as service, expand the *Read* access level and check *GetSecretValue*.
Expand Resources and select *Add ARN*. Paste the ARN of the secret created in the previous step.
Click on *Review*, give the policy a name und click on *Create policy*.

<p align="center">
<figure>
<a href="../../../assets/images/databricks/aws/databricks_secrets_manager_policy.png">
<img src="../../../assets/images/databricks/aws/databricks_secrets_manager_policy.png" alt="Configuring the access policy for the Secrets Manager">
</a>
<figcaption>Configuring the access policy for the Secrets Manager</figcaption>
</figure>
</p>

### Azure

On Azure we currently do not support storing the API Key in a secret storage. Instead just store the API Key in a file in your Databricks workspace so you can access it when connecting to the Feature Store.

## Next Steps

Continue with the [configuration guide](configuration.md) to finalize the configuration of the Databricks Cluster to communicate with the Hopsworks Feature Store.
Original file line number Diff line number Diff line change
@@ -1,15 +1,23 @@
# Databricks Integration

Users can configure their Databricks clusters to write the results of feature engineering pipelines in the Hopsworks Feature Store using HSFS.
Users can configure their Databricks clusters to write the results of feature engineering pipelines in the Hopsworks Feature Store using HSFS.
Configuring a Databricks cluster can be done from the Hopsworks Feature Store UI. This guide explains each step.

## Networking
## Prerequisites

Users should refer to [hopsworks.ai AWS Databricks](https://hopsworks.readthedocs.io/en/stable/getting_started/hopsworksai/guides/databricks_quick_start.html) or [hopsworks.ai Azure Databricks](https://hopsworks.readthedocs.io/en/stable/getting_started/hopsworksai/guides/databricks_quick_start_azure.html) for instructions on how to configure networking properly between Databricks' VPC (or Virtual Network on Azure) and their Hopsworks.ai VPC/VNet.
In order to be able to configure a Databricks cluster to use the Feature Store of your Hopsworks instance, it is necessary to ensure networking is setup correctly between the instances and that the Databricks cluster has access to the Hopsworks API key to perform requests with HSFS from Databricks to Hopsworks.

### Networking

If you haven't done so already, follow the networking guides for either [AWS](networking.md#aws) or [Azure](networking.md#azure) for instructions on how to configure networking properly between Databricks' VPC (or Virtual Network on Azure) and the Hopsworks.ai VPC/VNet.

### Hopsworks API Key

In order for the Feature Store API to be able to communicate with the user's Hopsworks instance, the client library (HSFS) needs to have access to a previously generated API Key from Hopsworks. For ways to setup and store the Hopsworks API Key, please refer to the [API Key guide for Databricks](api_key.md).

## Databricks API Key

Hopsworks uses the Databricks REST APIs to communicate with the Databricks instance and configure clusters on behalf of users.
Hopsworks uses the Databricks REST APIs to communicate with the Databricks instance and configure clusters on behalf of users.
To achieve that, the first step is to register an instance and a valid API Key in Hopsworks.

Users can get a valid Databricks API Key by following the [Databricks Documentation](https://docs.databricks.com/dev-tools/api/latest/authentication.html#generate-a-personal-access-token)
Expand All @@ -22,50 +30,68 @@ Users can get a valid Databricks API Key by following the [Databricks Documentat

Users can register a new Databricks instance by navigating to the `Integrations` tab of a project Feature Store. Registering a Databricks instance requires adding the instance address and the API Key.

The instance address should be in the format `[UUID].cloud.databricks.com` (or `adb-[UUID].19.azuredatabricks.net` for Databricks on Azure), essentially the same web
address used to reach the Databricks instance from the browser.
The instance address should be in the format `[UUID].cloud.databricks.com` (or `adb-[UUID].19.azuredatabricks.net` for Databricks on Azure), essentially the same web address used to reach the Databricks instance from the browser.

The API Key will be stored in the Hopsworks secret store for the user and will be available only for that user. If multiple users need to configure Databricks clusters, each has to generate an API Key and register an instance. The Databricks instance registration does not have a project scope, meaning that once registered, the user can configure clusters for all projects they are part of.

## Databricks Cluster

A cluster needs to exists before users can configure it using the Hopsworks UI. The cluster can be in any state prior to the configuration.
A cluster needs to exists before users can configure it using the Hopsworks UI. The cluster can be in any state prior to the configuration.

!!! warning "Runtime limitation"

Currently Runtime 6 is suggested to be able to use the full suite of Hopsworks Feature Store capabilities.
Currently Runtime 6 is suggested to be able to use the full suite of Hopsworks Feature Store capabilities.

## Configure a cluster

Clusters are configured for a project user, which, in Hopsworks terms, means a user operating within the scope of a project.
Clusters are configured for a project user, which, in Hopsworks terms, means a user operating within the scope of a project.
To configure a cluster, click on the `Configure` button. By default the cluster will be configured for the user making the request. If the user doesn't have `Can Manage` privilege on the cluster, they can ask a project `Data Owner` to configure it for them. Hopsworks `Data Owners` are allowed to configure clusters for other project users, as long as they have the required Databricks privileges.

During the cluster configuration the following steps will be taken:

- Upload an archive to DBFS containing the necessary Jars for HSFS and HopsFS to be able to read and write from the Hopsworks Feature Store
- Add an initScript to configure the Jars when the cluster is started
- Install `hsfs` python library
- Install `hsfs` python library
- Configure the necessary Spark properties to authenticate and communicate with the Feature Store

When a cluster is configured for a specific project user, all the operations with the Hopsworks Feature Store will be executed as that project user. If another user needs to re-use the same cluster, the cluster can be reconfigured by following the same steps above.

## Connecting to the Feature Store

At the end of the configuration, Hopsworks will start the cluster.
At the end of the configuration, Hopsworks will start the cluster.
Once the cluster is running users can establish a connection to the Hopsworks Feature Store from Databricks:

```python
import hsfs
conn = hsfs.connection(
'my_instance', # DNS of your Feature Store instance
443, # Port to reach your Hopsworks instance, defaults to 443
'my_project', # Name of your Hopsworks Feature Store project
secrets_store='secretsmanager', # Either parameterstore or secretsmanager
hostname_verification=True) # Disable for self-signed certificates
)
fs = conn.get_feature_store() # Get the project's default feature store
```
!!! note "API Key on Azure"
Please note, for Azure it is necessary to store the Hopsworks API Key locally on the cluster as a file. As we currently do not support storing the API Key on an Azure Secret Management Service as we do for AWS. Consult the [API Key guide for Azure](api_key.md#azure), for more information.

=== "AWS"

```python
import hsfs
conn = hsfs.connection(
'my_instance', # DNS of your Feature Store instance
443, # Port to reach your Hopsworks instance, defaults to 443
'my_project', # Name of your Hopsworks Feature Store project
secrets_store='secretsmanager', # Either parameterstore or secretsmanager
hostname_verification=True) # Disable for self-signed certificates
)
fs = conn.get_feature_store() # Get the project's default feature store
```

=== "Azure"

```python
import hsfs
conn = hsfs.connection(
'my_instance', # DNS of your Feature Store instance
443, # Port to reach your Hopsworks instance, defaults to 443
'my_project', # Name of your Hopsworks Feature Store project
api_key_file="featurestore.key" # For Azure, store the API key locally
hostname_verification=True) # Disable for self-signed certificates
)
fs = conn.get_feature_store() # Get the project's default feature store
```

## Next Steps

For more information about how to connect, see the [Connection](../concepts/project.md) guide. Or continue with the Data Source guide to import your own data to the Feature Store.
For more information about how to connect, see the [Connection](../../generated/project.md) guide. Or continue with the Data Source guide to import your own data to the Feature Store.
Loading

0 comments on commit b37a313

Please sign in to comment.