Skip to content

Commit

Permalink
Allow to set engine in connection (#196)
Browse files Browse the repository at this point in the history
Closes #186
  • Loading branch information
moritzmeister committed Dec 18, 2020
1 parent 60efdd9 commit 655d1c2
Show file tree
Hide file tree
Showing 7 changed files with 75 additions and 45 deletions.
6 changes: 3 additions & 3 deletions docs/integrations/databricks/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Once the cluster is running users can establish a connection to the Hopsworks Fe
443, # Port to reach your Hopsworks instance, defaults to 443
'my_project', # Name of your Hopsworks Feature Store project
secrets_store='secretsmanager', # Either parameterstore or secretsmanager
hostname_verification=True) # Disable for self-signed certificates
hostname_verification=True # Disable for self-signed certificates
)
fs = conn.get_feature_store() # Get the project's default feature store
```
Expand All @@ -104,8 +104,8 @@ Once the cluster is running users can establish a connection to the Hopsworks Fe
'my_instance', # DNS of your Feature Store instance
443, # Port to reach your Hopsworks instance, defaults to 443
'my_project', # Name of your Hopsworks Feature Store project
api_key_file="featurestore.key" # For Azure, store the API key locally
hostname_verification=True) # Disable for self-signed certificates
api_key_file="featurestore.key", # For Azure, store the API key locally
hostname_verification=True # Disable for self-signed certificates
)
fs = conn.get_feature_store() # Get the project's default feature store
```
Expand Down
4 changes: 4 additions & 0 deletions docs/integrations/python.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@ conn = hsfs.connection(
fs = conn.get_feature_store() # Get the project's default feature store
```

!!! note "Engine"

`HSFS` uses either Apache Spark or Apache Hive as an execution engine to perform queries against the feature store. The `engine` option of the connection let's you overwrite the default behaviour by setting it to `"hive"` or `"spark"`. By default, `HSFS` will try to use Spark as engine if PySpark is available. So if you have PySpark installed in your local Python environment, but you have not configured Spark, you will have to set `engine='hive'`. Please refer to the [Spark integration guide](spark.md) to configure your local Spark cluster to be able to connect to the Hopsworks Feature Store.

!!! info "Ports"

If you have trouble to connect, please ensure that your Feature Store can receive incoming traffic from your Python environment on ports 443, 9083 and 9085.
Expand Down
8 changes: 7 additions & 1 deletion docs/integrations/sagemaker.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,11 +168,17 @@ conn = hsfs.connection(
443, # Port to reach your Hopsworks instance, defaults to 443
'my_project', # Name of your Hopsworks Feature Store project
secrets_store='secretsmanager', # Either parameterstore or secretsmanager
hostname_verification=True) # Disable for self-signed certificates
hostname_verification=True, # Disable for self-signed certificates
engine='hive' # Choose Hive as engine if you haven't set up AWS EMR
)
fs = conn.get_feature_store() # Get the project's default feature store
```

!!! note "Engine"

`HSFS` uses either Apache Spark or Apache Hive as an execution engine to perform queries against the feature store. Most AWS SageMaker Kernels have PySpark installed but are not connected to AWS EMR by default, hence, the `engine` option of the connection let's you overwrite the default behaviour. By default, `HSFS` will try to use Spark as engine if PySpark is available, however, if Spark/EMR is not configured, you will have to set the engine manually to `"hive"`. Please refer to the [EMR integration guide](emr/emr_configuration.md) to setup EMR with the Hopsworks Feature Store.


!!! info "Ports"

If you have trouble connecting, please ensure that the Security Group of your Hopsworks instance on AWS is configured to allow incoming traffic from your SageMaker instance on ports 443, 9083 and 9085. See [VPC Security Groups](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html) for more information. If your SageMaker instances are not in the same VPC as your Hopsworks instance and the Hopsworks instance is not accessible from the internet then you will need to configure [VPC Peering on AWS](https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html).
Expand Down
5 changes: 4 additions & 1 deletion docs/integrations/spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,12 +89,15 @@ conn = hsfs.connection(
host='my_instance', # DNS of your Feature Store instance
port=443, # Port to reach your Hopsworks instance, defaults to 443
project='my_project', # Name of your Hopsworks Feature Store project
api_key_value='api_key', # The API key to authenticate with the feature store
api_key_value='api_key', # The API key to authenticate with the feature store
hostname_verification=True # Disable for self-signed certificates
)
fs = conn.get_feature_store() # Get the project's default feature store
```

!!! note "Engine"

`HSFS` uses either Apache Spark or Apache Hive as an execution engine to perform queries against the feature store. The `engine` option of the connection let's you overwrite the default behaviour by setting it to `"hive"` or `"spark"`. By default, `HSFS` will try to use Spark as engine if PySpark is available, hence, no further action should be required if you setup Spark correctly as described above.

## Next Steps

Expand Down
2 changes: 2 additions & 0 deletions python/hsfs/client/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ def init(
host=None,
port=None,
project=None,
engine=None,
region_name=None,
secrets_store=None,
hostname_verification=None,
Expand All @@ -41,6 +42,7 @@ def init(
host,
port,
project,
engine,
region_name,
secrets_store,
hostname_verification,
Expand Down
5 changes: 4 additions & 1 deletion python/hsfs/client/external.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ def __init__(
host,
port,
project,
engine,
region_name,
secrets_store,
hostname_verification,
Expand Down Expand Up @@ -67,7 +68,9 @@ def __init__(
project_info = self._get_project_info(self._project_name)
self._project_id = str(project_info["projectId"])

if cert_folder:
self._cert_key = None

if engine == "hive":
# On external Spark clients (Databricks, Spark Cluster),
# certificates need to be provided before the Spark application starts.
self._cert_folder_base = cert_folder
Expand Down
90 changes: 51 additions & 39 deletions python/hsfs/connection.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,11 @@ class Connection:
project: The name of the project to connect to. When running on Hopsworks, this
defaults to the project from where the client is run from.
Defaults to `None`.
engine: Which engine to use, `"spark"` or `"hive"`. Defaults to `None`, which
initializes the engine to Spark if the environment provides Spark, for
example on Hopsworks and Databricks, or falls back on Hive if Spark is not
available, e.g. on local Python environments or AWS SageMaker. This option
allows you to override this behaviour.
region_name: The name of the AWS region in which the required secrets are
stored, defaults to `"default"`.
secrets_store: The secrets storage to be used, either `"secretsmanager"`,
Expand All @@ -108,6 +113,7 @@ def __init__(
host: str = None,
port: int = HOPSWORKS_PORT_DEFAULT,
project: str = None,
engine: str = None,
region_name: str = AWS_DEFAULT_REGION,
secrets_store: str = SECRETS_STORE_DEFAULT,
hostname_verification: bool = HOSTNAME_VERIFICATION_DEFAULT,
Expand All @@ -119,6 +125,7 @@ def __init__(
self._host = host
self._port = port
self._project = project
self._engine = engine
self._region_name = region_name
self._secrets_store = secrets_store
self._hostname_verification = hostname_verification
Expand Down Expand Up @@ -168,48 +175,49 @@ def connect(self):
"""
self._connected = True
try:
# determine engine, needed to init client
if (self._engine is not None and self._engine.lower() == "spark") or (
self._engine is None and importlib.util.find_spec("pyspark")
):
self._engine = "spark"
elif (self._engine is not None and self._engine.lower() == "hive") or (
self._engine is None and not importlib.util.find_spec("pyspark")
):
self._engine = "hive"
else:
raise ConnectionError(
"Engine you are trying to initialize is unknown. "
"Supported engines are `'spark'` and `'hive'`."
)

# init client
if client.base.Client.REST_ENDPOINT not in os.environ:
if importlib.util.find_spec("pyspark"):
# databricks, emr, external spark clusters
client.init(
"external",
self._host,
self._port,
self._project,
self._region_name,
self._secrets_store,
self._hostname_verification,
self._trust_store_path,
None,
self._api_key_file,
self._api_key_value,
)
engine.init("spark")
else:
# aws
client.init(
"external",
self._host,
self._port,
self._project,
self._region_name,
self._secrets_store,
self._hostname_verification,
self._trust_store_path,
self._cert_folder,
self._api_key_file,
self._api_key_value,
)
engine.init(
"hive",
self._host,
self._cert_folder,
self._project,
client.get_instance()._cert_key,
)
client.init(
"external",
self._host,
self._port,
self._project,
self._engine,
self._region_name,
self._secrets_store,
self._hostname_verification,
self._trust_store_path,
self._cert_folder,
self._api_key_file,
self._api_key_value,
)
else:
client.init("hopsworks")
engine.init("spark")

# init engine
engine.init(
self._engine,
self._host,
self._cert_folder,
self._project,
client.get_instance()._cert_key,
)

self._feature_store_api = feature_store_api.FeatureStoreApi()
self._project_api = project_api.ProjectApi()
self._hosts_api = hosts_api.HostsApi()
Expand Down Expand Up @@ -239,6 +247,7 @@ def setup_databricks(
host: str = None,
port: int = HOPSWORKS_PORT_DEFAULT,
project: str = None,
engine: str = None,
region_name: str = AWS_DEFAULT_REGION,
secrets_store: str = SECRETS_STORE_DEFAULT,
hostname_verification: bool = HOSTNAME_VERIFICATION_DEFAULT,
Expand All @@ -259,6 +268,7 @@ def setup_databricks(
host,
port,
project,
engine,
region_name,
secrets_store,
hostname_verification,
Expand Down Expand Up @@ -286,6 +296,7 @@ def connection(
host: str = None,
port: int = HOPSWORKS_PORT_DEFAULT,
project: str = None,
engine: str = None,
region_name: str = AWS_DEFAULT_REGION,
secrets_store: str = SECRETS_STORE_DEFAULT,
hostname_verification: bool = HOSTNAME_VERIFICATION_DEFAULT,
Expand All @@ -299,6 +310,7 @@ def connection(
host,
port,
project,
engine,
region_name,
secrets_store,
hostname_verification,
Expand Down

0 comments on commit 655d1c2

Please sign in to comment.