Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding documentation for azure.cosmos.spark package and managing spark libraries #1013

Merged
merged 8 commits into from
Jan 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion docs/how-to-guides/jdbc-cosmos-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,4 +105,28 @@ client.get_offline_features(..., output_path=sink)

## Using SQL database as the online store

Same as the offline, create JDBC sink and add it to the `MaterializationSettings`, set corresponding environment variables, then use it with `FeathrClient.materialize_features`.
Same as the offline, create JDBC sink and add it to the `MaterializationSettings`, set corresponding environment variables, then use it with `FeathrClient.materialize_features`.

## Using CosmosDb as the online store

To use CosmosDb as the online store, create `CosmosDbSink` and add it to the `MaterializationSettings`, then use it with `FeathrClient.materialize_features`, e.g..

```
name = 'cosmosdb_output'
sink = CosmosDbSink(name, some_cosmosdb_url, some_cosmosdb_database, some_cosmosdb_collection)
os.environ[f"{name.upper()}_KEY"] = "some_cosmosdb_api_key"
client.materialize_features(..., materialization_settings=MaterializationSettings(..., sinks=[sink]))
```

Feathr client doesn't support getting feature values from CosmosDb, you need to use [official CosmosDb client](https://pypi.org/project/azure-cosmos/) to get the values:

```
from azure.cosmos import exceptions, CosmosClient, PartitionKey
client = CosmosClient(some_cosmosdb_url, some_cosmosdb_api_key)
db_client = client.get_database_client(some_cosmosdb_database)
container_client = db_client.get_container_client(some_cosmosdb_collection)
doc = container_client.read_item(some_key)
feature_value = doc['feature_name']
```

**Note: There is currently a known issue with using azure.cosmos.spark package on databricks. Somehow this dependency is not resolving correctly on databricks when packaged through gradle and results in ClassNotFound error. To mitigate this issue, please install the azure.cosmos.spark package directly from [maven central](https://mvnrepository.com/artifact/com.azure.cosmos.spark/azure-cosmos-spark_3-1_2-12) following the steps [here](./manage-library-spark-platform.md)**
16 changes: 16 additions & 0 deletions docs/how-to-guides/manage-library-spark-platform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
layout: default
title: Manage library for Azure Spark Compute (Azure Databricks and Azure Synapse)
parent: How-to Guides
---

# Manage libraries for Azure Synapse

Sometimes you might run into dependency issues where a particular dependency might be missing on your spark compute, most likely you will face this issue while executing sample notebooks in that environment.

If you want to intall maven, PyPi or private packages on your Synapse cluster, you can follow the [official documentation](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries) of how to do it for workspace, pool and session and the difference between each one of them.


## Manage libraries for Azure Databricks

Similarly to install an external library from PyPi, Maven or private packages on databricks you can follow the official [databricks documentation](https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries)