Skip to content

Commit

Permalink
feat: Feast Spark Offline Store (feast-dev#2349)
Browse files Browse the repository at this point in the history
* State of feast

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Clean up changes

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix random incorrect changes

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix lint

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix build errors

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix lint

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Add spark offline store components to test against current integration tests

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix lint

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Rename to pass checks

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix issues

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix type checking issues

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix lint

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Clean up print statements for first review

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix lint

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix flake 8 lint tests

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Add warnings for alpha version release

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Format

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Address review

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Address review

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix lint

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Add file store functionality

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* lint

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Add example feature repo

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Update data source creator

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Make cli work for feast init with spark

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Update the docs

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Clean up code

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Clean up more code

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Uncomment repo configs

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix setup.py

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Update dependencies

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix ci dependencies

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Screwed up rebase

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Screwed up rebase

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Screwed up rebase

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Realign with master

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix accidental changes

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Make type map change cleaner

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Address review comments

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix tests accidentally broken

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Add comments

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Reformat

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix logger

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Remove unused imports

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix imports

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix CI dependencies

Signed-off-by: Danny Chiao <danny@tecton.ai>

* Prefix destinations with project name

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Update comment

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix 3.8

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* temporary fix

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* rollback

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* update

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Update ci?

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Move third party to contrib

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Fix imports

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Remove third_party refactor

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Revert ci requirements and update comment in type map

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

* Revert 3.8-requirements

Signed-off-by: Kevin Zhang <kzhang@tecton.ai>

Co-authored-by: Danny Chiao <danny@tecton.ai>
  • Loading branch information
kevjumba and adchia committed Mar 4, 2022
1 parent 74f887f commit 98b8d8d
Show file tree
Hide file tree
Showing 21 changed files with 1,401 additions and 208 deletions.
2 changes: 2 additions & 0 deletions docs/reference/data-sources/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@ Please see [Data Source](../../getting-started/concepts/feature-view.md#data-sou
{% page-ref page="bigquery.md" %}

{% page-ref page="redshift.md" %}

{% page-ref page="spark.md" %}
45 changes: 45 additions & 0 deletions docs/reference/data-sources/spark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Spark

## Description

**NOTE**: Spark data source api is currently in alpha development and the API is not completely stable. The API may change or update in the future.

The spark data source API allows for the retrieval of historical feature values from file/database sources for building training datasets as well as materializing features into an online store.

* Either a table name, a SQL query, or a file path can be provided.

## Examples

Using a table reference from SparkSession(for example, either in memory or a Hive Metastore)

```python
from feast import SparkSource

my_spark_source = SparkSource(
table="FEATURE_TABLE",
)
```

Using a query

```python
from feast import SparkSource

my_spark_source = SparkSource(
query="SELECT timestamp as ts, created, f1, f2 "
"FROM spark_table",
)
```

Using a file reference

```python
from feast import SparkSource

my_spark_source = SparkSource(
path=f"{CURRENT_DIR}/data/driver_hourly_stats",
file_format="parquet",
event_timestamp_column="event_timestamp",
created_timestamp_column="created",
)
```
2 changes: 2 additions & 0 deletions docs/reference/offline-stores/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@ Please see [Offline Store](../../getting-started/architecture-and-components/off
{% page-ref page="bigquery.md" %}

{% page-ref page="redshift.md" %}

{% page-ref page="spark.md" %}
38 changes: 38 additions & 0 deletions docs/reference/offline-stores/spark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Spark

## Description

The Spark offline store is an offline store currently in alpha development that provides support for reading [SparkSources](../data-sources/spark.md).

## Disclaimer

This Spark offline store still does not achieve full test coverage and continues to fail some integration tests when integrating with the feast universal test suite. Please do NOT assume complete stability of the API.

* Spark tables and views are allowed as sources that are loaded in from some Spark store(e.g in Hive or in memory).
* Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.
* A `SparkRetrievalJob` is returned when calling `get_historical_features()`.
* This allows you to call
* `to_df` to retrieve the pandas dataframe.
* `to_arrow` to retrieve the dataframe as a pyarrow Table.
* `to_spark_df` to retrieve the dataframe the spark.

## Example

{% code title="feature_store.yaml" %}
```yaml
project: my_project
registry: data/registry.db
provider: local
offline_store:
type: spark
spark_conf:
spark.master: "local[*]"
spark.ui.enabled: "false"
spark.eventLog.enabled: "false"
spark.sql.catalogImplementation: "hive"
spark.sql.parser.quotedRegexColumnNames: "true"
spark.sql.session.timeZone: "UTC"
online_store:
path: data/online_store.db
```
{% endcode %}
4 changes: 4 additions & 0 deletions sdk/python/feast/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
from pkg_resources import DistributionNotFound, get_distribution

from feast.infra.offline_stores.bigquery_source import BigQuerySource
from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
SparkSource,
)
from feast.infra.offline_stores.file_source import FileSource
from feast.infra.offline_stores.redshift_source import RedshiftSource
from feast.infra.offline_stores.snowflake_source import SnowflakeSource
Expand Down Expand Up @@ -47,4 +50,5 @@
"RedshiftSource",
"RequestFeatureView",
"SnowflakeSource",
"SparkSource",
]
4 changes: 3 additions & 1 deletion sdk/python/feast/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -477,7 +477,9 @@ def materialize_incremental_command(ctx: click.Context, end_ts: str, views: List
@click.option(
"--template",
"-t",
type=click.Choice(["local", "gcp", "aws", "snowflake"], case_sensitive=False),
type=click.Choice(
["local", "gcp", "aws", "snowflake", "spark"], case_sensitive=False
),
help="Specify a template for the created project",
default="local",
)
Expand Down
7 changes: 5 additions & 2 deletions sdk/python/feast/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
FileSource,
RedshiftSource,
SnowflakeSource,
SparkSource,
)
from feast.data_source import DataSource
from feast.errors import RegistryInferenceFailure
Expand Down Expand Up @@ -84,7 +85,9 @@ def update_data_sources_with_inferred_event_timestamp_col(
):
# prepare right match pattern for data source
ts_column_type_regex_pattern = ""
if isinstance(data_source, FileSource):
if isinstance(data_source, FileSource) or isinstance(
data_source, SparkSource
):
ts_column_type_regex_pattern = r"^timestamp"
elif isinstance(data_source, BigQuerySource):
ts_column_type_regex_pattern = "TIMESTAMP|DATETIME"
Expand All @@ -97,7 +100,7 @@ def update_data_sources_with_inferred_event_timestamp_col(
"DataSource",
"""
DataSource inferencing of event_timestamp_column is currently only supported
for FileSource and BigQuerySource.
for FileSource, SparkSource, BigQuerySource, RedshiftSource, and SnowflakeSource.
""",
)
# for informing the type checker
Expand Down
Empty file.
Loading

0 comments on commit 98b8d8d

Please sign in to comment.