feat: Get Snowflake Query Output As Pyspark Dataframe (#2504) #3358

amithadiraju1694 · 2022-11-21T18:24:49Z

Added feature to offline_store-> snowflake.py to return results of snowflake query as pyspark data frame.This helps spark-based users to distribute data, which often doesn't fit in driver nodes through pandas output.
Also added relevant error class, to notify user on missing spark session , particular to this usecase.

Signed-off-by: amithadiraju1694 amith.adiraju@gmail.com

What this PR does / why we need it:

This adds feature to SnowflakeRetrievalJob to return result of snowflake query execution as a pyspark data frame.

Which issue(s) this PR fixes:

Fixes #2504

amithadiraju1694 · 2022-11-21T18:25:43Z

/assign @sfc-gh-madkins

/assign @adchia

sfc-gh-madkins · 2022-11-21T22:22:07Z

sdk/python/feast/infra/offline_stores/snowflake.py

@@ -447,6 +459,51 @@ def to_sql(self) -> str:
        with self._query_generator() as query:
            return query

+    def to_spark_df(
+        self, spark_session: Optional[SparkSession] = None
+    ) -> pyspark_DataFrame:


@amithadiraju1694 can you use just DataFrame here? Not the alias

sfc-gh-madkins · 2022-11-21T22:23:57Z

sdk/python/feast/infra/offline_stores/snowflake.py

+            spark_df: A pyspark dataframe.
+        """
+
+        if spark_session == None:


You should be able to combine these two if statements into one

sfc-gh-madkins · 2022-11-21T22:24:34Z

/ok-to-test

sfc-gh-madkins · 2022-11-21T22:24:51Z

@amithadiraju1694 can you run make lint-python

1. Added feature to offline_store-> snowflake.py to return results of snowflake query as pyspark data frame.This helps spark-based users to distribute data, which often doesn't fit in driver nodes through pandas output. 2. Also added relevant error class, to notify user on missing spark session , particular to this usecase. Signed-off-by: amithadiraju1694 <amith.adiraju@gmail.com>

adchia

/lgtm

feast-ci-bot · 2022-11-23T19:27:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adchia, amithadiraju1694

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [adchia]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

# [0.27.0](v0.26.0...v0.27.0) (2022-12-05) ### Bug Fixes * Changing Snowflake template code to avoid query not implemented … ([#3319](#3319)) ([1590d6b](1590d6b)) * Dask zero division error if parquet dataset has only one partition ([#3236](#3236)) ([69e4a7d](69e4a7d)) * Enable Spark materialization on Yarn ([#3370](#3370)) ([0c20a4e](0c20a4e)) * Ensure that Snowflake accounts for number columns that overspecify precision ([#3306](#3306)) ([0ad0ace](0ad0ace)) * Fix memory leak from usage.py not properly cleaning up call stack ([#3371](#3371)) ([a0c6fde](a0c6fde)) * Fix workflow to contain env vars ([#3379](#3379)) ([548bed9](548bed9)) * Update bytewax materialization ([#3368](#3368)) ([4ebe00f](4ebe00f)) * Update the version counts ([#3378](#3378)) ([8112db5](8112db5)) * Updated AWS Athena template ([#3322](#3322)) ([5956981](5956981)) * Wrong UI data source type display ([#3276](#3276)) ([8f28062](8f28062)) ### Features * Cassandra online store, concurrency in bulk write operations ([#3367](#3367)) ([eaf354c](eaf354c)) * Cassandra online store, concurrent fetching for multiple entities ([#3356](#3356)) ([00fa21f](00fa21f)) * Get Snowflake Query Output As Pyspark Dataframe ([#2504](#2504)) ([#3358](#3358)) ([2f18957](2f18957))

amithadiraju1694 requested a review from sfc-gh-madkins as a code owner November 21, 2022 18:24

feast-ci-bot added the size/M label Nov 21, 2022

feast-ci-bot assigned adchia and sfc-gh-madkins Nov 21, 2022

sfc-gh-madkins reviewed Nov 21, 2022

View reviewed changes

feast-ci-bot added the ok-to-test label Nov 21, 2022

amithadiraju1694 force-pushed the feat_snow_sparkdf branch from 29a76b0 to 3816354 Compare November 22, 2022 04:25

amithadiraju1694 force-pushed the feat_snow_sparkdf branch from 3816354 to 14c082e Compare November 22, 2022 04:45

adchia approved these changes Nov 23, 2022

View reviewed changes

feast-ci-bot added the lgtm label Nov 23, 2022

feast-ci-bot added the approved label Nov 23, 2022

feast-ci-bot merged commit 2f18957 into feast-dev:master Nov 23, 2022

amithadiraju1694 deleted the feat_snow_sparkdf branch November 24, 2022 02:50

sfc-gh-madkins mentioned this pull request Dec 27, 2022

Support for Snowflake connector with Spark #3364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Get Snowflake Query Output As Pyspark Dataframe (#2504) #3358

feat: Get Snowflake Query Output As Pyspark Dataframe (#2504) #3358

amithadiraju1694 commented Nov 21, 2022

amithadiraju1694 commented Nov 21, 2022

sfc-gh-madkins Nov 21, 2022

sfc-gh-madkins Nov 21, 2022

sfc-gh-madkins commented Nov 21, 2022

sfc-gh-madkins commented Nov 21, 2022

adchia left a comment

feast-ci-bot commented Nov 23, 2022

feat: Get Snowflake Query Output As Pyspark Dataframe (#2504) #3358

feat: Get Snowflake Query Output As Pyspark Dataframe (#2504) #3358

Conversation

amithadiraju1694 commented Nov 21, 2022

amithadiraju1694 commented Nov 21, 2022

sfc-gh-madkins Nov 21, 2022

Choose a reason for hiding this comment

sfc-gh-madkins Nov 21, 2022

Choose a reason for hiding this comment

sfc-gh-madkins commented Nov 21, 2022

sfc-gh-madkins commented Nov 21, 2022

adchia left a comment

Choose a reason for hiding this comment

feast-ci-bot commented Nov 23, 2022