[FR] Separate out Snowflake Source #835

aabbasi-hbo · 2022-11-04T18:35:19Z

Willingness to contribute

Yes. I can contribute a fix for this bug independently.

Feature Request Proposal

Currently Snowflake support is tied to HdfsSource and SimplePath. When the user wants to create a snowflake support, they have to provide a JDBC url to an HDFS source. This translates to a SimplePath in the Computation Engine.

Couple issues arise because of this:

Defining a snowflake source is confusing and difficult. Every time the user wants to use snowflake for feature source or observation path they have to create a jdbc url that requires SF_URL, SF_USER, SF_ROLE (all of which are already specified in the config settings).
Since this HDFS source translates to a SimplePath in the computation engine, there is an issue with running WindowAggregation functionality with snowflake. SimplePath's (isFileBasedLocation) defaults to True - however, snowflake and JDBC are not file locations.

Caused by: URISyntaxException: Relative path in absolute URI:
2022-10-26 05:01:39.682 | ERROR    | feathr.spark_provider._databricks_submission:wait_for_completion:218 - at org.apache.hadoop.fs.Path.initialize(Path.java:263)
	at org.apache.hadoop.fs.Path.<init>(Path.java:221)
	at com.linkedin.feathr.offline.util.HdfsUtils$.exists(HdfsUtils.scala:453)
	at com.linkedin.feathr.offline.source.pathutil.HdfsPathChecker.exists(HdfsPathChecker.scala:11)
	at com.linkedin.feathr.offline.source.pathutil.TimeBasedHdfsPathAnalyzer.analyze(TimeBasedHdfsPathAnalyzer.scala:51)
	at com.linkedin.feathr.offline.transformation.AnchorToDataSourceMapper.getWindowAggAnchorDFMapForJoin(AnchorToDataSourceMapper.scala:102)
	at com.linkedin.feathr.offline.swa.SlidingWindowAggregationJoiner.$anonfun$joinWindowAggFeaturesAsDF$8(SlidingWindowAggregationJoiner.scala:142)

Minor Addition:
Currently, the registry implementation (request functionality) is tied to Azure Auth. We create an AWS implementation that allows the user to provide an AWSRequestAuth key and authenticate to the registry.

Motivation

What is the use case for this feature?

Users can now define a snowflake source for features and observations using the SnowflakeSource API. Users can pass in a custom query rather than the dbtable. This allows for filtering/processing on snowflake cluster before loading data in memory.
User can now use Registry with AWS Auth by creating an AWSRequestsAuth object and passing it in during client initialization

Details

The goal of this feature is to separate out the Snowflake implementation into its own source and functionality. This includes:

Separate Source API (SnowflakeSource)
Translates to Snowflake DataLocation in computation engine. (No longer tied to SimplePath)
Instead of having to provide a JDBC url each time, user can provide database, schema, dbtable/query information and rest of the SF config is retrieved from the configurations specified during client initialization.
Along with dbtable, users now have the ability to also pass in a query instead. (Enabling predicate pushdown functionality)
Add sfWarehouse to required SF config so user doesn't need to specify each time.
Expose client functionality to generate snowflake url given the same parameters as SnowflakeSource

What component(s) does this feature request affect?

Python Client: This is the client users use to interact with most of our API. Mostly written in Python.
Computation Engine: The computation engine that execute the actual feature join and generation work. Mostly in Scala and Spark.
Feature Registry API: The frontend API layer supports SQL, Purview(Atlas) as storage. The API layer is in Python(FAST API)
Feature Registry Web UI: The Web UI for feature registry. Written in React

The text was updated successfully, but these errors were encountered:

aabbasi-hbo added the feature New feature or request label Nov 4, 2022

This was referenced Nov 4, 2022

Separate out snowflake source #834

Closed

Separate out snowflake source #836

Merged

xiaoyongzhu closed this as completed in #836 Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Separate out Snowflake Source #835

[FR] Separate out Snowflake Source #835

aabbasi-hbo commented Nov 4, 2022

What is the use case for this feature?

[FR] Separate out Snowflake Source #835

[FR] Separate out Snowflake Source #835

Comments

aabbasi-hbo commented Nov 4, 2022

Willingness to contribute

Feature Request Proposal

Motivation

What is the use case for this feature?

Details

What component(s) does this feature request affect?