Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Separate out Snowflake Source #835

Closed
2 of 4 tasks
aabbasi-hbo opened this issue Nov 4, 2022 · 0 comments · Fixed by #836
Closed
2 of 4 tasks

[FR] Separate out Snowflake Source #835

aabbasi-hbo opened this issue Nov 4, 2022 · 0 comments · Fixed by #836
Labels
feature New feature or request

Comments

@aabbasi-hbo
Copy link
Collaborator

Willingness to contribute

Yes. I can contribute a fix for this bug independently.

Feature Request Proposal

Currently Snowflake support is tied to HdfsSource and SimplePath. When the user wants to create a snowflake support, they have to provide a JDBC url to an HDFS source. This translates to a SimplePath in the Computation Engine.

Couple issues arise because of this:

  • Defining a snowflake source is confusing and difficult. Every time the user wants to use snowflake for feature source or observation path they have to create a jdbc url that requires SF_URL, SF_USER, SF_ROLE (all of which are already specified in the config settings).
  • Since this HDFS source translates to a SimplePath in the computation engine, there is an issue with running WindowAggregation functionality with snowflake. SimplePath's (isFileBasedLocation) defaults to True - however, snowflake and JDBC are not file locations.
Caused by: URISyntaxException: Relative path in absolute URI:
2022-10-26 05:01:39.682 | ERROR    | feathr.spark_provider._databricks_submission:wait_for_completion:218 - at org.apache.hadoop.fs.Path.initialize(Path.java:263)
	at org.apache.hadoop.fs.Path.<init>(Path.java:221)
	at com.linkedin.feathr.offline.util.HdfsUtils$.exists(HdfsUtils.scala:453)
	at com.linkedin.feathr.offline.source.pathutil.HdfsPathChecker.exists(HdfsPathChecker.scala:11)
	at com.linkedin.feathr.offline.source.pathutil.TimeBasedHdfsPathAnalyzer.analyze(TimeBasedHdfsPathAnalyzer.scala:51)
	at com.linkedin.feathr.offline.transformation.AnchorToDataSourceMapper.getWindowAggAnchorDFMapForJoin(AnchorToDataSourceMapper.scala:102)
	at com.linkedin.feathr.offline.swa.SlidingWindowAggregationJoiner.$anonfun$joinWindowAggFeaturesAsDF$8(SlidingWindowAggregationJoiner.scala:142)

Minor Addition:
Currently, the registry implementation (request functionality) is tied to Azure Auth. We create an AWS implementation that allows the user to provide an AWSRequestAuth key and authenticate to the registry.

Motivation

What is the use case for this feature?

  • Users can now define a snowflake source for features and observations using the SnowflakeSource API. Users can pass in a custom query rather than the dbtable. This allows for filtering/processing on snowflake cluster before loading data in memory.
  • User can now use Registry with AWS Auth by creating an AWSRequestsAuth object and passing it in during client initialization

Details

The goal of this feature is to separate out the Snowflake implementation into its own source and functionality. This includes:

  • Separate Source API (SnowflakeSource)
  • Translates to Snowflake DataLocation in computation engine. (No longer tied to SimplePath)
  • Instead of having to provide a JDBC url each time, user can provide database, schema, dbtable/query information and rest of the SF config is retrieved from the configurations specified during client initialization.
  • Along with dbtable, users now have the ability to also pass in a query instead. (Enabling predicate pushdown functionality)
  • Add sfWarehouse to required SF config so user doesn't need to specify each time.
  • Expose client functionality to generate snowflake url given the same parameters as SnowflakeSource

What component(s) does this feature request affect?

  • Python Client: This is the client users use to interact with most of our API. Mostly written in Python.
  • Computation Engine: The computation engine that execute the actual feature join and generation work. Mostly in Scala and Spark.
  • Feature Registry API: The frontend API layer supports SQL, Purview(Atlas) as storage. The API layer is in Python(FAST API)
  • Feature Registry Web UI: The Web UI for feature registry. Written in React
@aabbasi-hbo aabbasi-hbo added the feature New feature or request label Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
1 participant