Skip to content

Commit

Permalink
Adds future work section to pyspark udf readme
Browse files Browse the repository at this point in the history
So that it's clear we have more work to do here.
  • Loading branch information
skrawcz committed Feb 28, 2023
1 parent 8b89e55 commit 17ae5bd
Showing 1 changed file with 24 additions and 1 deletion.
25 changes: 24 additions & 1 deletion examples/spark/pyspark_udfs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,30 @@ passed in dataframe.
3. `@check_output` annotations are not currently supported for pyspark UDFs at this time. But we're working on it - ping
us in slack (or via issues) if you need this feature!

# Can't I just use pyspark dataframes directly with Hamilton functions?
# Future work

## Auto vectorize UDFs to be pandas_udfs
We could under the hood translate basic vanilla python UDF functions to use the pandas_udf route. This could be a
variable passed to the PySparkUDFGraphAdapter to enable it/or require some annotation on the function, or both.
Let us know if this would be useful to you!

## All the Pandas UDF signatures

(1) Let us know what you need.
(2) Implementation is a matter of (a) getting the API right, and (b) making sure it fits with the Hamilton way of thinking.

## Aggregation UDFs

We just need to determine what a good API for this would be. We're open to suggestions!

## Other dataframe operations

We could support other dataframe operations, like joins, etc. We're open to suggestions! The main issue is creating
a good API for this.

# Other questions

## Can't I just use pyspark dataframes directly with Hamilton functions?

Yes, with Hamilton you could write functions that define a named flow that operates entirely over pyspark dataframes.
However, you lose a lot of the flexibility of Hamilton doing things that way. We're open to suggestions,
Expand Down

0 comments on commit 17ae5bd

Please sign in to comment.