Full spark integration #249

elijahbenizzy · 2023-08-03T13:16:53Z

[Short description explaining the high-level reason for the pull request]

Changes

How I tested this

Notes

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

See issue for more design. Basic overview is: 1. We use `with_columns` to group together map transforms 2. Map transforms can be UDFs (normal) or pandas UDFs 3. THese all get run and "linearized" -- this means that we have two sets of edges: -> edges that form the "physical" dependency -- the dataframe geting passed through and consistently appended to -> edges that represent logical dependencies -- these are the original edges in the with_columns group` While this muddles the edges, this allows us to visualize both the structure and execution of the DAG. We will likely be adding metadata to edges to help with visualization.

This has basic documentation in the README, a notebook, and some simple hello_world code/a script. There are a few caveats that are noted in the code.

Dependencies within a with_columns group can come from one of three places: 1. Other columns inside that group 2. The upsteram dataframe 3. External places in the DAG Previously we were requiring users to specify (2) with the initial_schema kwarg, but we now allow them to specify (3) with the external_inputs kwarg.

append will just append to that dataframe select will just select the specified columns from that dataframe

This enables you to extract columns from a cetnral dataframe. Note there are two available approaches: 1. Specify initial_schema to get a set of columns extracted already 2. Specify dataframe_subdag_param to extract them yourself

1. Removes dataframe parameter in favor of using the function's first parameter 2. Renames the dataframe_subdag_param parameter to pass_dataframe_as 3. Renames the initial_schema parameter as columns_to_pass We also update the README to be a little easier to run.

These are in an example. We don't have a notebook yet (as getting the data is a pain and I don't want to link our copy due to licensing), but its great to demonstrate how it works.

skrawcz · 2023-08-22T04:26:39Z

hamilton/experimental/h_spark.py

    if return_type in (int, float, bool, str, bytes):
        return python_to_spark_type(return_type)
+    elif return_type in (list[int], list[float], list[bool], list[str], list[bytes]):


this should reference _list above.

else this will break for python < 3.9

Yeah this was lost in the merge

hamilton/experimental/h_spark.py

elijahbenizzy changed the base branch from main to materialization August 3, 2023 13:17

elijahbenizzy force-pushed the full-spark-integration branch 18 times, most recently from 8699370 to 8ca4ef6 Compare August 5, 2023 04:07

elijahbenizzy force-pushed the materialization branch from cb12bde to d370f28 Compare August 5, 2023 04:07

elijahbenizzy force-pushed the full-spark-integration branch from 4adca5d to 8ca4ef6 Compare August 5, 2023 23:36

elijahbenizzy force-pushed the materialization branch 9 times, most recently from 2587e09 to 14a32e1 Compare August 6, 2023 00:33

elijahbenizzy force-pushed the full-spark-integration branch 11 times, most recently from 7cb43c0 to 5b64362 Compare August 21, 2023 23:40

elijahbenizzy requested a review from skrawcz August 22, 2023 00:30

elijahbenizzy added 9 commits August 21, 2023 17:46

Adds pyspark examples

1f3282e

This has basic documentation in the README, a notebook, and some simple hello_world code/a script. There are a few caveats that are noted in the code.

Adds pyspark -> pyspark UDFs

f09bbc7

Adds mode= argument to with_columns

7b59083

append will just append to that dataframe select will just select the specified columns from that dataframe

Adds a few TPC-h queries

67ba4cb

These are in an example. We don't have a notebook yet (as getting the data is a pain and I don't want to link our copy due to licensing), but its great to demonstrate how it works.

Updates the blog post to be a better narrative/cleaner

a1ac01d

elijahbenizzy force-pushed the full-spark-integration branch from 5b64362 to a1ac01d Compare August 22, 2023 00:58

skrawcz reviewed Aug 22, 2023

View reviewed changes

elijahbenizzy added 2 commits August 21, 2023 21:51

pre-commit post spark rebase

63e2a18

Fixes prior bad rebase

49ed8d6

elijahbenizzy force-pushed the full-spark-integration branch from 5696e81 to 49ed8d6 Compare August 22, 2023 18:00

elijahbenizzy requested a review from skrawcz August 22, 2023 18:08

skrawcz approved these changes Aug 22, 2023

View reviewed changes

elijahbenizzy merged commit 71ca9d1 into main Aug 22, 2023

elijahbenizzy deleted the full-spark-integration branch August 22, 2023 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full spark integration #249

Full spark integration #249

elijahbenizzy commented Aug 3, 2023

skrawcz Aug 22, 2023

skrawcz Aug 22, 2023

elijahbenizzy Aug 22, 2023

Full spark integration #249

Full spark integration #249

Conversation

elijahbenizzy commented Aug 3, 2023

Changes

How I tested this

Notes

Checklist

skrawcz Aug 22, 2023

Choose a reason for hiding this comment

skrawcz Aug 22, 2023

Choose a reason for hiding this comment

elijahbenizzy Aug 22, 2023

Choose a reason for hiding this comment