Add spark pdf summarizer example #268

skrawcz · 2023-08-14T06:13:38Z

Adds example showing how to run the PDF summarizer on spark.

E.g. given a table, run the summarizer as a series of row based UDFs.

Changes

adds example
extends some minor typing support for spark

How I tested this

runs locally

Notes

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

We need to handle things like list[str] so that we can handle the pdf_summarizer code. Also the code didn't handle optional dependencies, now it does.

So that we get some coverage going in case we accidentally break something. Makes sure that we handle <3.9 appropriately too.

This example shows that you can run the same code in a FastAPI backend, and then turn around and run it on Spark as well! This is because flows like PDF summarization are essential “map” transformations, and can then be easily ported and modeled in spark. The magic to run on spark happens in the SparkUDFGraphAdapter. Otherwise, this adds and adjusts code to make the summarization.py code work on spark (one tweak was needed). We add everything for someone to run the example, including a script and notebook. Squashed commit of: Adds notebook to pdf spark example So that people can use that to get started too. (+4 squashed commits) Squashed commits: [7db102c] Removes unused code. [ce2abec] Adds missing requirements.txt for spark PDF example So that people know what to install. [eae3edd] Adds README to spark PDF post Renames directory to run_on_spark to make it clear. Adds pointer to main README to spark one. [154e3ce] Adds spark code to run PDF Summarizer We only needed one minor changes to the original code: * we have to ensure the order of "columns" versus values that can be bound are in the correct order. So we had to reorder one function's arguments for spark to work.

Extends h_spark udf to handle optional and array types

062d080

We need to handle things like list[str] so that we can handle the pdf_summarizer code. Also the code didn't handle optional dependencies, now it does.

skrawcz changed the title ~~WIP: Add spark pdf summarizer~~ Add spark pdf summarizer example Aug 18, 2023

skrawcz marked this pull request as ready for review August 18, 2023 04:04

skrawcz force-pushed the add_spark_pdf_summarizer branch 3 times, most recently from 6eaedab to 1807e6d Compare August 18, 2023 05:06

skrawcz requested a review from zilto August 18, 2023 05:12

skrawcz added 2 commits August 19, 2023 14:23

Adds unit tests to some missing functions in h_spark

f088dba

So that we get some coverage going in case we accidentally break something. Makes sure that we handle <3.9 appropriately too.

skrawcz force-pushed the add_spark_pdf_summarizer branch from 3297b3f to 6460b2d Compare August 19, 2023 21:23

skrawcz merged commit fd49e9c into main Aug 19, 2023

skrawcz deleted the add_spark_pdf_summarizer branch August 19, 2023 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spark pdf summarizer example #268

Add spark pdf summarizer example #268

skrawcz commented Aug 14, 2023 •

edited

Loading

Add spark pdf summarizer example #268

Add spark pdf summarizer example #268

Conversation

skrawcz commented Aug 14, 2023 • edited Loading

Changes

How I tested this

Notes

Checklist

skrawcz commented Aug 14, 2023 •

edited

Loading