Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spark pdf summarizer example #268

Merged
merged 3 commits into from
Aug 19, 2023
Merged

Add spark pdf summarizer example #268

merged 3 commits into from
Aug 19, 2023

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Aug 14, 2023

Adds example showing how to run the PDF summarizer on spark.

E.g. given a table, run the summarizer as a series of row based UDFs.

Changes

  • adds example
  • extends some minor typing support for spark

How I tested this

  • runs locally

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

We need to handle things like list[str] so that we can
handle the pdf_summarizer code.

Also the code didn't handle optional dependencies, now it does.
@skrawcz skrawcz changed the title WIP: Add spark pdf summarizer Add spark pdf summarizer example Aug 18, 2023
@skrawcz skrawcz marked this pull request as ready for review August 18, 2023 04:04
@skrawcz skrawcz force-pushed the add_spark_pdf_summarizer branch 3 times, most recently from 6eaedab to 1807e6d Compare August 18, 2023 05:06
@skrawcz skrawcz requested a review from zilto August 18, 2023 05:12
So that we get some coverage going in case we accidentally break something.

Makes sure that we handle <3.9 appropriately too.
This example shows that you can run the same code in a FastAPI backend,
and then turn around and run it on Spark as well! This is because flows like
PDF summarization are essential “map” transformations, and can then be easily
ported and modeled in spark.

The magic to run on spark happens in the SparkUDFGraphAdapter.

Otherwise, this adds and adjusts code to make the summarization.py code
work on spark (one tweak was needed). We add everything for someone to run the example,  including a script and notebook.

Squashed commit of:

Adds notebook to pdf spark example

So that people can use that to get started too. (+4 squashed commits)
Squashed commits:
[7db102c] Removes unused code.
[ce2abec] Adds missing requirements.txt for spark PDF example

So that people know what to install.
[eae3edd] Adds README to spark PDF post

Renames directory to run_on_spark to make it clear.

Adds pointer to main README to spark one.
[154e3ce] Adds spark code to run PDF Summarizer

We only needed one minor changes to the original code:

* we have to ensure the order of "columns" versus values that can be bound are
in the correct order. So we had to reorder one function's arguments for spark to
work.
@skrawcz skrawcz force-pushed the add_spark_pdf_summarizer branch from 3297b3f to 6460b2d Compare August 19, 2023 21:23
@skrawcz skrawcz merged commit fd49e9c into main Aug 19, 2023
@skrawcz skrawcz deleted the add_spark_pdf_summarizer branch August 19, 2023 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant