Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds s3_parallel_dataframe_load example #570

Merged
merged 1 commit into from
Nov 29, 2023
Merged

Conversation

elijahbenizzy
Copy link
Collaborator

@elijahbenizzy elijahbenizzy commented Nov 28, 2023

This downloads data from s3 in parallel. It has a few limitations, but overall is easy to adapt/modify. This is a hub contributions.

[Summary of contribution]

For new dataflows:

Do you have the following?

  • Added a directory mapping to my github user name in the contrib/hamilton/contrib/user directory.
    • If my author names contains hyphens I have replaced them with underscores.
    • If my author name starts with a number, I have prefixed it with an underscore.
    • If your author name is a python reserved keyword. Reach out to the maintainers for help.
    • Added an author.md file under my username directory and is filled out.
    • Added an init.py file under my username directory.
  • Added a new folder for my dataflow under my username directory.
    • Added a README.md file under my dataflow directory that follows the standard headings and is filled out.
    • Added a init.py file under my dataflow directory that contains the Hamilton code.
    • Added a requirements.txt under my dataflow directory that contains the required packages outside of Hamilton.
    • Added tags.json under my dataflow directory to curate my dataflow.
    • Added valid_configs.jsonl under my dataflow directory to specify the valid configurations.
    • Added a dag.png that shows one possible configuration of my dataflow.
  • I hearby acknowledge that to the best of my ability, that the code I have contributed contains correct attribution
    and notices as appropriate.

For existing dataflows -- what has changed?

N/A

How I tested this

Ran locally on a custom s3 bucket

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Dataflow documentation has been updated if adding/changing functionality.

Copy link
Contributor

sweep-ai bot commented Nov 28, 2023

Apply Sweep Rules to your PR?

  • Apply: All new business logic should have corresponding unit tests.
  • Apply: Refactor large functions to be more modular.
  • Apply: Add docstrings to all functions and file headers.

This downloads data from s3 in parallel. It has a few limitations, but
overall is easy to adapt/modify. This is a hub contributions.
@skrawcz skrawcz merged commit b207db7 into main Nov 29, 2023
2 checks passed
@skrawcz skrawcz deleted the s3-parallel-dataframe-load branch November 29, 2023 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants