Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds example on how to chunk Hamilton documentation #721

Merged
merged 4 commits into from
Mar 5, 2024

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Feb 29, 2024

This uses langchain's text splitting functionality, you could easily replace this with your own logic. The DAG though is run and defined in Hamilton -- and the processing of each URL is done in parallel. You could easily farm this out to Ray or Dask as well... See the documentation for more info.

Changes

  • adds example for pulling HTML, parsing, and then chunking it "into documents".

How I tested this

  • runs locally

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Ellipsis 🚀 This PR description was created by Ellipsis for commit 08e9f28.

Summary:

This PR adds a new example to the Hamilton library demonstrating how to chunk Hamilton documentation using a parallel processing pipeline, which can run locally, as well as with Ray, Dask, and PySpark.

Key points:

  • Added a new example module in examples/LLM_Workflows/scraping_and_chunking.
  • The module demonstrates how to chunk Hamilton documentation using a parallel processing pipeline.
  • The pipeline can run locally, as well as with Ray, Dask, and PySpark.
  • The processing of each URL is done in parallel.
  • The code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested.
  • New functions are documented (with a description, list of inputs, and expected output).
  • Placeholder code is flagged / future TODOs are captured in comments.
  • Project documentation has been updated if adding/changing functionality.

Generated with ❤️ by ellipsis.dev

@skrawcz skrawcz force-pushed the hamilton_docs_chunking branch from 38887ba to 859e881 Compare February 29, 2024 02:31
@skrawcz skrawcz marked this pull request as ready for review March 4, 2024 00:58
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Reviewed the entire pull request up to 7c5468d
  • Looked at 208 lines of code in 3 files
  • Took 1 minute and 8 seconds to review
More info
  • Skipped 2 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. /examples/LLM_Workflows/scraping_and_chunking/pipeline.py:45:
  • Assessed confidence : 70%
  • Grade: 0%
  • Comment:
    Consider adding a check to ensure that 'max_urls' is less than or equal to the length of 'urls_from_sitemap'. If 'max_urls' is greater than the length of the list, this could lead to unexpected behavior.
  • Reasoning:
    The 'url' function is defined to take a list of URLs and a maximum number of URLs to process. However, there is no check to ensure that 'max_urls' is less than or equal to the length of the list. If 'max_urls' is greater than the length of the list, this could lead to unexpected behavior.

Workflow ID: wflow_OEO6o5caT6ylp3uF


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. We'll respond in a few minutes. Learn more here.

examples/LLM_Workflows/scraping_and_chunking/pipeline.py Outdated Show resolved Hide resolved
examples/LLM_Workflows/scraping_and_chunking/pipeline.py Outdated Show resolved Hide resolved
skrawcz added 3 commits March 4, 2024 14:48
This uses langchain's text splitting functionality, you
could easily replace this with your own logic. The
DAG though is run and defined in Hamilton -- and the processing
of each URL is done in parallel. You could easily farm
this out to Ray or Dask as well... See the documentation
for more info.
Since that reflects what is happening,
rather than creating embeddings.

TODOs:
 - leave notes on how this connects with embeddings.
@skrawcz skrawcz force-pushed the hamilton_docs_chunking branch from 7c5468d to fd82bc0 Compare March 4, 2024 22:59
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on fd82bc0
  • Looked at 13 lines of code in 1 files
  • Took 1 minute and 48 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. /examples/LLM_Workflows/scraping_and_chunking/README.md:2:
  • Assessed confidence : 50%
  • Comment:
    Consider adding a brief explanation or a link to further information about 'RAG dataflow' for users who may not be familiar with this term.
  • Reasoning:
    The README file is concise and provides a clear overview of the purpose and usage of the new module. However, it would be helpful to include a brief explanation of what 'RAG dataflow' is, as not all users may be familiar with this term.

Workflow ID: wflow_jmScUJsssDYRhwty


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

To make it clearer how things work and operate.
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on e79193c
  • Looked at 461 lines of code in 7 files
  • Took 1 minute and 36 seconds to review
More info
  • Skipped 2 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. examples/LLM_Workflows/scraping_and_chunking/README.md:1:
  • Assessed confidence : 0%
  • Comment:
    The PR adds a new example to the Hamilton library that demonstrates how to chunk Hamilton documentation using a parallel processing pipeline. The pipeline can run locally, as well as with Ray, Dask, and PySpark. The code passed the pre-commit check & code is left cleaner/nicer than when first encountered. Any change in functionality is tested. New functions are documented (with a description, list of inputs, and expected output). Placeholder code is flagged / future TODOs are captured in comments. Project documentation has been updated if adding/changing functionality. The PR adheres to the principles and rules set for the review. No violations or bugs were found in the PR.
  • Reasoning:
    The PR adds a new example to the Hamilton library that demonstrates how to chunk Hamilton documentation using a parallel processing pipeline. The pipeline can run locally, as well as with Ray, Dask, and PySpark. The code passed the pre-commit check & code is left cleaner/nicer than when first encountered. Any change in functionality is tested. New functions are documented (with a description, list of inputs, and expected output). Placeholder code is flagged / future TODOs are captured in comments. Project documentation has been updated if adding/changing functionality. The PR adheres to the principles and rules set for the review. No violations or bugs were found in the PR.

Workflow ID: wflow_ZG9cycqL4iZeFfTk


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great -- don't have the full time now for a review but glanced over a few times and let's ship.

@skrawcz skrawcz merged commit 08e9f28 into main Mar 5, 2024
23 checks passed
@skrawcz skrawcz deleted the hamilton_docs_chunking branch March 5, 2024 06:48
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 08e9f28
  • Looked at 531 lines of code in 7 files
  • Took 2 minutes and 0 seconds to review
More info
  • Skipped 2 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. examples/LLM_Workflows/scraping_and_chunking/doc_pipeline.py:1:
  • Assessed confidence : 0%
  • Comment:
    The code is well-written and follows best practices. The functions are well-documented and the logic is clear. Good job!
  • Reasoning:
    The PR adds a new example to the Hamilton library that demonstrates how to chunk Hamilton documentation using a parallel processing pipeline. The pipeline can run locally, as well as with Ray, Dask, and PySpark. The code passed the pre-commit check & code is left cleaner/nicer than when first encountered. Any change in functionality is tested. New functions are documented (with a description, list of inputs, and expected output). Placeholder code is flagged / future TODOs are captured in comments. Project documentation has been updated if adding/changing functionality. The PR seems to be following the best practices and there are no obvious bugs or issues. The code is clean, well-documented, and follows the DRY principle. There are no secrets or credentials in the code, and the code does not log sensitive data. The code follows the Single Responsibility Principle and the function and method naming follows consistent patterns. The PR does not have WIP in the title or body description. The PR also suggests places to add to the sphinx documentation under docs/ where appropriate for each pull request.

Workflow ID: wflow_yIFkZM4uRy2ZXD4W


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants