-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds example on how to chunk Hamilton documentation #721
Conversation
38887ba
to
859e881
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Reviewed the entire pull request up to 7c5468d
- Looked at
208
lines of code in3
files - Took 1 minute and 8 seconds to review
More info
- Skipped
2
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. /examples/LLM_Workflows/scraping_and_chunking/pipeline.py:45
:
- Assessed confidence :
70%
- Grade:
0%
- Comment:
Consider adding a check to ensure that 'max_urls' is less than or equal to the length of 'urls_from_sitemap'. If 'max_urls' is greater than the length of the list, this could lead to unexpected behavior. - Reasoning:
The 'url' function is defined to take a list of URLs and a maximum number of URLs to process. However, there is no check to ensure that 'max_urls' is less than or equal to the length of the list. If 'max_urls' is greater than the length of the list, this could lead to unexpected behavior.
Workflow ID: wflow_OEO6o5caT6ylp3uF
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. We'll respond in a few minutes. Learn more here.
This uses langchain's text splitting functionality, you could easily replace this with your own logic. The DAG though is run and defined in Hamilton -- and the processing of each URL is done in parallel. You could easily farm this out to Ray or Dask as well... See the documentation for more info.
Since that reflects what is happening, rather than creating embeddings. TODOs: - leave notes on how this connects with embeddings.
7c5468d
to
fd82bc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on fd82bc0
- Looked at
13
lines of code in1
files - Took 1 minute and 48 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. /examples/LLM_Workflows/scraping_and_chunking/README.md:2
:
- Assessed confidence :
50%
- Comment:
Consider adding a brief explanation or a link to further information about 'RAG dataflow' for users who may not be familiar with this term. - Reasoning:
The README file is concise and provides a clear overview of the purpose and usage of the new module. However, it would be helpful to include a brief explanation of what 'RAG dataflow' is, as not all users may be familiar with this term.
Workflow ID: wflow_jmScUJsssDYRhwty
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
To make it clearer how things work and operate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on e79193c
- Looked at
461
lines of code in7
files - Took 1 minute and 36 seconds to review
More info
- Skipped
2
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. examples/LLM_Workflows/scraping_and_chunking/README.md:1
:
- Assessed confidence :
0%
- Comment:
The PR adds a new example to the Hamilton library that demonstrates how to chunk Hamilton documentation using a parallel processing pipeline. The pipeline can run locally, as well as with Ray, Dask, and PySpark. The code passed the pre-commit check & code is left cleaner/nicer than when first encountered. Any change in functionality is tested. New functions are documented (with a description, list of inputs, and expected output). Placeholder code is flagged / future TODOs are captured in comments. Project documentation has been updated if adding/changing functionality. The PR adheres to the principles and rules set for the review. No violations or bugs were found in the PR. - Reasoning:
The PR adds a new example to the Hamilton library that demonstrates how to chunk Hamilton documentation using a parallel processing pipeline. The pipeline can run locally, as well as with Ray, Dask, and PySpark. The code passed the pre-commit check & code is left cleaner/nicer than when first encountered. Any change in functionality is tested. New functions are documented (with a description, list of inputs, and expected output). Placeholder code is flagged / future TODOs are captured in comments. Project documentation has been updated if adding/changing functionality. The PR adheres to the principles and rules set for the review. No violations or bugs were found in the PR.
Workflow ID: wflow_ZG9cycqL4iZeFfTk
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great -- don't have the full time now for a review but glanced over a few times and let's ship.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 08e9f28
- Looked at
531
lines of code in7
files - Took 2 minutes and 0 seconds to review
More info
- Skipped
2
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. examples/LLM_Workflows/scraping_and_chunking/doc_pipeline.py:1
:
- Assessed confidence :
0%
- Comment:
The code is well-written and follows best practices. The functions are well-documented and the logic is clear. Good job! - Reasoning:
The PR adds a new example to the Hamilton library that demonstrates how to chunk Hamilton documentation using a parallel processing pipeline. The pipeline can run locally, as well as with Ray, Dask, and PySpark. The code passed the pre-commit check & code is left cleaner/nicer than when first encountered. Any change in functionality is tested. New functions are documented (with a description, list of inputs, and expected output). Placeholder code is flagged / future TODOs are captured in comments. Project documentation has been updated if adding/changing functionality. The PR seems to be following the best practices and there are no obvious bugs or issues. The code is clean, well-documented, and follows the DRY principle. There are no secrets or credentials in the code, and the code does not log sensitive data. The code follows the Single Responsibility Principle and the function and method naming follows consistent patterns. The PR does not have WIP in the title or body description. The PR also suggests places to add to the sphinx documentation underdocs/
where appropriate for each pull request.
Workflow ID: wflow_yIFkZM4uRy2ZXD4W
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
This uses langchain's text splitting functionality, you could easily replace this with your own logic. The DAG though is run and defined in Hamilton -- and the processing of each URL is done in parallel. You could easily farm this out to Ray or Dask as well... See the documentation for more info.
Changes
How I tested this
Notes
Checklist
Summary:
This PR adds a new example to the Hamilton library demonstrating how to chunk Hamilton documentation using a parallel processing pipeline, which can run locally, as well as with Ray, Dask, and PySpark.
Key points:
examples/LLM_Workflows/scraping_and_chunking
.Generated with ❤️ by ellipsis.dev