-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes bugs in pyspark @with_columns #730
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Reviewed the entire pull request up to 5fdc21c
- Looked at
319
lines of code in4
files - Took 1 minute and 10 seconds to review
More info
- Skipped
1
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. hamilton/plugins/h_spark.py:396
:
- Assessed confidence :
50%
- Comment:
The changes in this file seem to address the issues mentioned in the PR description. However, it would be beneficial to have tests to ensure these changes don't introduce new bugs. - Reasoning:
The changes in the h_spark.py file seem to be addressing the issues mentioned in the PR description. The changes include adding a parameter to specify required configuration, fixing a bug in the UDF logic, and changing the behavior of the 'select' in append mode to only append the columns specified in the 'select'. The changes seem logical and there are no clear violations of best practices. However, without running the code and tests, it's hard to be certain that these changes don't introduce new bugs.
Workflow ID: wflow_R8HypYCML8A27Dd9
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
5fdc21c
to
ddd38df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on ddd38df
- Looked at
359
lines of code in5
files - Took 2 minutes and 23 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
3
additional comments because they didn't meet confidence threshold of50%
.
1. plugin_tests/h_spark/test_h_spark.py:611
:
- Assessed confidence :
50%
- Comment:
Consider renaming the test case 'test_with_columns_generate_nodes_select_append_mode' to 'test_with_columns_generate_nodes_append_mode_with_select' for better clarity. - Reasoning:
The test case 'test_with_columns_generate_nodes_select_append_mode' is checking if the function 'with_columns' correctly generates nodes when the mode is 'append' and a 'select' parameter is provided. The test case is passing and the implementation seems correct. However, the test case name could be more descriptive to indicate that it's testing the 'append' mode with a 'select' parameter.
2. plugin_tests/h_spark/test_h_spark.py:634
:
- Assessed confidence :
50%
- Comment:
Consider renaming the test case 'test_with_columns_generate_nodes_select_mode_select' to 'test_with_columns_generate_nodes_select_mode_with_select' for better clarity. - Reasoning:
The test case 'test_with_columns_generate_nodes_select_mode_select' is checking if the function 'with_columns' correctly generates nodes when the mode is 'select' and a 'select' parameter is provided. The test case is passing and the implementation seems correct. However, the test case name could be more descriptive to indicate that it's testing the 'select' mode with a 'select' parameter.
3. hamilton/plugins/h_spark.py:396
:
- Assessed confidence :
80%
- Grade:
0%
- Comment:
Consider adding a test case for the function 'create_drop_node' to ensure its correct functionality. - Reasoning:
The function 'create_drop_node' is being used in 'generate_nodes' function in 'h_spark.py' file. However, there is no direct test case for this function. It would be good to have a separate test case for this function to ensure its correct functionality.
Workflow ID: wflow_BKH1gyTNBqwBmIaW
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
1. Configuration linkage was not exposed. So adding a parameter for people to specify what configuration is required. 2. There was a bug in the UDF logic, because `all(empty)` evaluates to True. So fixed that. 3. The `select` in append mode appended all possible columns. It didn't only append what was in the `select`. This changes that behavior by dropping "outputs" from the subdag that aren't in the select. We do it this way because we don't know what's in the original dataframe. 4. Adds test to check for the new logic
ddd38df
to
81fefd8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested.
- Performed an incremental review on 81fefd8
- Looked at
376
lines of code in6
files - Took 3 minutes and 39 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
3
additional comments because they didn't meet confidence threshold of50%
.
1. hamilton/plugins/h_spark.py:396
:
- Assessed confidence :
0%
- Comment:
The fix for theall(empty)
bug is correct. Theall
function returnsTrue
for an empty iterable in Python, which is not the desired behavior in this case. The added conditionlen(pandas_annotation.values()) > 0
ensures that the iterable is not empty before callingall
. - Reasoning:
In theh_spark.py
file, theall(empty)
bug fix is correct. Theall
function returnsTrue
for an empty iterable in Python, which is not the desired behavior in this case. The added conditionlen(pandas_annotation.values()) > 0
ensures that the iterable is not empty before callingall
.
2. examples/LLM_Workflows/pdf_summarizer/run_on_spark/README.md:1
:
- Assessed confidence :
0%
- Comment:
The changes in theREADME.md
file provide clear instructions on how to run the PDF summarizer on Spark using two approaches. The instructions are detailed and include steps for installing dependencies, running the code, and setting the OPENAI_API_KEY in the environment. The addition of diagrams to illustrate the dataflow is also a good practice. - Reasoning:
The changes in theREADME.md
file provide clear instructions on how to run the PDF summarizer on Spark using two approaches. The instructions are detailed and include steps for installing dependencies, running the code, and setting the OPENAI_API_KEY in the environment. The addition of diagrams to illustrate the dataflow is also a good practice.
3. tests/resources/spark/a_dag.py:9
:
- Assessed confidence :
80%
- Grade:
0%
- Comment:
The functionc
is supposed to return a float but the operationa * 3.5 + b
could potentially return an integer ifa
andb
are both 0. This could lead to unexpected behavior. Consider ensuring that the return type is always a float. For example:
return float(a * 3.5 + b)
- Reasoning:
The functionc
ina_dag.py
is supposed to return a float but the operationa * 3.5 + b
could potentially return an integer ifa
andb
are both 0. This could lead to unexpected behavior.
Workflow ID: wflow_XYKlW9pLFqa2xB81
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. We'll respond in a few minutes. Learn more here.
examples/LLM_Workflows/pdf_summarizer/run_on_spark/spark_pdf_pipeline.py
Show resolved
Hide resolved
examples/LLM_Workflows/pdf_summarizer/run_on_spark/spark_pdf_pipeline.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- made a few suggestions.
Skipped PR review on 59c28ca because no changed files had a supported extension. If you think this was in error, please contact us and we'll fix it right away. Generated with ❤️ by ellipsis.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 2e06300
- Looked at
77
lines of code in2
files - Took 1 minute and 29 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. examples/LLM_Workflows/scraping_and_chunking/spark/spark_pipeline.py:27
:
- Assessed confidence :
50%
- Comment:
The PR description mentions that a parameter was added to specify required configurations in the pyspark @with_columns decorator. However, I don't see any changes related to this in the diff. Please confirm if the changes are in a different file or if they were accidentally omitted. - Reasoning:
The PR description mentions that a parameter was added to specify required configurations in the pyspark @with_columns decorator. However, I don't see any changes related to this in the diff. It's possible that the changes are in a file that's not included in the diff. I'll need to check the entire codebase to confirm this.
Workflow ID: wflow_qGyTljvPQXrGawaZ
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
TODO:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on f47f0f6
- Looked at
11
lines of code in1
files - Took 1 minute and 57 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. /examples/LLM_Workflows/README.md:8
:
- Assessed confidence :
0%
- Comment:
The changes in the README file are clear and concise. They provide additional information about the examples and their alternative implementations. - Reasoning:
The changes in the README file are clear and concise. They provide additional information about the examples and their alternative implementations. No issues found here.
Workflow ID: wflow_BQH86UZHConQymdH
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 6ded5f1
- Looked at
68
lines of code in1
files - Took 2 minutes and 27 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. /examples/LLM_Workflows/scraping_and_chunking/spark/README.md:57
:
- Assessed confidence :
0%
- Comment:
The updates to the README file are clear and informative. Good job! - Reasoning:
The README file has been updated to reflect the changes made in the PR. It provides a clear explanation of the changes, why they were made, and how to use the updated code. There are no issues with this file.
Workflow ID: wflow_iJIHQ6gZL7nFZCxx
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
Skipped PR review on 61582b6 because no changed files had a supported extension. If you think this was in error, please contact us and we'll fix it right away. Generated with ❤️ by ellipsis.dev |
This is close to the original, but we need to specify the set of spark safe functions. This is okay-ish I think. It just also requires that we include the pipeline in the larger DAG too... (+1 squashed commit) Squashed commits: [fe08646] Adds code to do document chunking with pyspark with_columns makes some assumptions: (a) that all functions in the with_columns subdag take in the dataframe (b) that all intermediate functions in the subdag need to become columns in the dataframe (c) it doesn't allow you to wire through inputs to the subdag We'll need to improve upon some of these to make using spark a real drop in replacement. The most annoying part right now is not being able to wire through inputs to the subdag from other functions in the DAG. That's why the html_chunker and text_chunker are outside Hamilton right now.
Adds select decorator convenience function -- that way it's clearer if that mode is wanted. Adds some more comments to explain some options better.
That way it’s easier to figure out and determine what’s going on. Adds symlink in spark to llm example (+6 squashed commits) Squashed commits: [f47f0f6] Tells people to look into examples more for ray and pyspark. [6ded5f1] Adds more documentation to the spark chunking example So that it's clearer what and why. [2e06300] Pre-commit fix [59c28ca] Removes unneeded file [d3b9123] address comments left by @skrawcz on #732 (Adds code to do document chunking with pyspark); [be79aca] address comments left by @skrawcz on #732 (Adds code to do document chunking with pyspark);
61582b6
to
6a572fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 6a572fc
- Looked at
115
lines of code in4
files - Took 1 minute and 45 seconds to review
More info
- Skipped
2
files when reviewing. - Skipped posting
2
additional comments because they didn't meet confidence threshold of50%
.
1. /examples/LLM_Workflows/scraping_and_chunking/spark/doc_pipeline.py:25
:
- Assessed confidence :
50%
- Comment:
Consider including the original error message in the exception to help with debugging. For example:
raise Exception(f'Failed to get URL: {url}. Original error: {str(e)}')
- Reasoning:
The PR author has added exception handling for network requests, which is a good practice. However, the error message in the exception could be more informative. Instead of just saying 'Failed to get URL', it could include the original error message to help with debugging.
2. /examples/LLM_Workflows/scraping_and_chunking/spark/spark_pipeline.py:30
:
- Assessed confidence :
50%
- Comment:
Consider including the original error message in the exception to help with debugging. For example:
raise RuntimeError(f'Failed to fetch sitemap from {sitemap_url}. Original error: {str(e)}')
- Reasoning:
The PR author has added exception handling for network requests, which is a good practice. However, the error message in the exception could be more informative. Instead of just saying 'Failed to fetch sitemap', it could include the original error message to help with debugging.
Workflow ID: wflow_8Dtkx1sailRJdo0v
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
To try to steer it to be more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 9341d31
- Looked at
26
lines of code in1
files - Took 1 minute and 56 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. ellipsis.yaml:1
:
- Assessed confidence :
50%
- Comment:
This new file 'ellipsis.yaml' is not mentioned in the PR description. Could you please clarify if it's intended to be part of this PR? If so, please update the PR description to include information about this file and its purpose. - Reasoning:
The PR author has added a new file 'ellipsis.yaml'. This file seems to contain configuration for the Ellipsis tool, which is used for automated PR reviews. The rules defined in this file align with the principles provided for this review task. However, this file is not mentioned in the PR description, and it's unclear whether it's intended to be part of this PR or if it was added by mistake.
Workflow ID: wflow_XFF6KDS5tePXraMC
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me!
- Performed an incremental review on 02120ae
- Looked at
26
lines of code in1
files - Took 1 minute and 35 seconds to review
More info
- Skipped
0
files when reviewing. - Skipped posting
1
additional comments because they didn't meet confidence threshold of50%
.
1. ellipsis.yaml:20
:
- Assessed confidence :
0%
- Comment:
The addition of theellipsis.yaml
file is a good practice as it provides a clear set of rules and guidelines for contributors to follow. This will help maintain the quality and consistency of the codebase. However, it's important to ensure that these rules are communicated and enforced effectively. - Reasoning:
The addition of the ellipsis.yaml file is a good practice as it provides a clear set of rules and guidelines for contributors to follow. This will help maintain the quality and consistency of the codebase. However, it's important to ensure that these rules are communicated and enforced effectively.
Workflow ID: wflow_E2C7owIdPQ8Lv9hg
Not what you expected? You can customize the content of the reviews using rules. Learn more here.
all(empty)
evaluates to True. So fixed that.select
in append mode appended all possible columns. It didn't only append what was in theselect
. This changes that behavior by dropping "outputs" from the subdag that aren't in the select. We do it this way because we don't know what's in the original dataframe.Changes
How I tested this
Notes
Checklist
Summary:
This PR enhances the pyspark @with_columns decorator, fixes a bug in the UDF logic, modifies the 'select' behavior in append mode, updates the example workflows, and adds new files in the
pdf_summarizer/run_on_spark
directory.Key points:
all(empty)
was incorrectly evaluating to True.run_with_columns.py
andspark_pdf_pipeline.py
in thepdf_summarizer/run_on_spark
directory.Generated with ❤️ by ellipsis.dev