Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigating the Impact of Pylint's Inherent Limitations on Functionality of format.sh #4212

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

root-hbx
Copy link

@root-hbx root-hbx commented Oct 30, 2024

This PR aims to enhance the functionality of format.sh to provide more comprehensive file checks.

The origin of this improvement is as follows:

  1. The initial functionality (which included machine checks in workflows) was:
# .github/workflows/pylint.yml
pylint --load-plugins pylint_quotes sky
# format.sh
echo 'Sky Pylint:'
pylint --load-plugins pylint_quotes sky

With sky as the only parameter, according to pylint’s check guidelines, it only checked folders within sky/ that contained an __init__.py file, along with their internal files.

  1. In PR4184, @cg505 proposed the idea of “limiting pylint to only files that have changed from master,” which led to the current version of format.sh as shown below:
# format.sh
if [[ "$1" == '--files' ]]; then
    # If --files is passed, filter to files within sky/ and pass to pylint.
    pylint "${PYLINT_FLAGS[@]}" "${@:2}"
elif [[ "$1" == '--all' ]]; then
    # Pylint entire sky directory.
    pylint "${PYLINT_FLAGS[@]}" sky
else
    # Pylint only files in sky/ that have changed in last commit.
    changed_files=$(git diff --name-only --diff-filter=ACM "$MERGEBASE" -- 'sky/*.py' 'sky/*.pyi')
    if [[ -n "$changed_files" ]]; then
        echo "$changed_files" | tr '\n' '\0' | xargs -0 pylint "${PYLINT_FLAGS[@]}"
    else
        echo 'Pylint skipped: no files changed in sky/.'
    fi
fi
  1. Based on the above, I believe the --all option should check all subdirectories and internal files within sky/. However, due to pylint’s limitations, it can only check subdirectories that contain an __init__.py file, which does not meet our requirements. Therefore, I implemented the following functionality in my PR:
echo 'Sky Pylint:'
if [[ "$1" == '--files' ]]; then
    # If --files is passed, filter to files within sky/ and pass to pylint.
    pylint "${PYLINT_FLAGS[@]}" "${@:2}"
elif [[ "$1" == '--all' ]]; then
    # Pylint entire sky directory.
    pylint "${PYLINT_FLAGS[@]}" $(git ls-files 'sky/*.py' 'sky/*.pyi')
elif [[ "$1" == '--legacy' ]]; then
    # Pylint entire sky directory, ignoring all dirs without a __init__.py file.
    pylint "${PYLINT_FLAGS[@]}" sky
else
    # Pylint only files in sky/ that have changed in last commit.
    changed_files=$(git diff --name-only --diff-filter=ACM "$MERGEBASE" -- 'sky/*.py' 'sky/*.pyi')
    if [[ -n "$changed_files" ]]; then
        echo "$changed_files" | tr '\n' '\0' | xargs -0 pylint "${PYLINT_FLAGS[@]}"
    else
        echo 'Pylint skipped: no files changed in sky/.'
    fi
fi

We use --legacy to signify the original checking method, and we introduce --all to check all files (using git ls-files 'sky/*.py' 'sky/*.pyi' as the target files, avoiding pylint’s inherent limitations).

TL;DR After this change, we can use format.sh like:

  • bash format --all: check all the files in sky/
  • bash format --legacy: perform checks as originally designed
  • bash format: check only files that have changed from master

Furthermore, if this PR is deemed appropriate, we may need to consider updating .github/workflows/pylint.yml to align with the enhanced functionality in format.sh.

Currently, it uses pylint --load-plugins pylint_quotes sky. If we wanna check all, we need to replace it with pylint --load-plugins pylint_quotes $(git ls-files 'sky/*.py' 'sky/*.pyi').

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@root-hbx
Copy link
Author

In fact, employing the --all option to check all files introduces certain drawbacks, as it increases the workload significantly, resulting in longer inspection times.

Consequently, we must strike a balance between "comprehensive checks" and "manageable workload." This is why I did not directly modify .github/workflows/pylint.yml in the code of this PR, but instead made a brief mention of it in the aforementioned discussion.

Looking forward to any insights and guidance provided :)

@root-hbx
Copy link
Author

The motivation for this PR arose tonight when I ran bash format.sh as usual and encountered errors reported by pylint:

Success: no issues found in 231 source files
Sky Pylint:
************* Module setup
sky/setup_files/setup.py:161:0: C0301: Line too long (88/80) (line-too-long)
sky/setup_files/setup.py:176:0: C0301: Line too long (100/80) (line-too-long)
sky/setup_files/setup.py:177:0: C0301: Line too long (96/80) (line-too-long)
sky/setup_files/setup.py:180:0: C0301: Line too long (112/80) (line-too-long)
sky/setup_files/setup.py:181:0: C0301: Line too long (113/80) (line-too-long)
sky/setup_files/setup.py:183:0: C0301: Line too long (112/80) (line-too-long)
sky/setup_files/setup.py:184:0: C0301: Line too long (113/80) (line-too-long)
sky/setup_files/setup.py:189:0: C0301: Line too long (86/80) (line-too-long)
sky/setup_files/setup.py:196:0: C0301: Line too long (107/80) (line-too-long)
sky/setup_files/setup.py:225:0: C0301: Line too long (119/80) (line-too-long)
sky/setup_files/setup.py:248:0: C0301: Line too long (109/80) (line-too-long)
sky/setup_files/setup.py:180:0: C4001: Invalid string quote ", should be ' (invalid-string-quote)
sky/setup_files/setup.py:181:0: C4001: Invalid string quote ", should be ' (invalid-string-quote)
sky/setup_files/setup.py:183:0: C4001: Invalid string quote ", should be ' (invalid-string-quote)
sky/setup_files/setup.py:184:0: C4001: Invalid string quote ", should be ' (invalid-string-quote)

These errors are not related to the code I modified, prompting me to examine the logic in format.sh.

I discovered that under the initial setting of pylint --load-plugins pylint_quotes sky, these issues were not detected. However, following the implementation of #4184, they became identifiable.

Upon further investigation, I found that pylint seems not to recurse into directories that do not have an __init__.py file (i.e., namespace packages) and are not included in sys.path or PYTHONPATH.

Thus, neither the original code nor the --all option in #4184 can check all files, which has been validated.

Therefore, we need to provide users with three distinct checking formats:

  1. Comprehensive format checking, introduced as the new --all in this PR.
  2. To prevent excessive workload, we can maintain the original check using --legacy, ensuring backward compatibility.
  3. By omitting any parameters, the script will automatically check files that differ from the remote repository, as suggested in [dev] restrict pylint to changed files #4184.

These are my preliminary thoughts, and I would truly appreciate your guidance.

@Michaelvll
Copy link
Collaborator

Instead of adding another option, how about we just fix those lint issues when sky/ is used, and make the CI use sky/ as well?

@cg505
Copy link
Collaborator

cg505 commented Oct 30, 2024

Yes, I think ideally we can use your ls-files approach for --all, and hopefully we don't need to use --legacy at all.
It does require fixing the existing lint issues that were not being caught before.

@root-hbx What is the performance of --all vs --legacy with your changes? You mentioned that it is slower, but by how much?

@cg505
Copy link
Collaborator

cg505 commented Oct 30, 2024

There are a couple of other issues with format.sh that I'm now noticing.

  1. If we are up-to-date with origin/master, then changed_files will be empty. But we probably just want to default to the --all behavior in this case.
  2. MERGEBASE is defined inside format_changed, but we assume it is defined outside of that function. We should clean this up somehow.

@root-hbx If you're interested in working on these, happy to support these fixes as well.

@root-hbx
Copy link
Author

Thank you for your suggestions. @Michaelvll

It appears that addressing lint issues solely when sky/ is used and integrating this into the CI might be challenging due to certain limitations within pylint itself, which may not align closely with our primary objectives at the moment. Maybe we could wait for pylint to resolve this issue and then make further optimizations. What do you think?

Additionally, another consideration is that running full file checks in the CI could potentially lead to some confusion for current developers. For example, although their code may previously have received a perfect score of 10/10 in local tests, they could now encounter new issues under the revised CI checks, as highlighted in my previous points.

@root-hbx
Copy link
Author

Thanks for the fix :) @cg505

Current reason for using --all being slower than --legacy is that the former inspects a broader range of files that the latter does not cover. For instance, regarding my current commit (be0e1a0), the data shows only a minimal performance discrepancy between the two approaches, which is listed as follows:

format.md

I believe your perspective aligns perfectly with @Michaelvll, and I think this is a super fancy idea. The optimal approach might be a manual revision of all files flagged by pylint under --all. Subsequently, we could adopt ls-files approach for --all, making --legacy obsolete.

However, a key drawback is that numerous files, previously unflagged by pylint, would require extensive manual updates. Given the substantial labor involved, do we need to proceed in this manner?

@root-hbx
Copy link
Author

There are a couple of other issues with format.sh that I'm now noticing.

  1. If we are up-to-date with origin/master, then changed_files will be empty. But we probably just want to default to the --all behavior in this case.
  2. MERGEBASE is defined inside format_changed, but we assume it is defined outside of that function. We should clean this up somehow.

@root-hbx If you're interested in working on these, happy to support these fixes as well.

LGTM! If you can resolve these issues, it would make this check much more robust (especially the first point you mentioned). Looking forward to your fix! If there’s anything I can help with, feel free to reach out :D

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One fix on this change, and then let's merge it.

Regarding the other fixes discussed, I'll make them in a separate PR.

@@ -129,6 +129,9 @@ if [[ "$1" == '--files' ]]; then
pylint "${PYLINT_FLAGS[@]}" "${@:2}"
elif [[ "$1" == '--all' ]]; then
# Pylint entire sky directory.
pylint "${PYLINT_FLAGS[@]}" $(git ls-files 'sky/*.py' 'sky/*.pyi')
elif [[ "$1" == '--legacy' ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make sure that this flag is handled in the yapf logic as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot to handle the --legacy flag for yapf. If yapf can ensure full coverage, then the modification should look like this?

## This flag formats individual files. --files *must* be the first command line
## arg to use this option.
if [[ "$1" == '--files' ]]; then
   format "${@:2}"
   # If `--all` is passed, then any further arguments are ignored and the
   # entire python directory is formatted.
elif [[ "$1" == '--all' ]]; then
   format_all
elif [[ "$1" == '--legacy' ]]; then
   format_all
else
   # Format only the files that changed in last commit.
   format_changed
fi
echo 'SkyPilot yapf: Done'

The reason for this modification is that pylint previously used --all but couldn’t check all files due to its own limitations. If yapf can check all files when --all is specified originally, then in the new version, the behavior for --legacy and --all should be identical. Do you think that would work?

@root-hbx
Copy link
Author

@cg505 After my testing and verification, yapf does not raise any issues. Both --all and --legacy work well in current format tests(yapf / mypy / pylint). PTAL!

@Michaelvll
Copy link
Collaborator

Thank you for your suggestions. @Michaelvll

It appears that addressing lint issues solely when sky/ is used and integrating this into the CI might be challenging due to certain limitations within pylint itself, which may not align closely with our primary objectives at the moment. Maybe we could wait for pylint to resolve this issue and then make further optimizations. What do you think?

Additionally, another consideration is that running full file checks in the CI could potentially lead to some confusion for current developers. For example, although their code may previously have received a perfect score of 10/10 in local tests, they could now encounter new issues under the revised CI checks, as highlighted in my previous points.

I am confused. What issue are you referring to for pylint? Why can't we just fix those new errors showing by the new way we call pylint? I don't understand why we have to add a new argument --legacy.

@root-hbx
Copy link
Author

Thank you for your suggestions. @Michaelvll
It appears that addressing lint issues solely when sky/ is used and integrating this into the CI might be challenging due to certain limitations within pylint itself, which may not align closely with our primary objectives at the moment. Maybe we could wait for pylint to resolve this issue and then make further optimizations. What do you think?
Additionally, another consideration is that running full file checks in the CI could potentially lead to some confusion for current developers. For example, although their code may previously have received a perfect score of 10/10 in local tests, they could now encounter new issues under the revised CI checks, as highlighted in my previous points.

I am confused. What issue are you referring to for pylint? Why can't we just fix those new errors showing by the new way we call pylint? I don't understand why we have to add a new argument --legacy.

In the original check, we used pylint --load-plugins pylint_quotes sky/ with the --all option. However, this command only checks subfolders containing an __init__.py file, rather than examining all files within sky/. This limitation is due to pylint's fault, as documented here.

In this pull request, we address this by running pylint --load-plugins pylint_quotes $(git ls-files 'sky/*.py' 'sky/*.pyi') when the --all option is selected. This approach ensures that all files under sky/ are checked. Additionally, we provide a --legacy option, which has the same behavior like previous --all option.

BTW, we can also change the commands in CI from pylint --load-plugins pylint_quotes sky/ into pylint --load-plugins pylint_quotes $(git ls-files 'sky/*.py' 'sky/*.pyi'), which will truly check every files under sky/.

As your previous suggestion, we can remove the --legacy option and retain only the enhanced --all functionality if you agree with this approach :)

cg505
cg505 previously approved these changes Oct 31, 2024
Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @root-hbx !

@cg505 cg505 dismissed their stale review October 31, 2024 17:32

I think we will hold off merging this for now until there is a chance to clean up the all the pylint issues a bit more.

@root-hbx
Copy link
Author

Yes, I think this works @cg505. Once all pylint errors are resolved, we can merge everything together (--legacy option will be removed then).

This way, we address all historical issues within this PR and plan to introduce the new features you mentioned in a subsequent PR.

@root-hbx
Copy link
Author

root-hbx commented Nov 1, 2024

Here, we introduce an enhancement for the --all option:

With this update, the --all option allows users to specify file or directory paths for format.sh to ignore manually.

  • For the --all option, we can use EXCLUDE_PATHS in format.sh to specify files or directories to exclude.
  • For the --legacy option, exclusions will continue to be handled by ignore= in .pylintrc, the same as previous --all option.

Why is this enhancement necessary?

As previously highlighted by @cg505, a huge number of historical pylint issues remain unresolved, which are not pertinent to our primary objectives (last time they were modified was a looooooong time ago). I think manually addressing these nits would be time-consuming and offer little meaningful benefit.

This functionality will enable us to bypass these legacy issues temporarily, focusing the --all option on enforcing code quality in the current and upcoming development.

Simply put:

Let the past be the past; we’ll use --all to enforce code standards moving forward.

Do you think this way of "bypassing the existing excess issues" could work? I would greatly appreciate any suggestions or guidance you could offer. @Michaelvll @cg505

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants