Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update individual rules to take advantage of core rule processing changes #3041

Merged

Conversation

barrywhart
Copy link
Member

@barrywhart barrywhart commented Apr 7, 2022

  • Update L009 and L050 to only be called once (parse tree root)
  • Update rules 9, 10, 14, 30, 40, 50, 63 to run in the post linter phase.
  • Update docs for rule authors to cover the new iteration options

Brief summary of the change made

Fixes #3034, #3035, #3037

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

  • Please confirm you have completed any of the necessary steps below.

  • Included test cases to demonstrate any code changes, which may be one or more of the following:

    • .yml rule test cases in test/fixtures/rules/std_rule_cases.
    • .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with tox -e generate-fixture-yml).
    • Full autofix test cases in test/fixtures/linter/autofix.
    • Other.
  • Added appropriate documentation for the change.

  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

@barrywhart barrywhart marked this pull request as draft April 7, 2022 21:56
@codecov
Copy link

codecov bot commented Apr 8, 2022

Codecov Report

Merging #3041 (e28b033) into main (bbb8be7) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #3041   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          164       164           
  Lines        12106     12150   +44     
=========================================
+ Hits         12106     12150   +44     
Impacted Files Coverage Δ
src/sqlfluff/rules/L004.py 100.00% <ø> (ø)
src/sqlfluff/core/parser/segments/base.py 100.00% <100.00%> (ø)
src/sqlfluff/core/rules/base.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L001.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L002.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L003.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L005.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L009.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L010.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L011.py 100.00% <100.00%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbb8be7...e28b033. Read the comment docs.

@barrywhart barrywhart changed the title Refactor rule segment iteration for flexibility and speed Refactor core rule processing for flexibility and speed Apr 8, 2022
@@ -506,6 +507,10 @@ def lint_fix_parsed(
)

for crawler in progress_bar_crawler:
# Performance: After first loop pass, skip rules that don't do fixes.
if fix and loop > 0 and not is_fix_compatible(crawler):
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About 1/4 of the rules aren't fix compatible. But we were executing them on every linter pass. Wasteful!

@@ -416,6 +416,48 @@ def raw_segments(self):
)


class CrawlBehavior:
Copy link
Member Author

@barrywhart barrywhart Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted segment traversal code from BaseRule.crawl() into a separate class. This makes it easier to vary iteration strategies with less risk of breaking something else.

@tunetheweb
Copy link
Member

Nice improvement on "Rules critical errors tests" already! Maybe this fixes #3034 sufficiently that we can close that?

@barrywhart
Copy link
Member Author

@tunetheweb:

Nice improvement on "Rules critical errors tests" already! Maybe this fixes #3034 sufficiently that we can close that?

Oh, nice! I hadn't noticed. I updated the PR description to say "fixes" that issue. 👍

fname=fname,
fix=fix,
templated_file=templated_file,
for phase in ["main", "post"]:
Copy link
Member Author

@barrywhart barrywhart Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use "Hide whitespace" to review the changes in this section, since it adds a new loop and an additional level of indentation.

@@ -372,7 +391,7 @@ class FunctionalRuleContext:
def __init__(self, context: RuleContext):
self.context = context

@cached_property
Copy link
Member Author

@barrywhart barrywhart Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed caching of most properties now that we are reusing RuleContext objects


line_no = memory.line_no
target_line_no = cached_line_count + 1
for idx, elem in enumerate(raw_stack[memory.start_process_raw_idx :]):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some unrelated, specific tuning of L003 here. Previously, it was starting from the beginning of raw_stack every time, which caused O(n^2) performance. Now, it remembers the last line number and position in raw_stack, so it can resume scanning on the "current" line. This shows up as a lot fewer executions of these lines in the Python line profiler and a big reduction in "percent of overall function time".

Copy link
Contributor

@OTooleMichael OTooleMichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -408,7 +414,7 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:
# First non-whitespace element is our trigger
memory.trigger = segment

is_last = self.is_final_segment(context)
is_last = context.segment is context.final_segment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legend

@@ -43,8 +44,8 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:
We need at least one segment behind us for this to work.

"""
if len(context.raw_stack) >= 1:
cm1 = context.raw_stack[-1]
if context.raw_segment_pre is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to a guard here right ?

Suggested change
if context.raw_segment_pre is not None:
if context.raw_segment_pre is None:
return

@OTooleMichael
Copy link
Contributor

Im happy enough with everything going on here (especially having read your first split out first). It seems pretty "straightforward". (I could see some places to go further with the code splitting / Rule Flags) :) but all the change look good

anchor = cm1
return LintResult(anchor=anchor, fixes=[LintFix.delete(cm1)])
# Otherwise fine
"""Commas should not have whitespace directly before them."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary change to this file is switching from raw_stack to raw_segment_pre, which is faster. Also did some tidying, but no other meaningful changes. @OTooleMichael

@@ -503,12 +520,12 @@ class BaseRule:
# Lint loop / crawl behavior. When appropriate, rules can (and should)
# override these values to make linting faster.
recurse_into = True
needs_raw_stack = True
needs_raw_stack = False # False is faster & most rules don't need it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give more commentary as towhy a rule would need or not need raw_stack? I presume it's if you are looking at a segment's parents to find if it's nested in a particular statement? Remind me again what the difference is. between raw_stack and raw_segment_pre

Copy link
Member Author

@barrywhart barrywhart Apr 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more comments. Briefly,raw_segment_pre is the one raw segment prior to the current one, while raw_stack is a list of every raw segment prior. The former is pretty lightweight to compute, while the latter is pretty heavyweight, as we're creating a new (and increasingly lengthy) tuple as we proceed through the file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn’t get that.

Is that just the parent Segment? Would that be a better name for it rather than raw_stack_prev?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not the parent. Raw segments are (IIRC) those at the leaf of the tree, i.e. no children. Generally, these are a single keyword, operator, literal, or a meta segment like indent or dedent, like the "skin" of the tree as opposed to the "bones", I guess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous_raw_segement seems better as a name or raw_segment_previous.
at least prev

Comment on lines +21 to +30
def get_last_segment(segment: Segments) -> Tuple[List[BaseSegment], Segments]:
"""Returns rightmost & lowest descendant and its "parent stack"."""
parent_stack: List[BaseSegment] = []
while True:
children = segment.children()
if children:
parent_stack.append(segment[0])
segment = children.last()
else:
return parent_stack, segment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't it use the new final_segment function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it also needs the parent stack, which is not provided by RuleContext.final_segment. I could've had RuleContext return both, but that'd be harder to document clearly, and AFAIK, this is the only rule that needs it.

@barrywhart
Copy link
Member Author

@tunetheweb: Ok, ready for final review. That is, I added the new docs to developingrules.rst. There's probably some duplication between that file and some of the code comments. Might be worth moving things around a bit? It always feels tricky to get this right.

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine, but have some general comments for you to consider.

Comment on lines +30 to +36
``_works_on_unparsable``
^^^^^^^^^^^^^^^^^^^^^^^^
By default, `SQLFluff` calls ``_eval()`` for all segments, even "unparsable"
segments, i.e. segments that didn't match the parsing rules in the dialect.
This causes issues for some rules. If so, setting ``_works_on_unparsable``
to ``False`` tells SQLFluff not to call ``_eval()`` for unparsable segments and
their descendants.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? I thought default for this was false and you had to explicitly set it to true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought, too, but I checked. I think maybe I (we) were confusing this with the --fix-even-unparsable thing added recently. We've got so many settings. 🤯

IIUC, the setting documented here controls whether the rule is called for unparsable segments, and --fix-even-unparsable controls whether any fixes would be applied. (Note that you can lint without fixing!)

docs/source/developingrules.rst Outdated Show resolved Hide resolved
@@ -503,12 +520,27 @@ class BaseRule:
# Lint loop / crawl behavior. When appropriate, rules can (and should)
# override these values to make linting faster.
recurse_into = True
needs_raw_stack = True
# False is faster & most rules don't need it. Rules that use it are usually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me that the False refers to recurse_into or needs_raw_stack. I know it's the latter because of this PR but could be clearer for future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to mention the property name. 👍

@@ -408,7 +414,7 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:
# First non-whitespace element is our trigger
memory.trigger = segment

is_last = self.is_final_segment(context)
is_last = context.segment is context.final_segment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we avoid rule writers using the wrong call in future? These seem very similarly named and therefore not crazy to think they are interchangeable.

@barrywhart barrywhart merged commit c039be5 into sqlfluff:main Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up Rules critical errors tests
3 participants