Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update individual rules to take advantage of core rule processing changes #3041

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
283649e
Refactor rule segment iteration for flexibility and speed
Apr 7, 2022
3aa5abd
Fix a bug that was clearing memory when _eval() returned None
Apr 8, 2022
ea80c56
Create CrawlBehavior class, fix broken test
Apr 8, 2022
d5fe681
Update rule L009 to set recurse_into = False
Apr 8, 2022
1488c5b
After first linter loop pass, skip rules that don't do fixes
Apr 8, 2022
de85ca7
Reduce number of RuleContext objects, compute siblings_pre/post on de…
Apr 8, 2022
d138701
Add "raw_segment_pre" as an alternative to "raw_stack"
Apr 8, 2022
abcca8a
Coverage
Apr 8, 2022
78e42e8
Rules declare whether they need context.raw_stack
Apr 8, 2022
7e37250
Merge branch 'main' into bhart-issue_3037_rule_iteration_refactor
Apr 8, 2022
b426845
Implement linter phases so post-processing rules only run once
Apr 8, 2022
4056496
Merge branch 'main' into bhart-issue_3037_rule_iteration_refactor
barrywhart Apr 8, 2022
72e3bcd
Tidy, simplify
Apr 8, 2022
a723852
Update L050 to use recurse_into = False
Apr 8, 2022
5bc2b3d
Tidying
Apr 8, 2022
12141a9
Merge branch 'main' into bhart-issue_3037_rule_iteration_refactor
Apr 8, 2022
2b11c08
Tune L003 a bit
Apr 8, 2022
3d11b6c
Refactor L003 for performance, add RuleContext.final_segment property
Apr 8, 2022
d3f342e
Merge branch 'main' into bhart-issue_3037_rule_iteration_refactor
barrywhart Apr 8, 2022
6111998
Comments, coverage
Apr 8, 2022
b9f9e74
Merge branch 'bhart-issue_3037_rule_iteration_refactor' of https://gi…
Apr 8, 2022
765408e
Merge branch 'main' into bhart-issue_3037_rule_iteration_refactor
barrywhart Apr 9, 2022
2408082
Refactor core rule processing for flexibility and speed
Apr 9, 2022
bf7b98f
Merge branch 'main' into bhart-issue_3037_rule_iteration_refactor
Apr 9, 2022
c70059d
Merge branch 'bhart-issue_3037_rule_iteration_refactor' of https://gi…
Apr 9, 2022
8fc64c7
Only "main" phase when linting (as opposed to fixing)
Apr 9, 2022
2b2cc1f
Merge branch 'main' into bhart-issues_3035_3037_core_changes
barrywhart Apr 9, 2022
d636e10
Move test fixes from big PR
Apr 9, 2022
168022c
Merge branch 'bhart-issues_3035_3037_core_changes' of https://github.…
Apr 9, 2022
7c32cd4
PR review
Apr 9, 2022
56102c8
Simplify how we detect first linter pass and track lint violations
Apr 9, 2022
dc37633
Coverage
Apr 9, 2022
cedbbe1
Apply suggestions from code review
barrywhart Apr 9, 2022
52eeebc
PR review
Apr 9, 2022
9b7deb9
Merge branch 'main' into bhart-issues_3035_3037_core_changes
Apr 9, 2022
c4f5f8b
Merge branch 'bhart-issues_3035_3037_core_changes' of https://github.…
Apr 9, 2022
23780ad
Merge branch 'bhart-issues_3035_3037_core_changes' into bhart-issue_3…
Apr 9, 2022
d93f4f2
PR review
Apr 9, 2022
571b209
Merge branch 'main' into bhart-issue_3037_rule_iteration_refactor
Apr 9, 2022
93efb1e
Comments, tidying
Apr 9, 2022
5f11322
Update rule development docs with some of the new (and old) options
Apr 9, 2022
a220053
Add more comments about needs_raw_stack
Apr 9, 2022
579ac9b
Merge branch 'main' into bhart-issue_3037_rule_iteration_refactor
barrywhart Apr 9, 2022
bfef962
Merge branch 'main' into bhart-issue_3037_rule_iteration_refactor
barrywhart Apr 10, 2022
66a0507
Merge branch 'main' into bhart-issue_3037_rule_iteration_refactor
Apr 11, 2022
e0d3af6
PR review
Apr 11, 2022
a51ff1b
Apply suggestions from code review
barrywhart Apr 11, 2022
e28b033
Merge branch 'bhart-issue_3037_rule_iteration_refactor' of https://gi…
Apr 11, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 64 additions & 5 deletions docs/source/developingrules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,70 @@
Developing Rules
================

`Rules` in `SQLFluff` are implemented as `crawlers`. These are entities
which work their way through the parsed structure of a query to evaluate
a particular rule or set of rules. The intent is that the definition of
each specific rule should be really streamlined and only contain the logic
for the rule itself, with all the other mechanics abstracted away.
`Rules` in `SQLFluff` are implemented as classes inheriting from ``BaseRule``.
SQLFluff crawls through the parse tree of a SQL file, calling the rule's
``_eval()`` function for each segment in the tree. For many rules, this allows
the rule code to be really streamlined and only contain the logic for the rule
itself, with all the other mechanics abstracted away.

Traversal Options
-----------------

``recurse_into``
^^^^^^^^^^^^^^^^
Some rules are a poor fit for the simple traversal pattern described above.
Typical reasons include:

* The rule only looks at a small portion of the file (e.g. the beginning or
end).
* The rule needs to traverse the parse tree in a non-standard way.

These rules can override ``BaseRule``'s ``recurse_into`` field, setting it to
``False``. For these rules ``False``, ``_eval()`` is only called *once*, with
the root segment of the tree. This can be much more efficient, especially on
large files. For example, see rules ``L050`` and ``L009`` , which only look at
the beginning or end of the file, respectively.

``_works_on_unparsable``
^^^^^^^^^^^^^^^^^^^^^^^^
By default, `SQLFluff` calls ``_eval()`` for all segments, even "unparsable"
segments, i.e. segments that didn't match the parsing rules in the dialect.
This causes issues for some rules. If so, setting ``_works_on_unparsable``
to ``False`` tells SQLFluff not to call ``_eval()`` for unparsable segments and
their descendants.
Comment on lines +30 to +36
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? I thought default for this was false and you had to explicitly set it to true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought, too, but I checked. I think maybe I (we) were confusing this with the --fix-even-unparsable thing added recently. We've got so many settings. 🤯

IIUC, the setting documented here controls whether the rule is called for unparsable segments, and --fix-even-unparsable controls whether any fixes would be applied. (Note that you can lint without fixing!)


Performance-related Options
---------------------------
These are other fields on ``BaseRule``. Rules can override them.

``needs_raw_stack``
^^^^^^^^^^^^^^^^^^^
``needs_raw_stack`` defaults to ``False``. Some rules use
``RuleContext.raw_stack`` property to access earlier segments in the traversal.
This can be useful, but it adds significant overhead to the linting process.
For this reason, it is disabled by default.

``lint_phase``
^^^^^^^^^^^^^^
There are two phases of rule running.

1. The ``main`` phase is appropriate for most rules. These rules are assumed to
interact and potentially cause a cascade of fixes requiring multiple passes.
These rules run the `runaway_limit` number of times (default 10).

2. The ``post`` phase is for post-processing rules, not expected to trigger
any downstream rules, e.g. capitalization fixes. They are run in a
post-processing loop at the end. This loop is identical to the ``main`` loop,
but is only run 2 times at the end (once to fix, and once again to confirm no
remaining issues).

The two phases add complexity, but they also improve performance by allowing
SQLFluff to run fewer rules during the ``main`` phase, which often runs several
times.

NOTE: ``post`` rules also run on the *first* pass of the ``main`` phase so that
any issues they find will be presented in the list of issues output by
``sqlfluff fix`` and ``sqlfluff lint``.

Base Rules
----------
Expand Down
10 changes: 7 additions & 3 deletions src/sqlfluff/core/parser/segments/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -885,11 +885,15 @@ def select_children(
buff.append(seg)
return buff

def recursive_crawl_all(self):
def recursive_crawl_all(self, reverse: bool = False):
"""Recursively crawl all descendant segments."""
if reverse:
for seg in reversed(self.segments):
yield from seg.recursive_crawl_all(reverse=reverse)
yield self
for seg in self.segments:
yield from seg.recursive_crawl_all()
if not reverse:
for seg in self.segments:
yield from seg.recursive_crawl_all(reverse=reverse)

def recursive_crawl(self, *seg_type: str, recurse_into: bool = True):
"""Recursively crawl for segments of a given type.
Expand Down
39 changes: 36 additions & 3 deletions src/sqlfluff/core/rules/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -379,6 +379,23 @@ def siblings_post(self) -> Tuple[BaseSegment, ...]:
else:
return tuple()

@cached_property
def final_segment(self) -> BaseSegment:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New property

"""Returns rightmost & lowest descendant.

Similar in spirit to BaseRule.is_final_segment(), but:
- Much faster
- Does not allow filtering out meta segments
"""
last_segment: BaseSegment = (
self.parent_stack[0] if self.parent_stack else self.segment
)
while True:
try:
last_segment = last_segment.segments[-1]
except IndexError:
return last_segment

@property
def functional(self):
"""Returns a Surrogates object that simplifies writing rules."""
Expand Down Expand Up @@ -503,12 +520,28 @@ class BaseRule:
# Lint loop / crawl behavior. When appropriate, rules can (and should)
# override these values to make linting faster.
recurse_into = True
needs_raw_stack = True
# "needs_raw_stack" defaults to False because rules run faster that way, and
# most rules don't need it. Rules that use it are usually those that look
# at the surroundings of a segment, e.g. "is there whitespace preceding this
# segment?" In the long run, it would be good to review rules that use
# raw_stack to try and eliminate its use. These rules will often be good
# candidates for one of the following:
# - Rewriting to use "RuleContext.raw_segment_pre", which is similar to
# "raw_stack", but it's only the ONE raw segment prior to the current
# one.
# - Rewriting to use "BaseRule.recurse_into = False" and traversing the
# parse tree directly.
# - Using "RuleContext.memory" to implement custom, lighter weight tracking
# of just the MINIMUM required state across calls to _eval(). Reason:
# "raw_stack" becomes very large for large files (thousands or more
# segments!). In practice, most rules only need to look at a few adjacent
# segments, e.g. others on the same line or in the same statement.
needs_raw_stack = False
# Rules can override this to specify "post". "Post" rules are those that are
# not expected to trigger any downstream rules, e.g. capitalization fixes.
# They run on two occasions:
# - On the first loop of the main phase
# - In a second linter loop after the main rules run
# - On the first pass of the main phase
# - In a second linter pass after the main phase
lint_phase = "main"

def __init__(self, code, description, **kwargs):
Expand Down
2 changes: 2 additions & 0 deletions src/sqlfluff/rules/L001.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ class Rule_L001(BaseRule):
FROM foo
"""

needs_raw_stack = True

def _eval(self, context: RuleContext) -> LintResult:
"""Unnecessary trailing whitespace.

Expand Down
2 changes: 1 addition & 1 deletion src/sqlfluff/rules/L002.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:

if context.segment.is_type("whitespace"):
if " " in context.segment.raw and "\t" in context.segment.raw:
if len(context.raw_stack) == 0 or context.raw_stack[-1].is_type(
if context.raw_segment_pre is None or context.raw_segment_pre.is_type(
"newline"
):
# We've got a single whitespace at the beginning of a line.
Expand Down
3 changes: 2 additions & 1 deletion src/sqlfluff/rules/L003.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,7 @@ class Rule_L003(BaseRule):

targets_templated = True
_works_on_unparsable = False
needs_raw_stack = True
_adjust_anchors = True
_ignore_types: List[str] = ["script_content"]
config_keywords = ["tab_space_size", "indent_unit"]
Expand Down Expand Up @@ -413,7 +414,7 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:
# First non-whitespace element is our trigger
memory.trigger = segment

is_last = self.is_final_segment(context)
is_last = context.segment is context.final_segment
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_final_segment() is very complex and slow. Switching to the similar, much faster RuleContext.final_segment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legend

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we avoid rule writers using the wrong call in future? These seem very similarly named and therefore not crazy to think they are interchangeable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename the function? The property (RuleContext.final_segment) is the actual, final segment. The function supports optional filtering of meta segments. Currently, only one rule (L052) uses the function, so we could potentially move it into L052.py. Feel free to create an issue or PR for this. I'll probably go ahead and merge, since this doesn't seem like a blocker.

if not segment.is_type("newline") and not is_last:
# Process on line ends or file end
return LintResult(memory=memory)
Expand Down
7 changes: 4 additions & 3 deletions src/sqlfluff/rules/L004.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,8 @@ def _eval(self, context: RuleContext) -> LintResult:
)
# Only attempt a fix at the start of a newline for now
and (
len(context.raw_stack) == 0
or context.raw_stack[-1].is_type("newline")
context.raw_segment_pre is None
or context.raw_segment_pre.is_type("newline")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching from raw_stack to raw_segment_pre so this rule doesn't require the heavyweightraw_stack anymore.

)
):
fixes = [
Expand All @@ -90,7 +90,8 @@ def _eval(self, context: RuleContext) -> LintResult:
)
]
elif not (
len(context.raw_stack) == 0 or context.raw_stack[-1].is_type("newline")
context.raw_segment_pre is None
or context.raw_segment_pre.is_type("newline")
):
# give a helpful message if the wrong indent has been found and is not
# at the start of a newline
Expand Down
27 changes: 12 additions & 15 deletions src/sqlfluff/rules/L005.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Implementation of Rule L005."""
from typing import Optional

from sqlfluff.core.parser import RawSegment
from sqlfluff.core.rules.base import BaseRule, LintResult, LintFix, RuleContext
from sqlfluff.core.rules.doc_decorators import document_fix_compatible

Expand Down Expand Up @@ -38,19 +39,15 @@ class Rule_L005(BaseRule):
"""

def _eval(self, context: RuleContext) -> Optional[LintResult]:
"""Commas should not have whitespace directly before them.

We need at least one segment behind us for this to work.

"""
if len(context.raw_stack) >= 1:
cm1 = context.raw_stack[-1]
if (
context.segment.is_type("comma")
and cm1.is_type("whitespace")
and cm1.pos_marker.line_pos > 1
):
anchor = cm1
return LintResult(anchor=anchor, fixes=[LintFix.delete(cm1)])
# Otherwise fine
"""Commas should not have whitespace directly before them."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary change to this file is switching from raw_stack to raw_segment_pre, which is faster. Also did some tidying, but no other meaningful changes. @OTooleMichael

anchor: Optional[RawSegment] = context.raw_segment_pre
if (
# We need at least one segment previous segment for this to work.
anchor is not None
and context.segment.is_type("comma")
and anchor.is_type("whitespace")
and anchor.pos_marker.line_pos > 1
):
return LintResult(anchor=anchor, fixes=[LintFix.delete(anchor)])
# Otherwise fine.
return None
53 changes: 33 additions & 20 deletions src/sqlfluff/rules/L009.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,35 @@
"""Implementation of Rule L009."""
from typing import Optional

from sqlfluff.core.parser import NewlineSegment
from typing import List, Optional, Tuple

from sqlfluff.core.parser import BaseSegment, NewlineSegment
from sqlfluff.core.rules.base import BaseRule, LintResult, LintFix, RuleContext
from sqlfluff.core.rules.doc_decorators import document_fix_compatible
from sqlfluff.core.rules.functional import Segments, sp, tsp


def get_trailing_newlines(segment: BaseSegment) -> List[BaseSegment]:
Copy link
Member Author

@barrywhart barrywhart Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked this rule so it is only called once per parse tree, with the root segment.

"""Returns list of trailing newlines in the tree."""
result = []
for seg in segment.recursive_crawl_all(reverse=True):
if seg.is_type("newline"):
result.append(seg)
if not seg.is_whitespace and not seg.is_type("dedent"):
break
return result


def get_last_segment(segment: Segments) -> Tuple[List[BaseSegment], Segments]:
"""Returns rightmost & lowest descendant and its "parent stack"."""
parent_stack: List[BaseSegment] = []
while True:
children = segment.children()
if children:
parent_stack.append(segment[0])
segment = children.last()
else:
return parent_stack, segment
Comment on lines +21 to +30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't it use the new final_segment function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it also needs the parent stack, which is not provided by RuleContext.final_segment. I could've had RuleContext return both, but that'd be harder to document clearly, and AFAIK, this is the only rule that needs it.



@document_fix_compatible
class Rule_L009(BaseRule):
"""Files must end with a single trailing newline.
Expand Down Expand Up @@ -82,6 +104,9 @@ class Rule_L009(BaseRule):
"""

targets_templated = True
# TRICKY: Tells linter to only call _eval() ONCE, with the root segment
recurse_into = False
lint_phase = "post"

def _eval(self, context: RuleContext) -> Optional[LintResult]:
"""Files must end with a single trailing newline.
Expand All @@ -91,21 +116,9 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:

"""
# We only care about the final segment of the parse tree.
if not self.is_final_segment(context, filter_meta=False):
return None
parent_stack, segment = get_last_segment(context.functional.segment)

# Include current segment for complete stack and reverse.
parent_stack: Segments = context.functional.parent_stack
complete_stack: Segments = (
context.functional.raw_stack + context.functional.segment
)
reversed_complete_stack = complete_stack.reversed()

# Find the trailing newline segments.
trailing_newlines = reversed_complete_stack.select(
select_if=sp.is_type("newline"),
loop_while=sp.or_(sp.is_whitespace(), sp.is_type("dedent")),
)
trailing_newlines = Segments(*get_trailing_newlines(context.segment))
trailing_literal_newlines = trailing_newlines
if context.templated_file:
trailing_literal_newlines = trailing_newlines.select(
Expand All @@ -116,12 +129,12 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:
if not trailing_literal_newlines:
# We make an edit to create this segment after the child of the FileSegment.
if len(parent_stack) == 1:
fix_anchor_segment = context.segment
fix_anchor_segment = segment[0]
else:
fix_anchor_segment = parent_stack[1]

return LintResult(
anchor=context.segment,
anchor=segment[0],
fixes=[
LintFix.create_after(
fix_anchor_segment,
Expand All @@ -132,7 +145,7 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:
elif len(trailing_literal_newlines) > 1:
# Delete extra newlines.
return LintResult(
anchor=context.segment,
anchor=segment[0],
fixes=[LintFix.delete(d) for d in trailing_literal_newlines[1:]],
)
else:
Expand Down
1 change: 1 addition & 0 deletions src/sqlfluff/rules/L010.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ class Rule_L010(BaseRule):
from foo
"""

lint_phase = "post"
# Binary operators behave like keywords too.
_target_elems: List[Tuple[str, str]] = [
("type", "keyword"),
Expand Down
14 changes: 6 additions & 8 deletions src/sqlfluff/rules/L011.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,6 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:

We look for the alias segment, and then evaluate its parent and whether
it contains an AS keyword. This is the _eval function for both L011 and L012.

The use of `raw_stack` is just for working out how much whitespace to add.

"""
# Config type hints
self.aliasing: str
Expand All @@ -68,14 +65,14 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:

# Remove the AS as we're using implict aliasing
fixes.append(LintFix.delete(context.segment.segments[0]))
anchor = context.raw_stack[-1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching from raw_stack to raw_segment_pre so this rule doesn't require the heavyweight raw_stack anymore.

anchor = context.raw_segment_pre

# Remove whitespace before (if exists) or after (if not)
if (
len(context.raw_stack) > 0
and context.raw_stack[-1].type == "whitespace"
context.raw_segment_pre is not None
and context.raw_segment_pre.type == "whitespace"
):
fixes.append(LintFix.delete(context.raw_stack[-1]))
fixes.append(LintFix.delete(context.raw_segment_pre))
elif (
len(context.segment.segments) > 0
and context.segment.segments[1].type == "whitespace"
Expand All @@ -90,7 +87,8 @@ def _eval(self, context: RuleContext) -> Optional[LintResult]:
insert_buff: List[Union[WhitespaceSegment, KeywordSegment]] = []

# Add initial whitespace if we need to...
if context.raw_stack[-1].name not in ["whitespace", "newline"]:
assert context.raw_segment_pre
if context.raw_segment_pre.name not in ["whitespace", "newline"]:
insert_buff.append(WhitespaceSegment())

# Add an AS (Uppercase for now, but could be corrected later)
Expand Down
1 change: 1 addition & 0 deletions src/sqlfluff/rules/L014.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ class Rule_L014(Rule_L010):

"""

lint_phase = "post"
_target_elems: List[Tuple[str, str]] = [
("name", "naked_identifier"),
("name", "properties_naked_identifier"),
Expand Down
Loading