Escape user-provided text passed to regex #571

pitkley · 2019-10-06T14:53:17Z

Rather than using the user/document-provided values directly, we instead escape them to use them verbatim.

This fixes issue 568.

I only checked the models.py file for occurrences of unescaped regex-"injections". There might be more across the project.

kj21

Works beautifully

MasterofJOKers · 2019-11-02T10:30:20Z

Took the liberty to fix the pycodestyle errors shown in travis. It was just line length.

That whole matching section gives me shivers, though. Every if seems written in a different style and I don't get why we have to re.compile() the match before passing it to re.search().

MasterofJOKers · 2019-11-02T10:34:30Z

We still have test failures with these patches. Going to take a look.

MasterofJOKers · 2019-11-02T11:34:05Z

_split_match() returns spaces replaced with \s+. Also re.escape() escapes spaces, so the whole _split_match() magic is a little hard to fix.

>>> print(re.escape('alpha charlie gamma'))
alpha\ charlie\ gamma

MasterofJOKers · 2019-11-02T11:40:31Z

diff --git a/src/documents/models.py b/src/documents/models.py
index 7d12f91..f368f31 100644
--- a/src/documents/models.py
+++ b/src/documents/models.py
@@ -98,15 +98,14 @@ class MatchingModel(models.Model):
         if self.matching_algorithm == self.MATCH_ALL:
             for word in self._split_match():
                 search_result = re.search(
-                    r"\b{}\b".format(re.escape(word)), text, **search_kwargs)
+                    r"\b{}\b".format(word), text, **search_kwargs)
                 if not search_result:
                     return False
             return True
 
         if self.matching_algorithm == self.MATCH_ANY:
             for word in self._split_match():
-                if re.search(r"\b{}\b".format(re.escape(word)), text,
-                             **search_kwargs):
+                if re.search(r"\b{}\b".format(word), text, **search_kwargs):
                     return True
             return False
 
@@ -142,7 +141,7 @@ class MatchingModel(models.Model):
         findterms = re.compile(r'"([^"]+)"|(\S+)').findall
         normspace = re.compile(r"\s+").sub
         return [
-            normspace(" ", (t[0] or t[1]).strip()).replace(" ", r"\s+")
+            re.escape(normspace(" ", (t[0] or t[1]).strip())).replace(r"\ ", r"\s+")
             for t in findterms(self.match)
         ]

This makes it work but is pretty much unreadable.

MasterofJOKers · 2019-11-02T11:46:52Z

I'd prefer something like this. Opinions?

        findterms = re.compile(r'"([^"]+)"|(\S+)').findall
        normspace = re.compile(r"\s+").sub
        # find all terms and replace multiple spaces with a single one
        terms = [normspace(" ", (t[0] or t[1]).strip())
                    for t in findterms(self.match)]
        # escape each term so we don't have regexes where we don't want them.
        # This escapes spaces, too.
        terms = [re.escape(t).replace(r"\ ", "\s+")
                    for t in terms]
        return terms

Rather than using the user/document-provided values directly, we instead escape them to use them verbatim. This fixes issue #568.

pitkley · 2020-02-23T15:58:54Z

@MasterofJOKers sorry I didn't react after your review, thank you very much for analyzing this. I have applied pretty much exactly your solution to the problem, it seems like the simplest fix while still being correct! 👍

MasterofJOKers

I really liked the comments in my version though ;)

pitkley mentioned this pull request Oct 6, 2019

re.error: nothing to repeat at position 2 #568

Closed

kj21 previously approved these changes Oct 6, 2019

View reviewed changes

sbrunner previously approved these changes Oct 7, 2019

View reviewed changes

MasterofJOKers dismissed stale reviews from sbrunner and kj21 via 422a785 November 2, 2019 10:28

MasterofJOKers force-pushed the pitkley-patch-1 branch from 837c2f7 to 422a785 Compare November 2, 2019 10:28

MasterofJOKers previously approved these changes Nov 2, 2019

View reviewed changes

pitkley dismissed MasterofJOKers’s stale review via 3e6d177 February 23, 2020 15:22

pitkley force-pushed the pitkley-patch-1 branch 2 times, most recently from 3e6d177 to 989af90 Compare February 23, 2020 15:51

Escape user-provided text passed to regex

359e236

Rather than using the user/document-provided values directly, we instead escape them to use them verbatim. This fixes issue #568.

pitkley force-pushed the pitkley-patch-1 branch from 989af90 to 359e236 Compare February 23, 2020 15:55

pitkley requested review from MasterofJOKers and a team February 23, 2020 15:59

MasterofJOKers approved these changes Feb 23, 2020

View reviewed changes

pitkley requested a review from a team February 23, 2020 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escape user-provided text passed to regex #571

Escape user-provided text passed to regex #571

pitkley commented Oct 6, 2019

kj21 left a comment

MasterofJOKers commented Nov 2, 2019

MasterofJOKers commented Nov 2, 2019

MasterofJOKers commented Nov 2, 2019

MasterofJOKers commented Nov 2, 2019

MasterofJOKers commented Nov 2, 2019

pitkley commented Feb 23, 2020

MasterofJOKers left a comment

Escape user-provided text passed to regex #571

Are you sure you want to change the base?

Escape user-provided text passed to regex #571

Conversation

pitkley commented Oct 6, 2019

kj21 left a comment

Choose a reason for hiding this comment

MasterofJOKers commented Nov 2, 2019

MasterofJOKers commented Nov 2, 2019

MasterofJOKers commented Nov 2, 2019

MasterofJOKers commented Nov 2, 2019

MasterofJOKers commented Nov 2, 2019

pitkley commented Feb 23, 2020

MasterofJOKers left a comment

Choose a reason for hiding this comment