Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix catastrophic backtracking in BACKSLASH_URL_RE #56

Merged
merged 1 commit into from
Jan 6, 2023
Merged

Fix catastrophic backtracking in BACKSLASH_URL_RE #56

merged 1 commit into from
Jan 6, 2023

Conversation

Synse
Copy link
Contributor

@Synse Synse commented Dec 29, 2022

This fixes a Catastrophic Backtracking issue with BACKSLASH_URL_RE by updating the regex to match the format used by the bracket regex.

All current tests pass before and after the change.

Proof of concept script

#!/usr/bin/env python3
from time import time, strftime
from iocextract import extract_urls

text = """
[aaa](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaa.aaa_aaaaa.aa):<br>**`aaaaaaa`**=`/aaa/aaaa/aaaa`<br>**`aaaaaa_aa`**=`11.11.111.11`<br>**`aaaaaa_aaaaaaaa`**=`11.11.111.11`<br>**`aaa_aaaaa`**=`11`<br>|[**`aaaaaa_aaaaaaaaaaa_aaaaaaaaa(1.1)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaaaaaaaaa_aaaaaaaaaaa_aaaaaaaaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaaaaa`<br>|[**`aaaaaaa_aaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaaaaaaaaaaaa_aaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaa -aaaaaaaaaaaaaaaaaaaaaaa=aa -aaaaaaaaaaaaaaaaaaaaaa=aa aa-aaaaa-a1111a1.aaaaa-aaa1-11-aa1.aaaaaa.aaa -- aaaa -a \"aaaaaaa\\naa`<br>**`aaa_aaaaa`**=`1`<br>|[**`aaaa_aaaaaaaaaa_aaaaaa(1.1)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaaaaa_aaaaaaaaaa_aaaaaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ /aaa/aaa/aaaaa_aaaaaaa -a`<br> |[**`aaaaaaa_aaaaa(1.1)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaa_aaaaaaaaaaaaaaaaaa_aaaaa.aa):<br>**`aaaaaaa`**=`/aaa/aaa/aaaa aaaa`<br>**`aaaaaaa_aaaaa`**=`11`<br>|[**`aaaaaaa_aaaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaaaaaaaaaaa_aaaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaaa -a aaaaaaaaa aa-aaaaaaa aa-aaaa-aaaaaa_aa_aaaa`<br>|\n|[**`aaa_aaaaa(1.1)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaa.aaa_aaaaa.aa):<br>**`aaaaaaa`**=`/aaa/aaaa/aaaa`<br>**`aaaaaa_aa`**=`11.11.111.11`<br>**`aaaaaa_aaaaaaaa`**=`11.11.111.11`<br>**`aaa_aaaaa`**=`11`<br>| |[**`aaaaaaa_aaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaaaaaaaaaaaa_aaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaaa -a aaa-1 -a/aaaa/aa-aaaaa-aaaaaaa/aaaaaa/aaaaa.aa /aaaa/aa-aaaaa-aaaaaaa/aaa/aa-aaaaa-aaaaaa-aaaaaaa --aaaaa-aaaa --aaaaa`<br>**`aaa_aaaaa`**=`11`<br> |[**`aaaaaaaaaaaaaaaa_aaaa_aaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaaaaaaaaaaaaaaaaa_aaaa_aaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaaaaaaaa aaaaa aa-aaaa`<br>|[**`aaaaaaa_aaaaa(1.1)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaa_aaaaaaaaaaaaaaaaaa_aaaaa.aa):<br>**`aaaaaaa`**=`/aaa/aaa/aaaa aaaa`<br>**`aaaaaaa_aaaaa`**=`11`<br>|[**`aaaaaa_aaaaa_aaaaaaaa(111)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaaaaaaaaaa_aaaaa_aaaaaaaa.aa):<br>**`aaaaaaa`**=`aaaa -a aaaa aaaaaa aaaaa --aaaaaaa`<br>**`aaaaaaa_aaa`**=`/aaaa/aaaaaaaa`<br>**`aaaaaaa_aaa`**=`/aaa/aaaa`<br>**`aaaa`**=`aaaaaaa-aaa1`<br>**`aaaaa`**=`aaaaaaaa aaaaaaaa a`<br>**`aaaaaaaaaaa`**=`aaaaaaaaaa`<br>|\n|| |[**`aaaaaaa_aaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaaaaaaaaaaaa_aaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaaa -a aaa-1 -a/aaaa/aa-aaaaa-aaaaaaa/aaaaaa/aaaaa.aa /aaaa/aa-aaaaa-aaaaaaa/aaa/aa-aaaaa-aaaaaa-aaaaaaa --aaaaa-aaaa --aaaaa`<br>**`aaa_aaaaa`**=`11`<br> |[**`aaaaaaaaaaaaaaaa_aaaa_aaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaaaaaaaaaaaaaaaaa_aaaa_aaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ -aaaa --aaaaa -a \\/aaaa\\/aa-aaaaa-aaaaaaa\\/aaa\\/aa-aaaaaaaaaa-aaaaaa -a \\/aaa\\/aaa\\/aaaaa\\/ -a https://aa-aaaaa-111a111\\.aaaaa-aaa1-11-aa1\\.example\\.com
"""

start = time()
urls = set()

print('Starting url extraction...')
for url in extract_urls(text):
    # print(f'[{strftime("%T")}] extracted "{url}"')
    urls.add(url)

end = time()

print(f'Extracted {len(urls)} unique urls in {end - start} seconds')

Before fix

./iocextract_catastrophic_backtracking_poc.py 
Starting url extraction...
Extracted 9 unique urls in 9.963629484176636 seconds

After fix

./iocextract_catastrophic_backtracking_poc.py 
Starting url extraction...
Extracted 9 unique urls in 0.009938955307006836 seconds

I wasn't able to pinpoint exactly what in the sample text was triggering the backtracking but the longer the text is the exponentially longer the url extraction would take. The sample text above is 3.7k and takes <10 seconds, the original text I was having issues with was ~26k and extraction took almost 3 minutes.

Fixes #52

@DragonistYJ

This comment was marked as spam.

Copy link
Contributor

@battleoverflow battleoverflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Synse!

Great fix! Just tested and merged.

@battleoverflow battleoverflow merged commit 13721d1 into InQuest:master Jan 6, 2023
@Synse Synse deleted the fix-backslash-url-re branch January 6, 2023 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

catastrophic backtracking in BACKSLASH_URL_RE
3 participants