Skip to content

Commit

Permalink
Merge pull request #69 from InQuest/rc
Browse files Browse the repository at this point in the history
Release: v1.15.2
  • Loading branch information
azazelm3dj3d authored Apr 18, 2023
2 parents f7ce83f + f24ead2 commit 0fcd139
Show file tree
Hide file tree
Showing 7 changed files with 66 additions and 69 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ iocextract
==========

![Developed by InQuest](https://inquest.net/images/inquest-badge.svg)
![Build Status](https://github.com/InQuest/python-iocextract/workflows/iocextract-build/badge.svg)
![Build Status](https://github.com/InQuest/iocextract/workflows/iocextract-build/badge.svg)
[![Documentation Status](https://readthedocs.org/projects/iocextract/badge/?version=latest)](https://inquest.readthedocs.io/projects/iocextract/en/latest/?badge=latest)
![PyPI Version](https://img.shields.io/pypi/v/iocextract.svg)

Expand Down Expand Up @@ -152,7 +152,7 @@ http://example.com
http://example.com:8989/bad
"""

for url in iocextract.extract_urls(content, defang_data=False):
for url in iocextract.extract_urls(content, defang=False):
print(url)

# Output
Expand Down Expand Up @@ -234,7 +234,7 @@ Note: You will most likely end up with extra garbage at the end of URLs.
>> A. Maybe, but you should consider using the `--strip-urls` CLI flag (or the `strip=True` parameter in the library), and you may still get some extra garbage in your output. If you're extracting from HTML, consider using something like [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to first isolate the text content, and then pass that to iocextract, [like this](https://gist.github.com/rshipp/d399491305c5d293357a800d5a51b0aa).
> Q. Extracting IOCs that have not been defanged, from binary data like executables, or very large inputs?
>> A. There is a very simplistic version of this available when running as a library, but it requires the `defang_data=False` parameter and could potentially miss some of the IOCs. The regex in iocextract is designed to be flexible to catch defanged IOCs. If you're unable to collect the information you need, consider using something like [Cacador](https://github.com/sroberts/cacador) instead.
>> A. There is a very simplistic version of this available when running as a library, but it requires the `defang=False` parameter and could potentially miss some of the IOCs. The regex in iocextract is designed to be flexible to catch defanged IOCs. If you're unable to collect the information you need, consider using something like [Cacador](https://github.com/sroberts/cacador) instead.
More Details
------------
Expand Down Expand Up @@ -308,7 +308,7 @@ For URLs, the following defang techniques are supported:
| URL encoded | `http%3A%2F%2fexample%2Ecom%2Fpath` | `http://example.com/path` |
| Base64 encoded | `aHR0cDovL2V4YW1wbGUuY29tL3BhdGgK` | `http://example.com/path` |

NOTE: The tables above are not exhaustive, and other URL/defang patterns may also be extracted correctly. If you notice something missing or not working correctly, feel free to let us know via the [GitHub Issues](https://github.com/inquest/python-iocextract/issues).
NOTE: The tables above are not exhaustive, and other URL/defang patterns may also be extracted correctly. If you notice something missing or not working correctly, feel free to let us know via the [GitHub Issues](https://github.com/inquest/iocextract/issues).

The base64 regex was generated with [@deadpixi](https://github.com/deadpixi)'s [base64 regex tool](https://www.erlang-factory.com/upload/presentations/225/ErlangFactorySFBay2010-RobKing.pdf).

Expand Down Expand Up @@ -391,7 +391,7 @@ If you're working with YARA rules, you may be interested in [plyara](https://git
Contributing
------------

If you have a defang technique that doesn't make it through the extractor, or if you find any bugs, Pull Requests and Issues are always welcome. The library is released under a GPL-2.0 [license](https://github.com/InQuest/python-iocextract/blob/master/LICENSE).
If you have a defang technique that doesn't make it through the extractor, or if you find any bugs, Pull Requests and Issues are always welcome. The library is released under a GPL-2.0 [license](https://github.com/InQuest/iocextract/blob/master/LICENSE).

Who's using iocextract?
-----------------------
Expand Down
10 changes: 5 additions & 5 deletions docs/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ iocextract
.. image:: https://inquest.net/images/inquest-badge.svg
:target: https://inquest.net/
:alt: Developed by InQuest
.. image:: https://github.com/InQuest/python-iocextract/workflows/iocextract-build/badge.svg
:target: https://github.com/InQuest/python-iocextract/workflows/iocextract-build/
.. image:: https://github.com/InQuest/iocextract/workflows/iocextract-build/badge.svg
:target: https://github.com/InQuest/iocextract/workflows/iocextract-build/
:alt: Build Status
.. image:: https://readthedocs.org/projects/iocextract/badge/?version=latest
:target: https://inquest.readthedocs.io/projects/iocextract/en/latest/
Expand Down Expand Up @@ -123,7 +123,7 @@ If you don't want to defang the extracted IOCs at all during extraction, you can
http://example.com:8989/bad
"""

for url in iocextract.extract_urls(content, defang_data=False):
for url in iocextract.extract_urls(content, defang=False):
print(url)

# Output
Expand Down Expand Up @@ -205,7 +205,7 @@ Maybe, but you should consider using the ``--strip-urls`` CLI flag (or the ``str

**Extracting IOCs that have not been defanged, from binary data like executables, or very large inputs?**

There is a very simplistic version of this available when running as a library, but it requires the ``defang_data=False`` parameter and could potentially miss some of the IOCs. The regex in iocextract is designed to be flexible to catch defanged IOCs. If you're unable to collect the information you need, consider using something like `Cacador`_ instead.
There is a very simplistic version of this available when running as a library, but it requires the ``defang=False`` parameter and could potentially miss some of the IOCs. The regex in iocextract is designed to be flexible to catch defanged IOCs. If you're unable to collect the information you need, consider using something like `Cacador`_ instead.

More Details
------------
Expand Down Expand Up @@ -414,7 +414,7 @@ Who's using iocextract
Are you using it? Want to see your site listed here? Let us know!

.. _Indicator of Compromise: https://en.wikipedia.org/wiki/Indicator_of_compromise
.. _Issues: https://github.com/inquest/python-iocextract/issues
.. _Issues: https://github.com/inquest/iocextract/issues
.. _this tweet from @InQuest: https://twitter.com/InQuest/status/969469856931287041
.. _Cisco ESA: https://www.cisco.com/c/en/us/support/docs/security/email-security-appliance/118775-technote-esa-00.html
.. _appropriate wheel from PyPI: https://pypi.org/project/regex/#files
Expand Down
6 changes: 3 additions & 3 deletions docs/_templates/links.html
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@ <h3>Other Projects</h3>

<h3>Useful Links</h3>
<ul>
<li><a href="https://github.com/InQuest/python-iocextract">GitHub Repository</a></li>
<li><a href="https://github.com/InQuest/iocextract">GitHub Repository</a></li>
<li><a href="https://pypi.org/project/iocextract">PyPI Package</a></li>
<li><a href="https://github.com/InQuest/python-iocextract/issues">Issue Tracker</a></li>
<li><a href="https://github.com/InQuest/python-iocextract/releases">Changelog</a></li>
<li><a href="https://github.com/InQuest/iocextract/issues">Issue Tracker</a></li>
<li><a href="https://github.com/InQuest/iocextract/releases">Changelog</a></li>
</ul>

<h3>Stay Informed</h3>
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@
'logo_name': 'true',
'description': 'Advanced Indicator of Compromise (IOC) extractor.',
'github_user': 'InQuest',
'github_repo': 'python-iocextract',
'github_repo': 'iocextract',
'github_type': 'star',
'show_powered_by': 'false',
'page_width': 'auto',
Expand Down
63 changes: 30 additions & 33 deletions iocextract.py
Original file line number Diff line number Diff line change
Expand Up @@ -401,7 +401,7 @@ def extract_urls(
delimiter=False,
open_punc=False,
no_scheme=False,
defang_data=False,
defang=False,
):
"""
Extract URLs!
Expand All @@ -414,7 +414,7 @@ def extract_urls(
:param bool delimiter: Continue extracting even after whitespace is detected
:param bool open_punc: Disabled puncuation regex
:param bool no_scheme: Remove protocol (http, tcp, etc.) type in output
:param bool defang_data: Extract non-defanged IOCs
:param bool defang: Extract non-defanged IOCs
:rtype: :py:func:`itertools.chain`
"""

Expand All @@ -425,14 +425,14 @@ def extract_urls(
strip=strip,
open_punc=open_punc,
no_scheme=no_scheme,
defang_data=defang_data,
defang=defang,
),
extract_encoded_urls(data, refang=refang, strip=strip, delimiter=delimiter),
)


def extract_unencoded_urls(
data, refang=False, strip=False, open_punc=False, no_scheme=False, defang_data=False
data, refang=False, strip=False, open_punc=False, no_scheme=False, defang=False
):
"""
Extract only unencoded URLs!
Expand All @@ -442,40 +442,37 @@ def extract_unencoded_urls(
:param bool strip: Strip possible garbage from the end of URLs
:param bool open_punc: Disabled puncuation regex
:param bool no_scheme: Remove protocol (http, tcp, etc.) type in output
:param bool defang_data: Extract non-defanged IOCs
:param bool defang: Extract non-defanged IOCs
:rtype: Iterator[:class:`str`]
"""

if "[" not in data:
if defang_data:
data = str(data).replace(".", "[.]")

yield data

else:
unencoded_urls = itertools.chain(
url_re(open_punc).finditer(data),
BRACKET_URL_RE.finditer(data),
BACKSLASH_URL_RE.finditer(data),
)
unencoded_urls = itertools.chain(
url_re(open_punc).finditer(data),
BRACKET_URL_RE.finditer(data),
BACKSLASH_URL_RE.finditer(data),
)

for url in unencoded_urls:
for url in unencoded_urls:
if refang or defang:
if refang:
url = refang_url(url.group(1), no_scheme=no_scheme)
else:
url = url.group(1)
url = refang_data(url.group(1), no_scheme=no_scheme)

# Checks for whitespace in the string
def found_ws(s):
return True in [check_s in s for check_s in whitespace]
if defang:
url = defang_data(url.group(1))
else:
url = url.group(1)

if strip:
if found_ws(url):
url = re.split(WS_SYNTAX_RM, url)[0]
else:
url = re.split(URL_SPLIT_STR, url)[0]
# Checks for whitespace in the string
def found_ws(s):
return True in [check_s in s for check_s in whitespace]

yield url
if strip:
if found_ws(url):
url = re.split(WS_SYNTAX_RM, url)[0]
else:
url = re.split(URL_SPLIT_STR, url)[0]

yield url


def extract_encoded_urls(
Expand Down Expand Up @@ -728,7 +725,7 @@ def extract_custom_iocs(data, regex_list):
"""
Extract using custom regex strings!
Need help? Check out the README: https://github.com/inquest/python-iocextract#custom-regex
Need help? Check out the README: https://github.com/inquest/iocextract#custom-regex
:param data: Input text
:param regex_list: List of strings to treat as regex and match against data
Expand Down Expand Up @@ -814,7 +811,7 @@ def refang_email(email):
)


def refang_url(url, no_scheme=False):
def refang_data(url, no_scheme=False):
"""
Refang a URL!
Expand Down Expand Up @@ -923,7 +920,7 @@ def refang_ipv4(ip_address):
)


def defang(ioc):
def defang_data(ioc):
"""
Defang a URL, domain, or IPv4 address!
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

setup(
name='iocextract',
version='1.15.1',
version='1.15.2',
include_package_data=True,
py_modules=['iocextract',],
install_requires=['regex',],
Expand All @@ -27,7 +27,7 @@
description='Advanced Indicator of Compromise (IOC) extractor.',
long_description=README,
long_description_content_type = "text/markdown",
url='https://github.com/InQuest/python-iocextract',
url='https://github.com/InQuest/iocextract',
author='InQuest Labs',
author_email='labs@inquest.net',
classifiers=[
Expand Down
40 changes: 20 additions & 20 deletions tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -577,7 +577,7 @@ def test_refang_ipv4(self):
self.assertEqual(list(iocextract.extract_ipv4s(content, refang=True))[0], '111.111.111.111')
self.assertEqual(iocextract.refang_ipv4(content), '111.111.111.111')

def test_refang_url(self):
def test_refang_data(self):
content_list = [
'http://example.com/test',
'http:// example .com /test',
Expand All @@ -599,18 +599,18 @@ def test_refang_url(self):
]

for content in content_list:
self.assertEqual(iocextract.refang_url(content), 'http://example.com/test')
self.assertEqual(iocextract.refang_data(content), 'http://example.com/test')

self.assertEqual(iocextract.refang_url('ftx://example.com/test'), 'ftp://example.com/test')
self.assertEqual(iocextract.refang_data('ftx://example.com/test'), 'ftp://example.com/test')

# IPv6 works as expected
content = 'http://[2001:db8:85a3:0:0:8a2e:370:7334]:80/test'
self.assertEqual(iocextract.refang_url(content), content)
self.assertEqual(iocextract.refang_data(content), content)
self.assertEqual(list(iocextract.extract_urls(content, refang=True))[0], content)

# HXXPS
for content in ['hxxps://example[.]com/test', 'hXXps://example[dot]com/test']:
self.assertEqual(iocextract.refang_url(content), 'https://example.com/test')
self.assertEqual(iocextract.refang_data(content), 'https://example.com/test')

def test_url_extraction_handles_punctuation(self):
self.assertEqual(list(iocextract.extract_urls('example[.]com!'))[0], 'example[.]com')
Expand Down Expand Up @@ -657,8 +657,8 @@ def test_urlencoded_url_extraction(self):

def test_refang_never_excepts_from_urlparse(self):
try:
iocextract.refang_url('hxxp__test]')
iocextract.refang_url('CDATA[^h00ps://test.com/]]>')
iocextract.refang_data('hxxp__test]')
iocextract.refang_data('CDATA[^h00ps://test.com/]]>')
except ValueError as e:
self.fail('Unhandled parsing error in refang: {e}'.format(e=e))

Expand All @@ -670,8 +670,8 @@ def test_url_generic_regex_tight_edge_cases(self):
self.assertEqual(len(list(iocextract.extract_urls('https:// test /'))), 1)

def test_refang_removes_some_backslash_escaped_characters(self):
self.assertEqual(iocextract.refang_url('https://example\(.)com/'), 'https://example.com/')
self.assertEqual(iocextract.refang_url('https://example\(.\)com/test\.html'), 'https://example.com/test.html')
self.assertEqual(iocextract.refang_data('https://example\(.)com/'), 'https://example.com/')
self.assertEqual(iocextract.refang_data('https://example\(.\)com/test\.html'), 'https://example.com/test.html')

def test_ip_regex_allows_multiple_brackets(self):
self.assertEqual(list(iocextract.extract_ips('10.10.10.]]]10', refang=True))[0], '10.10.10.10')
Expand All @@ -688,16 +688,16 @@ def test_ip_regex_allows_backslash_escape(self):
self.assertEqual(list(iocextract.extract_ips('10[.]10(.10\.10', refang=True))[0], '10.10.10.10')

def test_defang(self):
self.assertEqual(iocextract.defang('http://example.com/some/lo.ng/path.ext/'), 'hxxp://example[.]com/some/lo.ng/path.ext/')
self.assertEqual(iocextract.defang('http://example.com/path.ext'), 'hxxp://example[.]com/path.ext')
self.assertEqual(iocextract.defang('http://127.0.0.1/path.ext'), 'hxxp://127[.]0[.]0[.]1/path.ext')
self.assertEqual(iocextract.defang('http://example.com/'), 'hxxp://example[.]com/')
self.assertEqual(iocextract.defang('https://example.com/'), 'hxxps://example[.]com/')
self.assertEqual(iocextract.defang('ftp://example.com/'), 'fxp://example[.]com/')
self.assertEqual(iocextract.defang('example.com'), 'example[.]com')
self.assertEqual(iocextract.defang('example.com/'), 'example[.]com/')
self.assertEqual(iocextract.defang('example.com/some/lo.ng/path.ext/'), 'example[.]com/some/lo.ng/path.ext/')
self.assertEqual(iocextract.defang('127.0.0.1'), '127[.]0[.]0[.]1')
self.assertEqual(iocextract.defang_data('http://example.com/some/lo.ng/path.ext/'), 'hxxp://example[.]com/some/lo.ng/path.ext/')
self.assertEqual(iocextract.defang_data('http://example.com/path.ext'), 'hxxp://example[.]com/path.ext')
self.assertEqual(iocextract.defang_data('http://127.0.0.1/path.ext'), 'hxxp://127[.]0[.]0[.]1/path.ext')
self.assertEqual(iocextract.defang_data('http://example.com/'), 'hxxp://example[.]com/')
self.assertEqual(iocextract.defang_data('https://example.com/'), 'hxxps://example[.]com/')
self.assertEqual(iocextract.defang_data('ftp://example.com/'), 'fxp://example[.]com/')
self.assertEqual(iocextract.defang_data('example.com'), 'example[.]com')
self.assertEqual(iocextract.defang_data('example.com/'), 'example[.]com/')
self.assertEqual(iocextract.defang_data('example.com/some/lo.ng/path.ext/'), 'example[.]com/some/lo.ng/path.ext/')
self.assertEqual(iocextract.defang_data('127.0.0.1'), '127[.]0[.]0[.]1')

def test_email_refang(self):
content_list = [
Expand Down Expand Up @@ -732,7 +732,7 @@ def test_path_refang(self):

for content in content_list:
self.assertEqual(list(iocextract.extract_urls(content, refang=True))[0], 'http://example.com/test.htm')
self.assertEqual(iocextract.refang_url(content), 'http://example.com/test.htm')
self.assertEqual(iocextract.refang_data(content), 'http://example.com/test.htm')

def test_b64_url_extraction_just_url(self):
content_list = [
Expand Down

0 comments on commit 0fcd139

Please sign in to comment.