Merge pull request #69 from InQuest/rc

Release: v1.15.2
InQuest · Apr 18, 2023 · 0fcd139 · 0fcd139
2 parents f7ce83f + f24ead2
commit 0fcd139
Show file tree

Hide file tree

Showing 7 changed files with 66 additions and 69 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@ iocextract
 ==========
 
 ![Developed by InQuest](https://inquest.net/images/inquest-badge.svg)
-![Build Status](https://github.com/InQuest/python-iocextract/workflows/iocextract-build/badge.svg)
+![Build Status](https://github.com/InQuest/iocextract/workflows/iocextract-build/badge.svg)
 [![Documentation Status](https://readthedocs.org/projects/iocextract/badge/?version=latest)](https://inquest.readthedocs.io/projects/iocextract/en/latest/?badge=latest)
 ![PyPI Version](https://img.shields.io/pypi/v/iocextract.svg)
 
@@ -152,7 +152,7 @@ http://example.com
 http://example.com:8989/bad
 """
 
-for url in iocextract.extract_urls(content, defang_data=False):
+for url in iocextract.extract_urls(content, defang=False):
     print(url)
 
     # Output
@@ -234,7 +234,7 @@ Note: You will most likely end up with extra garbage at the end of URLs.
 >> A. Maybe, but you should consider using the `--strip-urls` CLI flag (or the `strip=True` parameter in the library), and you may still get some extra garbage in your output. If you're extracting from HTML, consider using something like [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to first isolate the text content, and then pass that to iocextract, [like this](https://gist.github.com/rshipp/d399491305c5d293357a800d5a51b0aa).
 
 > Q. Extracting IOCs that have not been defanged, from binary data like executables, or very large inputs?
->> A. There is a very simplistic version of this available when running as a library, but it requires the `defang_data=False` parameter and could potentially miss some of the IOCs. The regex in iocextract is designed to be flexible to catch defanged IOCs. If you're unable to collect the information you need, consider using something like [Cacador](https://github.com/sroberts/cacador) instead.
+>> A. There is a very simplistic version of this available when running as a library, but it requires the `defang=False` parameter and could potentially miss some of the IOCs. The regex in iocextract is designed to be flexible to catch defanged IOCs. If you're unable to collect the information you need, consider using something like [Cacador](https://github.com/sroberts/cacador) instead.
 
 More Details
 ------------
@@ -308,7 +308,7 @@ For URLs, the following defang techniques are supported:
 | URL encoded     | `http%3A%2F%2fexample%2Ecom%2Fpath`                | `http://example.com/path` |
 | Base64 encoded  | `aHR0cDovL2V4YW1wbGUuY29tL3BhdGgK`                 | `http://example.com/path` |
 
-NOTE: The tables above are not exhaustive, and other URL/defang patterns may also be extracted correctly. If you notice something missing or not working correctly, feel free to let us know via the [GitHub Issues](https://github.com/inquest/python-iocextract/issues).
+NOTE: The tables above are not exhaustive, and other URL/defang patterns may also be extracted correctly. If you notice something missing or not working correctly, feel free to let us know via the [GitHub Issues](https://github.com/inquest/iocextract/issues).
 
 The base64 regex was generated with [@deadpixi](https://github.com/deadpixi)'s [base64 regex tool](https://www.erlang-factory.com/upload/presentations/225/ErlangFactorySFBay2010-RobKing.pdf).
 
@@ -391,7 +391,7 @@ If you're working with YARA rules, you may be interested in [plyara](https://git
 Contributing
 ------------
 
-If you have a defang technique that doesn't make it through the extractor, or if you find any bugs, Pull Requests and Issues are always welcome. The library is released under a GPL-2.0 [license](https://github.com/InQuest/python-iocextract/blob/master/LICENSE).
+If you have a defang technique that doesn't make it through the extractor, or if you find any bugs, Pull Requests and Issues are always welcome. The library is released under a GPL-2.0 [license](https://github.com/InQuest/iocextract/blob/master/LICENSE).
 
 Who's using iocextract?
 -----------------------

diff --git a/docs/README.rst b/docs/README.rst
@@ -4,8 +4,8 @@ iocextract
 .. image:: https://inquest.net/images/inquest-badge.svg
     :target: https://inquest.net/
     :alt: Developed by InQuest
-.. image:: https://github.com/InQuest/python-iocextract/workflows/iocextract-build/badge.svg
-    :target: https://github.com/InQuest/python-iocextract/workflows/iocextract-build/
+.. image:: https://github.com/InQuest/iocextract/workflows/iocextract-build/badge.svg
+    :target: https://github.com/InQuest/iocextract/workflows/iocextract-build/
     :alt: Build Status
 .. image:: https://readthedocs.org/projects/iocextract/badge/?version=latest
     :target: https://inquest.readthedocs.io/projects/iocextract/en/latest/
@@ -123,7 +123,7 @@ If you don't want to defang the extracted IOCs at all during extraction, you can
     http://example.com:8989/bad
     """
 
-    for url in iocextract.extract_urls(content, defang_data=False):
+    for url in iocextract.extract_urls(content, defang=False):
         print(url)
 
         # Output
@@ -205,7 +205,7 @@ Maybe, but you should consider using the ``--strip-urls`` CLI flag (or the ``str
 
 **Extracting IOCs that have not been defanged, from binary data like executables, or very large inputs?**
 
-There is a very simplistic version of this available when running as a library, but it requires the ``defang_data=False`` parameter and could potentially miss some of the IOCs. The regex in iocextract is designed to be flexible to catch defanged IOCs. If you're unable to collect the information you need, consider using something like `Cacador`_ instead.
+There is a very simplistic version of this available when running as a library, but it requires the ``defang=False`` parameter and could potentially miss some of the IOCs. The regex in iocextract is designed to be flexible to catch defanged IOCs. If you're unable to collect the information you need, consider using something like `Cacador`_ instead.
 
 More Details
 ------------
@@ -414,7 +414,7 @@ Who's using iocextract
 Are you using it? Want to see your site listed here? Let us know!
 
 .. _Indicator of Compromise: https://en.wikipedia.org/wiki/Indicator_of_compromise
-.. _Issues: https://github.com/inquest/python-iocextract/issues
+.. _Issues: https://github.com/inquest/iocextract/issues
 .. _this tweet from @InQuest: https://twitter.com/InQuest/status/969469856931287041
 .. _Cisco ESA: https://www.cisco.com/c/en/us/support/docs/security/email-security-appliance/118775-technote-esa-00.html
 .. _appropriate wheel from PyPI: https://pypi.org/project/regex/#files

diff --git a/docs/_templates/links.html b/docs/_templates/links.html
@@ -36,10 +36,10 @@ <h3>Other Projects</h3>
 
 <h3>Useful Links</h3>
 <ul>
-  <li><a href="https://github.com/InQuest/python-iocextract">GitHub Repository</a></li>
+  <li><a href="https://github.com/InQuest/iocextract">GitHub Repository</a></li>
   <li><a href="https://pypi.org/project/iocextract">PyPI Package</a></li>
-  <li><a href="https://github.com/InQuest/python-iocextract/issues">Issue Tracker</a></li>
-  <li><a href="https://github.com/InQuest/python-iocextract/releases">Changelog</a></li>
+  <li><a href="https://github.com/InQuest/iocextract/issues">Issue Tracker</a></li>
+  <li><a href="https://github.com/InQuest/iocextract/releases">Changelog</a></li>
 </ul>
 
 <h3>Stay Informed</h3>

diff --git a/docs/conf.py b/docs/conf.py
@@ -90,7 +90,7 @@
     'logo_name': 'true',
     'description': 'Advanced Indicator of Compromise (IOC) extractor.',
     'github_user': 'InQuest',
-    'github_repo': 'python-iocextract',
+    'github_repo': 'iocextract',
     'github_type': 'star',
     'show_powered_by': 'false',
     'page_width': 'auto',

diff --git a/iocextract.py b/iocextract.py
@@ -401,7 +401,7 @@ def extract_urls(
     delimiter=False,
     open_punc=False,
     no_scheme=False,
-    defang_data=False,
+    defang=False,
 ):
     """
     Extract URLs!
@@ -414,7 +414,7 @@ def extract_urls(
     :param bool delimiter: Continue extracting even after whitespace is detected
     :param bool open_punc: Disabled puncuation regex
     :param bool no_scheme: Remove protocol (http, tcp, etc.) type in output
-    :param bool defang_data: Extract non-defanged IOCs
+    :param bool defang: Extract non-defanged IOCs
     :rtype: :py:func:`itertools.chain`
     """
 
@@ -425,14 +425,14 @@ def extract_urls(
             strip=strip,
             open_punc=open_punc,
             no_scheme=no_scheme,
-            defang_data=defang_data,
+            defang=defang,
         ),
         extract_encoded_urls(data, refang=refang, strip=strip, delimiter=delimiter),
     )
 
 
 def extract_unencoded_urls(
-    data, refang=False, strip=False, open_punc=False, no_scheme=False, defang_data=False
+    data, refang=False, strip=False, open_punc=False, no_scheme=False, defang=False
 ):
     """
     Extract only unencoded URLs!
@@ -442,40 +442,37 @@ def extract_unencoded_urls(
     :param bool strip: Strip possible garbage from the end of URLs
     :param bool open_punc: Disabled puncuation regex
     :param bool no_scheme: Remove protocol (http, tcp, etc.) type in output
-    :param bool defang_data: Extract non-defanged IOCs
+    :param bool defang: Extract non-defanged IOCs
     :rtype: Iterator[:class:`str`]
     """
 
-    if "[" not in data:
-        if defang_data:
-            data = str(data).replace(".", "[.]")
-
-        yield data
-
-    else:
-        unencoded_urls = itertools.chain(
-            url_re(open_punc).finditer(data),
-            BRACKET_URL_RE.finditer(data),
-            BACKSLASH_URL_RE.finditer(data),
-        )
+    unencoded_urls = itertools.chain(
+        url_re(open_punc).finditer(data),
+        BRACKET_URL_RE.finditer(data),
+        BACKSLASH_URL_RE.finditer(data),
+    )
 
-        for url in unencoded_urls:
+    for url in unencoded_urls:
+        if refang or defang:
             if refang:
-                url = refang_url(url.group(1), no_scheme=no_scheme)
-            else:
-                url = url.group(1)
+                url = refang_data(url.group(1), no_scheme=no_scheme)
 
-            # Checks for whitespace in the string
-            def found_ws(s):
-                return True in [check_s in s for check_s in whitespace]
+            if defang:
+                url = defang_data(url.group(1))
+        else:
+            url = url.group(1)
 
-            if strip:
-                if found_ws(url):
-                    url = re.split(WS_SYNTAX_RM, url)[0]
-                else:
-                    url = re.split(URL_SPLIT_STR, url)[0]
+        # Checks for whitespace in the string
+        def found_ws(s):
+            return True in [check_s in s for check_s in whitespace]
 
-            yield url
+        if strip:
+            if found_ws(url):
+                url = re.split(WS_SYNTAX_RM, url)[0]
+            else:
+                url = re.split(URL_SPLIT_STR, url)[0]
+
+        yield url
 
 
 def extract_encoded_urls(
@@ -728,7 +725,7 @@ def extract_custom_iocs(data, regex_list):
     """
     Extract using custom regex strings!
 
-    Need help? Check out the README: https://github.com/inquest/python-iocextract#custom-regex
+    Need help? Check out the README: https://github.com/inquest/iocextract#custom-regex
 
     :param data: Input text
     :param regex_list: List of strings to treat as regex and match against data
@@ -814,7 +811,7 @@ def refang_email(email):
     )
 
 
-def refang_url(url, no_scheme=False):
+def refang_data(url, no_scheme=False):
     """
     Refang a URL!
 
@@ -923,7 +920,7 @@ def refang_ipv4(ip_address):
     )
 
 
-def defang(ioc):
+def defang_data(ioc):
     """
     Defang a URL, domain, or IPv4 address!
 

diff --git a/setup.py b/setup.py
@@ -9,7 +9,7 @@
 
 setup(
     name='iocextract',
-    version='1.15.1',
+    version='1.15.2',
     include_package_data=True,
     py_modules=['iocextract',],
     install_requires=['regex',],
@@ -27,7 +27,7 @@
     description='Advanced Indicator of Compromise (IOC) extractor.',
     long_description=README,
     long_description_content_type = "text/markdown",
-    url='https://github.com/InQuest/python-iocextract',
+    url='https://github.com/InQuest/iocextract',
     author='InQuest Labs',
     author_email='labs@inquest.net',
     classifiers=[

diff --git a/tests.py b/tests.py
@@ -577,7 +577,7 @@ def test_refang_ipv4(self):
             self.assertEqual(list(iocextract.extract_ipv4s(content, refang=True))[0], '111.111.111.111')
             self.assertEqual(iocextract.refang_ipv4(content), '111.111.111.111')
 
-    def test_refang_url(self):
+    def test_refang_data(self):
         content_list = [
             'http://example.com/test',
             'http:// example .com /test',
@@ -599,18 +599,18 @@ def test_refang_url(self):
         ]
 
         for content in content_list:
-            self.assertEqual(iocextract.refang_url(content), 'http://example.com/test')
+            self.assertEqual(iocextract.refang_data(content), 'http://example.com/test')
 
-        self.assertEqual(iocextract.refang_url('ftx://example.com/test'), 'ftp://example.com/test')
+        self.assertEqual(iocextract.refang_data('ftx://example.com/test'), 'ftp://example.com/test')
 
         # IPv6 works as expected
         content = 'http://[2001:db8:85a3:0:0:8a2e:370:7334]:80/test'
-        self.assertEqual(iocextract.refang_url(content), content)
+        self.assertEqual(iocextract.refang_data(content), content)
         self.assertEqual(list(iocextract.extract_urls(content, refang=True))[0], content)
 
         # HXXPS
         for content in ['hxxps://example[.]com/test', 'hXXps://example[dot]com/test']:
-            self.assertEqual(iocextract.refang_url(content), 'https://example.com/test')
+            self.assertEqual(iocextract.refang_data(content), 'https://example.com/test')
 
     def test_url_extraction_handles_punctuation(self):
         self.assertEqual(list(iocextract.extract_urls('example[.]com!'))[0], 'example[.]com')
@@ -657,8 +657,8 @@ def test_urlencoded_url_extraction(self):
 
     def test_refang_never_excepts_from_urlparse(self):
         try:
-            iocextract.refang_url('hxxp__test]')
-            iocextract.refang_url('CDATA[^h00ps://test.com/]]>')
+            iocextract.refang_data('hxxp__test]')
+            iocextract.refang_data('CDATA[^h00ps://test.com/]]>')
         except ValueError as e:
             self.fail('Unhandled parsing error in refang: {e}'.format(e=e))
 
@@ -670,8 +670,8 @@ def test_url_generic_regex_tight_edge_cases(self):
         self.assertEqual(len(list(iocextract.extract_urls('https:// test /'))), 1)
 
     def test_refang_removes_some_backslash_escaped_characters(self):
-        self.assertEqual(iocextract.refang_url('https://example\(.)com/'), 'https://example.com/')
-        self.assertEqual(iocextract.refang_url('https://example\(.\)com/test\.html'), 'https://example.com/test.html')
+        self.assertEqual(iocextract.refang_data('https://example\(.)com/'), 'https://example.com/')
+        self.assertEqual(iocextract.refang_data('https://example\(.\)com/test\.html'), 'https://example.com/test.html')
 
     def test_ip_regex_allows_multiple_brackets(self):
         self.assertEqual(list(iocextract.extract_ips('10.10.10.]]]10', refang=True))[0], '10.10.10.10')
@@ -688,16 +688,16 @@ def test_ip_regex_allows_backslash_escape(self):
         self.assertEqual(list(iocextract.extract_ips('10[.]10(.10\.10', refang=True))[0], '10.10.10.10')
 
     def test_defang(self):
-        self.assertEqual(iocextract.defang('http://example.com/some/lo.ng/path.ext/'), 'hxxp://example[.]com/some/lo.ng/path.ext/')
-        self.assertEqual(iocextract.defang('http://example.com/path.ext'), 'hxxp://example[.]com/path.ext')
-        self.assertEqual(iocextract.defang('http://127.0.0.1/path.ext'), 'hxxp://127[.]0[.]0[.]1/path.ext')
-        self.assertEqual(iocextract.defang('http://example.com/'), 'hxxp://example[.]com/')
-        self.assertEqual(iocextract.defang('https://example.com/'), 'hxxps://example[.]com/')
-        self.assertEqual(iocextract.defang('ftp://example.com/'), 'fxp://example[.]com/')
-        self.assertEqual(iocextract.defang('example.com'), 'example[.]com')
-        self.assertEqual(iocextract.defang('example.com/'), 'example[.]com/')
-        self.assertEqual(iocextract.defang('example.com/some/lo.ng/path.ext/'), 'example[.]com/some/lo.ng/path.ext/')
-        self.assertEqual(iocextract.defang('127.0.0.1'), '127[.]0[.]0[.]1')
+        self.assertEqual(iocextract.defang_data('http://example.com/some/lo.ng/path.ext/'), 'hxxp://example[.]com/some/lo.ng/path.ext/')
+        self.assertEqual(iocextract.defang_data('http://example.com/path.ext'), 'hxxp://example[.]com/path.ext')
+        self.assertEqual(iocextract.defang_data('http://127.0.0.1/path.ext'), 'hxxp://127[.]0[.]0[.]1/path.ext')
+        self.assertEqual(iocextract.defang_data('http://example.com/'), 'hxxp://example[.]com/')
+        self.assertEqual(iocextract.defang_data('https://example.com/'), 'hxxps://example[.]com/')
+        self.assertEqual(iocextract.defang_data('ftp://example.com/'), 'fxp://example[.]com/')
+        self.assertEqual(iocextract.defang_data('example.com'), 'example[.]com')
+        self.assertEqual(iocextract.defang_data('example.com/'), 'example[.]com/')
+        self.assertEqual(iocextract.defang_data('example.com/some/lo.ng/path.ext/'), 'example[.]com/some/lo.ng/path.ext/')
+        self.assertEqual(iocextract.defang_data('127.0.0.1'), '127[.]0[.]0[.]1')
 
     def test_email_refang(self):
         content_list = [
@@ -732,7 +732,7 @@ def test_path_refang(self):
 
         for content in content_list:
             self.assertEqual(list(iocextract.extract_urls(content, refang=True))[0], 'http://example.com/test.htm')
-            self.assertEqual(iocextract.refang_url(content), 'http://example.com/test.htm')
+            self.assertEqual(iocextract.refang_data(content), 'http://example.com/test.htm')
 
     def test_b64_url_extraction_just_url(self):
         content_list = [