Malformed URL with a [ or ] character breaks conversion #2040

edugonza · 2024-01-15T14:35:25Z

I have an html that contains a malformed link with a ] character:

<p><a hfref="http://sample.com] ">My link</a>

When converting to PDF, I get the following error:

weasyprint/__init__.py:257: in write_pdf
    self.render(font_config, counter_style, **options)
weasyprint/__init__.py:214: in render
    return Document._render(self, font_config, counter_style, options)
weasyprint/document.py:262: in _render
    [Page(page_box) for page_box in page_boxes],
weasyprint/document.py:262: in <listcomp>
    [Page(page_box) for page_box in page_boxes],
weasyprint/document.py:76: in __init__
    gather_anchors(
weasyprint/anchors.py:119: in gather_anchors
    gather_anchors(child, anchors, links, bookmarks, inputs, matrix)
weasyprint/anchors.py:119: in gather_anchors
    gather_anchors(child, anchors, links, bookmarks, inputs, matrix)
weasyprint/anchors.py:119: in gather_anchors
    gather_anchors(child, anchors, links, bookmarks, inputs, matrix)
weasyprint/anchors.py:119: in gather_anchors
    gather_anchors(child, anchors, links, bookmarks, inputs, matrix)
weasyprint/anchors.py:119: in gather_anchors
    gather_anchors(child, anchors, links, bookmarks, inputs, matrix)
weasyprint/anchors.py:83: in gather_anchors
    link = box.style['link']
weasyprint/css/__init__.py:792: in __missing__
    value = COMPUTER_FUNCTIONS[key](self, key, value)
weasyprint/css/computed_values.py:563: in link
    return get_link_attribute(style.element, value, style.base_url)
weasyprint/urls.py:154: in get_link_attribute
    parsed = urlsplit(uri)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

url = '', scheme = 'https', allow_fragments = True

    def urlsplit(url, scheme='', allow_fragments=True):
        """Parse a URL into 5 components:
        <scheme>://<netloc>/<path>?<query>#<fragment>
        Return a 5-tuple: (scheme, netloc, path, query, fragment).
        Note that we don't break the components up in smaller bits
        (e.g. netloc is a single string) and we don't expand % escapes."""
        url, scheme, _coerce_result = _coerce_args(url, scheme)
        url = _remove_unsafe_bytes_from_url(url)
        scheme = _remove_unsafe_bytes_from_url(scheme)
        allow_fragments = bool(allow_fragments)
        key = url, scheme, allow_fragments, type(url), type(scheme)
        cached = _parse_cache.get(key, None)
        if cached:
            return _coerce_result(cached)
        if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
            clear_cache()
        netloc = query = fragment = ''
        i = url.find(':')
        if i > 0:
            if url[:i] == 'http': # optimize the common case
                url = url[i+1:]
                if url[:2] == '//':
                    netloc, url = _splitnetloc(url, 2)
                    if (('[' in netloc and ']' not in netloc) or
                            (']' in netloc and '[' not in netloc)):
                        raise ValueError("Invalid IPv6 URL")
                if allow_fragments and '#' in url:
                    url, fragment = url.split('#', 1)
                if '?' in url:
                    url, query = url.split('?', 1)
                _checknetloc(netloc)
                v = SplitResult('http', netloc, url, query, fragment)
                _parse_cache[key] = v
                return _coerce_result(v)
            for c in url[:i]:
                if c not in scheme_chars:
                    break
            else:
                # make sure "url" is not actually a port number (in which case
                # "scheme" is really part of the path)
                rest = url[i+1:]
                if not rest or any(c not in '0123456789' for c in rest):
                    # not a port number
                    scheme, url = url[:i].lower(), rest
    
        if url[:2] == '//':
            netloc, url = _splitnetloc(url, 2)
            if (('[' in netloc and ']' not in netloc) or
                    (']' in netloc and '[' not in netloc)):
>               raise ValueError("Invalid IPv6 URL")
E               ValueError: Invalid IPv6 URL

/usr/local/lib/python3.8/urllib/parse.py:474: ValueError

This causes the conversion to fail because it thinks that it's a malformed IPv6 URL.

I think that the library should still generate a file with the original URL, even if it is malformed, and possibly print a warning to the console.

The text was updated successfully, but these errors were encountered:

to fall back to an external URL when the url cannot be split

fixes #2040 Wrapped urlsplit call in a try-except block

edugonza added a commit to edugonza/WeasyPrint that referenced this issue Jan 15, 2024

Refs Kozea#2040 Wrapped urlsplit call in a try-except block

68f439c

to fall back to an external URL when the url cannot be split

edugonza mentioned this issue Jan 15, 2024

fixes #2040 Wrapped urlsplit call in a try-except block #2041

Merged

liZe closed this as completed in #2041 Jan 16, 2024

liZe added a commit that referenced this issue Jan 16, 2024

Merge pull request #2041 from edugonza/main

bf58cb3

fixes #2040 Wrapped urlsplit call in a try-except block

liZe added the crash Problems preventing documents from being rendered label Jan 16, 2024

liZe added this to the 61.0 milestone Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malformed URL with a [ or ] character breaks conversion #2040

Malformed URL with a [ or ] character breaks conversion #2040

edugonza commented Jan 15, 2024

Malformed URL with a [ or ] character breaks conversion #2040

Malformed URL with a [ or ] character breaks conversion #2040

Comments

edugonza commented Jan 15, 2024