Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed URL with a [ or ] character breaks conversion #2040

Closed
edugonza opened this issue Jan 15, 2024 · 0 comments · Fixed by #2041
Closed

Malformed URL with a [ or ] character breaks conversion #2040

edugonza opened this issue Jan 15, 2024 · 0 comments · Fixed by #2041
Labels
crash Problems preventing documents from being rendered
Milestone

Comments

@edugonza
Copy link

I have an html that contains a malformed link with a ] character:

<p><a hfref="http://sample.com] ">My link</a>

When converting to PDF, I get the following error:

weasyprint/__init__.py:257: in write_pdf
    self.render(font_config, counter_style, **options)
weasyprint/__init__.py:214: in render
    return Document._render(self, font_config, counter_style, options)
weasyprint/document.py:262: in _render
    [Page(page_box) for page_box in page_boxes],
weasyprint/document.py:262: in <listcomp>
    [Page(page_box) for page_box in page_boxes],
weasyprint/document.py:76: in __init__
    gather_anchors(
weasyprint/anchors.py:119: in gather_anchors
    gather_anchors(child, anchors, links, bookmarks, inputs, matrix)
weasyprint/anchors.py:119: in gather_anchors
    gather_anchors(child, anchors, links, bookmarks, inputs, matrix)
weasyprint/anchors.py:119: in gather_anchors
    gather_anchors(child, anchors, links, bookmarks, inputs, matrix)
weasyprint/anchors.py:119: in gather_anchors
    gather_anchors(child, anchors, links, bookmarks, inputs, matrix)
weasyprint/anchors.py:119: in gather_anchors
    gather_anchors(child, anchors, links, bookmarks, inputs, matrix)
weasyprint/anchors.py:83: in gather_anchors
    link = box.style['link']
weasyprint/css/__init__.py:792: in __missing__
    value = COMPUTER_FUNCTIONS[key](self, key, value)
weasyprint/css/computed_values.py:563: in link
    return get_link_attribute(style.element, value, style.base_url)
weasyprint/urls.py:154: in get_link_attribute
    parsed = urlsplit(uri)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

url = '', scheme = 'https', allow_fragments = True

    def urlsplit(url, scheme='', allow_fragments=True):
        """Parse a URL into 5 components:
        <scheme>://<netloc>/<path>?<query>#<fragment>
        Return a 5-tuple: (scheme, netloc, path, query, fragment).
        Note that we don't break the components up in smaller bits
        (e.g. netloc is a single string) and we don't expand % escapes."""
        url, scheme, _coerce_result = _coerce_args(url, scheme)
        url = _remove_unsafe_bytes_from_url(url)
        scheme = _remove_unsafe_bytes_from_url(scheme)
        allow_fragments = bool(allow_fragments)
        key = url, scheme, allow_fragments, type(url), type(scheme)
        cached = _parse_cache.get(key, None)
        if cached:
            return _coerce_result(cached)
        if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
            clear_cache()
        netloc = query = fragment = ''
        i = url.find(':')
        if i > 0:
            if url[:i] == 'http': # optimize the common case
                url = url[i+1:]
                if url[:2] == '//':
                    netloc, url = _splitnetloc(url, 2)
                    if (('[' in netloc and ']' not in netloc) or
                            (']' in netloc and '[' not in netloc)):
                        raise ValueError("Invalid IPv6 URL")
                if allow_fragments and '#' in url:
                    url, fragment = url.split('#', 1)
                if '?' in url:
                    url, query = url.split('?', 1)
                _checknetloc(netloc)
                v = SplitResult('http', netloc, url, query, fragment)
                _parse_cache[key] = v
                return _coerce_result(v)
            for c in url[:i]:
                if c not in scheme_chars:
                    break
            else:
                # make sure "url" is not actually a port number (in which case
                # "scheme" is really part of the path)
                rest = url[i+1:]
                if not rest or any(c not in '0123456789' for c in rest):
                    # not a port number
                    scheme, url = url[:i].lower(), rest
    
        if url[:2] == '//':
            netloc, url = _splitnetloc(url, 2)
            if (('[' in netloc and ']' not in netloc) or
                    (']' in netloc and '[' not in netloc)):
>               raise ValueError("Invalid IPv6 URL")
E               ValueError: Invalid IPv6 URL

/usr/local/lib/python3.8/urllib/parse.py:474: ValueError

This causes the conversion to fail because it thinks that it's a malformed IPv6 URL.

I think that the library should still generate a file with the original URL, even if it is malformed, and possibly print a warning to the console.

edugonza added a commit to edugonza/WeasyPrint that referenced this issue Jan 15, 2024
to fall back to an external URL when the url cannot be split
liZe added a commit that referenced this issue Jan 16, 2024
fixes #2040 Wrapped urlsplit call in a try-except block
@liZe liZe added the crash Problems preventing documents from being rendered label Jan 16, 2024
@liZe liZe added this to the 61.0 milestone Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crash Problems preventing documents from being rendered
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants