Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wikipedia: query parameters cause URL parse failure #2412

Closed
SnoopJ opened this issue Feb 17, 2023 · 6 comments · Fixed by #2414
Closed

wikipedia: query parameters cause URL parse failure #2412

SnoopJ opened this issue Feb 17, 2023 · 6 comments · Fixed by #2414
Assignees
Labels
Bug Things to squish; generally used for issues
Milestone

Comments

@SnoopJ
Copy link
Contributor

SnoopJ commented Feb 17, 2023

Description

Wikipedia URLs that contain ?query=params can confuse Sopel, leading to a KeyError:

10:31 <+SnoopJ> Sopel, do you beef it on this URL? https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form
10:31 <+Sopel> Unexpected KeyError ('parse') from SnoopJ at 2023-02-17 15:31:51.723938. Message was: https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form

Reproduction steps

  1. Send a Wikipedia URL with an embedded query parameter to a Sopel instance with the wikipedia plugin active

Expected behavior

The same article-retrieval behavior as a Wikipedia URL without query parameters:

10:51 <+SnoopJ> Sopel, https://en.wikipedia.org/wiki/Gauss%27s_law#Differential_form
10:51 <+Sopel> [wikipedia] Gauss's law - Differential form | "By the divergence theorem, Gauss's law can alternatively be written in the differential form: ∇ ⋅ E = 
ρ ε 0 ε r {\displaystyle \nabla \cdot \mathbf {E} ={\frac {\rho }{\varepsilon _{0}\varepsilon _{r}}}} where ∇ · E is the divergence of the electric field, ε0 is the vacuum permittivity, ε r {\displaystyle \varepsilon _{r}} is the relative permittivity, and ρ is the volume charge density […]"

Relevant logs

No response

Notes

Exception details from the error log:

[2023-02-17 11:16:12,662] sopel.bot            ERROR    - Unexpected error ('parse') from SnoopJ at 2023-02-17 16:16:12.662537. Message was: https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form                                                                                                                   
Traceback (most recent call last):                                                                                                                                    
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/bot.py", line 763, in call_rule                                                                 
    rule.execute(sopel, trigger)                                                                                                                                      
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/plugins/rules.py", line 1057, in execute                                                        
    exit_code = self._handler(bot, trigger)                                                                                                                           
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/plugins/rules.py", line 1608, in execute_handler                                                
    return handler(bot, trigger, match=trigger)                                                                                                                       
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/modules/wikipedia.py", line 300, in mw_info                                                     
    say_section(bot, trigger, match.group(1), unquote(match.group(2)), unquote(match.group(3)))                                                                       
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/modules/wikipedia.py", line 243, in say_section                                                 
    snippet = mw_section(server, query, section)                                                                                                                      
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/modules/wikipedia.py", line 264, in mw_section                                                  
    for entry in sections['parse']['sections']:                                                                                                                       
KeyError: 'parse'

Sopel version

7.1.9 (also happens on master)

Installation method

pip install

Python version

3.9.9

Operating system

Ubuntu 20.04

IRCd

No response

Relevant plugins

wikipedia

@SnoopJ SnoopJ added the Bug Things to squish; generally used for issues label Feb 17, 2023
@SnoopJ SnoopJ self-assigned this Feb 17, 2023
@SnoopJ
Copy link
Contributor Author

SnoopJ commented Feb 17, 2023

Note that this seems to happen for any query parameters, not just the key/val pair I noticed:

10:54 <SnoopJ> https://en.wikipedia.org/wiki/Gauss%27s_law?someparam=someval#Differential_form
10:54 <terribot> Unexpected error ('parse') from SnoopJ at 2023-02-17 15:54:59.015063. Message was: https://en.wikipedia.org/wiki/Gauss%27s_law?someparam=someval#Differential_form

This does not seem to occur if there is no anchor, so this may be an interaction between query parameters and anchors

10:54 <SnoopJ> https://en.wikipedia.org/wiki/Gauss%27s_law?someparam=someval
[no Sopel response]

@Exirel
Copy link
Contributor

Exirel commented Feb 18, 2023

On the other hand:

>>> import re
>>> p = re.compile(r'https?:\/\/([a-z]+(?:\.m)?\.wikipedia\.org)\/wiki\/((?!File\:)[^ #]+)#?([^ ]*)')
>>> m = p.match('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form')
>>> m.groups()
('en.wikipedia.org', 'Gauss%27s_law?useskin=vector', 'Differential_form')

This is the regex used in the plugin, you can see that the second group is now Gauss%27s_law?useskin=vector so maybe all that is needed is to remove the part after ?.

If I remove the anchor part, here is the result:

>>> m2 = p.match('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector')
>>> m2.groups()
('en.wikipedia.org', 'Gauss%27s_law?useskin=vector', '')

@dgw
Copy link
Member

dgw commented Feb 18, 2023

One option is to take only the part of the page title before any ? character (if it appears in the page title, it will be escaped to %3F, e.g. this species article).

The other option is to modify the regex such that the title portion matches [^ #?]+ with an optional (?:\?[^ #]+)? group between it and the fragment. I would prefer this one for code-smell reasons, but it's simultaneously the technically cleaner way and the more complicated method.

@Exirel
Copy link
Contributor

Exirel commented Feb 18, 2023

To be honest, the best way is probably to stop doing regex for that and use the good old urllib.parse.urlparse:

>>> from urllib.parse import urlparse
>>> urlparse('https://en.wikipedia.org/wiki/%3FOryzomys_pliocaenicus')
ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/%3FOryzomys_pliocaenicus', params='', query='', fragment='')
>>> urlparse('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector')
ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/Gauss%27s_law', params='', query='useskin=vector', fragment='')
>>> urlparse('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form')
ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/Gauss%27s_law', params='', query='useskin=vector', fragment='Differential_form')

And then use a regex to work on the path only. This way, we would not have to deal with using a regex to detect the anchor, the query parameter, or what have you, and only work on the path of the URL to remove the prefix (such as /wiki).

@dgw
Copy link
Member

dgw commented Feb 18, 2023

You know what? You're right. Our plugins (built-in and external) do not take enough advantage of urllib.parse.

Pattern for Wikipedia article links can be something simple, like just the first part that excludes File: URLs, and lean on urlparse for the rest. 👍

@SnoopJ
Copy link
Contributor Author

SnoopJ commented Feb 19, 2023

I agree re: using urlparse as the Real Fix for what's wrong here, the patch I came up with does this with relatively low impact on the site of the bug here, see #2414.

However, I agree with the idea that if a plugin wants to deal with a URL, it would be much more appropriate to hand over a ParseResult or an object wrapped around one. I've filed #2413 separately for that improvement which I think could be a powerful one, but definitely needs more thinking-about than the 8.0.0 milestone allows for.

SnoopJ added a commit to SnoopJ/sopel that referenced this issue Feb 20, 2023
…2412)

Co-authored-by: dgw <dgw@technobabbl.es>
Co-authored-by: Florian Strzelecki <florian.strzelecki@gmail.com>
@dgw dgw closed this as completed in #2414 Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Things to squish; generally used for issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants