wikipedia: query parameters cause URL parse failure #2412

SnoopJ · 2023-02-17T15:54:21Z

Description

Wikipedia URLs that contain ?query=params can confuse Sopel, leading to a KeyError:

10:31 <+SnoopJ> Sopel, do you beef it on this URL? https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form
10:31 <+Sopel> Unexpected KeyError ('parse') from SnoopJ at 2023-02-17 15:31:51.723938. Message was: https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form

Reproduction steps

Send a Wikipedia URL with an embedded query parameter to a Sopel instance with the wikipedia plugin active

Expected behavior

The same article-retrieval behavior as a Wikipedia URL without query parameters:

10:51 <+SnoopJ> Sopel, https://en.wikipedia.org/wiki/Gauss%27s_law#Differential_form
10:51 <+Sopel> [wikipedia] Gauss's law - Differential form | "By the divergence theorem, Gauss's law can alternatively be written in the differential form: ∇ ⋅ E = 
ρ ε 0 ε r {\displaystyle \nabla \cdot \mathbf {E} ={\frac {\rho }{\varepsilon _{0}\varepsilon _{r}}}} where ∇ · E is the divergence of the electric field, ε0 is the vacuum permittivity, ε r {\displaystyle \varepsilon _{r}} is the relative permittivity, and ρ is the volume charge density […]"

Relevant logs

No response

Notes

Exception details from the error log:

[2023-02-17 11:16:12,662] sopel.bot            ERROR    - Unexpected error ('parse') from SnoopJ at 2023-02-17 16:16:12.662537. Message was: https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form                                                                                                                   
Traceback (most recent call last):                                                                                                                                    
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/bot.py", line 763, in call_rule                                                                 
    rule.execute(sopel, trigger)                                                                                                                                      
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/plugins/rules.py", line 1057, in execute                                                        
    exit_code = self._handler(bot, trigger)                                                                                                                           
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/plugins/rules.py", line 1608, in execute_handler                                                
    return handler(bot, trigger, match=trigger)                                                                                                                       
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/modules/wikipedia.py", line 300, in mw_info                                                     
    say_section(bot, trigger, match.group(1), unquote(match.group(2)), unquote(match.group(3)))                                                                       
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/modules/wikipedia.py", line 243, in say_section                                                 
    snippet = mw_section(server, query, section)                                                                                                                      
  File "/home/snoopjedi/.sopel/venv/lib/python3.9/site-packages/sopel/modules/wikipedia.py", line 264, in mw_section                                                  
    for entry in sections['parse']['sections']:                                                                                                                       
KeyError: 'parse'

Sopel version

7.1.9 (also happens on master)

Installation method

pip install

Python version

3.9.9

Operating system

Ubuntu 20.04

IRCd

No response

Relevant plugins

wikipedia

The text was updated successfully, but these errors were encountered:

SnoopJ · 2023-02-17T15:56:40Z

Note that this seems to happen for any query parameters, not just the key/val pair I noticed:

10:54 <SnoopJ> https://en.wikipedia.org/wiki/Gauss%27s_law?someparam=someval#Differential_form
10:54 <terribot> Unexpected error ('parse') from SnoopJ at 2023-02-17 15:54:59.015063. Message was: https://en.wikipedia.org/wiki/Gauss%27s_law?someparam=someval#Differential_form

This does not seem to occur if there is no anchor, so this may be an interaction between query parameters and anchors

10:54 <SnoopJ> https://en.wikipedia.org/wiki/Gauss%27s_law?someparam=someval
[no Sopel response]

Exirel · 2023-02-18T12:04:20Z

On the other hand:

>>> import re
>>> p = re.compile(r'https?:\/\/([a-z]+(?:\.m)?\.wikipedia\.org)\/wiki\/((?!File\:)[^ #]+)#?([^ ]*)')
>>> m = p.match('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form')
>>> m.groups()
('en.wikipedia.org', 'Gauss%27s_law?useskin=vector', 'Differential_form')

This is the regex used in the plugin, you can see that the second group is now Gauss%27s_law?useskin=vector so maybe all that is needed is to remove the part after ?.

If I remove the anchor part, here is the result:

>>> m2 = p.match('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector')
>>> m2.groups()
('en.wikipedia.org', 'Gauss%27s_law?useskin=vector', '')

dgw · 2023-02-18T12:35:05Z

One option is to take only the part of the page title before any ? character (if it appears in the page title, it will be escaped to %3F, e.g. this species article).

The other option is to modify the regex such that the title portion matches [^ #?]+ with an optional (?:\?[^ #]+)? group between it and the fragment. I would prefer this one for code-smell reasons, but it's simultaneously the technically cleaner way and the more complicated method.

Exirel · 2023-02-18T12:52:38Z

To be honest, the best way is probably to stop doing regex for that and use the good old urllib.parse.urlparse:

>>> from urllib.parse import urlparse
>>> urlparse('https://en.wikipedia.org/wiki/%3FOryzomys_pliocaenicus')
ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/%3FOryzomys_pliocaenicus', params='', query='', fragment='')
>>> urlparse('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector')
ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/Gauss%27s_law', params='', query='useskin=vector', fragment='')
>>> urlparse('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form')
ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/Gauss%27s_law', params='', query='useskin=vector', fragment='Differential_form')

And then use a regex to work on the path only. This way, we would not have to deal with using a regex to detect the anchor, the query parameter, or what have you, and only work on the path of the URL to remove the prefix (such as /wiki).

dgw · 2023-02-18T13:15:41Z

You know what? You're right. Our plugins (built-in and external) do not take enough advantage of urllib.parse.

Pattern for Wikipedia article links can be something simple, like just the first part that excludes File: URLs, and lean on urlparse for the rest. 👍

…2412)

SnoopJ · 2023-02-19T04:20:25Z

I agree re: using urlparse as the Real Fix for what's wrong here, the patch I came up with does this with relatively low impact on the site of the bug here, see #2414.

However, I agree with the idea that if a plugin wants to deal with a URL, it would be much more appropriate to hand over a ParseResult or an object wrapped around one. I've filed #2413 separately for that improvement which I think could be a powerful one, but definitely needs more thinking-about than the 8.0.0 milestone allows for.

…2412) Co-authored-by: dgw <dgw@technobabbl.es> Co-authored-by: Florian Strzelecki <florian.strzelecki@gmail.com>

SnoopJ added the Bug Things to squish; generally used for issues label Feb 17, 2023

SnoopJ self-assigned this Feb 17, 2023

dgw added this to the 8.0.0 milestone Feb 18, 2023

SnoopJ mentioned this issue Feb 19, 2023

Provide urllib.parse.ParseResult to plugins with URL triggers #2413

Closed

SnoopJ added a commit to SnoopJ/sopel that referenced this issue Feb 19, 2023

wikipedia: support query strings in Wikipedia URLs (closes sopel-irc#…

4f0cfd3

…2412)

SnoopJ mentioned this issue Feb 19, 2023

wikipedia: support query strings in Wikipedia URLs #2414

Merged

4 tasks

SnoopJ added a commit to SnoopJ/sopel that referenced this issue Feb 20, 2023

wikipedia: support query strings in Wikipedia URLs (closes sopel-irc#…

9229c20

…2412) Co-authored-by: dgw <dgw@technobabbl.es> Co-authored-by: Florian Strzelecki <florian.strzelecki@gmail.com>

dgw closed this as completed in #2414 Feb 20, 2023

SnoopJ mentioned this issue Nov 29, 2023

wikipedia: results for namespaces/articles with colons in title are surprising #2573

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wikipedia: query parameters cause URL parse failure #2412

wikipedia: query parameters cause URL parse failure #2412

SnoopJ commented Feb 17, 2023 •

edited

Loading

SnoopJ commented Feb 17, 2023

Exirel commented Feb 18, 2023

dgw commented Feb 18, 2023

Exirel commented Feb 18, 2023

dgw commented Feb 18, 2023

SnoopJ commented Feb 19, 2023

wikipedia: query parameters cause URL parse failure #2412

wikipedia: query parameters cause URL parse failure #2412

Comments

SnoopJ commented Feb 17, 2023 • edited Loading

Description

Reproduction steps

Expected behavior

Relevant logs

Notes

Sopel version

Installation method

Python version

Operating system

IRCd

Relevant plugins

SnoopJ commented Feb 17, 2023

Exirel commented Feb 18, 2023

dgw commented Feb 18, 2023

Exirel commented Feb 18, 2023

dgw commented Feb 18, 2023

SnoopJ commented Feb 19, 2023

SnoopJ commented Feb 17, 2023 •

edited

Loading