Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are source URLs getting incorrectly URL-decoded? #359

Open
snarfed opened this issue Mar 30, 2023 · 18 comments
Open

Are source URLs getting incorrectly URL-decoded? #359

snarfed opened this issue Mar 30, 2023 · 18 comments
Milestone

Comments

@snarfed
Copy link
Contributor

snarfed commented Mar 30, 2023

Hi @dshanske @pfefferle! I'm seeing an odd issue with source URLs with URL-encoded # characters, eg https://fed.brid.gy/render?id=https%3A%2F%2Findieweb.social%2Fusers%2Fsnarfed%23likes%2F709275 . That page has a u-like-of with a full p-author h-card, with name and photo, but when WordPress receives it as a webmention source, Semantic-Linkbacks doesn't find that author at all.

However, if I double-URL-encode the # character, ie https://fed.brid.gy/render?id=https%3A%2F%2Findieweb.social%2Fusers%2Fsnarfed%2523likes%2F709275 , the webmention works fine and correctly shows the author name and image.

I know URLs with #s are awkward, even when URL-encoded, but the first source URL is working ok with other wm receivers, eg https://www.jvt.me/week-notes/2023/09/ (scroll down and expand Interactions with this post), so I suspect this is a bug in this plugin or Semantic-Linkbacks?

Thanks in advance!

@snarfed snarfed changed the title Are source URLs getting URL-decoded? Are source URLs getting incorrectly URL-decoded? Mar 30, 2023
@pfefferle
Copy link
Owner

@snarfed might be perhaps an issue with the Mf2 parser, because it supports fragment-parsing.

@pfefferle
Copy link
Owner

@snarfed is the author outside of the fragment?

@snarfed
Copy link
Contributor Author

snarfed commented Mar 30, 2023

The source URL doesn't contain a fragment, it contains %23, which happens to be an encoded # character. I think the plugin(s) are decoding that part of the URL, but shouldn't be, since the form-encoded POST body shouldn't be URL-decoded. (I think?)

Ideally the plugins/parser would leave that %23 in the URL alone when fetching it and parsing mf2.

@pfefferle
Copy link
Owner

This is a really good question!

@pfefferle
Copy link
Owner

I would assume that they have to be URL encoded because otherwise an = might be misinterpreted as param of the form.

@pfefferle
Copy link
Owner

pfefferle commented Mar 30, 2023

And the content type is: application/x-www-form-urlencoded so it literally mentions "urlencoded", but I will have a look at the spec.

@snarfed
Copy link
Contributor Author

snarfed commented Mar 30, 2023

From @sknebel in chat:

for keys and vaues, percent-encode everything "except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and U+005F (_). "
HTML spec: https://url.spec.whatwg.org/#concept-urlencoded-serializer (and the quote specifically from https://url.spec.whatwg.org/#application-x-www-form-urlencoded-percent-encode-set )

I've confirmed that browsers URL-encode, so a form-encoded POST with key url and value http://test/url%23fragment results in the raw request body url=http%3A%2F%2Ftest%2Furl%2523.

@snarfed
Copy link
Contributor Author

snarfed commented Mar 30, 2023

I've also confirmed that my code is doing the same thing, ie the # is double-URL-encoded to %2523, so the raw webmention POST body looks like:

source=https%3A%2F%2Ffed.brid.gy%2Frender%3Fid%3Dhttps%253A%252F%252Findieweb.social%252Fusers%252Fsnarfed%2523likes%252F709275&target=https%3A%2F%2Fsnarfed.org%2F2023-03-28_49662

Note the %2523 in the source value. So @pfefferle you're absolutely right, the Webmention/Semantic Linkbacks plugins should URL-decode it once to get %23, but I think not twice, which they seem to be doing right now?

@pfefferle
Copy link
Owner

pfefferle commented Mar 30, 2023

OK, that might be possible because of the interaction of both (Webmention & SL) plugins, I will re-check the latest version of the Webmention plugin.

@snarfed
Copy link
Contributor Author

snarfed commented Apr 6, 2023

Looks like this isn't about the # character at all. I added custom encoding for #s, I'm now replacing them with ^^, and I'm still hitting this problem. Here's an example source URL:

https://fed.brid.gy/render?id=https%3A%2F%2Ftechhub.social%2Fusers%2Fdiazona^^likes%2F979471

If I send a webmention with this source, I get:

{"code":"resource_not_found","message":"Resource not found","data":{"status":400}}

Same if I %-encode the ^^, ie:

https://fed.brid.gy/render?id=https%3A%2F%2Ftechhub.social%2Fusers%2Fdiazona%5E%5Elikes%2F979471

However, if I double-encode those chars to %255E to the source URL below, it works.

https://fed.brid.gy/render?id=https%3A%2F%2Ftechhub.social%2Fusers%2Fdiazona%255E%255Elikes%2F979471

@dshanske dshanske added this to the 5.1.0 milestone Apr 7, 2023
@snarfed
Copy link
Contributor Author

snarfed commented May 25, 2023

Here are example WP debug logs I see for a failed webmention with a source URL with ^^ in it:

[25-May-2023 02:21:48 UTC] REST request: /webmention/1.0/endpoint: {"source":"https:\/\/fed.brid.gy\/convert\/activitypub\/webmention\/https:\/mastodon.social\/users\/notblanklikes\/88327162","target":"https:\/\/snarfed.org\/2023-05-24_50288"}(Header Present)
[25-May-2023 02:21:48 UTC] REST result: /webmention/1.0/endpoint: {"code":"source_error","message":"Bad Gateway","data":{"status":400}}(400) - [](User ID: 0)

The full source URL was https://fed.brid.gy/convert/activitypub/webmention/https:/mastodon.social/users/notblank^^likes/88327162. Note that the logged source URL is missing the ^^. I get the same logs if I URL-encode the ^^ to %5E%5E.

Btw this is on pre-merge plugins, ie Webmention 4.0.9 and Semantic-Linkbacks 3.12.0.

@pfefferle
Copy link
Owner

pfefferle commented May 25, 2023

Why do people put everything in URLs...??? (and please do not answer with: because they can ☺️ )

@snarfed
Copy link
Contributor Author

snarfed commented May 25, 2023

Hah, fair point, maybe I'm being a bit difficult here. Sorry! This bug does seem unrelated to any individual characters though, since it happens when they're URL-encoded too, eg the examples here with both %23 and %5E%5E still break the plugin.

I'm open to other ideas! I need to be able to include arbitrary URLs, including ones with # fragments, but I can encode them however works best for you all.

@pfefferle
Copy link
Owner

esc_url, esc_url_raw and sanitize_url seems to remove the ^^ special chars. That is not really good, because these are highly recommended when dealing with URLs.

@pfefferle
Copy link
Owner

It is at least no double encoding or something similar.

@snarfed
Copy link
Contributor Author

snarfed commented Jul 31, 2023

Odd: I switched back from ^^ to %23 recently, and now I'm seeing some of these source URLs work after all. Example: https://ap.brid.gy/convert/web/https:/bayes.club/users/zerology%23likes/32983 on https://snarfed.org/2023-07-10_50589

@pfefferle
Copy link
Owner

pfefferle commented Aug 19, 2023

@snarfed that make sense, because if you check the HTML of the fed.brid.gy links (vs the AP links), then you find only an h-card without any context... that's why the plugin ignores them, it does not know how to handle them...

@snarfed
Copy link
Contributor Author

snarfed commented Aug 19, 2023

Hmm! You're right about the top source URL in the original description, https://fed.brid.gy/render?id=https%3A%2F%2Findieweb.social%2Fusers%2Fsnarfed%23likes%2F709275 . Not sure what's going on there.

The rest of the source URLs here are valid u-like-ofs though, including the second one in the description, https://fed.brid.gy/render?id=https%3A%2F%2Findieweb.social%2Fusers%2Fsnarfed%2523likes%2F709275 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants