Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Youtube captions (link previews) are useless #9733

Closed
eras opened this issue Apr 2, 2021 · 45 comments
Closed

Youtube captions (link previews) are useless #9733

eras opened this issue Apr 2, 2021 · 45 comments
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@eras
Copy link

eras commented Apr 2, 2021

Description

At some point Youtube has updated the site and now all (?) captions generated by Synapse for the site are:

Before you continue to YouTube
Sign in a Google company Before you continue to YouTube Google uses cookies and data to: Deliver and maintain services, like tracking outages and protecting against spam, fraud, and abuse Measure audience engagement and site statistics to understand how our services are used

This is basically useless considering the primary point of the function, in particular in the case of a very popular website.

Steps to reproduce

  • send a Youtube URL m.room.message into a room, e.g. https://www.youtube.com/watch?v=RzJf02TIqxk
  • wait for Synapse to produce a caption for the link
  • witness the caption to contain no information about the actual link :)

Expected results:

  • A descriptive message about the contents, such as the one produced by an up-to-date youtube-dl --get-description:

Authentic recordings from inside Hetzner Online's data center park
Just like birds and insects, each server sings its own unique song.

Version information

  • Homeserver: matrix.org
@ShadowJonathan
Copy link
Contributor

(FTR: This is about link previews)

This is not neccecarily a problem with synapse, synapse is doing it's job perfectly by previewing the url as-is fetched, because matrix.org's server is located within the EU, Google has a tendency (heh) to present users with the cookie page before letting them access any part of the site, by law.

@ShadowJonathan
Copy link
Contributor

@eras
Copy link
Author

eras commented Apr 2, 2021

I agree that it's not particularly a bug in Synapse; however the only parties able to resolve this issue are Google and Synapse (or the 3rd party component it's using), and I have my doubts about Google doing anything about it :).

IIRC e.g. Slack doesn't have this issue, so it's resolvable; even if with special handling.

@eras eras changed the title Youtube captions are useless Youtube captions (link previews) are useless Apr 2, 2021
@clokep clokep added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Apr 2, 2021
@eras
Copy link
Author

eras commented Apr 5, 2021

For one plausible solution consider the following session:

% curl -s -A Mozilla -I https://www.youtube.com/watch?v=RzJf02TIqxk | grep -e '^HTTP' -e '^location'
HTTP/2 302 
location: https://consent.youtube.com/m?continue=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DRzJf02TIqxk&gl=FI&m=0&pc=yt&uxe=23983172&hl=fi&src=1

% curl -s -I https://www.youtube.com/watch?v=RzJf02TIqxk | grep -e '^HTTP' -e '^location'        
HTTP/2 200 

@richvdh
Copy link
Member

richvdh commented Apr 6, 2021

ohh bother. we had this with twitter (#7643).

It looks like we should do the same trick as we did with them (hardcode a mapping to the oembed api):

$ curl -A Mozilla 'https://www.youtube.com/oembed?url=https%3A//www.youtube.com/watch%3Fv%3DRzJf02TIqxk&format=json' 
{"title":"PURE RELAXATION - SERVER SOUNDS","author_name":"Hetzner","author_url":"https://www.youtube.com/c/HetznerOnline","type":"video","height":113,"width":200,"version":"1.0","provider_name":"YouTube","provider_url":"https://www.youtube.com/","thumbnail_height":360,"thumbnail_width":480,"thumbnail_url":"https://i.ytimg.com/vi/RzJf02TIqxk/hqdefault.jpg","html":"\u003ciframe width=\u0022200\u0022 height=\u0022113\u0022 src=\u0022https://www.youtube.com/embed/RzJf02TIqxk?feature=oembed\u0022 frameborder=\u00220\u0022 allow=\u0022accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\u0022 allowfullscreen\u003e\u003c/iframe\u003e"}

@alturiak
Copy link

alturiak commented Apr 6, 2021

I guess, this will be affecting an increasing number of (less high-profile) sites as well, such as https://www.golem.de (a german news-portal). Hardcoding exceptions for youtube is certainly warranted - but in the long run, it might be nice to be able to specify custom hooks in synapse's configuration, although I'm not sure if that's really worth the effort.

@clokep
Copy link
Member

clokep commented Apr 6, 2021

it might be nice to be able to specify custom hooks in synapse's configuration, although I'm not sure if that's really worth the effort.

This shouldn't be too hard, it would also be nice to default to using the documented providers (https://oembed.com/providers.json).

@ShadowJonathan
Copy link
Contributor

This shouldn't be too hard, it would also be nice to default to using the documented providers (https://oembed.com/providers.json).

Oooo, thanks for mentioning that, shouldn't that just be preloaded and used directly when URL previews are enabled?

@clokep
Copy link
Member

clokep commented Apr 6, 2021

This shouldn't be too hard, it would also be nice to default to using the documented providers (oembed.com/providers.json).

Oooo, thanks for mentioning that, shouldn't that just be preloaded and used directly when URL previews are enabled?

It should probably be tried. I don't know if it will regress other previews. 🤷

@licentiapoetica
Copy link

also on Hetzner, experiencing the same issue

@ItsCinnabar
Copy link

If anyone wants a temporary user sided fix for themselves, I made this tampermonkey script : https://gist.github.com/ItsCinnabar/ebcfe4f6b3ea7d224a8e1ef0783edeb2

Just edit the match url to your site and load it into tampermonkey/greasemonkey/etc

@licentiapoetica
Copy link

I found a way how to get it working again, you need to change your user agent to curl

self.user_agent = hs.version_string

replace to something like this: self.user_agent = "curl/7.59.0"

now youtube previews are working again

@alturiak
Copy link

I found a way how to get it working again, you need to change your user agent to curl

self.user_agent = hs.version_string

replace to something like this: self.user_agent = "curl/7.59.0"
now youtube previews are working again

This works for youtube (which is great, thanks!), but it's not a silver bullet as it depends on how the sites handles different user-agents, so a more versatile approach might still be warranted.

@licentiapoetica
Copy link

I found a way how to get it working again, you need to change your user agent to curl

self.user_agent = hs.version_string

replace to something like this: self.user_agent = "curl/7.59.0"
now youtube previews are working again

This works for youtube (which is great, thanks!), but it's not a silver bullet as it depends on how the sites handles different user-agents, so a more versatile approach might still be warranted.

yeah, you are right, but for now I think it suits me personally very well and I havnt encountered any url preview problem by now, I guess to make it youtube.com specific you would need to implement some if check for youtube specific and anything else just makes requests through the matrix user agent

@igeljaeger
Copy link

I found a way how to get it working again, you need to change your user agent to curl

self.user_agent = hs.version_string

replace to something like this: self.user_agent = "curl/7.59.0"
now youtube previews are working again

this also fixes previews for sites like anilist.co that only displayed a "please use a modern browser" error message before editing this.

@kuon
Copy link

kuon commented May 7, 2021

Setting the user agent to curl can be a problem for some other site, I remember it being blocked on some occasion.

Unfortunately, having worked on a framework like embed.ly in the past, it is easy to get to 90%, but the last 10% can be really difficult.

What we ended up doing was having our own user agent on the first try, but if the returned content was blocked, we tried again with google bot and other crawler user agent (facebook, twitter...). But some website can get really smart, I remember some validating the user agent with TCP TTL (IIRC windows is 128 and linux is 64).

I don't know what the best fix would be for synapse. Maybe the user agent could be configurable? Also maybe it could be configurable to use some external API or external command line tool on the home server.

In the end, having nice preview inline is crucial to a good user experience, but it is really hard to get right.

@richvdh
Copy link
Member

richvdh commented May 7, 2021

I still think the best fix is to use the oembed api. Changing the useragent is a hack and is always going to be brittle.

@licentiapoetica
Copy link

well this was labeled as s-minor, it seems the devs dont give a damn since they are not in the eu with their instances and if nobody gives a damn about implementing this oembed api for youtube there are 2 solutions, the user agent hack or hosting the synapse somewhere where this please sign in to youtube preview does not happen.

also I havnt had any trouble with curl as my user agent in synapse, everything works perfectly fine so far

@kuon
Copy link

kuon commented May 7, 2021

well this was labeled as s-minor, it seems the devs dont give a damn since they are not in the eu with their instances and if nobody gives a damn about implementing this oembed api for youtube there are 2 solutions, the user agent hack or hosting the synapse somewhere where this please sign in to youtube preview does not happen.

also I havnt had any trouble with curl as my user agent in synapse, everything works perfectly fine so far

Well, I don't think this tone is helpful. We are all trying to make things better.

Anyway, I agree that the user agent hack is brittle, per my experience it is not really a solution. But I also know it requires a lot of work to generate good previews. OEmbed is part of the solution and should be supported at some point, but having a configurable user agent can be a quick fix that shouldn't harm anything.

But the work involved to support OEmbed shouldn't be that big, if we look at https://github.com/webrecorder/oembed.link it is not that huge.

@clokep
Copy link
Member

clokep commented May 7, 2021

But the work involved to support OEmbed shouldn't be that big, if we look at webrecorder/oembed.link it is not that huge.

Maybe it wasn't explicit enough above, but OEmbed is already supported (see #7920). It currently hard-codes Twitter as the only supported service (see

# A map of globs to API endpoints.
_oembed_globs = {
# Twitter.
"https://publish.twitter.com/oembed": [
"https://twitter.com/*/status/*",
"https://*.twitter.com/*/status/*",
"https://twitter.com/*/moments/*",
"https://*.twitter.com/*/moments/*",
# Include the HTTP versions too.
"http://twitter.com/*/status/*",
"http://*.twitter.com/*/status/*",
"http://twitter.com/*/moments/*",
"http://*.twitter.com/*/moments/*",
],
}
).

Options to solve this would be:

  1. Add YouTube as another hard-coded service (kind of meh, but if it is really broken this might be OK).
  2. Support pulling the list dynamically (or bundle the JSON list with the package and load it at run-time) -- this is the idea discussed in Youtube captions (link previews) are useless #9733 (comment).
  3. Allow for configuration of this list so people can do this themselves (also kind of meh since it requires each admin to fix this individually).
  4. Some combination of the above.

If someone is interested in working on this I'll gladly help work through any of the above with them, but that is likely a discussion for #synapse-dev:matrix.org.

@kuon
Copy link

kuon commented May 7, 2021

I think using the list mentioned in #9733 (comment) is the way to go, and maybe make it use configurable (list URL).

So:

  • Have an URL configuration for the list, default to https://oembed.com/providers.json, allow for local file
  • Pull the list dynamically if remote (maybe weekly update?)

seems a good approach

@Bubu
Copy link
Contributor

Bubu commented Jun 9, 2021

I just wanted to note that adding @tulir's "UrlPreviewBot" UA workaround fixed both twitter image previews as well as youtube previews for me. 🎉.

https://mau.dev/maunium/synapse/-/commit/55d926999cffee893cb4951890a33985beaf70ba

@t3chguy
Copy link
Member

t3chguy commented Jul 9, 2021

I'm taking a quick stab at this, by putting the oembed_globs in config, later possibly defaulting the sample config to derive from https://oembed.com/providers.json

Edit: so unfortunately this is not quite as trivial, Youtube's oEmbed response is an iframe which we can't send over the preview_url API.

e.g

{
  "title": "The Giant Comes to Life...(POWER LOADER: PART 14)",
  "author_name": "Hacksmith Industries",
  "author_url": "https://www.youtube.com/c/theHacksmith",
  "type": "video",
  "height": 113,
  "width": 200,
  "version": "1.0",
  "provider_name": "YouTube",
  "provider_url": "https://www.youtube.com/",
  "thumbnail_height": 360,
  "thumbnail_width": 480,
  "thumbnail_url": "https://i.ytimg.com/vi/62tPTgpmT1U/hqdefault.jpg",
  "html": "\u003ciframe width=\u0022200\u0022 height=\u0022113\u0022 src=\u0022https://www.youtube.com/embed/62tPTgpmT1U?feature=oembed\u0022 frameborder=\u00220\u0022 allow=\u0022accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\u0022 allowfullscreen\u003e\u003c/iframe\u003e"
}

image

vs Twitter which has no title but sends a blockquote we send over to the client

{
  "url": "https:\/\/twitter.com\/CroydonCyclists\/status\/1147416388874768389",
  "author_name": "Croydon Cycling Campaign",
  "author_url": "https:\/\/twitter.com\/CroydonCyclists",
  "html": "\u003Cblockquote class=\"twitter-tweet\"\u003E\u003Cp lang=\"en\" dir=\"ltr\"\u003ETurns out that Lime bike will fine you for parking their bikes in parts of central Croydon where cycling is legal and there are parking racks. Beyond stupid. \u003Ca href=\"https:\/\/t.co\/EtDlbUSfog\"\u003Epic.twitter.com\/EtDlbUSfog\u003C\/a\u003E\u003C\/p\u003E— Croydon Cycling Campaign (@CroydonCyclists) \u003Ca href=\"https:\/\/twitter.com\/CroydonCyclists\/status\/1147416388874768389?ref_src=twsrc%5Etfw\"\u003EJuly 6, 2019\u003C\/a\u003E\u003C\/blockquote\u003E\n\u003Cscript async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"\u003E\u003C\/script\u003E\n",
  "width": 550,
  "height": null,
  "type": "rich",
  "cache_age": "3153600000",
  "provider_name": "Twitter",
  "provider_url": "https:\/\/twitter.com",
  "version": "1.0"
}

image

Edit2:

With some tweaking, I can get some better results out of it, but the code needs a bit of refactoring, all the oEmbed results go through a media/file interface and its not appropriate.

image

@nukeop

This comment has been minimized.

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Jul 22, 2021

Discord has some custom behaviour and design for youtube specifically, FYI. it's intended to be invisible, but that kind of special treatment is a bit problematic for element.

@nukeop

This comment has been minimized.

@Bubu

This comment has been minimized.

@nukeop

This comment has been minimized.

@t3chguy

This comment has been minimized.

@aaronraimist

This comment has been minimized.

@damentz

This comment has been minimized.

@nukeop

This comment has been minimized.

@richvdh
Copy link
Member

richvdh commented Jul 28, 2021

I've removed the conspiracy theories, suggestions of workarounds that have already been discussed 5 times, and "me too!" comments. None of these are helpful; please stay on topic. Yes it's annoying, no it's not a conspiracy by the evil Synapse maintainers to make your life worse.

We know it's possible to work around the problem by changing the User-agent. Per #9733 (comment): I'd rather not do that as I think it will be brittle.

Props to @t3chguy who, rather than complaining about the problem, has started work on a PR to fix it.

@nukeop

This comment has been minimized.

@t3chguy
Copy link
Member

t3chguy commented Jul 28, 2021

As a maintainer it is draining to see users spewing such garbage about something you put so much time into.

@nukeop

This comment has been minimized.

@matrix-org matrix-org locked as too heated and limited conversation to collaborators Jul 28, 2021
@matrix-org matrix-org unlocked this conversation Aug 6, 2021
@richvdh
Copy link
Member

richvdh commented Aug 6, 2021

I'm going to take further discussion of the oembed implementation to #2752.

@richvdh
Copy link
Member

richvdh commented Sep 1, 2021

#10714 has made good progress on this by changing the preview API to use a configurable list of oEmbed providers; however youtube previews are still somewhat useless as the default provider list doesn't include an entry for youtube.

@clokep are you aware of any reason we shouldn't include an entry for youtube in that file by default?

@clokep
Copy link
Member

clokep commented Sep 1, 2021

@clokep are you aware of any reason we shouldn't include an entry for youtube in that file by default?

oEmbed for YouTube doesn't really give a good response right now, in the image below the first preview is made without using oEmbed (but I'm in the US so I get a "real" description), while the second one is made with oEmbed:

image

I think the tweaks in #10392 were meant to make this preview better.

@richvdh
Copy link
Member

richvdh commented Sep 1, 2021

oh I see. So really we need to land the remaining tweaks in #10392 before we can make more progress here?

@clokep
Copy link
Member

clokep commented Sep 1, 2021

oh I see. So really we need to land the remaining tweaks in #10392 before we can make more progress here?

Yeah, pretty much. I'm not super thrilled with the flow right now of how we do previews when using oEmbed, but that's rather tough to crack apart. It could really use some documentation on where caches are and such.

I Think the gist is that we need to pull more info out of the oEmbed response though, e.g. the provider_name and title don't seem to end up properly in the response right now.

Here's what we get from oEmbed:

{
   "author_name" : "Rick Astley",
   "author_url" : "https://www.youtube.com/c/RickastleyCoUkOfficial",
   "height" : 113,
   "html" : "<iframe width=\"200\" height=\"113\" src=\"https://www.youtube.com/embed/dQw4w9WgXcQ?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>",
   "provider_name" : "YouTube",
   "provider_url" : "https://www.youtube.com/",
   "thumbnail_height" : 360,
   "thumbnail_url" : "https://i.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg",
   "thumbnail_width" : 480,
   "title" : "Rick Astley - Never Gonna Give You Up (Official Music Video)",
   "type" : "video",
   "version" : "1.0",
   "width" : 200
}

What we get from Synapse (when configured to use oEmbed for YouTube):

{
   "matrix:image:size" : 18498,
   "og:description" : null,
   "og:image" : "mxc://localhost:8480/2021-09-01_AfteoaZUTZOUJfoa",
   "og:image:height" : 360,
   "og:image:type" : "image/jpeg",
   "og:image:width" : 480
}

This is really only pulling the thumbnail_url properly right now.

For reference, this compares to what we get without using oEmbed:

{
   "matrix:image:size" : 65665,
   "og:description" : "Rick Astley's official music video for “Never Gonna Give You Up” Subscribe to the official Rick Astley YouTube channel: https://RickAstley.lnk.to/YTSubIDFoll...",
   "og:image" : "mxc://localhost:8480/2021-09-01_QwaVetzmVlEviNmK",
   "og:image:height" : 720,
   "og:image:type" : "image/jpeg",
   "og:image:width" : 1280,
   "og:site_name" : "YouTube",
   "og:title" : "Rick Astley - Never Gonna Give You Up (Official Music Video)",
   "og:type" : "video.other",
   "og:url" : "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
   "og:video:height" : "720",
   "og:video:secure_url" : "https://www.youtube.com/embed/dQw4w9WgXcQ",
   "og:video:tag" : "rick astley never gonna give you up lyrics",
   "og:video:type" : "text/html",
   "og:video:url" : "https://www.youtube.com/embed/dQw4w9WgXcQ",
   "og:video:width" : "1280"
}

@clokep
Copy link
Member

clokep commented Sep 22, 2021

I put up #10819 which should help with this, but it doesn't give quite as good of a preview as the current HTML parsing.

I've been unable to reproduce the blank / no preview for YouTube from US, UK, or France based servers. Are people still seeing issues with this?

@evoL
Copy link

evoL commented Sep 22, 2021

I get URL previews for YouTube now.

I think YouTube rolled out a change where they don't auto-redirect to consent.youtube.com anymore. I remember that some weeks ago the redirect happened on and off for me, which looked to me like an A/B test on their part. Maybe it's fully rolled out yet?

@asmaps
Copy link

asmaps commented Sep 22, 2021

I get URL previews for YouTube now.

I think YouTube rolled out a change where they don't auto-redirect to consent.youtube.com anymore. I remember that some weeks ago the redirect happened on and off for me, which looked to me like an A/B test on their part. Maybe it's fully rolled out yet?

Same here, started working from Germany without updating synapse.

@clokep
Copy link
Member

clokep commented Sep 22, 2021

Thank you @evoL and @asmaps! I'm going to close this for now then. If someone is seeing issues still, please shout!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.