Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Support image previews for Twitter oEmbed URL Previews #8022

Closed
anoadragon453 opened this issue Aug 3, 2020 · 7 comments
Closed

Support image previews for Twitter oEmbed URL Previews #8022

anoadragon453 opened this issue Aug 3, 2020 · 7 comments

Comments

@anoadragon453
Copy link
Member

anoadragon453 commented Aug 3, 2020

To help fix twitter embedding issues, we've just added support for oEmbed in capturing URL previews. We now use this for previewing twitter links (by default), and we can now receive tweet text without problems. However, twitter does not return any image data for a tweet in its oEmbed response:

{
  "url": "https:\\/\\/twitter.com\\/arnaudmez7\\/status\\/1284848614062338053",
  "author_name": "The Uncle Mez",
  "author_url": "https:\\/\\/twitter.com\\/arnaudmez7",
  "html": "\\u003Cblockquote class=\"twitter-tweet\"\\u003E\\u003Cp lang=\"en\" dir=\"ltr\"\\u003EI Absolutely like the new \\u003Ca href=\"https:\\/\\/twitter.com\\/element_hq?ref_src=twsrc%5Etfw\"\\u003E@element_hq\\u003C\\/a\\u003E \\u003Cbr\\u003EBeautiful work !\\u003Cbr\\u003ERun very well on \\u003Ca href=\"https:\\/\\/twitter.com\\/SolusProject?ref_src=twsrc%5Etfw\"\\u003E@SolusProject\\u003C\\/a\\u003E \\u003Ca href=\"https:\\/\\/t.co\\/bLzhmuoFdy\"\\u003Epic.twitter.com\\/bLzhmuoFdy\\u003C\\/a\\u003E\\u003C\\/p\\u003E— The Uncle Mez (@arnaudmez7) \\u003Ca href=\"https:\\/\\/twitter.com\\/arnaudmez7\\/status\\/1284848614062338053?ref_src=twsrc%5Etfw\"\\u003EJuly 19, 2020\\u003C\\/a\\u003E\\u003C\\/blockquote\\u003E\n\\u003Cscript async src=\"https:\\/\\/platform.twitter.com\\/widgets.js\" charset=\"utf-8\"\\u003E\\u003C\\/script\\u003E\n",
  "width": 550,
  "height": null,
  "type": "rich",
  "cache_age": "3153600000",
  "provider_name": "Twitter",
  "provider_url": "https:\\/\\/twitter.com",
  "version": "1.0"
}

You'll notice that the html key has a pic.twitter.com URL in it. However, this just leads us to the tweet HTML, and extracting it from this HTML is too twitter-specific anyways.

However, the HTML returned here is the exact same (minus being encoded) as what's shown on publish.twitter.com for this tweet. You can see that this HTML renders into a nice little standardised preview of the tweet. Part of this HTML is a JS script that gets loaded (platform.twitter.com/widgets.js) that will actually do most of the magic render the tweet.

Theoretically, after rendering this HTML output locally, we can just run our standard URL preview code over it and extract an image!

Thus my proposal for support Twitter image embeds with oEmbed that is still generic is to:

  1. Check if a response has image information (either photo or video response type is used, or thumbnail* keys are provided.
  2. If an image isn't easily provided, check for an html key.
  3. If html key exists, render securely and run URL preview code over it.
  4. Attempt to extract an image.

At the moment this is all theory, I haven't tested it in code yet.

@erikjohnston
Copy link
Member

I'm not really in favour of this for two reasons:

  1. This requires running a JS engine in synapse (or forking out to one), and that scares me.
  2. This is working around what Twitter has intentionally provided, where they clearly intend for this to render client side.

The Twitter API suggests that clients include https://platform.twitter.com/widgets.js and run twttr.widgets.load() on new URL previews, but that is a obviously twitter specific.

@erikjohnston
Copy link
Member

I'm going to close this because I think we've agreed that this isn't the right approach 🙂

@aaronraimist
Copy link
Contributor

It's actually easy to do and doesn't require all of that.

https://matrix.to/#/!XaqDhxuTIlvldquJaV:matrix.org/$bvBYxFl1vc1_FbDz-VxSb2Lqh1V0kFIPrgHD_KHMhog?via=sw1v.org&via=raim.ist&via=matrix.org

https://mau.dev/maunium/synapse/-/commit/fe01ce7cf786378f72f741c80b6183674aeada50

It seems that has been decided against for some reason but I'm just adding a comment here so at least it is mentioned somewhere on the repo.

@anoadragon453
Copy link
Member Author

For those coming here in the future, Synapse already sends a User-Agent string of Synapse/x.xx.x during it's URL preview fetching: #1859

It seems that the solution @aaronraimist works because twitter allows previews by programs with "bot" in their user-agent string. We're not sure whether we want to add this to the user-agent string, especially if it's not standard practice and twitter-specific.

One may suggest allowing the URL preview UA to be configurable, but having to tell users to change this setting to get services like twitter working isn't a great situation to be in.

Given the above there's not an easy path forward here.

@aaronraimist
Copy link
Contributor

aaronraimist commented Aug 18, 2020

Bot in the user agent doesn't seem like that much of a hack to me. For example #1859 was asking to put bot in the UA string back in 2017 just to show that it was in fact a bot making the requests.

Right now the current situation will never work so even if it only worked temporarily after making this change that's still an improvement. You don't have to guarantee that Twitter previews are going to continue to work after making this change. It can just happily work, until maybe in the future they change something and it stops working.

@anoadragon453
Copy link
Member Author

We don't want to modify the UA header for a twitter-specific reason. However, if putting "bot" in the URL is something industry-wide, or as you say to indicate that it's a request originating from a bot, then it'd be a good reason to do so. What do other link-fetching services do?

After some discussion in #synapse-dev, I'm more favourable towards the configurable UA option, although I do realise that it wouldn't solve the problem for twitter by default.

@aaronraimist
Copy link
Contributor

aaronraimist commented Aug 18, 2020

I don't know if it is a standard but it doesn't seem uncommon. For example most of Google's crawlers have the word bot in the user agent https://support.google.com/webmasters/answer/1061943?hl=en and like the Wikipedia article for user agents says

Automated web crawling tools can use a simplified form, where an important field is contact information in case of problems. By convention the word "bot" is included in the name of the agent.

As a reference for that it is just linking to a blog post but it does seem like something that some people recommend.

https://en.wikipedia.org/wiki/User_agent#Format_for_automated_agents_(bots)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants