Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

URL previewing support #688

Merged
merged 44 commits into from
Apr 11, 2016
Merged

URL previewing support #688

merged 44 commits into from
Apr 11, 2016

Conversation

ara4n
Copy link
Member

@ara4n ara4n commented Apr 3, 2016

  • add design sketch doc for a URL preview API
  • add a new SpiderHttpClient derived from SimpleHttpClient, which follows redirects and handles gzip CTE correctly
  • add get_file support to SimpleHttpClient, knowingly duplicated for now from matrixfederationclient.
  • add a preview_url_resource to implement the new media/r0/preview_url API. This:
    • spiders the a given URL, extracting or synthesising OpenGraph metadata for it via lxml, returning the metadata as a JSON blob
    • returns a cache (either from on disk or in memory) for the metadata of the URL as of the requested point in time, if available.
    • deduplicates requests to spider a URL such that only one req is in flight at any point.
  • adds support for thumbnailing SVGs (by just passing back the original image for now)
  • adds the local_media_repository_url_cache table to the DB for the on-disk URL cache
  • adds get_url_cache and store_url_cache to media_repository.py to wrap the new table

N.B. that following redirects will not work correctly until https://twistedmatrix.com/trac/ticket/8265 is merged. Unsure if it's worth maintaining our own Twisted fork until that happens.

Given I'm hardly a python/twisted expert, review would be particularly appreciated on:

  • Whether I should be doing anything with log contexts - I've seen some worrying warnings about log contexts not being preserved
  • Whether I should be doing anything smarter with the types or flavours of exceptions I raise. Particularly, some exceptions from the async_get method seem to lose their stack traces entirely - i haven't spotted the pattern yet.

This is part of a set of PRs spanning vector-web, matrix-react-sdk, matrix-js-sdk and synapse.
See also element-hq/element-web#1343 and matrix-org/matrix-react-sdk#260 and matrix-org/matrix-js-sdk#122

…d, experimental, etc. just putting it here for safekeeping for now
def get_url_cache_txn(txn):
# get the most recently cached result (relative to the given ts)
sql = (
"SELECT response_code, etag, expires, og, media_id, max(download_ts)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to be doing ORDER BY download_ts DESC LIMIT 1 rather than max(download_ts)

@NegativeMjark
Copy link
Contributor

I think you need to run apt-get install libxslt1-dev before you can install lxml on debian fwiw.

@NegativeMjark
Copy link
Contributor

Can we make the entire thing optional somehow? We probably can't run it by default anyway given that it needs an IP blacklist.

# first check the memory cache - good to handle all the clients on this
# HS thundering away to preview the same URL at the same time.
try:
og = self.cache[url]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use cache.get() rather try: except:

…oint. defaults to off.

Add url_preview_ip_range_blacklist to let admins specify internal IP ranges that must not be spidered.
Add url_preview_url_blacklist to let admins specify URL patterns that must not be spidered.
Implement a custom SpiderEndpoint and associated support classes to implement url_preview_ip_range_blacklist
Add commentary and generally address PR feedback
@ara4n
Copy link
Member Author

ara4n commented Apr 8, 2016

incorporate all the PR feedback - @NegativeMjark PTAL

isLeaf = True

def __init__(self, hs, filepaths):
if not html:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The not html probably throws if lxml isn't installed.

@ara4n
Copy link
Member Author

ara4n commented Apr 8, 2016

@NegativeMjark addressed these too, and now throwing sensible exceptions. PTAL

"blacklist in url_preview_ip_range_blacklist for url previewing "
"to work"
)
raise RunTimeError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its RuntimeError not RunTimeError. This sort of typo can be picked up by running flake8 synapse fwiw.

@NegativeMjark
Copy link
Contributor

Other than fixing the typo's and style warnings, it LGTM. I'm slightly concerned by the lack of tests for it though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants