feat: back-end implementation of ranked link seach #210

JasonChong96 · 2020-06-19T15:13:39Z

Problem

"Members of the public (MOPs) view go.gov.sg as a central hub for accessing government resources. By providing a link search feature, we will be able to better direct MOPs to access the resources that they require."

Link Search #181

Solution

This PR continues the implementation by providing the back-end API for fetching links using plain text queries.

Features:

A new endpoint api/search/urls has been added for ranked plain text search with support for pagination.
Similar to api/user/url, the response contains the total count of matching urls and the urls within the requested range.
The search takes into account both the shortUrl and description of each entry. Text from the description has a lower weightage than those from the shortUrl under the assumption that we can take shortUrl as the title and hence contains words which are more important.
The search uses PostgreSQL's ts_rank_cd, which takes into account how far apart query terms are found in the urls. The further they are, the lower the ranking.
The raw query params are passed separately from the rest of the query using parameter binding. This ensures that the endpoint is not susceptible to SQL injection given that a raw query is used. (Tested using Postman)
The API supports the following ranking conditions: relevance, recency and popularity
This new endpoint is rate limited to 20 requests/second/user
Search ignores INACTIVE urls and they are not included in the partial inverted index used for search
Relevancy ranking is normalized by multiplying by 1 / log(doc length + 1). This is due to the assumption that if there are less words that do not match the query, then the terms are more important in the entry, making it more likely to be relevant to the user's query.

The relevance ranking algorithm used is as follows:
(text ranking by PostgreSQL) * log(1 + clickCount)

Notes: The 1 is added to click counts to prevent 0 clickCount from causing an error. There is an assumption made that more popular links are more likely to be relevant to users' queries.

Additional notes

A request to the endpoint requires two separate database queries. This mimics the behavior of Sequelize's findAndCountAll which we use for api/user/url to support pagination.

Deploy Notes

The migration file provided creates the index required for this search to be run efficiently. Benchmarks done on local machine shows a speed increase of around 20x.
The migration file should only be run after the relevant columns are added from the phase 0 implementation.
The migration file does create any breaking changes so it can be run before this implementation is deployed.

Dependencies

express-rate-limit: rate limiter middleware for api endpoints
@types/express-rate-limit: type definitions for express-rate-limit

TODO:

Implement endpoint
Create migration file for indexing
Wait for phase 0 to be merged into develop and rebase
Remove click count from API response
Support different sorting orders
Add rate limiting
Add tests

Full documentation of this feature will be done on the wiki asap

src/server/api/search.ts

src/server/controllers/SearchController.ts

src/server/repositories/UrlRepository.ts

liangyuanruo

changes as commented - largely lgtm otherwise!

liangyuanruo

to address conflicts before merging

* feat: endpoint for url search * fix: remove redundant log * fix: inappropriate error message * feat: rate limiting on search endpoint * refactor: use table name from orm * feat: hide link clicks from search response * feat: support different search orders * fix: update comments * fix: imports * fix: search order validation * feat: add unit test for search controller * fix: test request using wrong params * feat: search ignores inactive links * refactor: move stripping of clicks to service layer * feat: additional tests for new methods * feat: add more tests for textsearch * refactor: remove redundant coalesce * fix: error in sql statement for recency sort * docs: add comment explaining ts_rank_cd normalization * refactor: extract helper methods from search * fix: packagelock * fix: use more reasonable default limit * fix: typo in documentation * refactor: capitalize sql keywords * fix: count including inactive urls and not using index * feat: rate limit use real ip and logs when limit is reached * fix: formatting

JasonChong96 changed the title ~~feat: back-end implementation of link seach~~ feat: back-end implementation of ranked link seach Jun 19, 2020

JasonChong96 force-pushed the search-data-collection branch from 8cd43b6 to 0cd76e2 Compare June 22, 2020 07:52

JasonChong96 force-pushed the search-phase-1 branch 2 times, most recently from 791722a to 1082347 Compare June 23, 2020 09:50

Base automatically changed from search-data-collection to develop June 24, 2020 06:20

JasonChong96 force-pushed the search-phase-1 branch from cea5a39 to a2b19d8 Compare June 24, 2020 07:28

JasonChong96 marked this pull request as ready for review June 24, 2020 10:47

JasonChong96 force-pushed the search-phase-1 branch from db763e4 to 8261d1c Compare June 25, 2020 04:59

JasonChong96 added 21 commits June 25, 2020 14:57

feat: endpoint for url search

f5b201f

fix: remove redundant log

5c2f2db

fix: inappropriate error message

d6b29fc

feat: rate limiting on search endpoint

e4b4d62

refactor: use table name from orm

af5fb15

feat: hide link clicks from search response

320a21d

feat: support different search orders

46aa9ae

fix: update comments

5153d2b

fix: imports

60c914d

fix: search order validation

7795e8d

feat: add unit test for search controller

8dade09

fix: test request using wrong params

c5501d0

feat: search ignores inactive links

aca20d2

refactor: move stripping of clicks to service layer

a44a9df

feat: additional tests for new methods

1b37abd

feat: add more tests for textsearch

1c78f3d

refactor: remove redundant coalesce

1878820

fix: error in sql statement for recency sort

73d1a18

docs: add comment explaining ts_rank_cd normalization

f74f59c

refactor: extract helper methods from search

e780b3d

fix: packagelock

ac87c21

JasonChong96 force-pushed the search-phase-1 branch from 3f232b1 to ac87c21 Compare June 25, 2020 07:20

JasonChong96 requested a review from liangyuanruo June 25, 2020 09:51