Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: back-end implementation of ranked link seach #210

Merged
merged 28 commits into from
Jun 29, 2020

Conversation

JasonChong96
Copy link
Contributor

@JasonChong96 JasonChong96 commented Jun 19, 2020

Problem

"Members of the public (MOPs) view go.gov.sg as a central hub for accessing government resources. By providing a link search feature, we will be able to better direct MOPs to access the resources that they require."

Solution

This PR continues the implementation by providing the back-end API for fetching links using plain text queries.

Features:

  • A new endpoint api/search/urls has been added for ranked plain text search with support for pagination.
  • Similar to api/user/url, the response contains the total count of matching urls and the urls within the requested range.
  • The search takes into account both the shortUrl and description of each entry. Text from the description has a lower weightage than those from the shortUrl under the assumption that we can take shortUrl as the title and hence contains words which are more important.
  • The search uses PostgreSQL's ts_rank_cd, which takes into account how far apart query terms are found in the urls. The further they are, the lower the ranking.
  • The raw query params are passed separately from the rest of the query using parameter binding. This ensures that the endpoint is not susceptible to SQL injection given that a raw query is used. (Tested using Postman)
  • The API supports the following ranking conditions: relevance, recency and popularity
  • This new endpoint is rate limited to 20 requests/second/user
  • Search ignores INACTIVE urls and they are not included in the partial inverted index used for search
  • Relevancy ranking is normalized by multiplying by 1 / log(doc length + 1). This is due to the assumption that if there are less words that do not match the query, then the terms are more important in the entry, making it more likely to be relevant to the user's query.

The relevance ranking algorithm used is as follows:
(text ranking by PostgreSQL) * log(1 + clickCount)

Notes: The 1 is added to click counts to prevent 0 clickCount from causing an error. There is an assumption made that more popular links are more likely to be relevant to users' queries.

Additional notes

A request to the endpoint requires two separate database queries. This mimics the behavior of Sequelize's findAndCountAll which we use for api/user/url to support pagination.

Deploy Notes

  • The migration file provided creates the index required for this search to be run efficiently. Benchmarks done on local machine shows a speed increase of around 20x.
  • The migration file should only be run after the relevant columns are added from the phase 0 implementation.
  • The migration file does create any breaking changes so it can be run before this implementation is deployed.

Dependencies

  • express-rate-limit: rate limiter middleware for api endpoints
  • @types/express-rate-limit: type definitions for express-rate-limit

TODO:

  • Implement endpoint
  • Create migration file for indexing
  • Wait for phase 0 to be merged into develop and rebase
  • Remove click count from API response
  • Support different sorting orders
  • Add rate limiting
  • Add tests

Full documentation of this feature will be done on the wiki asap

@JasonChong96 JasonChong96 changed the title feat: back-end implementation of link seach feat: back-end implementation of ranked link seach Jun 19, 2020
@JasonChong96 JasonChong96 force-pushed the search-data-collection branch from 8cd43b6 to 0cd76e2 Compare June 22, 2020 07:52
@JasonChong96 JasonChong96 force-pushed the search-phase-1 branch 2 times, most recently from 791722a to 1082347 Compare June 23, 2020 09:50
Base automatically changed from search-data-collection to develop June 24, 2020 06:20
@JasonChong96 JasonChong96 marked this pull request as ready for review June 24, 2020 10:47
Copy link
Contributor

@liangyuanruo liangyuanruo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes as commented - largely lgtm otherwise!

Copy link
Contributor

@liangyuanruo liangyuanruo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to address conflicts before merging

@JasonChong96 JasonChong96 merged this pull request into develop Jun 29, 2020
@JasonChong96 JasonChong96 deleted the search-phase-1 branch June 29, 2020 18:10
LoneRifle pushed a commit that referenced this pull request Jun 30, 2020
* feat: endpoint for url search

* fix: remove redundant log

* fix: inappropriate error message

* feat: rate limiting on search endpoint

* refactor: use table name from orm

* feat: hide link clicks from search response

* feat: support different search orders

* fix: update comments

* fix: imports

* fix: search order validation

* feat: add unit test for search controller

* fix: test request using wrong params

* feat: search ignores inactive links

* refactor: move stripping of clicks to service layer

* feat: additional tests for new methods

* feat: add more tests for textsearch

* refactor: remove redundant coalesce

* fix: error in sql statement for recency sort

* docs: add comment explaining ts_rank_cd normalization

* refactor: extract helper methods from search

* fix: packagelock

* fix: use more reasonable default limit

* fix: typo in documentation

* refactor: capitalize sql keywords

* fix: count including inactive urls and not using index

* feat: rate limit use real ip and logs when limit is reached

* fix: formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants