Skip to content

Latest commit

 

History

History
129 lines (103 loc) · 4.42 KB

photo_sharing_service.md

File metadata and controls

129 lines (103 loc) · 4.42 KB

photo-sharing service

STEP 1) requirements

  1. functional req
  • user should be able to upload photo/ video
  • user should be able to favourite/ comment on photo
  • user can follow/unfollow other users
  • user should be able to see "news-feed" based on most relative post of users followed
  • user should be able to search photos/videos by names
  • users should be able to tag photos (out of scope)
  1. non functional req
  • highly available (while posting videos/photos etc)
  • minimal latency, "news feed" should be displayed within 200ms
  • reliability - posted data should never be lost
  • eventually consistent - posted photos can be visible after a while

STEP 2) scaling estimation

  • photo/video read heavy

  • total number of users - 500Mil

  • total active users/ day - 1Mil

  • new photos/ videos pers day - 2Mil

  • average photo/ video size - 200KB

  • storage estimation

  • space / day = 2Mil * 200KBytes per day = 400 GB

  • total space/ year = space per day * 365 = 146,000 Gi = 146TB

STEP 3) System interface definition

client  ~>  POST   -H "Authorization: Digest " -d '{"data": "", "dataTitle": "", "dataType": "png"}' /user/${userId}/post
        ~>  POST   -H "Authorization: Digest " -d '{"commentText": "", "dateTime": "2020-06-20T00:00:00-0700[utz]"}' /user/${userId}/${mediaId}
        ~>  DELETE -H "Authorization: Digest "  /user/${userId}/post/${photoId}

        ~> POST    -H "Authorization: Digest " -d '{"followId": 123, followedDateTime: "yyyy-MM-ddThh:mm:ssZone"}'   /user/${userId}/follow
        ~> POST    -H "Authorization: Digest " -d '{"followId": 123, unfollowedDateTime: "yyyy-MM-ddThh:mm:ssZone"}' /user/${userId}/unfollow

        ~> GET     -H "Authorization: Digest " /user/${userId}/news-feed

STEP 4) data model definition/ design

  • each url record will store around 1024 bytes.
  • database needs to support heavy read (write once read many)
  • records will go to billions
-- psql

CREATE TABLE user(
  user_id           SERIAL PRIMARY KEY,
  username          VARCHAR(16),
  password          VARCHAR(255),
  created_date_time TIMESTAMPZ DEFAULT CURRENT_TIMESTAMP,
  is_active         BOOLEAN,
  deactivated_on    TIMESTAMPZ --should there be reactivate feature? if so CREATE TABLE user_activity
)

create table media(
  media_id            SERIAL PRIMARY KEY,
  user_id             BIGINT REFERENCES user(user_id),
  media_type          VARCHAR(10),
  media_resource_path VARCHAR(255), --azure storage, gcp storage link etc
  created_date_time   TIMESTAMPZ,
  is_deleted          BOOLEAN,
  deleted_on          TIMESTAMPZ
)

CREATE TABLE media_comment (
  comment_id         SERIAL PRIMARY KEY
  media_id           BIGINT REFERENCES media(media_id),
  comment_text       VARCHAR(255),
  commented_by       BIGINT REFERENCING user(user_id),
  created_date_time  TIMESTAMPZ,
  modified_date_time TIMESTAMPZ, --create media_comment_history(??)
  is_deleted         BOOLEAN,
  deleted_on         TIMESTAMPZ
)

-- followers
CREATE TABLE user_follower (
  id          SERIAL PRIMARY KEY,
  user_id     BIGINT REFERENCING user(user_id),
  follower_id BIGINT REFERENCING user(user_id),
  followed_on TIMESTAMPZ,
  is_deleted  BOOLEAN,
  deleted_on  TIMESTAMPZ
)
  • the data is relational (eg. user has followers and follower need to be joined with itself. same with user has comment which can be denormalized but comment has commented_by_user_id)

  • it is possible to use NoSQL(eg. Cassandra) for better scaling but single user record grow really big

  • data can be partitioned using user_id so all related data for given user goes to the same partition (user_id % num_of_partition). need to think about data of hot users (eg. celebraties) to same partition so that it does not overflow the partition

  • Purging or DB cleanup: do a lazy cleanup (example. periodic cleanup when traffic is less)

STEP 6) Detailed design

  • Load Balancer (LB)

  • Cache: Considering 80% traffic - from 20% read, we can use ContentDeliveryNetwork for cache highly used media posts

  • Telemetry

  • Security and Permissions

news feed generation

  • generate feed based on relevancy (previous conversation etc) with the users followed
  • total number of interactions on post media
  • total number of views on media
  • total time spent on media
  • check if media post reported (Authentic communication)

Other designs