Table of Contents
In this project, I worked with a team of engineers in designing a complex backend system for a legacy codebase to prepare the service for production level traffic. I worked on scaling the tours component of the service which enabled the user to view the most popular tours for their destination and sort through them with a variety of categories (see Front End Demo section below). As I was working with the backend, the areas I did the majority of the work in are with the database, data generation, server setup, and some work on the client folder as I made some small front-end refactors.
In order to scale the component, I began by performing multiple stress tests to simulate high user traffic using Loader.io and monitored my response information using New Relic. After recording the initial maximum load of the component, I proceeded to horizontally scale my service using an NGINX load balancer and also vertically scale my database. In the end, I was able to increase the servers maximum requests per minute (rpm) by 760% to 114,000 rpm. My methods for how I accomplished this are explained in greater detail in the Designing The Backend section.
Project Link: https://github.com/trips-ahoy/tours-service
In choosing a database for my component, scalability was very important and with this consideration in mind, I narrowed my choices to two databases a SQL database, Postgres, and a NOSQL database, Cassandra. In order to decide between the two, I proceeded to benchmark the two databases once they were seeded with the 10 million database entries and recorded the average non-cached response time at each of the 3 endpoints at a given listing ID. These listing IDs were distributed evenly throughout the dataset with 5 tested each at the first 10%, middle 10%, and last 10% portion of the dataset per endpoint for each database. Results are shown in the table below.
My results for the benchmark favored postgres and this was most likely due to relational nature of the queries I performed (describe in Dataset Breakdown). In postgres I was able to perform one complex query to get the data needed but this was not possible with a single query for cassandra and requried 2 separate queries to get all the information I needed. Looking at how both databases scale, Cassandra is a great option as it was built with scalability to be easily scalable but Postgres is a good option too with it's streaming replication feature which spreads queries to multiple read only replicas. This works well with my service as my endpoints only require reading from the database and do not need to insert into it. In the end, I choose to go with Postgres because of its better average response time and because it could scale well for what I needed it to do.
In the dataset, there are 10 million records that contain the tour information for the site. This information is spread out into 5 relational tables and is organized using the following schema.
The relation of the the data is as follows:
- There are 10 million listing IDs and each listing ID has 1 location ID
- There are 1000 location IDs and each location ID has 11-16 category IDs assigned to it
- There are 30 category ID's and each category ID has many tours associated to it
- There are 10 million tours and each tour has 1 location ID and 1 category ID
After I got my postgres database and service server deployed on EC2, I went on to hook my service server to a NGINX load balancer and began stress testing it while connected to a single service. I used a testing software called Loader io in order to simulate high traffic and increased traffic by 50 requests per second (rps) until my response time from the server exceeded 2s or my error rate passed 1%. Once that happened, I recorded the maximum rps that the service could handle before passing the minimum requirements and then added another service and began stress testing starting at the previous max rps and went on until no change/very minimal change was recorded across all endpoints. My results for the horizontal scaling of my database are shown in the table below and as you will see there was a large difference in the performance across the 3 endpoints with the best performing endpoint starting at 450 rps and then reaching 1250 rps when scaled to 4 services while my worst performing endpoint started at 250 rps and bottlenecked at 300 rps with only 2 services.
My next step was to investigate what was bottlenecking my worst endpoint. I noticed a trend in the performance across the endpoints where the faster the average response time the endpoint had the more it was able to increase the max rps when additional service servers were connected. This made me think that the database was being overwhelmed and was experiencing serious backlog when being flooded by requests that took longer to process. Due to a time constraint, I was not able to scale my database horizontally using streaming replication and instead went about scaling vertically. With the load balancer still connected to four services, I went about increasing the size of the EC2 instance type and effectively provided the database with more resources (additional vCPU's and RAM) and then stress tested the endpoints after, my results are shown in the table below. This change had immediate results and with each increase I was able to see a significant improvement across all endpoints. Most notably with my worst performing endpoint, initally maxed at 300 rps with a t2.micro now able to reach 1900 rps with a t2.xlarge.
In the end, I was able to successfully scale my component and surpassed my initial goal of 1000 rps across endpoints and nearly doubled that goal for each endpoint. If I had more time, I would have tried to continue increasing the maximum rps for the endpoint and would have scaled up whatever process was my current bottleneck. This would be great practice for scaling the service to handle extreme loads of traffic such as a Black Friday scenario. Additonally, I would rework my database to scale horizontally using streaming replication on EC2 t2.micro instances as the t2.xlarge that is currently used is already 16 times more expensive per hour than the t2.micro to deploy and would quickly rack up a large bill.