Skip to content

Latest commit

 

History

History
80 lines (49 loc) · 2.95 KB

RUNBOOK.md

File metadata and controls

80 lines (49 loc) · 2.95 KB

Based on this template

Basket API Runbook (Example based on a fictional service)

Overview

Business overview

  • Allows our customers to add tracks or releases to a basket to be later available for purchase via the 7digital web store.
  • Customers can pay for the contents of the basket, providing revenue for our D2C business.
  • The service should be highly available and process calls in a timely fashion to ensure the user experience does not discourage people purchasing items.

Technical overview

  • HTTP-based API

Service Level Agreements (SLAs)

  • Internal SLOs:
    • 99.5% of calls in a monthly window return successful HTTP status codes.
    • 99.5% of calls in a monthly window complete in less than 500 ms.
  • No explcit client-facing or internal SLA

Dependencies

Consult the platform dependencies mapping diagram

  • Purchasing API
  • Locker API
  • AWS DynamoDB

Owner

Core Platform Team

System detail

Data and processing flows

Infrastructure and network design

Resilience

  • Deployed into 2 AZs via auto-scaling groups managed by AWS Beanstalk
  • Circuit breakers & timeouts implemented on calls to 7digital API
  • Traffic is load balanced across multiple instances via an ELB.

Scalability

  • Auto scaling is triggered by excessive CPU usage.
  • Manual scaling can be accomplished by configuring the auto-scaling group "min" property
  • Limited by 7digital API in the DC
  • Requests cannot be throttled.

Monitoring and alerting

  • Events are logged to the app-error SumoLogic source category.
  • Metrics logged to DataDog
  • Application reports health via /status endpoint which ELB uses to determine if it should receive traffic.
  • Pingdom checks that a specific basket can be retrieved every minute. Will trigger SRE on-call if it fails twice in a row.

Expected traffic and load

  • ~5 RPS UK evenings
  • ~2 RPS UK daytime

CI/CD

  • Deployed via TeamCity
  • Rollback is accomplished by un-pinning the deployed "Build" and starting the "Deploy" build configuration.

Known issues

Pingdom check fails/flaps

  • Sometimes the application can deadlock. You can tell that it has deadlocked by finding the EC2 instance that has no CPU usage. It can be fixed by restarting the EC2 instance via the AWS EC2 console. Core Platform have a ticket open to investigate.