Skip to content

Latest commit

 

History

History
98 lines (49 loc) · 3.32 KB

run-book-template.md

File metadata and controls

98 lines (49 loc) · 3.32 KB

Run Book / System Operation Manual

Service or system overview

Service or system name:

Business overview

What business need is met by this service or system? What expectations do we have about availability and performance?

(e.g. Provides reliable automated reconciliation of logistics transactions from the previous 24 hours)

Technical overview

What kind of system is this? Web-connected order processing? Back-end batch system? Internal HTTP-based API? ETL control system?

(e.g. Internal API for order reconciliation based on Ruby and RabbitMQ, deployed in Docker containers on Kubernetes)

Service owner

Which team owns and runs this service or system?

(e.g. The Sneaky Sharks team (Bangalore) develops and runs this service: sneaky.sharks@company.com / #sneaky-sharks on Slack / Extension 9265)

Contributing applications, daemons, services, middleware

Which distinct software applications, daemons, services, etc. make up the service or system? What external dependencies does it have?

(e.g. Ruby app + RabbitMQ for source messages + PostgreSQL for reconciled transactions)

Security and access control

Password and PII security

What kind of security is in place for passwords and Personally Identifiable Information (PII)? Are the passwords hashed with a strong hash function and salted?

(e.g. Passwords are hashed with a 10-character salt and SHA265)

System configuration

Configuration management

How is configuration managed for the system?

(e.g. CloudInit bootstraps the installation of Puppet - Puppet then drives all system and application level configuration except for the XYZ service which is configured via App.config files in Subversion)

Secrets management

How are configuration secrets managed?

(e.g. Secrets are managed with Hashicorp Vault with 3 shards for the master key)

System backup and restore

Backup requirements

Which parts of the system need to be backed up?

(e.g. Only the CoreTransactions database in PostgreSQL and the Puppet master database need to be backed up)

Backup procedures

How does backup happen? Is service affected? Should the system be [partially] shut down first?

(e.g. Backup happens from the read replica - live service is not affected)

Restore procedures

How does restore happen? Is service affected? Should the system be [partially] shut down first?

(e.g. The Booking service must be switched off before Restore happens otherwise transactions will be lost)

Monitoring and alerting

Metrics

What significant metrics will be generated?

(e.g. Usual VM stats (CPU, disk, threads, etc.) + around 200 application technical metrics + around 400 user-level metrics)

Health checks

How is the health of dependencies (components and systems) assessed? How does the system report its own health?

Operational tasks

Deployment

How is the software deployed? How does roll-back happen?

(e.g. We use GoCD to coordinate deployments, triggering a Chef run pulling RPMs from the internal yum repo)

Troubleshooting

How should troubleshooting happen? What tools are available?

(e.g. Use a combination of the /health endpoint checks and the abc-*.sh scripts for diagnosing typical problems)