Service or system name:
What business need is met by this service or system? What expectations do we have about availability and performance?
(e.g. Provides reliable automated reconciliation of logistics transactions from the previous 24 hours)
What kind of system is this? Web-connected order processing? Back-end batch system? Internal HTTP-based API? ETL control system?
(e.g. Internal API for order reconciliation based on Ruby and RabbitMQ, deployed in Docker containers on Kubernetes)
Which team owns and runs this service or system?
(e.g. The Sneaky Sharks team (Bangalore) develops and runs this service: sneaky.sharks@company.com / #sneaky-sharks on Slack / Extension 9265)
Which distinct software applications, daemons, services, etc. make up the service or system? What external dependencies does it have?
(e.g. Ruby app + RabbitMQ for source messages + PostgreSQL for reconciled transactions)
What kind of security is in place for passwords and Personally Identifiable Information (PII)? Are the passwords hashed with a strong hash function and salted?
(e.g. Passwords are hashed with a 10-character salt and SHA265)
How is configuration managed for the system?
(e.g. CloudInit bootstraps the installation of Puppet - Puppet then drives all system and application level configuration except for the XYZ service which is configured via App.config
files in Subversion)
How are configuration secrets managed?
(e.g. Secrets are managed with Hashicorp Vault with 3 shards for the master key)
Which parts of the system need to be backed up?
(e.g. Only the CoreTransactions database in PostgreSQL and the Puppet master database need to be backed up)
How does backup happen? Is service affected? Should the system be [partially] shut down first?
(e.g. Backup happens from the read replica - live service is not affected)
How does restore happen? Is service affected? Should the system be [partially] shut down first?
(e.g. The Booking service must be switched off before Restore happens otherwise transactions will be lost)
What significant metrics will be generated?
(e.g. Usual VM stats (CPU, disk, threads, etc.) + around 200 application technical metrics + around 400 user-level metrics)
How is the health of dependencies (components and systems) assessed? How does the system report its own health?
How is the software deployed? How does roll-back happen?
(e.g. We use GoCD to coordinate deployments, triggering a Chef run pulling RPMs from the internal yum repo)
How should troubleshooting happen? What tools are available?
(e.g. Use a combination of the /health
endpoint checks and the abc-*.sh
scripts for diagnosing typical problems)