Service Readiness Definitions
Last updated November 4, 2024
I have in the past used service readiness definitions as a tool for communicating to the business the state of a service and making clear the risks of not investing in further hardening. Below is a sample set of definitions I’ve used for a past team.
In development
- Not for production use
- The API may change without notice
- No guarantees about uptime, responsiveness, or load capacity - The service may go down or become unresponsive without warning
Beta
- Not recommended for production use
- Discuss use cases and exceptions with the owning team
- Known and unknown bugs are expected
- API changes are versioned
- Functionality, for the happy path, generally works as expected
- Bug fixes will not be prioritized as production issues
Production Ready
- Service is ready for consumption by non-critical production systems - critical/essential systems should still use discretion and evaluate the maturity of the system before consuming it
- Engage with owning team for any P1 service usage
- API changes are versioned
- The service has some guarantee about uptime - Outages are considered an incident for the owning team and addressed with appropriate priority
- The service has some guarantee about amount of load it can handle before degrading
- The service has some guarantee about the response time of the service
- The service has dashboards showing key application metrics
- The service has alerts
Battle Tested
- Service is ready for consumption by any production system
- The service has operated in production, handling notable load, for at least one calendar year
- The service has a well-defined SLOs based on historical record including the following stats:
- Uptime characteristics (eg. 99.9% uptime - up to 8.77 hours of downtime per year)
- Load capacity & documentation indicating behavior when close to/over the limit (eg. Can handle up to 20,000 rpm before degrading, can handle up to 25,000 rpm with degraded response time before falling over)
- Response time (eg. 50th percentile, 90th percentile, and 99th percentile response time)
- The service’s constrained resource is understood and documented and instances are right-sized
- The service has robust alerts that have been proven to detect issues with appropriate timeliness
- The service has experienced multiple outages (either organic or artificial) that have exercised the service’s alerts, Runbooks, and incident response procedures