策略
2026-03-12
7 次浏览
🚨 Runbook: Incident Response
描述
| Revenue impact > $X | Finance Tracker + Studio Producer | Business impact assessment |
文档内容
# 🚨 Runbook: Incident Response
> **Mode**: NEXUS-Micro | **Duration**: Minutes to hours | **Agents**: 3-8
---
## Scenario
Something is broken in production. Users are affected. Speed of response matters, but so does doing it right. This runbook covers detection through post-mortem.
## Severity Classification
| Level | Definition | Examples | Response Time |
|-------|-----------|----------|--------------|
| **P0 — Critical** | Service completely down, data loss, security breach | Database corruption, DDoS attack, auth system failure | Immediate (all hands) |
| **P1 — High** | Major feature broken, significant performance degradation | Payment processing down, 50%+ error rate, 10x latency | < 1 hour |
| **P2 — Medium** | Minor feature broken, workaround available | Search not working, non-critical API errors | < 4 hours |
| **P3 — Low** | Cosmetic issue, minor inconvenience | Styling bug, typo, minor UI glitch | Next sprint |
## Response Teams by Severity
### P0 — Critical Response Team
| Agent | Role | Action |
|-------|------|--------|
| **Infrastructure Maintainer** | Incident commander | Assess scope, coordinate response |
| **DevOps Automator** | Deployment/rollback | Execute rollback if needed |
| **Backend Architect** | Root cause investigation | Diagnose system issues |
| **Frontend Developer** | UI-side investigation | Diagnose client-side issues |
| **Support Responder** | User communication | Status page updates, user notifications |
| **Executive Summary Generator** | Stakeholder communication | Real-time executive updates |
### P1 — High Response Team
| Agent | Role |
|-------|------|
| **Infrastructure Maintainer** | Incident commander |
| **DevOps Automator** | Deployment support |
| **Relevant Developer Agent** | Fix implementation |
| **Support Responder** | User communication |
### P2 — Medium Response
| Agent | Role |
|-------|------|
| **Relevant Developer Agent** | Fix implementation |
| **Evidence Collector** | Verify fix |
### P3 — Low Response
| Agent | Role |
|-------|------|
| **Sprint Prioritizer** | Add to backlog |
## Incident Response Sequence
### Step 1: Detection & Triage (0-5 minutes)
```
TRIGGER: Alert from monitoring / User report / Agent detection
Infrastructure Maintainer:
1. Acknowledge alert
2. Assess scope and impact
- How many users affected?
- Which services are impacted?
- Is data at risk?
3. Classify severity (P0/P1/P2/P3)
4. Activate appropriate response team
5. Create incident channel/thread
Output: Incident classification + response team activated
```
### Step 2: Investigation (5-30 minutes)
```
PARALLEL INVESTIGATION:
Infrastructure Maintainer:
├── Check system metrics (CPU, memory, network, disk)
├── Review error logs
├── Check recent deployments
└── Verify external dependencies
Backend Architect (if P0/P1):
├── Check database health
├── Review API error rates
├── Check service communication
└── Identify failing component
DevOps Automator:
├── Review recent deployment history
├── Check CI/CD pipeline status
├── Prepare rollback if needed
└── Verify infrastructure state
Output: Root cause identified (or narrowed to component)
```
### Step 3: Mitigation (15-60 minutes)
```
DECISION TREE:
IF caused by recent deployment:
→ DevOps Automator: Execute rollback
→ Infrastructure Maintainer: Verify recovery
→ Evidence Collector: Confirm fix
IF caused by infrastructure issue:
→ Infrastructure Maintainer: Scale/restart/failover
→ DevOps Automator: Support infrastructure changes
→ Verify recovery
IF caused by code bug:
→ Relevant Developer Agent: Implement hotfix
→ Evidence Collector: Verify fix
→ DevOps Automator: Deploy hotfix
→ Infrastructure Maintainer: Monitor recovery
IF caused by external dependency:
→ Infrastructure Maintainer: Activate fallback/cache
→ Support Responder: Communicate to users
→ Monitor for external recovery
THROUGHOUT:
→ Support Responder: Update status page every 15 minutes
→ Executive Summary Generator: Brief stakeholders (P0 only)
```
### Step 4: Resolution Verification (Post-fix)
```
Evidence Collector:
1. Verify the fix resolves the issue
2. Screenshot evidence of working state
3. Confirm no new issues introduced
Infrastructure Maintainer:
1. Verify all metrics returning to normal
2. Confirm no cascading failures
3. Monitor for 30 minutes post-fix
API Tester (if API-related):
1. Run regression on affected endpoints
2. Verify response times normalized
3. Confirm error rates at baseline
Output: Incident resolved confirmation
```
### Step 5: Post-Mortem (Within 48 hours)
```
Workflow Optimizer leads post-mortem:
1. Timeline reconstruction
- When was the issue introduced?
- When was it detected?
- When was it resolved?
- Total user impact duration
2. Root cause analysis
- What failed?
- Why did it fail?
- Why wasn't it caught earlier?
- 5 Whys analysis
3. Impact assessment
- Users affected
- Revenue impact
- Reputation impact
- Data impact
4. Prevention measures
- What monitoring would have caught this sooner?
- What testing would have prevented this?
- What process changes are needed?
- What infrastructure changes are needed?
5. Action items
- [Action] → [Owner] → [Deadline]
- [Action] → [Owner] → [Deadline]
- [Action] → [Owner] → [Deadline]
Output: Post-Mortem Report → Sprint Prioritizer adds prevention tasks to backlog
```
## Communication Templates
### Status Page Update (Support Responder)
```
[TIMESTAMP] — [SERVICE NAME] Incident
Status: [Investigating / Identified / Monitoring / Resolved]
Impact: [Description of user impact]
Current action: [What we're doing about it]
Next update: [When to expect the next update]
```
### Executive Update (Executive Summary Generator — P0 only)
```
INCIDENT BRIEF — [TIMESTAMP]
SITUATION: [Service] is [down/degraded] affecting [N users/% of traffic]
CAUSE: [Known/Under investigation] — [Brief description if known]
ACTION: [What's being done] — ETA [time estimate]
IMPACT: [Business impact — revenue, users, reputation]
NEXT UPDATE: [Timestamp]
```
## Escalation Matrix
| Condition | Escalate To | Action |
|-----------|------------|--------|
| P0 not resolved in 30 min | Studio Producer | Additional resources, vendor escalation |
| P1 not resolved in 2 hours | Project Shepherd | Resource reallocation |
| Data breach suspected | Legal Compliance Checker | Regulatory notification assessment |
| User data affected | Legal Compliance Checker + Executive Summary Generator | GDPR/CCPA notification |
| Revenue impact > $X | Finance Tracker + Studio Producer | Business impact assessment |
本文内容来自网络,本站仅作收录整理。 查看原文