Knowledge Base

📝 Context Summary

This runbook provides the operational procedure for responding to SIE incidents, implementing the Steady Presence protocol (A-03). It covers incident detection, severity classification, immediate containment, root cause investigation, corrective action, and closure criteria. The runbook enforces the principle that every incident must result in a committed system improvement — an incident cannot close without a linked protocol update, content fix, or automated test change.

Playbook: Incident Response Runbook

In the SIE, every failure is treated as a formal incident. There is no category of failure that is too small to investigate. This is not perfectionism — it is the Steady Presence protocol (A-03), which operates on the principle of Hormesis: the system gains strength from every stressor it encounters. A failure that is investigated and resolved makes the system immune to that failure mode. A failure that is dismissed will recur.

This runbook provides the operational procedure for responding to incidents from detection through resolution.

Incident Triggers

An incident is opened whenever any of the following occur:

  • Agent failure: An agent produces output that the Fleet Commander rejects as factually incorrect, off-brand, out-of-scope, or hallucinated
  • Pipeline failure: The Knowledge Pipeline fails to sync — embedding errors, WordPress API failures, validation rejections, or desynchronization between the vector database and source content
  • Content defect: A knowledge base article is discovered to contain inaccurate information that has been (or could have been) retrieved by agents
  • Human correction: The Fleet Commander or a Content Steward makes a substantive correction to an agent’s output (not stylistic editing — factual, structural, or strategic corrections)
  • Tool failure: An agent’s tool call fails due to misconfiguration, API changes, or permission errors
  • Security event: Unauthorized access, credential exposure, or any action that violates the access control model

Phase 1: Detection and Logging

Step 1.1 — Create the incident record. Open a new incident entry in the incident log. Record:
– Timestamp (ISO 8601)
– Incident source (which agent, pipeline stage, or human actor detected the issue)
– Brief description of what happened
– Affected content or systems

Step 1.2 — Classify severity.

Severity Criteria Response Time
SEV-1: Critical Incorrect content published to production. Agent acting outside defined boundaries. Security event. Immediate — stop affected agent/pipeline, begin investigation within 1 hour
SEV-2: High Agent output rejected by Fleet Commander. Pipeline sync failure affecting multiple documents. Content defect in high-retrieval document. Begin investigation within 24 hours
SEV-3: Medium Human correction to agent output. Pipeline warning (partial sync). Content defect in low-retrieval document. Begin investigation within 72 hours
SEV-4: Low Minor tool failure with automatic retry. Stylistic correction to agent output. Log and batch — address in next maintenance cycle

Step 1.3 — Notify. For SEV-1 and SEV-2 incidents, notify the Fleet Commander immediately via the configured notification channel (Discord webhook, email, or dashboard alert).

Phase 2: Containment

For SEV-1 and SEV-2 incidents, take immediate containment action before beginning investigation.

Step 2.1 — Stop the bleeding. Depending on the incident type:
Agent failure: Pause the affected agent. It does not execute further tasks until the investigation is complete.
Published defect: Revert the WordPress post to draft status or correct the specific error. If the defect was in a knowledge base article, set its status to Under Review to deprioritize it in agent retrieval.
Pipeline failure: Halt the sync pipeline. Do not allow partial or corrupted data to propagate to the vector database.
Security event: Revoke affected credentials immediately. Audit recent actions taken with those credentials.

Step 2.2 — Preserve evidence. Before making any changes to the system, capture the current state:
– The agent’s full output including Verification Ledger
– The relevant audit log entries
– The content of any Knowledge Core documents cited in the output
– The pipeline sync logs if applicable

This evidence is required for root cause analysis and prevents “fixing” the symptom while losing the ability to understand the cause.

Phase 3: Root Cause Investigation

Step 3.1 — Apply the blameless framework. The root cause is almost never “the LLM hallucinated.” LLMs hallucinate when they lack sufficient context, receive ambiguous instructions, or are given conflicting information. The investigation must identify the systemic cause, not stop at the proximate cause.

Common root cause categories:

Category Description Example
Context gap The Knowledge Core lacked the information the agent needed Agent cited outdated pricing because the TOOLS article hadn’t been updated
Constraint failure The agent’s boundaries or instructions were ambiguous Agent interpreted “draft social content” as including ad copy, which was out of scope
Freshness decay Source content was accurate when written but is now outdated Market analysis from 8 months ago cited as current
Tool misconfiguration The agent’s tool was not configured with the correct constraints WordPress API wrapper allowed status values other than draft
Retrieval failure The vector search returned irrelevant or insufficient context Embedding quality was poor due to a Dual-Readability violation in the source document
Prompt ambiguity The agent’s system prompt or Commander’s Intent was unclear Two instructions in the prompt conflicted, and the agent chose the wrong one

Step 3.2 — Trace the failure chain. Starting from the visible symptom, trace backward through the system to find where the chain broke:
1. What did the agent output? (Review the Verification Ledger)
2. What sources did it cite? (Check the sources_used field)
3. Were those sources accurate? (Read the cited Knowledge Core documents)
4. Was the retrieval correct? (Did the vector search return the right chunks?)
5. Was the agent’s interpretation correct? (Given the retrieved context, was the agent’s reasoning sound?)
6. Were the constraints clear? (Review the agent’s role definition and tool configuration)

The point where the chain breaks is the root cause.

Step 3.3 — Document findings. Record the root cause category, the specific failure point, and the evidence supporting the diagnosis.

Phase 4: Corrective Action

Step 4.1 — Generate system immunity. Based on the root cause, implement a corrective action that prevents this specific failure mode from recurring:

Root Cause Corrective Action
Context gap Create or update the missing Knowledge Core content. Flag for freshness audit.
Constraint failure Update the agent’s role definition with explicit boundary clarification. Add the scenario as a negative example.
Freshness decay Update the stale content. Adjust the freshness audit cadence for that domain if the decay was faster than expected.
Tool misconfiguration Fix the tool configuration. Add a validation check that would have caught the misconfiguration.
Retrieval failure Fix the source document’s Dual-Readability compliance. Re-embed. Consider adding synthetic questions to improve retrieval targeting.
Prompt ambiguity Rewrite the conflicting instructions. Add a priority hierarchy if multiple directives could conflict.

Step 4.2 — Commit the fix. All corrective actions must be committed to the relevant system: Git for Knowledge Core changes, agent configuration for role/prompt changes, tool code for tool fixes. The commit message must reference the incident ID.

Step 4.3 — Verify the fix. Re-run the original failing scenario (or a close approximation) with the corrective action in place. Confirm that the agent now produces the correct output.

Phase 5: Closure

Step 5.1 — Confirm closure criteria. An incident cannot be closed until all of the following are true:
– Root cause is identified and documented
– Corrective action is committed to the relevant system
– The fix has been verified by re-running the failing scenario
– The incident record is complete with all findings, actions, and evidence

Step 5.2 — Update the incident record. Record the resolution: root cause category, corrective action taken, commit reference, and verification result. Set the incident status to Resolved.

Step 5.3 — Review for patterns. After closing the incident, check for patterns:
– Is this the same root cause category as recent incidents? If so, there may be a systemic issue that individual fixes won’t resolve.
– Is this the same agent repeatedly failing? If so, the agent may need a more fundamental reconfiguration or additional graduated testing.
– Is this the same Knowledge Core domain repeatedly cited? If so, that domain may need a dedicated freshness audit or steward assignment.

Patterns identified here feed into the next quarterly freshness audit and fleet review.

Key Concepts: Incident Response Severity Classification Root Cause Analysis Blameless Post-Mortem System Immunity Closure Criteria

About the Author: Adam Bernard

Playbook: Incident Response Runbook
Adam Bernard is a digital marketing strategist and SEO specialist building AI-powered business intelligence systems. He's the creator of the Strategic Intelligence Engine (SIE), a multi-agent framework that transforms business knowledge into autonomous, AI-driven competitive advantages.

Let’s Connect

Ready to Build Your Own Intelligence Engine?

If you’re ready to move from theory to implementation and build a Knowledge Core for your own business, I can help you design the engine to power it. Let’s discuss how these principles can be applied to your unique challenges and goals.