Knowledge Base
📝 Context Summary
Data Quality Standards
The quality of every AI agent output is bounded by the quality of the knowledge it retrieves. A perfectly designed retrieval system querying poorly structured content will produce mediocre results. The SIE’s data quality standards exist to prevent this failure mode by establishing clear, enforceable requirements for all content entering the Knowledge Core.
These standards are not guidelines — they are gates. Content that does not meet the requirements does not enter the Knowledge Pipeline, does not get embedded in the vector database, and does not get published to WordPress.
The Three Quality Gates
Quality is enforced at three distinct points in the content lifecycle. Each gate catches a different category of defect.
Gate 1: Authoring Standards
Every markdown file in the Knowledge Core must meet structural requirements at the time of creation.
Frontmatter completeness. All required YAML frontmatter fields must be present and populated. The required fields are: title, id, version, steward, updated, status, doc_type, summary, tags, relations, semantic_summary, synthetic_questions, key_concepts, primary_keyword, seo_title, meta_description, and excerpt. Files with empty or missing required fields are flagged during validation and excluded from the sync pipeline until corrected.
Dual-Readability compliance. All body content must conform to the Dual-Readability standard. Every paragraph must be semantically self-contained — meaning it can be extracted by a vector search and understood without the surrounding paragraphs. Narrative-dependent prose, unresolved pronouns referring to prior sections, and paragraphs that begin with “As mentioned above” or similar references violate this standard.
Epistemic marking. Claims and assertions should carry epistemic markers where appropriate — Axiomatic (non-negotiable truth), Heuristic (best practice based on evidence), or Speculative (theory or prediction). These markers allow agents to weigh the certainty of retrieved information when constructing responses.
Structural formatting. Articles must use hierarchical headings (H1 for title, H2 for major sections, H3 for subsections). Lists, tables, and code blocks must be properly formatted in standard markdown. Heading hierarchy must not skip levels.
Gate 2: Ingestion Validation
When content enters the Knowledge Pipeline for embedding and synchronization, automated validation checks enforce consistency.
Schema validation. The frontmatter is validated against the required schema. Missing fields, incorrect types (e.g., a string where an array is expected), and empty required fields trigger rejection. The file is logged as failed and excluded from the current sync cycle.
Taxonomy mapping. Each file must resolve to a valid knowledge_topic term via the path pattern mapping. Files that fall outside the defined taxonomy tree are flagged for manual classification before they can be indexed.
Duplicate detection. The pipeline checks for duplicate id fields and near-duplicate content (based on embedding similarity above a defined threshold). Duplicates fragment the Knowledge Core and degrade retrieval precision — when an agent retrieves two near-identical chunks, it wastes context window and introduces ambiguity.
Link integrity. Internal links (Obsidian [Wikilinks](/kb/wikilinks/) and relations in frontmatter) are validated. Broken links to nonexistent files are flagged. Orphaned files — those with no inbound links from any other document — are surfaced for review, as they may indicate content that has drifted out of the knowledge graph.
Gate 3: Ongoing Audit
Content that passes Gates 1 and 2 is not permanently exempt from review. The Knowledge Core is a living system, and content decays over time.
Freshness scoring. Every document carries an updated timestamp. Documents that have not been reviewed within a defined period (determined by the content category and rate of change in its domain) are flagged for freshness review. The freshness framework defines three tiers of review — automated, hybrid, and expert — based on the sensitivity and complexity of the content.
Retrieval quality monitoring. Agent outputs that receive low confidence scores or trigger the Iron Word’s human_review_required flag are traced back to their source documents. If a pattern emerges — multiple low-confidence outputs citing the same source — that document is escalated for quality review regardless of its freshness score.
Post-mortem integration. When the Steady Presence Incident Loop identifies a content defect as the root cause of an agent failure, the affected document is immediately flagged and enters a mandatory review cycle. The incident cannot close until the content defect is corrected and the document passes Gate 2 validation again.
Content Status Lifecycle
Every document in the Knowledge Core carries a status field that governs its visibility in the pipeline:
- Draft — Content is in progress. Excluded from the vector database and WordPress sync. Visible only in Obsidian.
- Active — Content has passed Gates 1 and 2. Fully indexed in the vector database and eligible for WordPress publication.
- Under Review — Content has been flagged by Gate 3 (freshness, retrieval quality, or post-mortem). Remains in the vector database but is deprioritized in retrieval ranking until review is complete.
- Archived — Content is no longer current. Removed from the vector database and WordPress. Retained in Git history for audit purposes.
Why Standards Are Non-Negotiable
The temptation with any knowledge base is to prioritize volume over quality — to get content in first and clean it up later. The SIE rejects this approach. Every substandard document that enters the vector database becomes a potential source of hallucination, a drain on retrieval precision, and a contributor to the Human Correction Tax. The quality gates exist to ensure that the cost of maintaining high standards is paid at authoring time, where it is cheapest, rather than at agent output time, where it compounds.