A discipline that applies reliability engineering principles to AI systems — ensuring consistent, auditable, and recoverable behavior in production.
Definition
AI Reliability Engineering is an emerging engineering discipline that combines traditional site reliability engineering (SRE) practices with AI-specific concerns: model drift detection, constraint enforcement, hallucination prevention, and automated governance. It treats AI components as probabilistic systems that require different reliability patterns than deterministic software — including output validation, confidence monitoring, and graceful degradation when model behavior deviates from expectations.
Origin
The term emerged from the intersection of site reliability engineering (pioneered by Google in the mid-2000s) and the increasing deployment of AI systems in production environments. As organizations moved AI from research prototypes to production services, they discovered that traditional SRE practices were necessary but insufficient. AI systems introduced failure modes — hallucination, drift, prompt injection, context confusion — that had no equivalent in conventional software. AI reliability engineering evolved to address these gaps, incorporating constraint enforcement, knowledge-based governance, and AI-specific observability into the SRE framework.
Applications
Ecosystem
Related
The discipline of building AI systems that work consistently in production — covering constraint enforcement, drift detection, and failure recovery.
Monitoring, tracing, and understanding AI agent behavior in production — from token usage to decision quality.
A taxonomy of how AI agents fail in production — from hallucinations and tool misuse to cascading failures in multi-agent systems.
Engineering practices for deploying and operating AI systems in production — beyond prototypes and demos.