Back to blog

21 Sept 2025

Engineering Agents in Production: Building Autonomy With Adult Supervision

How to deploy software and operations agents safely in production using clear boundaries, evidence gates, and performance controls.

agentsaiengineeringdevopsgovernance

Engineering agents in production cover

The first time a team demos an engineering agent, the reaction is predictable.

  • Someone is impressed.
  • Someone is worried.
  • Someone asks if we can replace half the backlog meetings.

All reasonable reactions.

By late 2025, most engineering organizations had moved past "Can agents do interesting things?" The practical question became: Can agents do useful things reliably inside production delivery systems?

That is a higher bar. It requires architecture, governance, and operational discipline.

An agent is not just a smarter autocomplete. It is a decision actor in your delivery loop. Once you treat it that way, the implementation pattern becomes clearer.

Define the control boundary first, not last

Many failed rollouts start with capability demos and only later ask, "What should this agent be allowed to do?"

Reverse that order.

For each agent workflow, define:

  • action scope,
  • data scope,
  • decision authority,
  • escalation path,
  • rollback authority.

If the boundary is unclear, incidents become debates about process rather than technical fixes.

Agent tasks that usually work early

Start with bounded, high-volume tasks where quality can be validated quickly:

  • test case expansion,
  • refactoring proposals,
  • dependency and risk summaries,
  • migration planning drafts,
  • runbook updates,
  • post-incident timeline synthesis.

These tasks generate tangible value without giving agents direct authority over critical production behavior.

Tasks that should stay human-led initially

Keep humans in the decisive loop for:

  • customer-facing pricing logic,
  • identity and access control changes,
  • compliance-sensitive workflows,
  • irreversible data migration decisions,
  • high-severity incident communication.

Autonomy can expand over time, but only after quality and controls are proven.

The three-loop operating model for production agents

I use a three-loop model.

Loop 1: Pre-Execution Qualification

Before an agent acts:

  • task is classified by risk,
  • required context is validated,
  • applicable policies are attached,
  • expected outputs are defined.

Without qualification, agents produce plausible output in the wrong context.

Loop 2: Evidence-Gated Execution

Agent output must pass objective checks:

  • static analysis and linting,
  • test results,
  • policy and dependency checks,
  • architecture rule conformance.

No evidence, no merge. "Looks fine" is not a control.

Loop 3: Post-Release Monitoring

After deployment:

  • track latency/error deltas,
  • watch change failure signals,
  • enforce rollback thresholds,
  • log incident associations for agent-assisted changes.

If outcomes drift, tighten boundaries before scaling further.

Why agent rollouts fail in mature organizations

In larger organizations, failures are often social-technical:

  • unclear ownership of agent behavior,
  • inconsistent review standards between teams,
  • missing integration with CI/CD controls,
  • weak telemetry linking output to outcomes.

The painful pattern: output volume rises, confidence falls. Teams feel fast and brittle at the same time.

The fix is not "more training" alone. It is operating model redesign.

Engineering quality in an agentic world

Traditional productivity metrics are no longer enough.

Add these:

  • defect-adjusted throughput,
  • review burden per merged change,
  • rollback rate for agent-assisted releases,
  • incident correlation by generation source,
  • time-to-understand during on-call handover.

High output with high clean-up cost is negative productivity.

Integrating agents into CI/CD without chaos

A practical CI/CD integration pattern:

  1. agent generates draft change,
  2. pipeline runs full quality and policy suite,
  3. human reviewer approves by risk class,
  4. staged rollout with guardrails,
  5. telemetry tags capture provenance.

Two design details matter:

  • provenance tagging for traceability,
  • consistent policy enforcement across human and agent changes.

Different rules for different actors always become loopholes.

Agent governance in regulated and trust-sensitive contexts

With DORA, NIS2, and phased AI Act obligations shaping expectations, engineering teams need evidence-ready controls. That does not require bureaucracy. It requires traceable decisions.

Minimum governance for production agents:

  • usage inventory,
  • policy mapping by workflow,
  • review obligations by risk class,
  • audit trail for approvals and overrides,
  • exception register with expiry.

If you cannot explain why an agent made a specific change path, you are not ready for high-impact automation.

Story from scaled delivery programs

In complex delivery environments, consistency beats heroics. When standards were explicit and onboarding included practical certification, quality variance decreased and throughput became more predictable.

The same principle applies to agents:

  • shared standards,
  • repeatable validation,
  • clear ownership,
  • real post-release feedback.

Without those, each team invents its own workflow. That is innovation at first and fragmentation soon after.

Team roles that make agent adoption sustainable

You do not need a massive AI org chart. You do need explicit responsibilities:

  • Platform engineering: integration patterns, controls, tooling reliability.
  • Security/risk: policy boundaries and exception governance.
  • Domain engineering leads: task suitability and quality standards.
  • Product leadership: outcome alignment and risk trade-off clarity.

One owner should coordinate this operating model end-to-end.

The "autonomy maturity" ladder

Scale autonomy progressively.

Level 1: Suggest

Agent drafts, humans decide.

Level 2: Execute under supervision

Agent runs bounded actions with required approvals.

Level 3: Conditional autonomy

Agent can execute pre-approved changes when all checks pass.

Level 4: Assisted self-healing

Agent proposes and executes low-risk remediations with tight rollback.

Most organizations should spend meaningful time in Levels 1-2 before climbing higher.

Avoiding agent theater

Agent theater is when demos look impressive but production impact is small. Symptoms include:

  • lots of generated code, little reliability gain,
  • pilot successes that never scale,
  • heavy dependence on one champion engineer,
  • no measurable impact on cycle time or defect rate.

To avoid this, treat each agent workflow as a product:

  • define users,
  • define value,
  • define controls,
  • define success metrics,
  • retire what does not work.

Humor break: "the agent wrote it"

During early adoption, I heard a phrase that should never appear in incident review: "The agent wrote it."

That sentence explains nothing. Accountability does not move from team to tool. If your ownership model allows that sentence to end a discussion, redesign the model.

A practical 60-day rollout plan

If you need momentum without chaos:

  1. Select 3 low-risk, high-volume workflows.
  2. Define risk classes and review obligations.
  3. Integrate evidence gates in CI/CD.
  4. Tag provenance for all agent-assisted changes.
  5. Review outcomes weekly and adjust boundaries.
  6. Publish a short standard and train teams.

This is enough to create repeatable gains quickly.

What success looks like after one quarter

You should see:

  • faster cycle time in targeted workflows,
  • stable or improved defect metrics,
  • lower onboarding friction,
  • clearer incident traceability,
  • better confidence in release decisions.

If only speed improves while quality degrades, slow down and tighten controls.

Final reflection

Engineering agents are powerful. They are not magical. Their value depends on the operating system around them: boundaries, validation, ownership, and feedback.

When teams design that system deliberately, agents reduce toil and increase capacity for higher-order engineering work. When teams skip the system design, agents amplify existing weaknesses.

In production, the rule is simple: autonomy must earn trust through evidence.

That is how you scale it professionally.

SRE and Agent Workflows: The Handshake Most Teams Forget

Engineering-agent conversations often happen in product and platform circles first, then SRE gets involved when the first odd incident appears. That sequence should be reversed. SRE thinking belongs at design time, not only at incident time.

For agent-enabled delivery, I recommend a formal SRE handshake with five concrete artifacts:

  1. Failure mode map for each agent workflow.
  2. Service-level impact assumptions (latency, error, dependency tolerance).
  3. Operational kill-switch design with explicit authority.
  4. Alert taxonomy distinguishing model-quality drift from infrastructure degradation.
  5. Post-incident attribution template that captures where the control model failed.

This sounds heavy, but in practice it can be implemented in a lightweight template and reviewed in under an hour per workflow.

Another practical step is to include agent provenance tags in incident timelines. During response, teams should not waste 40 minutes figuring out whether a suspicious change came from a human workflow, a scripted automation, or an agent-assisted path. The timeline should answer that immediately.

When SRE is integrated early, teams gain confidence to expand autonomy safely. Without SRE integration, organizations drift into either overconfidence or overrestriction. Neither scales.

The simple benchmark I use is this: if your on-call engineers can identify, contain, and explain an agent-related issue as quickly as a conventional deployment issue, your operating model is maturing. If not, keep autonomy boundaries tighter and improve observability before expanding scope.

Where Agent Programs Commonly Overreach

The most frequent overreach is giving agents too much scope before teams have reliable quality baselines. Expand autonomy only when the previous level is boringly stable for a sustained period. "It worked in the demo" is not a readiness signal. Consistent defect, incident, and rollback behavior is.

Maturity is intentionally uneventful: fewer surprises, faster diagnosis, and less heroics in incident channels. If your agent program feels exciting every week, your control boundaries are probably too loose.

References and Data