Software Delivery Autonomy Levels

Most organizations using AI for software development are stuck at the same place: AI writes code, humans review it. This feels like progress. Developers report productivity gains. Commit velocity increases. But the fundamental process remains unchanged—humans orchestrate every step, validate every output, and gate every deployment.

The real transformation begins when AI stops being a faster typewriter and starts changing how software delivery operates.

This article introduces the Software Delivery Autonomy Levels—a seven-level progression from traditional Agile teams to fully autonomous software delivery and operations. We focus specifically on Levels 4 through 7, where the process itself transforms, examining real-world implementations at AWS, Spotify, StrongDM, and the emerging frontier beyond them.

At Levels 6 and 7, we introduce ASDM (Autonomous Software Delivery Model)—a methodology with specific practices and artifacts for achieving autonomous coding, review, validation, and operations. Teams can adopt ASDM practices incrementally as they progress through the levels.

The shift from “AI-assisted” to “AI-operated” is not incremental improvement. It is a fundamental change in how software gets built, validated, and shipped.

Software Delivery Autonomy Levels: The Model

The seven levels describe a progression in who—or what—executes software delivery work, and how humans participate in the process.

Level	Name	Human Role	Defining Characteristic
1	Agile/Scrum	Execute work	Team coordination optimization
2	DevOps	Build + operate	Pipeline automation (CI/CD)
3	AI-Assisted	Primary implementer	AI augments existing process
4	AI-Native Workflows	Orchestrator	Process redesigned around AI
5	Agentic Engineering	Batch reviewer	Extended autonomous sessions
6	Autonomous Coding & Review	Specification owner	No human coding or code review
7	Autonomous Operations	Governor	Autonomous production operations, exceptions only

A useful lens is the loop position of human involvement:

Levels 1-3: Human-in-the-loop (interactive, every step)
Level 4: Human-in-the-loop (but process is AI-native)
Level 5: Human-on-the-loop (batch review after autonomous sessions)
Level 6: Human-over-the-code-loop (humans define intent and scenarios, not code)
Level 7: Human-over-the-operations-loop (governance, incidents, and exceptions only)

Why We Focus on Levels 4-7

Levels 1 and 2 represent pre-AI organizational maturity. Agile optimized team coordination. DevOps automated delivery pipelines. These are established baselines that most software organizations have adopted in some form.

Level 3—AI-Assisted development—is where most organizations operate today. Developers use Copilot, Cursor, or similar tools to accelerate coding. AI generates boilerplate, suggests completions, and helps with documentation. Productivity improves, but the workflow remains fundamentally unchanged: humans drive, AI assists.

The inflection point occurs between Level 3 and Level 4. This is where AI stops augmenting existing processes and starts requiring new ones. The tooling changes. The team dynamics change. The validation systems change. Everything downstream of “how we build software” must be reconsidered.

Levels 4 through 7 represent this transformation—from AI as tool to AI as operator.

Validation Methods by Level

As autonomy increases, validation methods must evolve. Human judgment gives way to automated systems capable of operating without human attention.

Level	Primary Validation	Automated Validation	Human Gate
3	Human code review	Unit tests, CI checks	Every PR
4	Human code review	Unit tests, CI checks, prompt benchmarks	Every PR
5	Human batch review	Unit tests, CI checks, LLM-as-a-Judge for style/security	Before merge
6	Scenario satisfaction	Scenarios, DTU, LLM-as-a-Judge, probabilistic metrics	Before production
7	Operational satisfaction	Production telemetry, SLOs, canaries, rollback/remediation agents	Exceptions only

LLM-as-a-Judge emerges at Level 5 as a bridge technology. AI models evaluate code for style consistency, security vulnerabilities, and architectural compliance—tasks previously requiring human reviewers. At Level 6, LLM-as-a-Judge becomes one component of a broader autonomous validation stack. At Level 7, validation expands beyond code correctness into operational health: deployment safety, live telemetry, incident response, rollback, and continuous policy compliance.

Artifacts and Rituals by Level

Each level introduces new artifacts (documents, infrastructure) and rituals (practices, ceremonies) that enable increased autonomy.

Level	New Artifacts	New Rituals
4	Steering files (AGENTS.md), prompt libraries, context specs	Mob Elaboration, Mob Construction, prompt reviews
5	Overnight execution logs, batch review queues, agent traces	Queue-before-sleep, morning review sessions, high-bandwidth standups
6	Scenario libraries, Digital Twins, satisfaction dashboards, provenance records	Scenario authoring, DTU maintenance, policy governance, convergence monitoring
7	Operational policy engines, production guardrails, automated remediation playbooks	SLO governance, incident simulation, exception review, autonomous operations review

These artifacts and rituals accumulate—Level 7 organizations maintain everything from earlier levels while adding new capabilities.

Level Operating Profiles

The levels are easiest to distinguish by looking at what changes in behavior, output, metrics, and limitation:

Level	Behaviors	Primary Output	Useful Metrics	Key Shift	Limitation
1 — Agile/Scrum	Humans coordinate work through ceremonies, tickets, and planning rituals	Shipped increments from human teams	Sprint predictability, throughput, cycle time	Work becomes coordinated	Delivery remains human-executed
2 — DevOps	Teams automate build, test, release, and operations workflows	Reliable delivery pipeline	Deployment frequency, lead time, MTTR, change failure rate	Delivery becomes automated	Implementation remains human-executed
3 — AI-Assisted	Developers use copilots and chat tools inside existing workflows	Faster human-authored code	AI-assisted commits, time saved, PR throughput	AI accelerates individuals	Process and review model remain unchanged
4 — AI-Native Workflows	Teams redesign planning, construction, context, and feedback around AI participation	AI-native delivery workflow	Prompt reuse, context quality, prompt benchmark pass rates, PR quality	AI becomes part of the process	Humans still orchestrate every significant step
5 — Agentic Engineering	Agents execute for hours; humans review completed batches	Agent-generated PRs or features	Autonomous session duration, batch review volume, generated PRs, review rejection rate	Humans move from interactive driver to batch reviewer	Human review remains the bottleneck
6 — Autonomous Coding & Review	Agents write and validate code without human code review	Scenario-satisfied software candidate	Human-written code %, human code-review %, scenario volume, satisfaction rate, token spend	Validation replaces code review	Production operations still retain human gates
7 — Autonomous Operations	Systems deploy, monitor, remediate, and escalate under policy	Operating production system	Autonomous deployment %, rollback/remediation success, SLO adherence, exception rate	Policy-governed systems operate software	Requires exceptional trust, observability, and compliance controls

Level 4: AI-Native Workflows

Defining characteristic: The development process is redesigned around AI participation across the entire lifecycle.

Human role: Orchestrator—humans direct AI across planning, implementation, testing, and deployment rather than executing each step.

Example: AWS AI-Driven Development Lifecycle (AI-DLC)

The AWS AI-DLC Methodology

In early 2026, AWS published their AI-Driven Development Lifecycle (AI-DLC) framework, arguing that existing Agile and Scrum practices are fundamentally misaligned with AI-native development. Their core critique:

“Simply retrofitting AI as an assistant… reinforces outdated inefficiencies.”

AI-DLC proposes replacing traditional sprint ceremonies with AI-native workflows:

Mob Elaboration replaces sprint planning. Instead of humans decomposing work into tickets, AI participates in requirements analysis, generates implementation plans, and identifies technical risks. Humans guide and validate rather than author.

Mob Construction replaces pair programming. Multiple AI agents can work on related components simultaneously while humans orchestrate priorities and resolve conflicts.

Bolts instead of Sprints. The two-week sprint cadence, designed around human cognitive limits and coordination overhead, becomes less relevant when AI can iterate continuously. Work flows in smaller, faster increments.

What Changes at Level 4

The shift from Level 3 to Level 4 requires rethinking several assumptions:

Steering files become infrastructure. Documents like AGENTS.md, CLAUDE.md, and COPILOT.md emerge as machine-readable governance. These encode repository conventions, architecture constraints, dependency rules, and operational safety requirements. In Level 3, these lived as tribal knowledge or scattered documentation. In Level 4, they become executable contracts that AI systems consume directly.

Context management becomes critical. AI systems require explicit context that humans infer implicitly. Level 4 organizations invest in persistent memory systems, architectural decision records, and structured specifications that AI can consume across sessions.

Prompt quality becomes operationally important. Bad prompts at Level 3 waste time. Bad prompts at Level 4 can create insecure code, hallucinated refactors, or architectural drift at scale. Organizations begin treating prompts as versioned assets—reviewed, tested, and benchmarked.

Feedback loops tighten. AI-DLC emphasizes adaptive workflows where production telemetry directly influences development priorities. The boundary between “building” and “operating” software blurs further than DevOps achieved.

The Skill of Calibrated Prompting

One of the underappreciated requirements of Level 4 is developing intuition for AI interaction. Joe Magherimov, who led AWS’s Project Mantle, describes this as finding the “maximally supportable request”:

“One of the intuitions you build up… is that you build up that intuition of what is that maximally supportable [request]. Ask for too much and the model is likely to fail—it might run out of the context window or do weird stuff. Ask for too little and you’re not getting quite the speed-ups because you still have to be constantly in the loop.”

This calibration is not static. It varies by model, project, and domain. The only way to develop this intuition is practice—which means Level 4 requires investment in team skill development, not just tooling.

Ambiguity is another enemy of productivity. When multiple approaches could solve a problem, effective Level 4 practitioners brainstorm with the model before implementation. Magherimov describes starting with “here’s the problem I have, list solutions” to eliminate dead ends before committing to an approach.

Context Window Discipline

Level 4 also introduces new technical practices around context management. Magherimov describes clearing the context window between every request, using files as persistent memory for multi-step work:

“I would start with a file that describes a high-level approach. Then we break it down. I would start a prompt with: okay, here’s our design we agreed on, we’re now on step two out of seven, and here’s what we’re going to do in this step.”

When a step completes, the code is committed, the context is cleared, and the next session begins fresh with the durable spec file reloaded. This mirrors how humans naturally decompose large problems into manageable pieces—but makes the practice explicit and systematic.

The Level 4 Bottleneck

Despite AI participation across the lifecycle, Level 4 maintains human orchestration at each step. The developer’s job changes—from implementer to orchestrator—but humans remain in the loop for every significant decision.

This is intentional. AWS explicitly frames AI-DLC as “AI-powered execution with human oversight.” The methodology assumes humans validate AI outputs before they progress.

For many organizations, Level 4 represents a sustainable equilibrium. AI dramatically accelerates work while humans maintain quality control. But the constraint is clear: every output waits for human attention.

Level 5 breaks this constraint.

Level 5: Agentic Engineering

Defining characteristic: AI agents execute autonomously for extended periods (hours, not minutes). Humans review outputs in batches rather than interactively.

Human role: Batch reviewer—humans validate completed work rather than guiding each step.

Examples: AWS Project Mantle, Spotify Honk

AWS Project Mantle: Overnight Autonomous Coding

Project Mantle offers one of the most detailed public accounts of Level 5 development. In January 2026, Joe Magherimov—VP and Distinguished Engineer at AWS—described how a team of approximately nine senior engineers built a new inference engine for Amazon Bedrock using AI-first development, claiming 10x commit velocity.

The team composition matters: one Distinguished Engineer plus eight Principal Engineers, all from EC2 with decades of collective experience. This was not junior developers experimenting with AI. This was senior engineers deliberately restructuring how they work.

The overnight pattern. Magherimov describes queuing prompts before sleep and checking results in the morning:

“The thing that probably took me most by surprise is that asynchronous nature of software development. All of a sudden, I don’t actually need to be present or even paying attention for 80% to 90% of the cycle of making a software change.”

This represents the defining shift of Level 5: extended autonomous sessions. AI agents execute for eight or more hours without human guidance. The human checkpoint comes after the autonomous session completes, not during it.

Human accountability remains. Critically, Mantle maintained human responsibility for all code:

“At the end of the day, any line of code committed into the repository has a human name attached to it. A human is ultimately responsible for the quality of the code.”

This distinguishes Level 5 from Level 6. Agents execute autonomously, but humans still review everything before merge. The workflow is “fire, then review” rather than “review each step.”

The magnification effect. Magherimov’s colleague offered a key insight:

“AI-assisted development is gonna magnify well-functioning teams much more than it magnifies not-so-well-functioning teams.”

Teams with strong fundamentals—clear communication, fast feedback loops, solid testing—see those advantages amplified. Teams with dysfunction see that amplified too. Level 5 does not fix broken processes; it accelerates whatever exists.

Spotify Honk: Background Agents at Scale

Spotify’s Honk system provides another Level 5 implementation, optimized for fleet-wide changes across their large codebase.

Background execution. Rather than interactive coding sessions, Honk runs agents in the background. Engineers specify transformations in natural language, and agents execute autonomously—running formatting, linting, generating diffs, and capturing traces.

Ready-to-review outputs. Agents generate pull requests automatically opened against target repositories. The surrounding infrastructure handles “targeting repositories, opening pull requests, getting reviews, and merging into production.” Humans review before merge, but the creation is fully autonomous.

Scale metrics. Spotify reports over 1,500 merged AI-generated pull requests across migrations, with 60-90% time savings versus manual implementation. Hundreds of developers interact with the system for tasks ranging from language modernization to framework upgrades to UI component migrations.

The headline claim. In February 2026, TechCrunch reported that Spotify’s “best developers haven’t written a line of code since December, thanks to AI.” These engineers shifted entirely to specification, review, and governance—the natural endpoint of Level 5.

The Level 5 Operating Model

Level 5 organizations exhibit several common patterns:

Batch review replaces interactive review. Instead of reviewing each AI response, humans review completed PRs or feature implementations. This dramatically increases throughput but requires higher trust in AI outputs.

Testing investment increases. Both Mantle and Honk teams report significant investment in testing infrastructure. When code velocity increases 10x, bug volume increases proportionally unless validation systems scale.

Local integration testing becomes critical. Mantle emphasized running the full stack locally with fast feedback loops—minutes, not the 20-30 minutes typical of CI pipelines. This enables rapid validation of AI-generated code before it affects other engineers.

Co-location or high-bandwidth communication. Mantle co-located the entire team for high-frequency whiteboard discussions. When code changes happen overnight, shared understanding becomes the bottleneck. Miscommunication compounds faster.

The Quality Challenge at Velocity

Level 5 velocity creates new quality challenges. Magherimov observed:

“Even if the rate of bugs was lower than what would have been with a human, they still happen. And when you write a lot more code, you’re gonna have a lot more bugs.”

More critically, bugs at this velocity impact the entire team. A buggy check-in can break other engineers’ workflows, creating “tools-down” scenarios where everyone chases down what changed. The team’s response was to raise the bar on testing and verification—catching bugs before they’re checked in, before they impact other engineers.

This is why Mantle allocated approximately 25% of team effort to improving development practices rather than feature work. They worked with build system teams to speed up feedback loops—minutes instead of twenty to thirty minutes for builds and integration tests.

As Magherimov acknowledges, none of this is new—the industry has been paying attention to testing and verification for years. But at these rates of change, these practices become even more critical. Without a way to curtail chaos, it explodes.

The Level 5 Constraint

Level 5 removes human involvement from execution but maintains it for validation. Every PR still requires human review before merge. Every deployment still gates on human approval.

This constraint exists for good reason: AI-generated code can be subtly wrong in ways that compile and pass tests but violate business logic or security requirements. Human review catches these failures.

But this constraint also creates a bottleneck. As agent throughput increases, human review capacity becomes the limiting factor. Organizations either hire more reviewers, accept slower throughput, or consider Level 6.

Level 6: Autonomous Coding & Review

Defining characteristic: Humans no longer write or review code. Validation shifts to scenario-based testing and probabilistic satisfaction metrics, but production operations still retain human gates or conventional operational controls.

Human role: Specification owner—humans define intent, scenarios, constraints, and acceptance thresholds. They do not write or review code.

Example: StrongDM Software Factory

The StrongDM Software Factory

StrongDM’s Software Factory represents the most aggressive public implementation of autonomous coding and review. Their operating principles are explicit:

“Code must not be written by humans.” “Code must not be reviewed by humans.”

This is not “humans review less” or “humans review only high-risk changes.” This is a categorical exclusion of human coding and code review from the development process.

That is why StrongDM fits Level 6 rather than Level 7 in this model. The public material demonstrates non-interactive development: specifications and scenarios drive agents that write code, run harnesses, and converge without human review. It also describes Digital Twin clones of services such as Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets for high-volume validation. It does not, however, establish that StrongDM has delegated production operations for its enterprise offerings to autonomous systems without human operational gates. Level 6 is therefore the right classification: autonomous coding and review, not yet fully autonomous operations.

How does software get validated without human review?

Scenarios replace tests. Traditional unit tests are written by humans (or AI) alongside the code. This creates a problem: AI might game narrow test conditions with shortcuts like hardcoded return values.

StrongDM uses “scenarios”—end-to-end user stories stored outside the codebase. These function as holdout sets that the code-generating agents cannot see during implementation. The scenarios describe what users need to accomplish, not implementation details.

Probabilistic satisfaction replaces binary pass/fail. Instead of “did all tests pass?” the factory measures: “Of all the observed trajectories through all the scenarios, what fraction of them likely satisfy the user?”

This probabilistic framing acknowledges that software correctness is not binary. A system can satisfy 99% of user journeys while failing edge cases. Level 6 makes this tradeoff explicit and measurable before humans accept the system for operational use.

Digital Twin Universe (DTU) enables unlimited testing. Software rarely operates in isolation. Modern applications integrate with Okta, Jira, Slack, Google Workspace, and dozens of other services. Testing against production APIs has limits: rate limiting, API costs, inability to test failure modes.

StrongDM’s DTU creates behavioral clones of third-party services. This enables testing at volumes exceeding production limits, safe failure mode testing, and running thousands of scenarios per hour without rate limiting or API costs. The DTU makes validation possible that would be impossible against real services.

The DTU is perhaps the most significant infrastructure investment required for Level 6. Without it, scenario-based validation cannot achieve the coverage necessary to replace human code review.

The Convergence Model

StrongDM describes their factory as a system where “specs + scenarios drive agents that write code, run harnesses, and converge without human review.” The key word is converge.

Unlike traditional development where code is written once and reviewed, the factory operates iteratively. Agents generate code, run it against scenarios, observe failures, and generate improved versions. This cycle repeats until the satisfaction metrics reach acceptable thresholds.

This convergence model is only possible because of recent advances in LLM reliability. StrongDM credits Claude 3.5 (October 2024) with achieving “compounding correctness rather than error” in long-horizon coding workflows. Earlier models would accumulate errors over extended sessions, making convergence impossible. Recent models can maintain coherence across thousands of iterations.

The combination of Cursor’s YOLO mode (continuous agent iteration without human confirmation) and reliable long-horizon models created the technical foundation for Level 6.

The Economics of Level 6

StrongDM suggests a provocative cost baseline:

“If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement.”

At $1,000/day per engineer, a 10-person team spends $200,000/month on tokens alone—before infrastructure, tooling, or salaries. This is dramatically higher than Level 3-5 token costs.

Simon Willison, in his analysis of the Software Factory, raises the obvious concern:

“If these patterns really do add $20,000/month per engineer to your budget they’re far less interesting to me.”

The economics only work if the productivity gains exceed the costs. StrongDM’s implicit claim is that autonomous coding and review delivers enough value to justify the investment. But this remains unproven at scale across industries.

What Level 6 Requires

Level 6 is not achievable by simply removing human review from Level 5. It requires fundamentally different validation infrastructure:

Scenario libraries. Comprehensive end-to-end user stories that define what “working software” means. These must be maintained separately from the codebase and cover edge cases that would traditionally rely on human judgment.

Digital Twin infrastructure. Behavioral clones of all external services the software interacts with. This is a significant engineering investment, potentially as complex as the software itself.

Satisfaction metrics. Systems for measuring probabilistic user satisfaction across scenario trajectories. This requires instrumentation, measurement frameworks, and thresholds that define “good enough.”

Governance policies. Clear rules for what requires human escalation before production exposure. Even Level 6 systems need escape hatches for truly novel situations, security incidents, or regulatory requirements.

Provenance and auditability. When no human reviews code, you need perfect records of which agent made which change, which prompt caused it, which model version was used, and why. Compliance and debugging both depend on this.

Level 7: Autonomous Operations

Defining characteristic: Autonomous systems not only generate and validate code, but also operate software in production: deployment, monitoring, rollback, remediation, incident response, and policy enforcement run without routine human gates.

Human role: Governor—humans define business objectives, risk policies, escalation thresholds, and regulatory constraints. Humans handle exceptions rather than normal delivery or operations.

Example: Emerging frontier; no broadly documented enterprise example yet.

Level 7 is the true endpoint of software delivery autonomy. It is what people often mean when they use the “dark factory” metaphor: a production system where software moves from intent to implementation to operation without humans in the normal execution path. We avoid that term here because it carries unnecessary negative connotations and centers the absence of humans rather than the governance model that makes autonomy acceptable.

The important distinction is operational responsibility. Level 6 can produce code without human coding or code review, but humans may still decide when it is exposed to customers, monitor production behavior, approve remediations, and own incident response. Level 7 moves those operational gates into policy-governed autonomous systems.

What Level 7 Adds Beyond Level 6

Autonomous deployment. Changes that satisfy scenario and policy thresholds can progress through environments, canaries, and rollout stages without human approval.

Production telemetry as validation. Scenario satisfaction is necessary but not sufficient. Level 7 systems continuously compare live behavior against SLOs, user outcomes, security signals, cost budgets, and policy constraints.

Autonomous rollback and remediation. When telemetry degrades, systems can pause rollout, revert, generate a fix, validate it, and redeploy without waiting for an operator.

Incident simulation and preparedness. Digital Twins expand from third-party service validation into operational rehearsal: outage scenarios, degraded dependencies, permission failures, regional failures, abuse patterns, and cost spikes.

Policy-governed escalation. Humans are paged for exceptions, not routine decisions. Escalation policies define what the system may do alone, what requires notification, and what requires human approval.

Why Level 7 Is Rare

Autonomous operations requires trust across a wider surface than autonomous coding. Code correctness is only one part of production safety. Level 7 also needs observability maturity, deployment isolation, rollback confidence, incident playbooks, security policy automation, compliance evidence, cost controls, and organizational willingness to let systems act.

This is why StrongDM’s Software Factory is an important Level 6 signal without being sufficient evidence of Level 7. It shows that code creation and review can move out of the human loop. The next frontier is whether production operation can do the same.

Introducing ASDM: The Methodology for Levels 6 and 7

ASDM (Autonomous Software Delivery Model) is the methodology for achieving and operating at Levels 6 and 7. Just as Scrum provides artifacts and practices for Agile teams, and DevOps provides practices for automated delivery, ASDM provides the framework for autonomous coding, review, validation, and operations.

Core ASDM Practices

Specification-first development. Software is defined by machine-readable specifications that agents consume directly. Humans author intent; agents implement.

Scenario-based validation. End-to-end user stories replace unit tests as the primary validation mechanism. Scenarios are holdout sets that implementation agents cannot access.

Probabilistic satisfaction metrics. Success is measured by the fraction of user journeys satisfied, not binary test passage.

Digital Twin testing. External services are cloned for unlimited, safe validation. Production APIs are never the bottleneck.

Zero human code review. Code is not read by humans as a required approval step. Validation systems replace human code review.

Governance-by-policy. Human oversight operates through policy definition, not code review. Policies define escalation triggers, risk thresholds, prohibited patterns, and operational authority.

Full provenance. Every change is traceable to its originating specification, prompt, agent, and model version.

Autonomous operations. At Level 7, deployment, monitoring, rollback, remediation, and incident response are governed by policy and telemetry rather than routine human approval.

ASDM Practices at Earlier Levels

Not all ASDM practices require Level 6. Teams can adopt components incrementally:

Practice	Level 4	Level 5	Level 6	Level 7
Steering files (AGENTS.md)	Introduce	Mature	Required	Required
Extended autonomous sessions	—	Core practice	Core practice	Core practice
Scenario-based validation	Experiment	Partial adoption	Required	Required
Digital Twin infrastructure	—	Experiment	Required	Required
Probabilistic satisfaction	—	—	Required	Required
Zero human code review	—	—	Required	Required
Governance-by-policy	Introduce	Develop	Required	Required
Full provenance	Good practice	Important	Required	Required
Autonomous operations	—	—	Experiment	Required

This incremental adoption path allows teams to build toward Levels 6 and 7 without committing to the full transformation immediately.

The ASDM Mindset Shift

Adopting ASDM requires more than new tools and processes. It requires a fundamental shift in how teams think about software quality and accountability.

From “Did a human verify this?” to “Do the scenarios and policies pass?” Traditional quality assurance relies on human judgment. Someone reads the code, considers edge cases, and attests that it meets requirements. ASDM replaces this with systematic scenario coverage and explicit policy gates. At Level 6, this determines whether code can be accepted without human review. At Level 7, it also determines whether software can operate, remediate, and evolve in production without routine human approval.

From implementation ownership to specification ownership. Engineers stop owning code and start owning specifications and scenarios. The quality of your work is measured by how well you defined what the software should do, not how elegantly you implemented it.

From debugging code to debugging scenarios. When production fails, the first question is not “what’s wrong with the code?” but “what scenario did we miss?” Failures indicate gaps in scenario coverage, not gaps in human review.

This mindset shift is perhaps the hardest part of Level 6 and Level 7 adoption. Many engineers derive professional identity from code craftsmanship. ASDM asks them to derive identity from specification, scenario, and operational policy craftsmanship instead.

The Critical Transitions

Moving between levels is not continuous improvement—each transition requires specific changes that can feel like regression before the benefits materialize.

Level 3 → Level 4: Accepting Process Change

The hardest part of moving from AI-Assisted to AI-Native is accepting that existing processes need replacement, not augmentation. Teams comfortable with Agile ceremonies resist replacing sprint planning with Mob Elaboration. Engineers skilled at code review resist restructuring around AI-generated code.

The transition requires leadership commitment to process experimentation and tolerance for temporary productivity loss while new patterns establish.

Level 4 → Level 5: Trusting Asynchronous Execution

Moving from interactive orchestration to overnight autonomous sessions requires trust that wasn’t necessary before. Engineers must accept that code will be generated while they sleep, without the ability to course-correct in real-time.

This transition also requires infrastructure for extended agent execution—checkpointing, failure recovery, and reliable long-running processes that most organizations haven’t built.

Level 5 → Level 6: Eliminating Human Code Review

Removing human code review feels like removing a safety net—because it is. Teams must build confidence that scenario-based validation catches what human review would catch, and accepts the cases where it doesn’t.

This transition is not for every organization. Many will find Level 5 the appropriate equilibrium—significant productivity gains with maintained human oversight. Level 6 is for organizations where the economics and risk tolerance align with fully autonomous coding and review.

Level 6 → Level 7: Removing Routine Operational Gates

Level 7 is a different kind of trust boundary. A team may accept no human code review while still requiring human approval for deployment, rollout, rollback, customer exposure, and incident response. Moving to Level 7 requires confidence that operational policies, telemetry, SLOs, canaries, and remediation systems can make routine production decisions.

This transition is likely to lag Level 6 adoption. The blast radius is larger, the compliance questions are harder, and the failure modes are more visible to customers.

What Each Level Requires

The infrastructure and organizational requirements escalate significantly across Levels 4-7. Each level builds on the previous—Level 7 organizations need everything from Levels 4, 5, and 6, plus additional operational capabilities.

Requirement Category	Level 4	Level 5	Level 6	Level 7
Tooling	AI-native IDE, prompt management	Extended execution infrastructure	Convergence orchestration	Autonomous deployment and remediation
Documentation	Steering files (AGENTS.md)	Agent execution logs, traces	Scenario libraries, provenance records	Operational policies and incident playbooks
Testing	CI/CD optimized for velocity	Local integration tests, 25% effort allocation	Digital Twin Universe, satisfaction metrics	Incident simulation and production canaries
Validation	Prompt benchmarks	LLM-as-a-Judge, batch review tooling	Probabilistic satisfaction measurement	SLO and telemetry-based operational validation
Communication	Context specs, ADRs	High-bandwidth standups, co-location	Policy governance forums	Exception review and operational risk forums
Skills	Prompt calibration, context management	Trust in async execution, batch review	Scenario authoring, specification ownership	Operational policy design and systems governance
Governance	Prompt review practices	Escalation policies	Full governance-by-policy engine	Production authority and automated compliance evidence
Budget	Standard AI tooling costs	Moderate token increase	$1,000+/day per engineer in tokens	Level 6 plus continuous operations and simulation costs

The investment required for Level 6 is substantial. Level 7 adds production operations risk and cost on top. Organizations should honestly assess whether the productivity gains justify the infrastructure, tooling, token costs, and operational risk before committing.

Concerns and When Not to Advance

Higher autonomy is not universally better. Each level introduces new risks and may be inappropriate for certain contexts.

Cost Concerns

Level 6 token costs are an order of magnitude higher than Level 3-5. The $1,000/day per engineer baseline suggested by StrongDM represents $240,000/year in tokens alone per engineer—potentially exceeding salary costs. Level 7 may add continuous simulation, telemetry analysis, remediation, and incident rehearsal costs.

The economics only work if productivity gains justify the investment. Organizations should model expected throughput against token costs before committing.

Moat Erosion

Simon Willison raises a strategic concern:

“Could competitors clone your newest features with a few hours of coding agent work?”

If autonomous coding commoditizes implementation, competitive advantage shifts to other factors: specifications, scenarios, Digital Twin quality, data, and go-to-market. Organizations relying on implementation complexity as a moat should consider this carefully.

Regulatory and Compliance Constraints

Some industries require human review for compliance reasons:

Financial services with audit requirements
Healthcare with patient safety regulations
Defense and government contracting
Any domain with legal liability for software failures

For these contexts, Level 5 or Level 6 may be the maximum achievable autonomy. Governance-by-policy can document intent, but regulators may require human attestation before deployment or during production operations.

Quality and Maintainability Risks

AI-generated code can accumulate technical debt invisibly. Systems optimized for scenario satisfaction may degrade in maintainability over time. Without human reading code, architectural drift goes unnoticed until it manifests as system failures.

Level 6 requires exceptional observability and architectural governance to catch degradation that human reviewers would otherwise flag. Level 7 raises the bar further because the system must detect, decide, and act while software is already serving users.

Choosing Between Levels 5, 6, and 7

Many organizations will find Level 5 the optimal target. Some will adopt Level 6 for constrained domains where scenario coverage is strong. Level 7 requires substantial additional investment and tradeoffs that few contexts will justify early.

Factor	Level 5 Optimal	Level 6 Optimal	Level 7 Optimal
Productivity goal	Significant gains sufficient	Maximum coding/review throughput	Maximum delivery and operations throughput
Error tolerance	Human review catches subtle bugs	Scenario coverage sufficient before production	Telemetry and rollback sufficient in production
Infrastructure budget	Moderate investment	High investment (DTU, scenarios, provenance)	Level 6 plus autonomous ops infrastructure
Regulatory context	Compliance requires human code attestation	No human review mandate	No human operational approval mandate
Codebase understanding	Team wants to maintain code literacy	Acceptable tradeoff for velocity	Acceptable tradeoff for operational speed
Token budget	Standard AI tooling costs	$1,000+/day per engineer	Level 6 plus continuous operations spend
Domain characteristics	Complex, ambiguous requirements	High-volume, well-specified domains	High automation tolerance and strong rollback paths
Risk tolerance	Conservative, proven patterns	Willing to remove code review	Willing to remove routine production gates

Most organizations should target Level 5 first and evaluate Level 6 only after operating successfully at Level 5 for an extended period. Level 7 should come later, after Level 6 validation infrastructure has proven itself and the organization has operational guardrails strong enough to tolerate autonomous production decisions.

How ASDM Compares to Dan Shapiro’s Five Levels

Dan Shapiro’s “Five Levels: from Spicy Autocomplete to the Dark Factory” is one of the clearest public framings of the same general shift. His model starts with an individual developer experience: from manual coding, to autocomplete, to pairing with AI, to managing coding agents, to a black-box system that turns specifications into software.

ASDM uses a different unit of analysis. It is an organizational software delivery model, not primarily an individual developer adoption ladder. That is why ASDM starts with Agile and DevOps as pre-AI organizational baselines, then separates AI-native workflows, agentic engineering, autonomous coding and review, and autonomous operations.

The two models rhyme, but they answer different questions:

Question	Shapiro’s Five Levels	ASDM’s Seven Levels
Primary lens	Individual developer workflow	Organizational delivery capability
Starting point	Manual coding and AI assistance	Agile and DevOps baselines
Middle transition	Developer becomes manager/PM of agents	Organization redesigns process, validation, governance
Factory endpoint	Specs turn into software without normal developer involvement	Intent flows through coding, validation, deployment, and operations under policy
Key distinction	How a person works with AI	What an organization can safely delegate to AI

Both models end near the same frontier: software production with humans outside the normal implementation path. ASDM splits that frontier into two levels because code generation and production operation are different trust boundaries. StrongDM is strong evidence for Level 6: no human coding or code review, scenario validation, and Digital Twin testing. Level 7 requires a further claim: autonomous systems operate production software without routine human gates.

Conclusion

Software delivery is evolving from human-executed to machine-operated. This is not speculation—it is observable in the practices of teams at AWS, Spotify, StrongDM, and others.

The seven Software Delivery Autonomy Levels provide a framework for understanding this progression:

Levels 1-2 established pre-AI baselines: team coordination and pipeline automation.
Level 3 augmented existing processes with AI tooling.
Level 4 redesigns processes around AI participation.
Level 5 enables extended autonomous execution with batch human review.
Level 6 eliminates human coding and code review, replacing it with scenario-based validation.
Level 7 removes routine operational gates, replacing them with policy-governed autonomous operations.

Most organizations today operate between Levels 3 and 4. Leaders are reaching Level 5. A few pioneers are experimenting with Level 6. Level 7 remains the frontier.

ASDM provides the methodology for Levels 6 and 7—the practices, artifacts, and infrastructure required for autonomous coding, review, validation, and operations. Teams can adopt ASDM components incrementally as they progress through the levels.

The implications for product builders are significant. As implementation becomes increasingly automated, value shifts to:

Clear specification of intent
Comprehensive scenario definition
Validation infrastructure
Governance and policy design
Operational telemetry and remediation design

The builders who thrive in this environment will be those who master not just AI tooling, but the organizational and infrastructural transformations that enable autonomous software delivery.

The factory is coming online. The question is not whether to engage with it, but at what level, and how fast.

References

Magherimov, Joe. “Amazon Bedrock Mantle and Developing at the Speed of AI.” AWS Podcast. Episode 753. AWS. January 2026. podcasts.apple.com/us/podcast/753-amazon-bedrock-mantle-and-developing-at-the-speed-of-ai/id1122785133
AWS. “AI-Driven Development Life Cycle.” Amazon Web Services. January 2026. aws.amazon.com/blogs/devops/ai-driven-development-life-cycle
StrongDM. Software Factory. factory.strongdm.ai
Willison, Simon. “The Software Factory.” Simon Willison’s Weblog. February 2026. simonwillison.net/2026/Feb/7/software-factory
Shapiro, Dan. “The Five Levels: From Spicy Autocomplete to the Dark Factory.” January 2026. www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory
Spotify Engineering. “Honk: Spotify’s Path to Agentic AI Coding.” Spotify Engineering Blog. February 2026. engineering.atspotify.com/2026/02/honk-spotifys-path-to-agentic-ai-coding
TechCrunch. “Spotify’s best developers haven’t written a line of code since December, thanks to AI.” TechCrunch. February 2026. techcrunch.com/2026/02/spotify-developers-ai-coding