What are the DORA Software Delivery Metrics?

Understanding how your engineering teams perform is critical to improving software delivery outcomes. DORA metrics have become the industry standard for measuring software delivery performance—and in 2025, they’re more relevant than ever.

What is DORA?

DORA (DevOps Research and Assessment) is a research program that studies software delivery performance and publishes the annual State of DevOps report.

Google Cloud currently operates DORA, building on foundational work by Puppet and DORA co-founders Dr. Nicole Forsgren, Gene Kim, and Jez Humble. This team developed a report from 2014 to 2017 to articulate valid and reliable ways to measure software delivery performance.

They tied that performance to predictors that drive business outcomes, and in 2018, published the book Accelerate to share their research with the world. The resulting metrics act as leading indicators of business and team well-being, and as lagging indicators of the underlying engineering practices.

Armed with a scientific way to measure modern software development and the capabilities that impact it, the team has continued work on the study, publishing a new State of DevOps report every year.

The DORA Metrics Framework: Throughput and Instability

DORA measures software delivery performance across two key dimensions: throughput and instability. Taken together, these factors give teams a high-level understanding of their software delivery performance. Measuring them over time provides insight into how software delivery performance is changing.

Throughput

Throughput is a measure of how many changes can move through the system over a period of time. Higher throughput means that the system can move more changes through to the production environment.

DORA uses three factors to measure software delivery throughput:

Lead Time for Changes The amount of time it takes for a change to go from committed to version control to deployed in production. This metric reflects how quickly teams can respond to changing customer needs and unexpected events. In mature agile or DevOps processes, commits happen early and often, providing a consistent and easily measurable way to collect data.

Note that the strict DORA definition does not capture all time spent by a developer working on code—only the time between the first commit and when code is deployed. The agile “cycle time” metric tries to capture how long a task takes from start to finish, but often uses an earlier definition of “done” than production deployment.

Deployment Frequency The number of deployments over a given period or the time between deployments. This measures how many times development teams successfully deploy changes to production or release them to end users.

The definition of success is variable and dependent on the system in question, but the crucial point is that this metric doesn’t just measure volume of change, deployments that always break something or only get rolled out to a small percentage of users might not count. While not every change makes a meaningful impact, deployment frequency provides a useful proxy for how quickly teams can deliver value to end users.

Failed Deployment Recovery Time The time it takes to recover from a deployment that fails and requires immediate intervention. This metric focuses on how long it takes to diagnose, develop, and deploy a fix when problems are detected.

Failed deployment recovery time is important because while slow deployment of change has an opportunity cost in terms of business value, change failures are likely to have an actively negative impact on business performance and customer satisfaction.

Instability

Instability is a measure of how well software deployments go. When deployments go well, teams can confidently push more changes into production and users are less likely to experience issues with the application immediately following a deployment.

DORA uses two factors to measure software delivery instability:

Change Fail Rate The ratio of deployments that require immediate intervention following a deployment, likely resulting in a rollback of the changes or a “hotfix” to quickly remediate any issues.

Change failure rate provides a window into the amount of time spent by teams on rework rather than high-value new development. It can also be combined with other metrics to provide a view of the impact of change failures on customer satisfaction.

Teams new to agile or DevOps may fear that improving deployment frequency and lead time will result in a higher change failure rate. In robust processes, the opposite is true. Small deployments—however frequent—are typically better understood and carry less risk because they simply involve less change. DORA’s longitudinal studies show that speed and stability are positively correlated, not competing priorities.

Rework Rate The ratio of deployments that are unplanned but happen as a result of an incident in production. This captures the reactive work that teams must do when production issues arise, diverting resources from planned feature development.

The Legacy Bottleneck: A Critical Team Archetype

The 2025 DORA research identified seven distinct team archetypes based on their software delivery patterns. One of the most concerning is Cluster 2: The Legacy Bottleneck.

What Is the Legacy Bottleneck?

Teams in the legacy bottleneck cluster are in a constant state of reaction, where unstable systems dictate their work and undermine their morale. According to the 2025 research, 11% of survey respondents fall into this cluster.

Key characteristics of legacy bottleneck teams:

Low performance indicators: Key metrics for product performance are low. While the team delivers regular updates, the value realized is diminished by ongoing quality issues.
Elevated burnout and friction: Team well-being data indicates a demanding work environment. Team members report elevated levels of friction and burnout.
System instability: There are significant and frequent challenges with the stability of the software and its operational environment, leading to a high volume of unplanned, reactive work.

Why Unit Testing Is Critical for Escaping the Legacy Bottleneck

Teams trapped in the legacy bottleneck share a common technical root cause: insufficient automated testing. Without comprehensive test coverage, every code change becomes a risk that can introduce new instabilities—perpetuating the cycle of reactive work and burnout.

Unit testing breaks this cycle by:

Catching regressions early: Automated unit tests identify breaking changes before they reach production, reducing the change failure rate and the need for emergency fixes.
Enabling confident refactoring: Teams can modernize legacy code safely when comprehensive tests document existing behavior and catch unintended changes.
Reducing unplanned work: With fewer production incidents, teams can shift from reactive firefighting to proactive feature development.
Improving team well-being: Reduced pressure from system instability directly addresses the elevated burnout and friction reported by legacy bottleneck teams.

The challenge is that manually writing comprehensive unit test suites for large legacy applications takes considerable time and effort—research suggests at least 20% of developer time is spent on writing unit tests. This creates a catch-22: teams need tests to escape the legacy bottleneck, but they don’t have time to write tests because they’re trapped in reactive work.

Diffblue Cover solves this problem by using AI to write and maintain entire Java unit test suites completely autonomously. Cover can rapidly achieve comprehensive coverage on legacy codebases—exactly the kind of “unknown” inherited systems where legacy bottleneck teams struggle most.

AI Adoption Across the Software Development Lifecycle

The 2025 State of AI-Assisted Software Development research reveals significant AI adoption across development tasks, with important implications for how teams approach automation.

Where Developers Are Using AI

Among developers who perform specific tasks, AI assistance is now common across the software development lifecycle:

Task	AI Usage Rate	Task Performance Rate
Writing new code	71%	60%
Modifying existing code	66%	55%
Writing documentation	64%	59%
Creating test cases	62%	36%
Debugging	59%	49%
Code review	56%	53%
Maintaining legacy code	55%	35%

Critical Insight: The Testing Gap

While 62% of developers who create test cases use AI to assist, only 36% of all respondents perform this task. This reveals a significant gap: testing remains a specialized activity rather than a universal practice—even as AI tools become available.

This gap creates systemic risk:

Teams that don’t prioritize testing become legacy bottleneck candidates
AI coding assistants can increase code velocity without corresponding test coverage
The result is faster accumulation of untested code—accelerating technical debt

The Difference Between AI-Assisted and Autonomous Testing

Most AI tools today offer AI-assisted capabilities—they help developers write tests faster but still require developer initiative, prompts, and oversight. This approach works well for new feature development where developers are actively engaged.

However, the 2025 research shows that maintaining legacy code has only a 35% task performance rate despite 55% AI usage among those who do it. This suggests that even with AI assistance, legacy code remains neglected because:

Developers must understand code before prompting AI tools to test it
Legacy systems often lack the context AI assistants need
The cognitive burden of working with unfamiliar code remains high

Autonomous testing takes a fundamentally different approach. Diffblue Cover doesn’t require developers to understand code first—it analyzes compiled bytecode and reverse-engineers behavior to automatically generate comprehensive tests. This is particularly powerful for:

Inherited codebases from M&A activity where documentation is fiction but code is truth
Large-scale coverage goals where manual writing would take years
Legacy modernization where teams need tests before they can safely refactor

AI for Speed vs. AI for Safety

The research points to a clear division in AI tool purposes:

AI for New Code Velocity (Coding Assistants like GitHub Copilot)

Writing new code
Assisting modification of existing code
Best for: Greenfield development, feature velocity

AI for Safety (Autonomous Testing like Diffblue Cover)

Creating comprehensive regression suites at scale
Documenting actual behavior of unknown systems
Best for: Legacy code, compliance requirements, production stability

Elite-performing teams use both—AI assistants to accelerate new development and autonomous testing to ensure that acceleration doesn’t compromise stability.

How Unit Testing Improves DORA Metrics

Effective unit testing is a powerful way to improve software delivery performance across all DORA dimensions.

Impact on Throughput

Faster lead time for changes: When developers can rely on automated tests to catch regressions, they spend less time on manual verification and can commit code with confidence. Tests that run in CI/CD pipelines provide immediate feedback, accelerating the path from commit to production.

Higher deployment frequency: Teams with comprehensive test coverage can deploy more frequently because each deployment carries less risk. Small, well-tested changes are safer than large, untested ones—enabling the “deploy early, deploy often” practices that DORA research shows lead to better outcomes.

Faster failed deployment recovery: When failures do occur, unit tests help isolate the cause quickly. A failing test immediately points to what broke, reducing mean time to diagnosis from hours to minutes.

Impact on Instability

Lower change fail rate: Unit tests catch breaking changes before they reach production. The early-warning system that comprehensive testing provides is the foundation of deployment stability.

Lower rework rate: When tests catch issues early, teams spend less time on unplanned production incidents. This shifts engineering capacity from reactive firefighting to proactive value delivery.

The Productivity Challenge

Few software development teams today are being asked to do less work. Backlogs continue to grow as applications become more complex and business-critical.

Assuming those many user stories really are necessary, that leaves two basic choices: add more developers to the team, or find ways to get more done with the people you already have.

Option 2 is the go-to for most. Finding the right people with the right skills is slow and expensive, even if you have an extra headcount budget. That’s where modern development techniques like agile and DevOps come in, along with automation tools, making developer productivity improvements essential. That’s where modern development techniques like agile and DevOps come in, along with automation tools.

But writing and maintaining comprehensive unit test suites for large applications takes considerable time—time that doesn’t directly deliver business value. This creates tension: testing is essential for sustainable velocity, but testing effort competes with feature development for limited engineering capacity.

Autonomous Testing: Solving the DORA Optimization Challenge

Diffblue Cover solves this problem for Java teams by using AI to write and maintain entire Java unit test suites completely autonomously. Cover operates at any scale, from method-level within your IDE to across an entire codebase as an integrated part of your automated CI/CD pipeline.

How Diffblue Cover Improves Each DORA Metric

Lead Time for Changes Cover generates tests automatically as part of your CI/CD pipeline, eliminating the testing bottleneck that often delays deployments. Developers no longer need to pause feature work to write tests as they’re created automatically.

Deployment Frequency With comprehensive test coverage maintained automatically, teams can deploy with confidence. The risk that typically limits deployment frequency—fear of unknown regressions—is eliminated by autonomous test coverage.

Failed Deployment Recovery Time When issues are detected, Cover’s tests help isolate failures quickly. Comprehensive coverage means better observability into what changed and what broke.

Change Fail Rate Autonomous test generation achieves 80%+ coverage across large codebases, catching the regressions that manual testing misses. Cover’s tests are deterministic and reliable, unlike LLM-generated suggestions, which are probabilistic. Cover’s tests are deterministic and reliable—

Rework Rate By catching issues before production, Cover reduces the unplanned work that drives rework. Teams can focus on planned feature development instead of reactive fixes.

Escaping the Legacy Bottleneck with Diffblue Cover

For teams trapped in the legacy bottleneck archetype, Cover provides a path out:

Immediate coverage: Achieve comprehensive test coverage in weeks, not years, even on legacy codebases with no existing tests
No code understanding required: Cover analyzes compiled bytecode, so developers don’t need to understand legacy code before testing it
Behavioral documentation: Generated tests document what code actually does—critical for inherited systems where documentation is outdated or missing
Safe modernization: With tests in place, teams can confidently refactor legacy code without fear of introducing regressions

Getting Started

Learn more about DORA research and metrics at dora.dev, or get in touch with our team to learn more about autonomous Java unit test writing.

Ready to improve your DORA metrics?

Try Diffblue Cover to see autonomous test generation in action
Calculate your ROI to understand the impact on your team

What are the Accelerate DORA software development metrics?

Author

Table of contents