What are the DORA Software Delivery Metrics?
Understanding how your engineering teams perform is critical to improving software delivery outcomes. DORA metrics have become the industry standard for measuring software delivery performance—and in 2025, they’re more relevant than ever.
What is DORA?
DORA (DevOps Research and Assessment) is a research program that studies software delivery performance and publishes the annual State of DevOps report.
Google Cloud currently operates DORA, building on foundational work by Puppet and DORA co-founders Dr. Nicole Forsgren, Gene Kim, and Jez Humble. This team developed a report from 2014 to 2017 to articulate valid and reliable ways to measure software delivery performance.
They tied that performance to predictors that drive business outcomes, and in 2018, published the book Accelerate to share their research with the world. The resulting metrics act as leading indicators of business and team well-being, and as lagging indicators of the underlying engineering practices.
Armed with a scientific way to measure modern software development and the capabilities that impact it, the team has continued work on the study, publishing a new State of DevOps report every year.
The DORA Metrics Framework: Throughput and Instability
DORA measures software delivery performance across two key dimensions: throughput and instability. Taken together, these factors give teams a high-level understanding of their software delivery performance. Measuring them over time provides insight into how software delivery performance is changing.
Throughput
Throughput is a measure of how many changes can move through the system over a period of time. Higher throughput means that the system can move more changes through to the production environment.
DORA uses three factors to measure software delivery throughput:
Lead Time for Changes The amount of time it takes for a change to go from committed to version control to deployed in production. This metric reflects how quickly teams can respond to changing customer needs and unexpected events. In mature agile or DevOps processes, commits happen early and often, providing a consistent and easily measurable way to collect data.
Note that the strict DORA definition does not capture all time spent by a developer working on code—only the time between the first commit and when code is deployed. The agile “cycle time” metric tries to capture how long a task takes from start to finish, but often uses an earlier definition of “done” than production deployment.
Deployment Frequency The number of deployments over a given period or the time between deployments. This measures how many times development teams successfully deploy changes to production or release them to end users.
The definition of success is variable and dependent on the system in question, but the crucial point is that this metric doesn’t just measure volume of change, deployments that always break something or only get rolled out to a small percentage of users might not count. While not every change makes a meaningful impact, deployment frequency provides a useful proxy for how quickly teams can deliver value to end users.
Failed Deployment Recovery Time The time it takes to recover from a deployment that fails and requires immediate intervention. This metric focuses on how long it takes to diagnose, develop, and deploy a fix when problems are detected.
Failed deployment recovery time is important because while slow deployment of change has an opportunity cost in terms of business value, change failures are likely to have an actively negative impact on business performance and customer satisfaction.
Instability
Instability is a measure of how well software deployments go. When deployments go well, teams can confidently push more changes into production and users are less likely to experience issues with the application immediately following a deployment.
DORA uses two factors to measure software delivery instability:
Change Fail Rate The ratio of deployments that require immediate intervention following a deployment, likely resulting in a rollback of the changes or a “hotfix” to quickly remediate any issues.
Change failure rate provides a window into the amount of time spent by teams on rework rather than high-value new development. It can also be combined with other metrics to provide a view of the impact of change failures on customer satisfaction.
Teams new to agile or DevOps may fear that improving deployment frequency and lead time will result in a higher change failure rate. In robust processes, the opposite is true. Small deployments—however frequent—are typically better understood and carry less risk because they simply involve less change. DORA’s longitudinal studies show that speed and stability are positively correlated, not competing priorities.
Rework Rate The ratio of deployments that are unplanned but happen as a result of an incident in production. This captures the reactive work that teams must do when production issues arise, diverting resources from planned feature development.
The Legacy Bottleneck: A Critical Team Archetype
The 2025 DORA research identified seven distinct team archetypes based on their software delivery patterns. One of the most concerning is Cluster 2: The Legacy Bottleneck.
What Is the Legacy Bottleneck?
Teams in the legacy bottleneck cluster are in a constant state of reaction, where unstable systems dictate their work and undermine their morale. According to the 2025 research, 11% of survey respondents fall into this cluster.
Key characteristics of legacy bottleneck teams:
- Low performance indicators: Key metrics for product performance are low. While the team delivers regular updates, the value realized is diminished by ongoing quality issues.
- Elevated burnout and friction: Team well-being data indicates a demanding work environment. Team members report elevated levels of friction and burnout.
- System instability: There are significant and frequent challenges with the stability of the software and its operational environment, leading to a high volume of unplanned, reactive work.
Why Unit Testing Is Critical for Escaping the Legacy Bottleneck
Teams trapped in the legacy bottleneck share a common technical root cause: insufficient automated testing. Without comprehensive test coverage, every code change becomes a risk that can introduce new instabilities—perpetuating the cycle of reactive work and burnout.
Unit testing breaks this cycle by:
- Catching regressions early: Automated unit tests identify breaking changes before they reach production, reducing the change failure rate and the need for emergency fixes.
- Enabling confident refactoring: Teams can modernize legacy code safely when comprehensive tests document existing behavior and catch unintended changes.
- Reducing unplanned work: With fewer production incidents, teams can shift from reactive firefighting to proactive feature development.
- Improving team well-being: Reduced pressure from system instability directly addresses the elevated burnout and friction reported by legacy bottleneck teams.
The challenge is that manually writing comprehensive unit test suites for large legacy applications takes considerable time and effort—research suggests at least 20% of developer time is spent on writing unit tests. This creates a catch-22: teams need tests to escape the legacy bottleneck, but they don’t have time to write tests because they’re trapped in reactive work.
Diffblue Cover solves this problem by using AI to write and maintain entire Java unit test suites completely autonomously. Cover can rapidly achieve comprehensive coverage on legacy codebases—exactly the kind of “unknown” inherited systems where legacy bottleneck teams struggle most.
AI Adoption Across the Software Development Lifecycle
The 2025 State of AI-Assisted Software Development research reveals significant AI adoption across development tasks, with important implications for how teams approach automation.
Where Developers Are Using AI
Among developers who perform specific tasks, AI assistance is now common across the software development lifecycle:
| Task | AI Usage Rate | Task Performance Rate |
|---|---|---|
| Writing new code | 71% | 60% |
| Modifying existing code | 66% | 55% |
| Writing documentation | 64% | 59% |
| Creating test cases | 62% | 36% |
| Debugging | 59% | 49% |
| Code review | 56% | 53% |
| Maintaining legacy code | 55% | 35% |
Critical Insight: The Testing Gap
While 62% of developers who create test cases use AI to assist, only 36% of all respondents perform this task. This reveals a significant gap: testing remains a specialized activity rather than a universal practice—even as AI tools become available.
This gap creates systemic risk:
- Teams that don’t prioritize testing become legacy bottleneck candidates
- AI coding assistants can increase code velocity without corresponding test coverage
- The result is faster accumulation of untested code—accelerating technical debt
The Difference Between AI-Assisted and Autonomous Testing
Most AI tools today offer AI-assisted capabilities—they help developers write tests faster but still require developer initiative, prompts, and oversight. This approach works well for new feature development where developers are actively engaged.
However, the 2025 research shows that maintaining legacy code has only a 35% task performance rate despite 55% AI usage among those who do it. This suggests that even with AI assistance, legacy code remains neglected because:
- Developers must understand code before prompting AI tools to test it
- Legacy systems often lack the context AI assistants need
- The cognitive burden of working with unfamiliar code remains high
Autonomous testing takes a fundamentally different approach. Diffblue Cover doesn’t require developers to understand code first—it analyzes compiled bytecode and reverse-engineers behavior to automatically generate comprehensive tests. This is particularly powerful for:
- Inherited codebases from M&A activity where documentation is fiction but code is truth
- Large-scale coverage goals where manual writing would take years
- Legacy modernization where teams need tests before they can safely refactor
AI for Speed vs. AI for Safety
The research points to a clear division in AI tool purposes:
AI for New Code Velocity (Coding Assistants like GitHub Copilot)
- Writing new code
- Assisting modification of existing code
- Best for: Greenfield development, feature velocity
AI for Safety (Autonomous Testing like Diffblue Cover)
- Creating comprehensive regression suites at scale
- Documenting actual behavior of unknown systems
- Best for: Legacy code, compliance requirements, production stability
Elite-performing teams use both—AI assistants to accelerate new development and autonomous testing to ensure that acceleration doesn’t compromise stability.
How Unit Testing Improves DORA Metrics
Effective unit testing is a powerful way to improve software delivery performance across all DORA dimensions.
Impact on Throughput
Faster lead time for changes: When developers can rely on automated tests to catch regressions, they spend less time on manual verification and can commit code with confidence. Tests that run in CI/CD pipelines provide immediate feedback, accelerating the path from commit to production.
Higher deployment frequency: Teams with comprehensive test coverage can deploy more frequently because each deployment carries less risk. Small, well-tested changes are safer than large, untested ones—enabling the “deploy early, deploy often” practices that DORA research shows lead to better outcomes.
Faster failed deployment recovery: When failures do occur, unit tests help isolate the cause quickly. A failing test immediately points to what broke, reducing mean time to diagnosis from hours to minutes.
Impact on Instability
Lower change fail rate: Unit tests catch breaking changes before they reach production. The early-warning system that comprehensive testing provides is the foundation of deployment stability.
Lower rework rate: When tests catch issues early, teams spend less time on unplanned production incidents. This shifts engineering capacity from reactive firefighting to proactive value delivery.
The Productivity Challenge
Few software development teams today are being asked to do less work. Backlogs continue to grow as applications become more complex and business-critical.
Assuming those many user stories really are necessary, that leaves two basic choices: add more developers to the team, or find ways to get more done with the people you already have.
Option 2 is the go-to for most. Finding the right people with the right skills is slow and expensive, even if you have an extra headcount budget. That’s where modern development techniques like agile and DevOps come in, along with automation tools, making developer productivity improvements essential. That’s where modern development techniques like agile and DevOps come in, along with automation tools.
But writing and maintaining comprehensive unit test suites for large applications takes considerable time—time that doesn’t directly deliver business value. This creates tension: testing is essential for sustainable velocity, but testing effort competes with feature development for limited engineering capacity.
Autonomous Testing: Solving the DORA Optimization Challenge
Diffblue Cover solves this problem for Java teams by using AI to write and maintain entire Java unit test suites completely autonomously. Cover operates at any scale, from method-level within your IDE to across an entire codebase as an integrated part of your automated CI/CD pipeline.
How Diffblue Cover Improves Each DORA Metric
Lead Time for Changes Cover generates tests automatically as part of your CI/CD pipeline, eliminating the testing bottleneck that often delays deployments. Developers no longer need to pause feature work to write tests as they’re created automatically.
Deployment Frequency With comprehensive test coverage maintained automatically, teams can deploy with confidence. The risk that typically limits deployment frequency—fear of unknown regressions—is eliminated by autonomous test coverage.
Failed Deployment Recovery Time When issues are detected, Cover’s tests help isolate failures quickly. Comprehensive coverage means better observability into what changed and what broke.
Change Fail Rate Autonomous test generation achieves 80%+ coverage across large codebases, catching the regressions that manual testing misses. Cover’s tests are deterministic and reliable, unlike LLM-generated suggestions, which are probabilistic. Cover’s tests are deterministic and reliable—
Rework Rate By catching issues before production, Cover reduces the unplanned work that drives rework. Teams can focus on planned feature development instead of reactive fixes.
Escaping the Legacy Bottleneck with Diffblue Cover
For teams trapped in the legacy bottleneck archetype, Cover provides a path out:
- Immediate coverage: Achieve comprehensive test coverage in weeks, not years, even on legacy codebases with no existing tests
- No code understanding required: Cover analyzes compiled bytecode, so developers don’t need to understand legacy code before testing it
- Behavioral documentation: Generated tests document what code actually does—critical for inherited systems where documentation is outdated or missing
- Safe modernization: With tests in place, teams can confidently refactor legacy code without fear of introducing regressions
Getting Started
Learn more about DORA research and metrics at dora.dev, or get in touch with our team to learn more about autonomous Java unit test writing.
Ready to improve your DORA metrics?
- Try Diffblue Cover to see autonomous test generation in action
- Calculate your ROI to understand the impact on your team







