Transcript

Adam Mackay:

Hi, everyone, and thanks for joining us today. Welcome to this Diffblue webinar on coverage, or more specifically, When Is Enough Enough?: Stop Setting Unrealistic Unit Test Coverage Goals. In industry, code coverage is often used as a proxy for the thoroughness of software testing, but does chasing higher coverage always add value? That’s what we’re here today to find out. I’m Adam Mackay, a Senior Product Manager at Diffblue, and I’ve got many years experience in managing, designing, and deploying software in the safety critical domain. Today, I’m talking to Matthew Richards, the Head of Product at Diffblue. Matt spends a lot of his time talking to Java teams about their challenges and how automation can help. Prior to joining Diffblue, Matt spent 10 years at Cisco, leading international engineering teams and product teams. So, Matt, welcome to the webinar. Please introduce yourself and remind everyone what we do at Diffblue.

Matthew Richards:

Hi. Yes. Thank you very much for asking me to come and talk today. I’m Head of Product here at Diffblue. We’re a startup based in Oxford in the UK and our goal as a company is to use code to write code. So, writing code is one of these things that’s still really an art form. It’s a trade. There is a lot of hands on activity. It’s something much more akin to art. But we want to revolutionize the world of software development, really a lot like the steam engine revolutionized the world of manufacturing and then transport, etc. So, we want to use AI to automating as much of the what we call vital, but tedious part of coding as possible. So, really, the next generation of developer tools that everyone’s going to be using and it will become the norm to let AI write as much of the code as possible so that you as a developer can focus on those really complex tasks, the things that are exciting and interesting and where you can use your skills the most.

Adam Mackay:

Yes. That’s brilliant. Thanks Matt. So, just for everyone in attendance, there will be an opportunity to put questions to Matt towards the end of the webinar. So, please use the chat functionality or the Q&A window in the Zoom webinar and we’ll pick those up as we go through. If we don’t have time to get to all of them at the end of the webinar, we will follow up with you directly. So, without further ado, let’s dive into today’s topic. When Is Enough Enough?: Stop Setting Unrealistic Unit Test Coverage Goals. I think that’s quite a provocative title, Matt, and we were discussing titles. Why did you come up with that?

Matthew Richards:

Well, it’s interesting. It is provocative because it’s something that I see again and again and again. People that I talk to have coverage goals, but they don’t know why they have that goal. Let’s remind everyone of what coverage is and go back to basics. So, code coverage is a percentage, which is a measure of how many lines of your total code are actually exercised by unit tests. So, when we say exercised, we mean really how many lines of code are executed by unit tests. So, if we’ve got a method with, say, a hundred lines and I’ve got unit tests for that method and those unit tests exercise 50 of those hundred lines, then I’d say I have 50% unit test coverage. So, that means the other 50% of the code that’s not being exercised by my unit tests, that 50% represents a unmitigated risk in my project. I can’t see if that code has regressed using unit tests.

Matthew Richards:

So, going back to the title, companies often set a coverage goal as a gate to release code. So, a developer is writing code and they’re not allowed to push that code further along the pipeline towards production until they’ve hit a particular percentage. A typical coverage goal is 70%. So, that developer wants to commit a hundred lines of new code. They’ve got to have written unit tests that cover 70 of those lines. So, what we’re going to dig into today is why that rather unscientific approach to choosing that rather arbitrary number occurs, and what is a better way of selecting a coverage target and really about being aware of why you’ve chosen it.

Adam Mackay:

Excellent. So, yeah. I absolutely agree with you there, Matt, and I think it’s something that I’ve seen personally many times over the last 20 years, particularly working in the software development industry and safety critical industry where teams will chase after some kind of arbitrary coverage number. Typically, this is below the 100% maximum, but it depends on context, I guess. So, how do you go about setting an appropriate percentage and is coverage really the best metric to use?

Matthew Richards:

So, there’s a lot in there. So, first, it’s a mixture of multiple things and there’s no one answer for every organization. But what we can say is that choosing an arbitrary number is certainly not the way to do it. But that number itself, coverage is really the best proxy for most of us. So, coverage itself is a good proxy. It’s something that is a proxy for how much have I reduced the risk in my code? A very simplistic measure. The number is just a number. It’s very easy for both developers and managers to understand and to track. We like to believe 100% coverage is what we should be doing. It sounds, of course, we should get 100% coverage for this code, but that’s unrealistic unless you’re someone like NASA, and we’ll talk about some of the reasons.

Adam Mackay:

I’m not sure we’ve got any attendees from NASA on the call. I might be wrong. They might hide behind.

Matthew Richards:

Maybe not. But NASA’s quite famous because they checked their code again and again and again. Every single line of code was inspected multiple times and they have 100% code coverage, maybe 200% code coverage. Maybe they’re covering code repeatedly because of the context of that code. So, back to really the question of what is an appropriate percentage? So, there’s a lot of different dimensions to this. The effort required to reach a goal increases as we increase the coverage target. So, to get from zero to 50% coverage is relatively straightforward for most applications, whereas getting from 80 to 90% is infinitely harder. We’re really in diminishing returns here and we have to ask ourselves, is it worth that cost? Is it worth that value?

Matthew Richards:

For NASA, that cost would be huge, getting to 100% coverage. Perhaps 100% coverage isn’t actually necessary for them to mitigate their risk. Or one might even say it might be impossible to get to 100% coverage because of quirks of the language. So, it’s really when we’re writing and creating coverage. As a developer, we cover the things that are easiest to cover. We look at the code and we say, “I know it’s easiest to write coverage for this part of my code. I’m going to get the biggest bang from my buck there,” and it makes us feel good and everyone sees a nice percentage and says, “Hey, yeah, that’s great coverage.” We get a quick increase in coverage, but ultimately, we have unit tested the easier code, and is that easier code actually the code that we should consider important to unit test?

Adam Mackay:

Yeah. Sure. So, there’s lots of reasons teams might end up choosing a prominent target like 100% or maybe not so high, 70%. But that might not actually add value. It’s only one measure among many others that could be used. So, would you say in effect that coverage is often misplaced as a quality metric?

Matthew Richards:

Well, number one is it’s not a quality metric. There is not a direct relationship between coverage and quality. I can have an incredibly high quality piece of code that has no coverage where the risk of introducing problems and regressions is low because the actual code itself has been very well built. It’s gone through code review. We’re comfortable with it. It just doesn’t have unit test coverage. Conversely, we could have some quite terrible code. We could have code that was thrown together quickly, maybe written 20 years ago. No one really understands it, but we’ve got a lot of coverage for that code. It doesn’t mean the code is of high quality. So, people perceive coverage often as a measure of quality. But frank example, saying I have 70% coverage means that my code quality is higher than if I said I have 60% coverage. But back to the previous point, that code coverage is a percentage that measures my ability to detect more quality code. So, the slightly more provocative question really is the percentage of coverage actually what’s important, or is it where that coverage is?

Adam Mackay:

Yeah. Yeah. Sure. I think on one of the earlier answers; you touched on the concept of effort, the amount of effort to test the code. So, is coverage not so much of a quality metric as an effort metric?

Matthew Richards:

You could say it is a metric that is articulating how much effort I’ve put into mitigating the risk of regressions in my code. Not linear, as I’ll show you soon. But going back to that 70% figure, again, quite typical, one might say I have put enough effort to detect the risk of regression in my code, but maybe putting in more effort is too expensive. So, maybe that’s why my coverage target isn’t 80, 90, or 100. That’s really important to acknowledge and actually for us as engineering leaders to say I have chosen to not go for a higher coverage target than, say, 70% because it’s too expensive. That’s a valid reason. But very rarely that we hear people able to articulate that is a reason they’re not going above 70. So, maybe there are other ways to mitigate that remaining 30%, either other forms of testing or maybe your organization is quite happy with that risk. We’re not NASA. Maybe saying yes that latent risk is okay and maybe I simply ensure against it using other mechanisms.

Adam Mackay:

Yeah. Sure. So, I think we all agree coverage is a simple thing to measure and track. So, there must be value in doing that. It’s a readily available number and it can help put a label on how well complex software is understood by the developers, maybe. I guess documenting it and indicating how maintainable that code is likely to be, and therefore reducing the risk of a bug making it through to production code.

Matthew Richards:

Yeah. That’s very true. This using coverage not for its primary purpose is a measure of how am I going to detect a regression in the future? But using coverage is a measure of how well the code is that I am protecting. Kind of related to the effort metric. So, it’s potentially also, as well as being a percentage in indicating effort, also, how maintainable is that code? So, one could say 70% code is more maintainable and therefore less likely for regressions to introduce. So, one might infer that code that has high coverage is better quality because the developer must have written code that is testable because it has been tested to that high level. The developer must have been thinking about how to test the code when they wrote it. That mindset itself and to say that developer had the mindset of thinking about testing, I would say there’s probably a very strong correlation with actually the code being good. It’s not rushed. I’ve thought through it. I’ve designed the code and I’ve thought about these things.

Matthew Richards:

So, one might infer that actually, that means that the code is more maintainable. Even further than that, you could say it’s broken down so it can be unit tested. That code is being broken down into smaller units of functionality and therefore is easier to understand. So, it’s not that the code itself is just better written, it’s smaller units of functionality, easier to understand. You mentioned the word’ documentation’ there. Adding lines and lines of documentation is not something we are doing. We don’t do that. We expect the code to explain to us what it is doing. But by adding unit tests to the code, we’re actually documenting the code. We’re giving a meta description of the behavior of the code. The unit tests are ultimately just a description of each individual behavior of a particular method.

Matthew Richards:

So, if you’ve got a method with three unit tests, each test is documenting three different behaviors that the method can exhibit. If you have that documentation, when a different developer comes along maybe years later to edit that code, they’re actually able to read those unit tests as if they were documentation, understand what the current behavior is. So, when they make the code change, they’re making that code change intentionally knowing what the existing behavior was. We’ve all been there. We change code when we don’t really know what it does. So, there’s a lot there to unpack, but really, there’s some quite interesting byproducts that come out of having good test coverage.

Adam Mackay:

Yeah. Yeah. There definitely are. So, given the limitations with using coverage as a measure of quality, what alternative measures should we be using?

Matthew Richards:

So, first, coverage is a good metric. It’s just if we know where our coverage is. So, it’s not a perfect metric by any stretch of the imagination, but maybe it’s a little too simplistic. There are some force multipliers we need to think about. We need to look at that coverage figure through different lenses. So, really looking at it in the code’s context itself. I have a nice diagram that shows the distribution of code complexity. So, code complexity is a lens through which we can look at code. Code complexity is a metric that tells us how complex is the code. So, how many pathways are there through that code? So, we can see here a distribution of code. There’s going to be a large quantity of very trivial code. So, this flat part of the curve here is representing that a large quantity of that code is going to be, in the Java world, getters and setters and trivial things that are non-branching, maybe small methods with limited functionality.

Matthew Richards:

But the bulk of the application will be these small trivial non-branching methods. Then there’ll be a middle ground level of complexity where we see some branching methods and some error handling. For example, parts of your code that might call a database, they’ve got to negotiate some connection parameters, handle some basic exceptions, and there might be some mind your business logic. Then there’s going to be some very complex code, a small proportion of the code being very complex. But this is the code that has got a lot of branching, a lot of potential state management. It might be a state machine in there looping. These things are very hard to test and this is really the heart of the business logic. So, this metric to measure the complexity of the code, cyclomatic complexity being a measure of the pathways through the code, is really important when we look at our coverage.

Matthew Richards:

One might say that as well as this curve showing us the increasing complexity of the code, one might also say that this exact curve also maps the riskiness of that code. Trivial code is less risky because it’s easier to understand as a developer. I’m less likely to introduce folks because I can see what’s going on. It may be less important code, whereas this high complexity code is more likely to be very high risk code where it’s harder for a developer to see if it’s working or not. Is it behaving as I intended?

Matthew Richards:

So, we can really correlate risk to the complexity of code. So, it’s a case of how do I find other metrics, other lenses to look at the code? I love showing people how straightforward this can be. Let’s pick the 70% coverage figure. If we map on our coverage and say where is the coverage, what if it’s this? What if it’s that bottom 70% of the code that’s covered? We feel really warm and fuzzy. I’ve hit my 70% coverage target, yippee, and we think I don’t have to try any harder because my organization asked for 70%. But actually, I’ve only covered the 70% least complex code. Not really intentionally, but because that’s easier for me to test, and that’s really quite dangerous.

Adam Mackay:

Sure. So, I guess a better metric than raw coverage would be code coverage targeted at that complex code where the risk is likely to be?

Matthew Richards:

Yeah. So, picking your coverage intentionally and saying I am going to have a coverage target and I’m going to focus on that coverage target, though, on particular complexities of code. So, planning out really this red bit here and maybe saying you always have to cover that red bit, or I’ve got a nicer example of where 50% coverage is incredibly valuable and actually, 50% coverage is more valuable than 70 or 80% coverage because it’s the most complex 50%. So, I don’t really mind here if I’ve got a lower figure. I’ve done it intentionally, I’ve said it’s 50% for an excellent reason.

Adam Mackay:

Yeah. That’s fascinating. I’ve come across scenarios where this sort of thought and insight into where the coverage actually sits in the code base would be quite valuable. Maybe you have a real world example yourself where this has played out.

Matthew Richards:

So, I have come prepared with a story. I’m full of anecdotes about this. This is based on a real code base. So, I’m going to use our analytics tool here to show you. So, this is what we call the sunburst diagram in our analytics tool. This represents the coverage of your code. This project actually had 90% coverage. So, as Diffy says here, hey, there’s nothing to worry about. There’s no risk in my code. I’m covered 90% is incredible. But when you actually map out where that coverage is, what you can see from this diagram here is this red blob. This represents a piece of very complex code. It’s actually a group of methods that has no coverage. So, we should ask ourselves, what is this? Is this important? We only ask ourselves that question because we care about where the coverage is.

Matthew Richards:

So, we dug into Git; we went through the commit history, and we found out that this was first written back in 2010. Actually, this was written when the application was first written. It hadn’t then been changed for two years. So, in the last 10 years, nobody touched it. Then we looked at the complexity of this code and we found it was incredibly complex. The code was actually incredibly poor code. It was written obviously hastily, with large methods, etc., and we know it’s got no coverage. So, now alarm bells are ringing. It’s a ancient piece of code that hasn’t been touched in a long time. It’s a sizeable piece of code that’s got high complexity and no coverage.

Matthew Richards:

So, then we went and asked the developers and the architect, hey, what is this piece of code? It’s quite scary, but they came back and they said that this is the backup and restore function. So, this was actually a server-based piece of software. So, this is something that was deployed into customer environments and they’d run it on servers and each of these servers had to back up every night. So, the backup job would run and it would produce a backup file and if the system failed, the server went down. They could bring in a new server and restore the backup. So, this code was the restore part of that.

Adam Mackay:

So, it’s a vital piece of functionality, I guess, and ensures the safe operation of the system, and it didn’t have a single test running against it. It was, I guess, in a part of the code base that didn’t have any unit tests.

Matthew Richards:

No. It didn’t, and we could infer why. We could infer, well, maybe it was never covered because no one ever complained about it. Maybe they got no bugs or support cases for this because it was rarely exercised. If we’re using it 365 days a year to back up, but we’re never using it to restore, we’re never going to see the support cases for it. So, number one is maybe they didn’t think there were any problems in this code and number two is it didn’t actually have to change very often. It’s pretty much not been touched in 10 years. So, that could show that, well, why touch something that’s not broken? But by actually digging through it, there were problems in there. So, by working on the code, by looking through those lenses, we’re able to understand that there is that risk there, even though we’ve got a lot of coverage.

Adam Mackay:

Yeah. Sure. So, definitely a lot of risk held in that single piece of code there. So, we’ve discussed how complexity is related to risk, and I think we can all relate to that. How about code churn? I guess the frequency of updates to a particular piece of code is also an indicator of risk. So, I guess we should measure the amount of change in a particular piece of code, alongside both its complexity and coverage.

Matthew Richards:

Definitely. So, churn is really interesting. We just saw it, that example. Is there something that has a very low churn? So, churn is a metric that quantifies how often does a piece of code change? It’s often used as a comparative metric against other pieces of code. So, we might measure the churn of code in our application and we might say that this method has changed, say, once in the last year. So, the churn of that method compared to another method over here, which has changed seven times in the last year, we’d say that the second method has a churn seven times greater than the first method. So, in that example that I just gave, that piece of code hadn’t really changed for 10 years. That’s incredibly low churn, and incredibly low churn is a risk indicator because maybe it’s showing us that the code is unloved.

Matthew Richards:

But at the same time, coded is changing all the time. Really, it’s telling us that there is a higher probability of us introducing a regression. If for that second method we’re changing it seven times a year, there are seven times the number of occasions every year to introduce a regression. So, it indicates the likelihood of a risk occurring is how often it’s churning. We also would want to look at why is it churning so often? Are we actually even aware that is occurring? Maybe it’s churning because that’s a feature that’s well loved by our customers and we’re constantly updating it, we’re constantly making it better, or maybe it’s a piece of code that’s just full of bugs and we update that and fix bugs seven times a year. Or maybe it’s code that is just so misunderstood that it’s churning because someone changed it, someone’s changing it back, and we’ve seen these things.

Adam Mackay:

Mm-hmm. So, I guess another thing to consider is what risks exist in your system? If you’ve got a system where safety is critical, then obviously, you need to make sure that every line of code is exercised during testing because if there’re any bugs in your system, it might lead to maybe a life-threatening issue further down the line. I still don’t see anyone from NASA on the call, but there’s plenty of other industries where criticality of code is a genuine concern.

Matthew Richards:

Definitely. So, we talked about NASA already where we could say there’s 100% coverage target because it’s safety critical. The financial investment is enormous, as well as the human life investment. But also, once you’ve launched that vehicle into space, you can’t really change the code in space, and we see on films where the astronaut’s changing some code in real-time whilst they’re in space. I’m not sure how often that actually happens. But down here on Earth, the same happens. If you think about an engine control unit in a car, let’s go back 10 years. That engine control unit was probably not field updatable. So, once the car has shipped and that engine control unit has been configured, a safety critical component that might control lots of many safety critical parameters of the vehicle, changing that’s difficult.

Matthew Richards:

What do we have to do? Do we have to recall all the cars to update the software? Are we shipping USB sticks out to our customers? What if they don’t use that USB stick? What if they don’t update? Are we liable as the manufacturer? We see this in cars. In my car now, every so often, it pops up, and it says that there’s a firmware update and I click yes. But I guess that’s probably not for the safety critical parts.

Matthew Richards:

You’d expect that to be even. It’s probably for the CD player. But it’s a case of how does my organization tolerate risk? How do my products deal with risk? Can I update the software from now on? What’s the financial risk to me? Maybe getting 100% coverage is really difficult, but maybe it’s actually the right thing to do. But I’m sure there are converse examples where there are other ways to mitigate the risk. So, maybe you have a much lower coverage. We see this a lot in the networking world where actually, a way to mitigate risk in large-scale networks is to just use two different manufacturers of devices. You will see two different routers, one from each different manufacturer. So, it’s not the manufacturer mitigating the risk, but the customer is mitigating the risk by using two different manufacturers. So, there’s other ways, and we have to consider all of this when we’re thinking about the code is how important is it to cover that risk and also, what other mitigations do we have?

Adam Mackay:

Yeah. Sure. So, I enjoy drawing on your experience, Matt, and you’ve given us some great examples already. I wondered if you might give us some illustration as to maybe where high test coverage has been misused. What figures do our customers aim for?

Matthew Richards:

That’s really interesting. I’m actually going to answer a slightly different question first. One of our customers doesn’t even know what their coverage is, and that’s intentional. They’ve chosen to not even measure coverage. They don’t see coverage as the metric to track and they take testing seriously. Seriously. But they’ve actually got their own metrics. But other users that we’ve worked with, and in my experience, I’ve seen people with 90% coverage targets where that’s where they’re aiming for, but they’re at zero today. That’s really admirable that they’re aiming for that, and that might be their long-term goal. That’s the KPI that they’re looking at.

Matthew Richards:

But we need to break that down and we need to say yes, but how are we getting there, not setting the goal in such a way that it’s seen as unachievable? So, setting that coverage goal where there’s very little coverage at all, we’re having to understand what does the journey looks like to get us to 90%. So, they’re not misusing the goal. It’s just that they’ve set the goal at 90% and what they’re going to see is the delivery of that value be challenging to do in a way where engineers can still release features and still do their day job without just aiming for the 90% all in.

Adam Mackay:

So, I guess it’s a case of avoiding maybe a false sense of security that the high coverage figure might bring in. Okay. So, how do we balance achieving acceptable levels of risk without sacrificing development speed? Where do we start and how do we agree on what’s acceptable within our particular organization?

Matthew Richards:

So, first, think of unit testing just like you think of any other feature. As a product manager, I think about delivering features in a particular sequence and I think about feature delivery as a journey. So, writing code is a journey. That’s what agile software delivery is about. It’s about delivering incremental pieces of value. As product people, we choose the value that gives us the biggest bang for our buck first, and I use those words the biggest bang for our buck earlier. It’s about the biggest risk mitigation first. So, over time, we work on things that have a lower return on investment. So, that’s how we think about features, and we should think about that with testing as well. We have customers come to us and say, “I have no test coverage, or maybe I want 85% coverage and I need to get there straight away.” But we wouldn’t do that with product features. We would say, “I want this suite of features but this one is going to give me the biggest bang for our buck.”

Matthew Richards:

So, it’s a dangerous approach, and it drives potentially behavior of panic because we get focused on that 85% as an organization, as an engineering team, I should say, and it gives us an unrealistic sense of expectation.as a product guy, I’m biased. I’ve lived in both worlds. But as a product person, I’ve got a business to run. I’ve got features to ship. I’ve got customers to satisfy. We can’t just be testing all the time. We’ve got to be delivering new value and we’ve got to be fixing bugs in our software. Unit testing doesn’t fix the defects.

Matthew Richards:

So, we need to introduce this risk mitigation of unit testing just like any other feature we bring into our product. We should balance the risk and reward of that work, of our unit test coverage, just like we balance the risk and reward of features. So, just like with features, we need to analyze and understand why we’re doing it, understand what the value is. Just like with the feature, we’d understand what is our market? Where is our customer opportunity? Is my customer opportunity bigger here or here? What are the risks? We should do the same with coverage?

Adam Mackay:

Sure. So, we’ve discussed the different aspects of code coverage and how coverage figures alone really only make sense when taken into context with things like code complexity, churn, and the underlying functionality of a piece of code. There are many other benefits that come out of unit tests, and maybe some of them aren’t as obvious as others. So, what other benefits do you see with unit testing?

Matthew Richards:

So, the benefits are different if you are using unit testing from the very first line of code from an application vs. retrofitting them to a multimillion line existing legacy application. Let’s look at them separately. So, in the first, well, in the case story of where we’re retrofitting unit tests to an existing application, the benefits of doing that, of adding those unit tests is it’s giving us understanding of the project and it’s allowing us to understand what the code is doing, and really, going back to what we talked about before, it’s that documenting the code.

Matthew Richards:

Having that documented code as unit tests means that when we then change our legacy application, we’re able to more likely make that code change in a way that will not break the code. So, we understand the code before we change it and then the unit test would also help us catch the regressions. So, it’s about actually writing the better code in the first place is one of the key things. A lot of our customers say to us, “I don’t want to find more bugs in my legacy code.” They say, “My legacy code is actually good. I’m quite happy with my legacy code. It works. The risk is introducing change to that legacy code. How do I introduce that change without actually introducing additional problems?” So, it’s that documentation, really.

Adam Mackay:

Yeah. So, that covers legacy code or existing code that may be complex. Tests had a lot of insight into that code, which is very useful for lots of reasons. So, maybe we’re at an organization that’s writing new code from scratch. What benefits does unit testing bring there?

Matthew Richards:

So, the example where we’re starting from scratch, again, we’re writing that documentation, but from the beginning. Actually, by writing that documentation, we’re actually checking into the behavior, correct, right from day zero. I keep talking about behavior. It focuses us on value delivery. It focuses us on the behavior of the code. We can say we know what the code does. So, it’s really attaching me as a developer to the behavior rather than the way it’s implemented. So, starting from day one is making sure that the code is doing what I wanted from beginning, and also then maintaining that over time is cheaper because it’s very expensive to maintain unit tests, but it’s meaning that I’m adding to that. Every time someone changes the code, I know that I’m changing that code safely and we have that confidence. We never have that sense of panic. It’s, yeah, now I know I’ve got good code here because I’ve had 70% coverage maybe, the right 70% from the start.

Adam Mackay:

I guess understanding someone else’s code, as we all know, it is not a straightforward task. You need time, patience, and ideal documentation, or at the very least clear commented code. For example, people working with a large legacy code base they must see this all the time, and I can see that the unit tests around that legacy code can really help in that kind of situation.

Matthew Richards:

Definitely. Definitely. So, for legacy code, and I’ve done this multiple times myself, adding those unit tests is about really, you’ve got to get to know your code again. Maybe you’ve brought an application in house that you’ve not actually seen for a long time. It’s incredibly expensive to go through, understand what is this code? What is it doing? That’s a senior person. That’s a senior engineer going through and saying, “Oh, yeah. This is what this code is doing.” If I want to unit test this code, I need to understand those behaviors. So, as a developer adding those unit tests, I need to have a really intimate understanding of the application before I can even think about adding unit tests, and I need to be an expert in the language. So, in our case, Java. I really need to be an expert Java developer because I’m going to have to understand the concept of testability, understand I need to reactor this complex legacy code to actually add unit tests to it.

Matthew Richards:

So, we’re talking very experienced engineers spending a lot of time, and the cost of that is significant, and as a product person now, I’m spending money on a developer adding unit tests, not building features, and that’s an opportunity cost. I know that the time spent on features, I can account for that. I can say this is going to drive revenue. This is going to make my business more successful. An opportunity cost is infinitely more than developer salary. So, I’ve got to weigh up those two, and that’s the tension between engineering and products in this case of, yeah, product want happy customers, definitely. But, there’s opportunity there and every time we spend time and effort on unit testing or any form of testing, I as a product person, lose out. I don’t want any time spent on testing, really. I want it to be good quality, but I want the minimum time possible spent on actually ensuring that because I want the developers. As a product person, I want to spend my money on features.

Adam Mackay:

Yeah. Yeah. Sure. I guess when we’re building new features, unit tests force us to think about the code very intentionally. I guess almost supplementing or augmenting the code documentation.

Matthew Richards:

Yeah. It’s what I said before. It’s keeping me focused on what the code does rather than how it does it, and that’s a core principle behind user stories, for example. User stories are written in a way that’s focused on what the user can do and why, and unit testing keeps us focused on what is the behavior of my code. I love it. Users talking about behavior of code is something I like to hear more of. I have people ask me, “What do you mean by behavior?” They’re not having these behavioral conversations. They’re talking about the way the code works rather than the behavior that a customer or user is going to see. It also stops gold plating because it means we’re not doing too much. We’re not perfecting something beyond where it needs to be. We’re writing the code to exhibit particular behaviors and then we stop and we move on to the next job.

Adam Mackay:

Sure. So, there are many experiences I’ve had where a better grasp of code functionality would’ve helped me out of a sticky situation, and I’m sure you’ve got similar experiences in that regard, Matt.

Matthew Richards:

Definitely. Definitely. It’s all about understanding behavior at the most granular level. If we look at a real world example of where that is really difficult to do, so a really fresh example, let’s use Log4j. Everyone understands the Log4j vulnerability. So, there are many ways to fix Log4j, but all the standard fixes for Log4j change not necessarily a lot of the code in your application, but Log4j is used everywhere in your application. So, how do I fix the Log4j vulnerability, introducing no nasty side effects that I’m not aware of? Maybe those side effects take years for me to actually realize. We can leap versions of Log4j. I might be on a version of Log4j from five or six years ago and I’m leaping to today’s version with the fix.

Matthew Richards:

So, this is where unit tests really help. So, if I have the unit tests that describe the behavior of my code before I upgrade Log4j or fix the code, I can then make that quite wholesale change and then I can rerun those unit tests afterwards. Technically, none of those unit tests should fail, which means the behavior of my application before I updated Log4j is the same as after. I’ve not introduced some tiny little side effect that I wasn’t aware of in this new version of Log4j that might take a long time to manifest itself further down the line in some area that take months and months and months to discover.

Adam Mackay:

Yeah. So, I guess in effect, with good unit tests, we’re locking in the ability to detect the changing behavior of legacy code. Okay. So, we already know that Diffblue automates the writing of unit tests. So, how do our customers typically use this technology to deal with complexity?

Matthew Richards:

So, Diffblue provides two core capabilities. So, we provide the functionality to look deeply at your code and understand where that risk is. Knowing where you are mitigating risk is critical. We need to look through the lens of not just coverage, but complexity, churn, and other factors that help you understand what’s important. So, giving those insights and not just tools, coverage tools like SonarQube that just gives you a very blunt hammer. You’ve not hit your 70% coverage. You are a bad developer. It’s more of the nuance of this is where risk is. Mitigate it in these places.

Adam Mackay:

Showing you where you should spend your time. So, valuable insights and everyone could use more valuable insights. What’s the second? I guess that’s more obvious. We help write tests.

Matthew Richards:

Yeah. We write the unit tests for you. So, going back to the Log4j example, Diffblue can automatically write hundreds of thousands of unit tests for your entire million line application that describe every single behavior at the most granular level. So, when you make that wholesale change, you’re able to do it with confidence. But also, you’re fixing bugs. I know I’ve not introduced a regression or I’ve introduced a new feature. How do I know that my code hasn’t regressed, but how do I know the new functionality is correct?

Adam Mackay:

Yeah. So, it’s the priceless combination, and it all adds up to giving you some time back.

Matthew Richards:

Definitely. So, visibility is key to know, where am I spending my time, and us actually writing those unit tests saves you even more time. So, typically, for a developer to produce 70% coverage, they’re spending 50% of their time working out how to write unit tests. We are increasing the developer velocity by writing the tests for you. So, you were spending 50% of your time writing unit tests. Diffblue can do a lot of that work for you. Maybe we give you 40% of your total time back. That’s a huge velocity increase.

Matthew Richards:

So, it’s about us doing the work and helping you target that work at the right areas of your code. To conclude the thought here, for us, it’s not just about the number itself that’s important. It’s not do I have the right coverage goal? The more important thing is to have chosen that number with intent, to actually sit down and work through the question, what coverage do I need and where do I need it to de-risk this specific application in this specific company at this specific point in time, and is that achievable to do? So, it’s about the journey to get to the coverage target than the actual coverage figure that you use.

Adam Mackay:

Sure. Excellent. So, that’s all great. It’s been a great discussion up to this point, and we started talking about Diffblue. So, maybe you can go into a little more about the solution and how it works.

Matthew Richards:

Sure. So, I’ll give you a quick look at the core technology, which is the ability to actually write the unit tests. You can then download a trial from diffblue.com, which you can use to write unit tests. If you use IntelliJ, you can use our IntelliJ plugin. Otherwise, we have a CLI tool there that you can try out as well. So, I’m just going to show you IntelliJ here. Quick look at how do we actually work? So, I’m going to share my screen here. So, quickly, I’ve got Spring PetClinic here. This is a standard Spring Java demonstration project. I’ve got my code on the right-hand side here and I’ve got an API controlled method here. So, this process find form is one of those medium to high complexity or higher complexity pieces of code.

Matthew Richards:

So, there are multiple pathways through this piece of code, a lot of different behaviors that I could test here. For a developer, as they’re writing this code we can write the test for you. So, we’re saving you the time of doing this. Simply with our plugin, this flask icon here with the green plus, you’re clicking on that and Diffblue analyzes your code. It’s actually running your code in the background. It’s actually analyzing what are the various pathways through that code and then producing a unit test for each behavior of that code. So, as I say here, there’s several different behaviors.

Matthew Richards:

So, what this does is we click that, and I’ve got one I prepared earlier, and we produce the unit test. So, you’ll see here that Diffblue’s produced a unit test for each of these methods, sorry, for each of the pathways. So, we have the test process find form and we’re testing the various scenarios that you can put this piece of code through. So, we’re using MockMvc to mock the actual API call. The four scenarios that we’ve tested here are testing what happens if I don’t search for anything, so it’s a bit of validation, what happens if I search for something that returns one search result, What happens if I search for something that returns multiple search results, and what happens if I search for something that has no search results? So, those are the four behaviors that my code can exhibit and Diffblue’s written those unit tests for me, saving me a ton of time.

Adam Mackay:

That’s really outstanding. Excellent. So, a really wonderful demonstration and fascinating conversation, and I’m sure our viewers found that as interesting as I did. So, we’ve got a stream of questions coming in. I think we’ve got eight minutes until the end of the hour. So, we can try to tackle a few of those if you’re okay with that, Matt.

Matthew Richards:

Sure.

Adam Mackay:

So, shall we start with the question of TDD? I know you’ve got some views yourself on TDD. The question is how are organizations that engage in TDD for fresh development faring in comparisons to other organizations who are doing unit testing traditionally for recent development?

Matthew Richards:

Very interesting question and loaded with a lot of different aspects to it. So, TDD, test-driven development, is the idea of I write my unit test before I write my code. It’s quite a aspirational activity that several organizations have said, “Yeah, we’re going to do TDD.” In my experience, talking to the people I talk to, TDD is an incredibly expensive way to write code. Writing all the unit tests first, aiming for that 100%, coverage and then writing my code. People have tried it for several years and actually ended up saying, “We cannot measure the benefit of it in a way that is delivering the value that we expected originally.” The cost is just so high. Traditional unit testing and having an intentional coverage target and writing those unit tests afterwards, many people I’ve seen are actually falling back to doing that as a cheaper and actually more effective mechanism. But TDD has its place. There are certain occasions when it is a good thing to do.

Adam Mackay:

Yeah. Yeah. Of course. So, this one, a little more straightforward. I think I could even tackle this one. Is your solution just for Java code?

Matthew Richards:

Yes, it is. We support Java 8, 11, and 17.

Adam Mackay:

Excellent. I guess following on from that, are there any prerequisites on the flavor of Java that work better? I guess we’ve talked about versions, maybe frameworks and-

Matthew Richards:

So, most of our customers have a mixture of really large, gnarly old legacy code and ultra modern Spring applications written in the last 12 months. So, ultimately, we support Java as a language and we’re used to supporting a whole wide range of different flavors of coding styles, etc.

Adam Mackay:

Excellent. Okay. So, someone here asking about our AI. So, you’re using AI. Does this run on your servers? I guess they’re drawing parallels to things that other companies are doing with co-pilot and so on.

Matthew Richards:

So, we run nothing in Diffblue. We have no sort of cloud SaaS deployment at Diffblue. Everything that Diffblue’s doing is actually happening on your machine. So, when I used the plugin there, IntelliJ, your code is staying on your machine, the AI is running on your machine, it’s doing all the processing there, and we’ve done that because we know that sending your code off to the cloud is just not a viable option for most companies.

Adam Mackay:

Yeah. Sure. Well, personally, I think it’s very impressive that you’ve got such a sophisticated code running on developer machines. There’s obviously a lot of work that’ve gone into optimizing and building up that solution. So, there’s another one here about the tests themselves. Can I pass in my custom data to drive the tests?

Matthew Richards:

Yes. So, we have two different mechanisms for that, and really, this is useful when your application maybe has some specific data formats, some very organizational specific strings, for example. You can literally tell us what the strings are and where to use them. We have what’s called custom input rules. As well as that, we have a tool which actually automatically extracts that data available in our enterprise edition called cover replay. So, that pulls out the values that we see in that custom data and we pull that out by actually running your code, or you run your code in a production-like environment and we watch it running.

Adam Mackay:

Okay. So, I guess maybe if you’re running the integration tests, you can extract more value from those.

Matthew Richards:

Yeah. The way we extract that is you would run those integration tests. So, maybe you’ve got some Selenium UI tests that you’re running against the actual application. We can sit underneath your application, actually watch those tests run, and we take that data and we use that as input back into our AI.

Adam Mackay:

Yeah. Sure. So, that’s an outstanding way of getting more value out of higher levels of testing. So, I think we’re coming short on time to answer questions. There’s a couple that we haven’t got to, so I’ll make sure that they get followed up with after the webinar. If you want to get in touch, our website is diffblue.com. Please monitor the website for future sessions like this and maybe consider signing up to our monthly newsletter, and then you’ll get immediate notification when we run other events and talks. So, thanks for attending. Thank you, Matt. That was superb, really fascinating, and I found it interesting and learned a lot myself, so I’m sure all of our attendees did too. So, until next time, see you soon.

Matthew Richards:

Thanks very much.