Transcript

Eric Koslo:

So thank you everyone for coming to today’s webinar, called ChatGPT is Fun, But The Future is Fully Autonomous AI For Code. I’m Eric Koslo. I’m an editor with InfoQ’s Java Queue. I’m focused on Java and security. I’m looking forward to this webinar, and what I think is particularly cool about this one is that nobody really likes writing unit tests. I think we all like having them. So what Diffblue has done here is they’ve taken something that we don’t like doing, and then they’ve paired it up with something cool, which is AI and automation, and they’ve taken that cool thing and they use it to get rid of the boring thing.

And in the last webinar with them, we talked about how it’s an AI algorithm that rewards itself based on code coverage, so that’s kind of a tiny preview of what’s to come. Otherwise, if you have questions, there’s a section in the webinar controls, please post your questions in that section. I’m going to hold on to them until the end, and then we’ll answer as many as we can just live after the talk. Otherwise, just for the common question that everybody asks, yes, this is recorded. Yes, slides will be available. So any other questions that you have, just type them in there, and I’ll be on top of it. So, Matthew, take it away.

Mathew Lodge:

Hey, thanks, Eric. Right. So, in terms of what we’re going to learn today, we’re going to look at the generative AI for code landscape. So where does ChatGPT and all these technologies, where do they all fit? What’s available? We’re going to go do a fairly deep dive into large language models. That’s the underlying technology that’s used in ChatGPT, GPT-3, GPT-4, and coding suggestions technology. So, we’ll go through that. And then we’re going to take a look at reinforcement learning. And because it wouldn’t be a webinar without live demo, we’ll do a live demo of fully autonomous unit test writing using reinforcement learning.

So ChatGPT is cool. I asked it to write me a haiku about a software project that is very late, and very quickly, it gave me this, which I think is cool. And this really demonstrates what ChatGPT is the best at. It is best at making stuff up. It is a really good tool for language tasks, in particular for writing new content, for copywriting, and things like that. But of course, it could also write code.

And so when we look at generative AI for code, what we see today essentially are two groups of technologies, one based around large language models, and that’s things like GitHub Copilot and Tabnine in addition to ChatGPT, so using language models to write code. And then unit test writing, there is Diffblue and an open source project called EvoSuite that use reinforcement learning to write code. So two different approaches, and we’ll talk about the different pros and cons of each.

The core technology that’s used in ChatGPT and other large language models is the transformer. Transformers were invented at Google a few years ago, and they originally were developed for things like language translation. So you can see an example here going from English to French, the transformer breaks the input down into what it calls tokens, and a token is more or less analogous to a word. It’s not exactly that, but that’s close enough for us to understand what it does. And the transformer essentially is trained on a corpus, a collection of text.

And essentially, in the case of a transformer for language translation, it’s given lots of examples, like here’s an English sentence, here’s a French sentence. And so it learns statistical text patterns, and it does that by understanding what role the word plays in the text. Is it an adjective, is it a verb, a noun, and so on? And it’s using that information, that attention to the context of each token in order to do a better job of prediction. And it’s an iterative approach. So you can see that you feed back the output so far back into the transformer along with the input, and it generates the next word. So essentially what you’re doing with a transformer is you’re asking it what’s the next word? And you keep doing that until it has produced the final set of outputs. So those are transformers.

Now, in the case of GPT, that stands for Generative Pre-Trained transformers, this is a series of models from OpenAI, so we can see the history of the technology. So GPT-2 was an open-source model back in February of 2019, so you could take this model, and it’s still available today, you can go take a look at the code, you can see what it does. And it was designed to summarize and translate and answer questions about text.

GPT-3 was a much, much larger model. So you can see that we went from 1.5 billion parameters to 175 billion parameters. So parameter essentially gives you an idea of the size of the statistical model. So the transformers, the model that is built, essentially has a number of parameters to translate the input to the output. So those transformers, the size of those transformers when they’re aggregated together are multiple layers of transformers that take the input and transform it to the output. So you can get a rough idea of the size of that statistical model, and 175 billion parameters, a parameter is basically a number that’s used in the neural network, right? It’s used to take the input and figure out what the output is. We’ll get onto what that means in a practical sense in a second.

So GPT-3 was much, much larger than its predecessor, a hundred times bigger, and it was able to write new text from the input. So this was the really interesting thing about GPT-3. You could have it do tasks that it had not been trained to do before. But there’s another version, a derivative model of GPT-3, that’s called Codex. This is the model that is used for Copilot, and it is trained on code and text. But it’s a different model, it’s a smaller model, designed for code suggestions, and this is the core technology that’s behind Copilot. And you’ll see a little bit later on, you can get an API token, you can try this API yourself from OpenAI, and that’s what we have done with some examples later. So, you’ll see that.

Just a note on size, a 175 billion parameters, what does that mean? So to give you some context, it’s basically 3.14 times 10 to the 23 floating operations per second required to train that model, which is a lot. That’s multiple, multiple teraflops in order to do this. If you ran this on a single GPU, the Nvidia V100, it would take 355 years to train, GPU years. So, on a single GPU, it would take that long. On the Nvidia RTX 8000, it would take 665 years. If you bought that on reserved instances on Microsoft Azure, that would be $4.6 million. Thank you very much. Now obviously, they don’t train it on a single GPU, they train it on multiple GPUs. They use a service in Azure that connects those servers together using InfiniBand, because you have to break the model down into, segment the model essentially, in order to make it possible to train in a reasonable amount of time given the size.

And here’s what we know about GPT-4, which is basically nothing. When GPT-4 was announced last week, there is deliberately no information in there on the size of the model, how it was trained, any of that information. It’s most likely a derivative of GPT-3.5, but we don’t really know. So, let’s talk about the strengths of this model.

So the thing that was really breakthrough about GPT-3 was how accurate the model was for language tasks. So it was able to do things like translation and summarization and writing new texts with an accuracy that matched specific and explicitly-trained models. So you could train a model to do one particular task. The real breakthrough of GPT-3 is they could do language tasks that it had not been explicitly trained to do.

So, in the example I gave right at the beginning, I asked to write a haiku. I could ask you to write an excuse for why the software project is late in the style of Charles Dickens. It’s never been trained to do that, specifically, but it can do it. And that is the real power of a model like GPT-3. It’s the generality of that model, and that’s what makes it really interesting. And it’s made ChatGPT really accessible to a broad audience because anybody can use it to write anything. You can have it make up a short story. It can do all kinds of fun things, and so that makes AI real for a very general audience.

If we switch over to code, though, what else can you do with it? Well, here’s an example for boilerplate. So we’re creating a class in Java, and as you know, there’s lots of boilerplate typing goes on when you create a class. And you can see right there on line 1 at the top, then there’s a comment that says what it is we want, a person class with name and age accessors, and it equals and hash code. And you start typing public class Person {, and here’s the completion that you get, which is very good. So this is priming that model with that comment, right? And you’ll see this later. What goes into the prompt is incredibly important. It’s all about the text. But this is a really great example of saving you some time.

Here’s another example, this is in Go, but this is an example of a completion to call the Twitter API, and we don’t remember exactly what to do in APIs frequently, so we’re often googling for code examples, like, “How exactly do you call this API?” And maybe looking at documentation, looking at code samples. This is a really good way. It doesn’t give you something that completely works, but it’s very close. And crucially, it’s got an API example that you can take and you can modify and you can turn it into something that is quite useful for you. So that’s a really useful use case for this and would save you a lot of time.

So the original GPT-3 paper when this first came out was very honest about the weaknesses in GPT-3, and one of those was drawing factual conclusions. Second one is multiple-digit maths and repetition, lack of coherence, losing the plot. All of these problems have been common to large language models all along. So you still run into these problems, even when the model gets bigger, so this is not a problem that you solve by making the model bigger and bigger and bigger. These problems don’t go away.

And it’s interesting that drawing factual conclusions is one, because there’s been a lot of hype recently about using ChatGPT and GPT-4 as a way to answer general questions that people talk about as being a replacement for search, and there’s been all kinds of articles about how this is the end of the world for Google. But it’s not really a strength of this kind of model to accurately answer questions, because they have no model of the world, they have no model of truth, they have no model of maths, of logic, any of those things. They just have a model of language, and we’ll come back to that a little bit later on, but that’s why the mathematical ability is limited. So something equipped earlier on, I wouldn’t be trusting it to do my tax return just yet.

The other thing is that because these are very, very large statistical models, they are inherently unpredictable, and small prompt changes can make a big difference in the output. So what that means practically is it can be very difficult to figure out how to change your prompt to get the result that you’re looking for. And people start talking about prompt engineering sort of like it’s a science, it’s not a science, it’s trial and error for the most part, because it’s unpredictable.

You just have to remember that there’s been a lot of hype and people talking about the reasoning ability of GPT-4. GPT-4 does not reason. It doesn’t have a model of reasoning, doesn’t have a model of the world. It is just statistical text patterns that it has learned, and that is its amazing strength, and it’s also its biggest weakness, as you might expect. Our strengths are our weaknesses.

So, let’s take a look at what we can do with GPT for unit test writing. So, this is using the OpenAI API. And if I call it, and I give it a prompt, and say write a Java unit test for the following add method, and I give it some Java code, so I have a signature for a method called add. It’s empty, it doesn’t do anything. And what do we get back? We get a perfect unit test. If the code actually did an add operation, this would be a good unit test. It’s maybe a little verbose, but it works, and it’s clear enough what it’s doing.

It’s just wrong, and the reason it’s wrong is because it’s been led astray by the text. It thinks there’s an add method in there. It doesn’t know what the code is doing, it just knows knows what the code says. So, here’s another example of that. We have an add method that multiplies, and again, we get a really good unit test for adding that doesn’t work. This would be perfectly reasonable if this was more reasonable code, and you’d have a decent test.

Here’s an example of just changing the prompt slightly can get you an unpredictable response. This is a saturating add where you add two numbers together, and if it’s above 100, return 100, if it’s below -100, you return -100. So, this is often used in numerical methods. And we get a unit test that doesn’t work at all. It creates a calculator test by instantiating an object of date, time, and service, whatever that is. So here’s an example, it’s seen some code like this before, and it doesn’t know what to do for this particular example, and so it is what’s called hallucinating. So hallucination is when the model doesn’t know what to do, doesn’t have a good match or something it’s learned, there’s not enough in the prompt, it hallucinates. And so it writes an interesting set of assertions there, but this is a test that flat out doesn’t work.

I did want to talk just a little bit more about what happens with ChatGPT, that there’s an additional layer of reinforcement learning in ChatGPT. So this is different to what’s going on with something like Copilot where you’re just feeding things into GPT-3. With ChatGPT, what they wanted to do was steer the model, and they were worried about two things. They worried about safety where the model can go crazy and start spewing racist nonsense, for example, but they were also concerned that it kept giving the wrong answer. And so the reinforcement layer is a way of correcting for that and essentially steering it to more desirable outcomes, and they use humans to do that.

So, here’s an example. We can start a sentence, “Citizen Kane is …” and we can ask GPT-3 to complete that sentence. And there are four things here, all of which were said about Citizen Kane. One of the best movies ever made, a commercial failure. It was a commercial failure. It was the end of Orson Welles’ Hollywood career, and one of the critics who was loyal to William Randolph Hearst and the person being parodied in Citizen Kane said, “It was a vicious and irresponsible attack on a great man.” And essentially what they do is they get humans to rank these questions, which is their best answer?

And so this is part of the reinforcement layer that sits behind the neural network in ChatGPT, and it steers the algorithm towards desirable outcomes. This is also the reason why you find in ChatGPT-4, you can see that things that ChatGPT maybe have got stuck on or stumbled upon, ChatGPT-4 does a better job, and this is why. There’s all of those examples it has been trained on. There has been reinforcement learning, somebody has steered it towards the better answer.

So, if we can use ChatGPT to write code, I asked it to write me a perceptron, which is a fundamental building block in neural networks, and said write this for me in Python, and the code, it comes back. It does qualify as being a simple implementation, which is fine. It’s not particularly sophisticated, and it looks pretty good. No one is going to pay me to write code these days that my professional coding days are over, but I can spot two bugs in this perceptron. There might be more, actually.

The first one is that it zeros the weights, and that means that this perceptron will never learn in the fitting function because it multiplies the data by the weight. If the weights are all zero, the answer’s always going to be zero, and it will never learn. The second thing is that it doesn’t normalize the input data for fitting. So features that are on different scales will skew the perceptron. So, it gives you an idea. It looks good, but you really have to take a close look at code snippets that are non-trivial and make sure they’re doing the right thing.

And then finally we have an example of ChatGPT, where it generates a hallucinatory Python to do mobile phone geolocation. So the idea is you should be able to pass in a phone number and get a geographic location for that number, which of course is something would break lots of privacy rules. The problem is that the company that offers a geocoding service, so geocoding is when you go from GPS coordinates to a street address, for example, or reverse, street address to GPS coordinates, they’ve never offered mobile phone geolocation, but ChatGPT thought they had and gave people code to call the API. The first OpenCage the vendor knew about this was that they started to get bad reviews. They had a big spike in signups, and they started to get bad reviews, and they couldn’t figure out why. What was going on? And they eventually tracked it down to ChatGPT. They still don’t know why it thinks that they offer this service. They’re not really sure why that is.

So the main thing to re remember about large language models is that it’s all about the text, and so they can be incredibly powerful, and it comes out of the text and the training. It’s important to remember that large language models don’t understand what your program does. They understand what the programs they generate do. They don’t have a model of the programming language. And so, the danger is you can get hallucinations when training is thin or there’s not enough text, and of course, it will never write code that it has not seen before. So if your company has a set of internal libraries that you use all the time, so base libraries, common functions for your company, you’re never going to get code that calls those, because it’s not been trained on that code. It’s never seen it before. So it can be very powerful, but it comes with set of limitations to cover that.

So, let’s just pause for a second. Geoff Hinton is one of the grandees of AI. Geoff helped to invent recurrent neural networks and back propagation in his time at Google. He’s the real deal when it comes to being an AI expert. And his point about large language models is that they just know language. And so, he said, “We learn to throw a basketball so it goes through the hoop. We don’t learn that using language. We learn it from trial and error.” And so the guy on the left is reading about basketball, and the people on the right are practicing basketball, and the people on the right are going to learn how to shoot hoops. The same approach is used in Diffblue Cover. Essentially that’s what reinforcement learning about. Reinforcement learning is a lot like learning to play basketball by trial and error, by shooting hoops.

So consistently accurate test writing is much harder than code completion, because you can’t rely on a human to correct your mistakes, and you need a lot more context. It’s actually quite difficult to do this for more typical Java code. I’ve been using very simple examples to make my point and to make it easy to understand, but as you know, real world Java code is not that simple. And so, you have a lot of context that you need to take into account. You’ve got to have a semantic model of the program. Otherwise, you can’t, for example, optimize for unit test coverage. You need to know what kind of coverage you have, need to have a model of that for the program. And it’s most valuable when it’s 100% autonomous because you need a lot of unit tests. If you are writing unit tests for 500,000 lines of Java, you’re going to have tens of thousands of unit tests as a result. Hand coding or hand checking and crafting each of those, it’s not really scalable for that. It’s much more valuable if we can make it autonomous.

And so what we do in Diffblue Cover is reinforcement learning, which you can think of as coding as search. Essentially, what we are doing is conducting a search for unit tests, and you can think of the Venn diagram, the set of all possible test programs where you want tests that are valid and run. We want them to have high coverage, and they need to be tests that developers find easy to read. And that’s essentially what we’re doing in Diffblue Cover.

And so, our analogy to shooting basketball hoops here is that we write a test. We start out, we guess what a good test might be for some code, and we run it against the code. So, we run the test against the method and we see how it does. What kind of coverage does it get? There are other metrics that we measure. Some of this is in our secret sauce, if you like. We use the existing Java code to give us clues, and from that, we can predict what a better test might look like, and we can write that test, run it, see how it does. Does it do better, does it do worse? And we keep going, we iterate through this loop. So, we’re shooting lots of basketball hoops here, going around the loop, until we have the best set of tests that we can find for a particular method, and then we repeat that across the entire program.

So, if you think about where this fits into a sort of typical software development process, if I think about pull requests and GitHub style of doing things, your engineer writes the code, you’ve got your existing code base, you write your code change, you create a pull request. As part of that, typically CI will run your unit test for you. You might have to run them yourself. And based on what happens with the tests, you can figure out is the change correct? Do the tests pass or fail?

And so, what the unit tests are doing is helping you find regressions, right? So, they used to run your unit test passed on the baseline, you’ve ripped the change, the unit test fails, and it means that something is wrong. Either test is wrong, or the code is wrong. So, the engineer takes a look, figures out what they need to do, and updates the code until the tests pass and off they go.

So, with Diffblue Cover, you’ve got a test baseline, so Diffblue can write a baseline for the entire program. So, you take the code base, take your main line, and we’ll give you an entire test suite that represents the current behavior of the program. And so, then when you have a pull test, a pull request rather, you can run those tests against the pull request. Now, we have a feature in the product that can speed this up for you where we can have CI run just the relevant test, because we know from the pull request how the code has been changed, and we know which tests will reach that code from the core graph.

And so, from that we can figure out what’s the minimal set of tests that we have to run in order to fully test this particular code change? And so, you run those Diffblue tests, you look at what happens. Do they succeed, do they fail? Engineer updates the code when they’re happy with the code, then we regenerate the affected Diffblue unit test. This is incrementally, so you don’t rewrite the unit test for the entire program. It takes too long. You just write the tests that need to be changed as a result of this PR. So you get a baseline, and then you have Diffblue maintain your unit test suite as the main branch evolves, as your code base evolves.

All right, so presentation over for a second, and what I’m going to do here is I’m going to switch over, and I’m going to show you Google Chrome. So here is Google Chrome. What we’re going to do is we’re going to use Spring Pet Clinic, which is the example application for Spring Java. And so in our fictitious pet clinic, we have a way to find owners. So if I just click find owners, and I don’t type anything in, I get the list of all of the owners. If I put in a name where I get multiple results, I get a list of owners, and I can then click on those and look at the individual owners. If I put a name in there that doesn’t exist, my name, you get not found. And then if I put in a single owner, like McTavish, then we get different behavior again. What we get is we get directed to the page for this particular owner. So this is ID number five I can see from the URL, and that’s my find owner functionality in Pet Clinic.

So what we’re going to do is take a look at the code. I’m going to switch back to PowerPoint here because it’s just clearer to do it this way. So here we go. Here’s the code for that, find owner, that’s the logic that we just saw. So in this example, we’re not going to use a very simple made-up Java program. This is a real Java as you might get in a Spring NBC application, so a typical web-based application. And you can see here that if we get the empty field, then we look for all of the owners. If the owner is not found, then we go back to the find owners page, and we print a not found message. If we have exactly one result, we go straight to the page for that owner, so we don’t give people a list with one item in it. And if you’ve got two or more results, then you’re going to show a nice paginated list of owners. So pretty straightforward.

So I’m going to do that. I’m going to run Diffblue Cover, so I’m going to switch over into the terminal. I’m going to run dcover create. So just a couple of notes here while it gets started. All of this is running locally. So, I’m running this on MacBook Pro. It’s got Apple Silicon. This is an M1 MacBook Pro. So everything is running locally inside my existing Java development environment.

So this is very different to large language models, because they are so big, you really have to run them in the cloud in order to get any kind of speed out of them. You need the kind of processing power that does not sit on your desktop. Yes, you could run it on your desktop, but it would be incredibly slow. Most people don’t have the hardware necessary to run the inference for a 175 billion-parameter model. They don’t have that sitting on their desk.

However, reinforcement learning doesn’t have to use a 175 billion-parameter model. It is doing this iteration, and it is a much more efficient way. So the predictions that we’re making here are for better tests and I need to do that. So you can see that we’re already done on this, and then Diffblue Cover is just going to run the test, just to make sure that the test that it wrote runs properly. And there we go, we’ve done the whole thing, including checking that the test had run properly. We did that in one minute and 21 seconds. So, we’ve got our unit test now.

So what I’m going to do is switch over into IntelliJ, just because it’s a little easier to see what’s going on. You can see this is our owner controller code, and you can see that there’s this new file showed up, owner control Diffblue test. I’m going to open that. So here we go. This is the code that has been written by Diffblue Cover. I’m going to process find form. There it is.

So here are tests that are being written by Diffblue Cover for that code that I showed you before. And so, the first test is with the empty result. So how does Diffblue Cover do that? So it sets up a mock, right? So using Spring’s mocking capability here to return an empty list, and we want to see that we end up back at the find owner’s page. The next thing we do is have the single owner result. So, what’s going on here is that we are creating an owner object, so we’re going to mock the return. We’re going to create an owner that’s going to be returned, because we’re not going to call the database here. This is a database-driven application, but we’re mocking out the database. So we’re going to create an owner, we’re going to put it in a list so that we can have exactly one item come back. There’s our mock to do that.

And then we have another mock down here, where we’re mocking the call into the controller, and we’re checking the results. So we want to see yes, that owner should be found, and we should go and be directed to the owner’s page for ID number one up here. So that’s our second test, the single owner example. And as you might expect, here’s what’s going on. We’ve got multiple owners. So we create two owners for the mock this time, add them to a list, return that list with a When Then. Same thing, mock the call into the controller, and we get to see that we get redirected back to the owner’s list page. And the size of our response here is five, and that represents all of the items in the model when you’ve got two owners. And then finally, we have process find form pull test, and this is where we are going to get not find it.

And so, we’re going to put a name in that’s wrong, and we’re not going to find it. We are going to get a model size of one, and we’re going to redirected back to the find owner’s page. And so you can see that we have generated or we have found all of the tests for that particular Java code, and we did that completely autonomously without having to review any code.

So, what is Cover useful for? So, Cover is designed to solve one problem very well. It’s designed to solve unit test writing, and today, it only supports Java. So one of the things we’re giving up with this kind of approach is that it’s not as general as a large language model, but it is much more accurate, and that matters in unit testing. So today, this is just for Java, then we all do other languages in the future.

It is 100% autonomous. It is not a code suggestion tool. It writes complete sets of tests, and it will write them for very, very large programs, completely unattended. And it ships also with a dashboard. So when you get to a large number of tests across the program, you would need a way to visualize what’s going on with unit testing, what kind of coverage you’re getting, what has Diffblue done, what have your own tests that your own developers have written, what are they covering, where’s the highest risk, is there untestable code? And the dashboard tells you about all of those things.

So in summary, generative AI for code tools, they’re real and they can help eliminate that tedious error-prone coding. Reinforcement learning is much better than large language models when it comes to unit tests. And you can try this for free, you can try Diffblue for free, you can try the other tools for free. Here’s a QR code, and there’s also a bit.ly/aiforcode and try it yourself. So, I’m going to stop there. I’m going to leave those slides there, but let’s get into Q&A.

So, you just answered the question, but I’ll ask it again. Will this be available for C# and Visual Studio?

Mathew Lodge:

We’d love to do C# in the future. That’s one of the languages that’s high on our list to do in the future, but that’s a future development for us.

Eric Koslo:

Awesome. So the next question there is will this generate identical code each time it’s run?

Mathew Lodge:

Yes, it’s deterministic. And part of the reason for that is that a lot of our customers integrate directly into CI, and what they show their developers is the delta between what were the tests before and what are the tests now. So, you’ve got your PR, we’ve generated tests for their PR, how do they differ? And so you don’t want to have a lot of noise in that, so that’s why it’s deterministic.

Eric Koslo:

All right. So another question is, I’m going to reorder the way this question was asked a little bit, take a program of about 15,000 lines. About how long does it take to train the model on that code base?

Mathew Lodge:

Yes. So, I can’t tell you exactly. What I can tell you is that we write a test roughly every one to two seconds. So if a 15,000 line program required 1,000 tests, you’d take you around 1,000 to 2,000 seconds.

Eric Koslo:

All right. There are two other questions in here that talk about test-driven development. So there’s two of them that say will this work with test-driven development, but I’m going to read this one a little more. Is GPT in a state that it could return valid and reliable code given a test provided by a human?

Mathew Lodge:

It’ll do the best it can, but it’s very difficult. It finds it difficult to go the other direction. So, it’s seen a lot of tests written for existing code in training. We haven’t tried to assess it going the other way, actually, so I don’t know, I guess, is the honest answer.

Eric Koslo:

But effectively, what does Diffblue do in test-driven development cultures where there’s really that focus on unit tests?

Mathew Lodge:

So, a lot of teams can write the initial set of tests and they write the set of code to do that, and then they can run Diffblue and they get all of the tests that they missed. So it’s quite difficult for us to write the tests for things, especially things for negative cases. That’s pretty common. There’s a version of Diffblue you can, that integrates with IntelliJ, that’s designed for this case, where you can have it, for example, write skeleton tests, because one of the more boring parts of test-driven development is writing all the mocks, and we can write all those for you and give you a test outline that then you can fill in in test-driven development.

Eric Koslo:

All right. And there’s two questions here that are pretty similar. One is can this generate tests for a single-page application, referring to the front end, and another says how useful is this for UX/UI testing? So, what about the connection between testing front end and testing backend?

Mathew Lodge:

Yeah. So these are unit tests, so it’s not trying to do UI/UX testing, which is more functional testing. UI/UX testing is does the program do the right thing under these inputs? In unit testing, you’re isolating more modules. And the case that I showed there, we are isolating what goes on in the owner controller, and we just test the owner controller, and then there’s a separate thing for the pet controller, the second visit controller, and we isolate those units, and we mock the dependencies. So, it will work for a single-page application. You’ll get a set of unit tests for a single-page application, but it is not trying to do what UI testing software does.

Eric Koslo:

So I think what you’re referring to is I’ll be able to do my functional testing on the UI, but I will have a reliable tested backend.

Mathew Lodge:

Yeah, you’ll have to, that’s right. Yep.

Eric Koslo:

Okay, another question here. If the code I write has a bug, will it generate faulty tests?

Mathew Lodge:

Yes, it will. So, we have no idea what your intent is with your code. We just know what your code does. We don’t know if it’s right or not, because we have no way of knowing. And this question comes up quite a lot. But you’ve got to remember that unit tests are there to detect regressions. So when you think about a code base at a particular point in time, you have a code base, you have a set of tests. Unit tests that are written by humans reflect the current behavior of the program, bugs and all. That’s what unit tests are. They pass by definition, unit tests pass on your program as it is, including all the bugs. And we are no different.

Eric Koslo:

Got it. People do tend to rely on certain bugs because they assume that those are normal behaviors. So yeah, if you fix a bug, you can actually cause a regression by fixing something.

Mathew Lodge:

That’s right. Well, you need to know if the behavior’s changing.

Eric Koslo:

All right. Since GPT and similar AI models require so much computational resources, will we always be dependent on the large cloud infrastructures to provide services, or do you see any possibility of open-source equivalence that we can run on premises?

Mathew Lodge:

Yeah, that’s a great question. So today, this is very much you have to be incredibly well funded to develop these models. However, some folks at Stanford did an interesting experiment where they trained a model based on the output of GPT-3. So they fed prompts into GPT-3, and they got responses, and they used that to generate training data for a much smaller model, the model’s called LLaMA. So if you google Stanford LLaMA, you’d probably find the writeup on this particular experiment. It was a much smaller model, and it was able to, it’s something that you could potentially run on a desktop with a decent GPU, and it did a pretty good job of replicating what GPT-3 could do, but on a much smaller basis. It wasn’t as accurate, had a couple of other flaws, but I think that’s a really interesting approach.

Eric Koslo:

All right. Their further question here, is this tool a step towards making testers redundant?

Mathew Lodge:

It’s not really. I don’t think it’s a move towards making testers redundant. I think it makes developers actually more productive. And this is tedious error-prone stuff.

Eric Koslo:

Yes, all right. Does it mock the objects which are direct dependencies on the tested class, or is it able to mock components deeper in the hierarchy that perform external interactions? For example, something reaches out to a database, something reaches out to a rest service. How does Diffblue interact with those capabilities?

Mathew Lodge:

Essentially, we’re mocking the direct call, and one of the nice things about Spring framework is that all of that is built in, and so it’s much easier to do in Spring. We can also do it for regular Java code, and we also give you controls for things like static mocking, so you can tell Diffblue Cover how to mock certain things. And then the ultimate configurability is that we give you a way to do that using a custom test harness. So you can do that with Diffblue Cover, but you have to give it more information, you have to give it more to go on.

Eric Koslo:

All right. Will it be able to detect cybersecurity vulnerabilities to figure out malicious actions?

Mathew Lodge:

So, it doesn’t do that out of the box. The interesting thing, though, about this kind of approach, and one of the things we’ve seen with people who use the product, is they look at the test to figure out what the code does. And this is pretty common for applications where there’s poorly understood code. So maybe you’ve inherited an application or a module in an application, the person that wrote the code is long gone, nobody really understands this very well. When you run Diffblue against it, you find out what the code does, because the tests are asserting on the behavior of the code. So, it can be useful for finding side effects that might be security vulnerabilities, but it doesn’t do that out of the box.

Eric Koslo:

All right. Here we go. So tests could be used to organize developer thought process, unexpected behavior. It’s not part of the development cycle. It can push developers back to write code without structuring their process with tests. It looks like Cover is more to cover legacy code bases to maintain current quality, not create something new. Is that correct?

Mathew Lodge:

You can use it for new code as well as existing code, but a big challenge for … The number of times that you get to write code from scratch are pretty small compared to building on something that already exists. And so this definitely is a real help when you’ve got poorly tested code base. I hesitate to use the word legacy. I know what you mean with that term, but for some organizations what is legacy code? It’s code I shipped last week. If it’s in production, it’s legacy.

Eric Koslo:

I don’t call it legacy. I call it revenue-generating code. It sounds much better. All right. How well does it recognize corner cases, such as bounded addition in the slides?

Mathew Lodge:

Essentially what it does is it iterates, and it does a good job of finding boundary conditions. So one of the things we do is, I mentioned one of the things we feed into our reinforcement learning loop is the existing code, and so that’s how we can find corner conditions.

Eric Koslo:

All right. So, there’s a number of questions in here about TDD, and this one, you cannot write unit tests for functionality which was not written, so call defects of omission. What could be used in combination with Cover to help developers understand all possible scenarios which need to be coded if you take unit test writing from them? So I think the way to summarize that is how do I differentiate between what Diffblue automates and what I as the person should be focusing on?

Mathew Lodge:

Yes. So, in TDD, the whole idea is you write the test first, and so you are writing tests for the happy path for the most part. And so the way that Cover can help there is it can write skeleton tests for you if you’ve got existing signatures for the things that you want to code, and if you have dependencies as well that need to be mocked. So that’s one part. And then the second thing you can do is when you’re done, you’ve written your tests, you’ve written your code, run Diffblue Cover and see what else it finds.

Eric Koslo:

All right. Here you go. Test-driven development in my understanding starts with writing tests and then proceeds to write code to match the test. Is there support for such a workflow? For example, deriving the tests from the API, probably interface only from the API docs and method names.

Mathew Lodge:

There are tools out there that will do that for you today. They will essentially write stubs from an API definition, because that’s a mechanical thing that you can do.

Eric Koslo:

All right. Does it generate comments to explain the complex blocks of generated code?

Mathew Lodge:

It generates a comment to explain what it’s testing. It does not generate a comment to explain what it does, because it doesn’t know. All it knows is that this is a test that is important for establishing coverage. It is an area where we might be able to use large language models to generate the comment. So that’s something that we’re exploring right now. That could be an interesting thing to go from a code to text.

Eric Koslo:

All right. So just a clarification on the rest that came in, because a lot of the questions focus on TDD, because I think most developers and architects are taught write the tests first. How often they do that, I’m not really sure. Question here. So Cover can be used in test-driven development to iteratively update tests while you iteratively update code. And then do you have examples on how that works?

Mathew Lodge:

All of these TDD questions are essentially asking the same question about how you use the tool. So we also have some examples in the blog. I think if I direct people there, they can dig into that in more detail. We do a developer survey every year. We did one, I think, two years ago when we asked how many people write the test. We asked how many people do TDD. 20% of respondents said they did TDD. Later on in the survey, we asked how many of you write the test first, and that was 8%. So sometimes TDD means different things to different people. For some people, TDD means I’m writing my test at the same time as I’m writing my code, and I’m using good practices to think about unit testing when I write my code.

Eric Koslo:

Right. So that would explain the discrepancy. So, in that survey, about how much time are people spending writing their unit tests? Is it half the time? Is it a lot? Is it a tiny bit? What have you found in terms of the time impact without Diffblue, the hands-on keyboard typing of unit tests?

Mathew Lodge:

Yeah, that’s a great question. So we often ask our customers that, and what we get back is roughly somewhere between 20% and 30% of the time is their best estimate. And we also ask them how long on average does it take to write a unit test? And to a certain extent this is difficult because most developers don’t time themselves like that. And so you get sort of rule of thumb things, like 10 minutes, 15 minutes on average. Some are faster. If you’ve got lots of mocks, some are going to be slower. But that gives you an idea of how much time you’re saving.

Eric Koslo:

Okay. So, let’s assume the engineers are spending 20% to 30% of their time writing unit tests. Let’s say I do one day a week on unit tests, because that’s about 20%. If I have the capability to automate my regression unit testing on the code that I’m writing, about how much time will I spend on both reviewing the automated unit tests to just get a feel for them and writing my own tests?

Mathew Lodge:

Yeah, that’s a great question. So, what we find is that when people first try Diffblue Cover, they’re very intrigued, and they want to make sure that it’s writing good tests. And so they spend a lot of time looking at the tests that have been generated, and they try it in different kinds of code and different modules, and they spend a lot of time on test inspection until they get to the point where they can trust it. And at that point, they integrate it into CI, and they just get tests as a matter of course. And so they’re going to be looking at the test when they’re doing PR.

So the most common implementation is you’ve got what are the existing set of Diffblue tests, what are the Diffblue tests, how do they change as a result of this PR? So essentially what we’re doing is we take the branch the PR is on, we run Diffblue Cover against it, you get a new set of tests, and you diff the two, and that tells you what has changed. And so you get to look at the tests at that moment, but you don’t inspect every test that it does.

Eric Koslo:

Okay, awesome. So in the organizations that adopt Diffblue, I know a lot of companies have the cadence of you have to maintain an 80% test rate. So what’s the impact of implementing Diffblue on a code base that is either below that or needs to maintain a certain level of test coverage?

Mathew Lodge:

To a certain extent, this is dependent on the code itself, because some code is untestable, so nobody can write, a human can’t write a test, we can’t write a test for a variety of different reasons. It can be very simple stuff like there are no observers that we can call, so we can’t get at the state of an object in order to write an assertion, for example. And so what we’ll do in that case is we’ll give you a partial test, but we can’t write the assert statement because there’s no observer for this thing. So we will tell you what to do.

In the tool itself, we have automated refactoring for testability. So simple problems, like lack of observers, we can just fix those for you, because we know they’re missing already. And so, you can run Diffblue Cover in refactor mode, and it will automatically make all those changes. You can check them in, rerun Diffblue Cover, and you’ll get more tests. So some customers, they fire it up and they get 90%. Pet clinic, we get 96% test coverage on pet clinic. For example, pet clinic’s a really nicely well-written Java application. So, to a certain extent, it depends on how the code is written, but it’s not uncommon to go from, in some cases, zero to 50%. And then with some tuning of the products and customization to have it tell Cover more about your environment, more about the kind of data it’s likely to see, then you can easily get up to that 80% figure.

Eric Koslo:

All right. So, we talked before about how a major point of unit tests is to identify regressions. There’s a question here of the challenge that test first forces developers to understand expected behavior before coding, which is that 20%. So can you just talk a little bit more about the idea of writing the unit test before I write the code and having automated tests appear after I write the code where maybe I didn’t write a test.

Mathew Lodge:

So essentially, what you can do is look at, and one of the things that the dashboard product will tell you, is how much of the code coverage does Diffblue Cover write, and how much did you write, and what’s the intersection of the two? So, we can show you that Diffblue Cover could have written all of those tests, that you wrote them, so we both cover the same thing. And then here’s the increments that you got out of Diffblue Cover for that.

Eric Koslo:

All right. And then another question about what’s local versus what’s in the cloud when the AI uses a lot of parameters. When using Diffblue, will any code be sent to a server outside my organization?

Mathew Lodge:

No. It can run completely isolated. You don’t need Internet connectivity to run Diffblue Cover, and it doesn’t send any code. We never see your code. We don’t run a cloud service that [inaudible] your code.

Eric Koslo:

Right. So the example you did before was exclusively on your laptop, no cloud connectivity at all?

Mathew Lodge:

That’s correct, yes. Runs completely locally.

Eric Koslo:

Awesome. All right. Question, is there a free version of this that’s actually linked right here, Cover free trial? Anything people should know about that? Is it time-bound? Is it consumption-bound?

Mathew Lodge:

Yes, that’s a great question. So we have Cover Community Edition, which is free forever, and that is a plugin for IntelliJ. So if you’ve got IntelliJ, you use IntelliJ, you can go to the IntelliJ Marketplace, search for Diffblue, install that plugin, and certainly, you get limited number of tests per day, essentially, so it’s limited in how many tests it will write for you, and it’s free forever. So if you’re very patient, you can write all of your unit tests that way. The free trial doesn’t have that kind of limitation, and it gives you the command line version, the version I showed you, so you can also try out things like incremental mode, which you would use in CI. And also, if you don’t use IntelliJ, then you want to have the command line version. The trial is for 14 days. You plug in your email address, we send you a license in less than a minute, you download the software, and off you go.

Eric Koslo:

Awesome. All right. I think that hits a lot of the questions, and we are coming up … OK. Is it free or open-source software?

Mathew Lodge:

It’s free. It’s not open source.

Eric Koslo:

Got it. Oh, is it? Sorry, that doesn’t say free or OSS. It says is it free for FOS? So for example, someone creates their own open source projects, can they use Diffblue for free on that open source project?

Mathew Lodge:

Yeah, they can use Community Edition for free. And if we’ve got any open-source projects, if we’ve got any maintainers out there, we’d be happy to give you something for CI integration for open source projects. We’re big open source fans ourselves. We also maintain a separate open source project called CBMC. [inaudible]

Eric Koslo:

Got it. So, whoever asked that, write a message, and I’m sorry I read it wrong initially. I read or FOS, not for FOS. So that was my mistake. That’s good to know. All right. So that covers it. I hope everybody enjoyed today’s webinar and feel free to follow up on some of the links and hopefully spend less time having to write tests while making better applications.

Mathew Lodge:

Great. Thanks, Eric.

Eric Koslo:

All right. Thanks a lot.

Mathew Lodge:

Thank you.

ript of this webinar will be available soon.