Transcript
My name is Mathew Lodge, CEO of Diffblue, and this is a presentation I gave at the Quest for Quality conference in Dublin, Ireland. AI is eating software: Using AI to write tests.
Marc Andreessen wrote an article about why software is eating the world back in 2011, essentially pointing to the fact that software is pretty much in every industry. So Marc Andreessen said that in the future, every company would become a software company in order to differentiate and be successful.
The irony of this is that software has eaten almost everything except writing software itself. Software is the only industry that has not been automated by software.
Jensen Huang, the CEO of Nvidia contends that AI is going to eat software. What he means by that is AI techniques are able to produce programs that solve problems that human programmers have been unable to tackle, in some cases in over 30 years.
What we see is that AI is eating code from all different angles. Jensen Huang is talking about machine learning AI algorithms that you have on the left. So far, witg certain tasks like image recognition, AI algorithms are much better than algorithms written by computer programmers and for image recognition in some cases computers are more accurate even than humans when it comes to identifying images.
At the top end you also have new low-code/no-code AI platforms, so things like H2O and DataRobot are in this space and they are reducing manually written code. That is necessary to build those machine learning models. And then in the testing world, you have things like Test Craft and Mabl using artificial intelligence to emulate humans so it looks like a human is using the product. Then coming up from the bottom, things like TabNine which is AI-based autocompletion. Microsoft has also introduced a similar technology, and then things like Diffblue where we’re writing tests using AI. I’m going to talk about what Goldman Sachs have been doing with this and what it looks like.
So Goldman Sachs have been using AI to write Java unit tests. The true purpose there is to automate their software development process, improve quality and facilitate agile adoption.
So what does this look like? What I’m showing you on this screen is an example of a test on the right hand side that has been written by Diffblue AI. On the left-hand side you can see the original code. The tool takes the original code analysis to write unit tests. So on the left-hand side, this is a O & X’s or tic tac toe gameplay code; the method here checks to see if someone has won the game. We can see on the left-hand side, beginning at line 64, here’s the logic that checks to see if a player has won by filling a column. On the right-hand side you can see the test was written by Diffblue cover in order to exercise this path. Essentially, it’s arranged in three sections. The arranged section itself puts together the data and then it’s going to be fed in the inputs that are going to go into the method. You can see here it’s building an array of integers on the board. It creates the board object from those and you can see player 2 has won in this particular example. Then it runs the method itself and checks the results and asserts; in this case, player two should have won the game. The AI goes through each of the methods and for a particular path of execution it will generate a test to take you through the path and assert you have the correct results.
Goldman Sachs’ real challenges is that they know that there’s a relationship between the use of automatic testing and high software delivery performance. This is something the authors of the Accelerate book were able to demonstrate. Unfortunately, a lot of Goldman Sachs’ software was written before unit testing was established, so they have a backlog of tests to be written and maintained versus their software delivery goals. Goldman uses this technique of AI to write unit tests and they have been able to increase their coverage to go from 36% to 72%. Goldman Sachs estimated that it is 10x faster than manual coding.
So the comparison there for a human: let’s say it takes 30 minutes to write a test and a human works 6 hours a day, so in one specific example we are able to write about 3,211 tests. This was an overnight run of about 8 hours to generate all those tests. Goldman estimates that it would have taken 268 developer days. You can see that this technique is much faster than the manual alternative, so for organizations that are looking to adopt this kind of approach very quickly, having AI write the test then maintain them is very attractive.
So here’s how this currently works in the Diffblue product. Essentially, have the input Java program which we analyze, so it’s a blend of static dynamic analysis, and from that we get a code behavioral model. We are conducting an AI-guided search through that model both for patrol flow and data flow and that gives us a testing strategy which we can then use to synthesize java code that will exercise those different paths and feed in the correct inputs and check for the correct outputs, and that goes into the test that I generated.
At Goldman, the deployment architecture is based on kubernetes. Everything runs containers so that we can scale out the abstraction of this product, and it’s memory bound so by adding additional instances you can power the generation of tests.
Current challenges with this kind of approach are well understood problems in reachability analysis. One way function for example for hashing used in cryptology and other applications. The entire purpose in cryptology is that it’s very difficult and takes a lot of time to reverse the function, it also takes an infinite amount of time to reverse the function so we have not solved magic computer science problems here.
So what we can assert on the other side of a one way function is basically it’s not null. Since we can’t reverse it, we can’t figure out what the correct input should be and in general polynomial reachability problems are challenges. On the other hand, this is a good indicator that a method should be refactored; if it has that kind of complexity in it then it’s a good candidate for refactoring.
The ultimate goal here is that you can get a set of unit tests that can run very quickly right after a commit. This kind of technology is really good for generating that kind of unit test. These are designed to be simple, they should run quickly, they should find regressions that are part of the code which should fail when there is a regression and they can be used very early on in the process in order to get feedback to the developers. These are imperfect tests and from Martin Fowler’s quote from his 2006 canonical paper about continuous integration ‘Imperfect tests, run frequently, are much better than tests that are never written at all.’ So you can think of this AI as writing imperfect tests if you like.
A brief summary: AI is eating software in lots of different ways: top-down, bottom up, algorithmic, test automation. AI written tests are one example of that bottom-up approach and they’re about 10x faster if you look at Goldman Sachs using a human software developer. And they run early, they run quickly, they spot basic functional errors and so it’s quick feedback to the developer as they commit code. AI-based user emulation is better for functional tests. There’s been a lot of work done around emulating humans and making it look like a human user sitting in front of an application, and for applications that have user interfaces like a browser or something else, then that is much better for a functional test. This kind of testing is really about unit testing and it happens earlier on in the CI or development cycle.