New Generative AI for Code tools like Diffblue Cover and GitHub Copilot can help developers do their jobs more quickly with less effort by synthesizing code. ChatGPT has made this kind of generative AI accessible to a broad general audience and served to make far more developers aware of it.
The biggest difference between Cover and Copilot is that Copilot is an interactive code suggestion tool for general code writing (Microsoft call it an ‘AI pair programmer’), whereas Diffblue Cover autonomously writes entire suites of fully-functioning unit tests without human help.
Copilot is designed to work with a human developer to deal with tedious and repetitive coding tasks like writing boilerplate and calling into “foreign lands” (3rd party APIs). While you can get Copilot to suggest a unit test for your code, you can’t point it at a million lines of code and come back in a few hours to tens of thousands of unit tests that all compile and run correctly – something you can do with Cover.
The other big difference is that Cover writes entire suites of unit tests necessary to maximize coverage - a major advantage in big modernization projects, for example - whereas Copilot will write one test and then require some “prompt engineering” to coax it into writing a different test, plus the time to edit the resulting suggestions and get them to compile and run properly.
Under the covers, generative AI for code tools fall into two basic categories: transformer-based Large Language Models (e.g. Copilot, ChatGPT), and reinforcement learning-based systems (e.g. Diffblue Cover).
Here’s a quick summary:
|Feature||GitHub Copilot||Diffblue Cover|
|AI approach||Large Language Model||Reinforcement Learning|
|Learning approach||Supervised (pre-trained); human refinement of results||Unsupervised (no pre-training required)|
|Training source||The internet||N/A|
|How it writes code||Suggests completions according to statistical model of coding patterns learned during training||Probabilistic search of the set of possible tests for a given program|
|Code-writing approach||Generates list of auto-completion suggestions based on input; developer chooses and edits||Analyzes program and then writes tests autonomously|
|Interactive usage||Describe code (signature) or start coding, click Copilot and choose best completion||Select method/class and click “Write tests”|
|Autonomous usage||Not possible||CLI to autonomously write tests for entire projects, plus fast incremental test generation for any PR (for use in CI)|
|Generated code guaranteed to compile and run?||No||Yes|
|Delivery model||Cloud-based SaaS service integrated with IDE||Software installed into development environment|
|Minimum developer machine resources||Enough to run IDE and the dev environment. Model runs in the cloud.||2 CPU cores and 16GB memory|
|IP indemnification of generated code||No||Yes in Teams and Enterprise licenses|
Copilot: Transformer-based code synthesis
Generative Transformers started to emerge a few years ago, but they’ve been capturing the headlines thanks to the work of OpenAI, Google, Microsoft and others who have achieved uncannily good results in synthesizing text. Because code can be seen as “just text”, these generative models can also generate code that they have seen in training. Tabnine produced an IDE autocompletion tool originally based on OpenAI’s open source GPT-2 model, and then GitHub released its Copilot autocompletion tool based on a retrained and tuned GPT-3-based algorithm.
Transformers are deep learning models that incorporate the idea of attention, weighing the significance of the input data (e.g. what other words are next to the word, or whether the word is a verb, noun, adjective or other part of speech). They are very popular for natural language processing because they encode the significance of context to each part of the input text.
Some of the power of GPT-3 was in its - by 2021 standards - monster size: it’s a 175 billion parameter model that is estimated to have cost about $5m to train. Training it essentially computes 175 billion numbers (“parameters”); these form a statistical model that relates the input typed by the user to the output text. By comparison, Microsoft’s ResNet, a machine learning model that is better than humans at identifying images, is a mere 0.06 billion parameter model.
Despite its size, GPT-3 wasn’t great at suggesting useful code, so GitHub Copilot originally used a similar mathematical model based on a different and more specific training set: a blend of public repository source code on GitHub, and text. OpenAI called this model Codex. Copilot used Codex to look at what the developer has typed and used that input to assemble blocks of code it had seen on GitHub before, to suggest a likely useful completion (you can ask it to generate several probable completions and choose the best one).
Things are moving fast. A newer version of GPT-3, unimaginatively dubbed GPT-3.5, was released with the ChatGPT tool towards the end of 2022. It has proved so much better at code suggestions that it has replaced Codex as the foundation for Github Copilot (in March 2023 OpenAI announced the Codex model would be officially deprecated). Since then GPT-4 has already been released, though it’s not yet used by Copilot.
All of these transformers can do an extraordinary job of completions but at times they also generate complete nonsense, and the algorithm can’t tell the difference. Users of the original GPT-3 were “simultaneously impressed by its coherence and amused by its brittleness.” GPT-3 does not produce an error code when it can’t do a good transformation (it doesn’t have a way to rank its completions), and tinkering with the model’s settings can produce wildly different results. It’s therefore incumbent on a human developer to check that what the tool is proposing is logically correct and reject erroneous completions – and to spot errors and potential vulnerabilities. GPT-3.5 and GPT-4 retain this inherent limitation of LLMs. There’s an important reason why GitHub calls the tool Copilot and continues to describe it as a pair programming tool: using it interactively and with a critical eye is important – you can’t let the tool work autonomously.
The other challenge with this kind of ML is that the context you can feed into the input is limited. The GPT-3 as an algorithm was limited to 2,049 tokens (words, more or less) of context; GPT-3.5 is a bit better at 4,096 but Copilot still doesn’t remember what came before and continues to struggle with things like multi-digit math. You could easily get GPT-3 to give you incorrect outputs (for example: https://www.nabla.com/blog/gpt-3/) and though things have improved, there’s still no guarantee you’ll get correct answers to mathematical problems (no small limitation when it comes to code!).
When it comes to real-world code, we know from our own experience with Diffblue Cover that the context needed for a good test can be very large indeed. Complex methods that process large and/or complex objects need a lot of state to be “just right” if they’re to do anything useful. In short, writing tests is a much harder task than finding auto-completions. That’s a big challenge for an algorithm that can struggle with arithmetic and doesn’t remember much context.
Diffblue Cover: Reinforcement learning code synthesis
Reinforcement learning is from a different branch of the AI tree: unsupervised learning. Instead of being pre-trained on example data, the learning happens in real time as the model seeks to find the best answer by searching the space of possible answers and following promising trajectories, and backtracking if things seem to be getting worse.
The algorithm is guided by a reward function that seeks to optimize the long-term reward. This is known as “probabilistic search” – a method where the space of potential solutions is sampled, and the algorithm then spends more time searching regions with a greater probability of a good solution.
Perhaps the best-known example of reinforcement learning is Google’s AlphaGo algorithm, where the search task is to find the best move in the game of Go. Reinforcement learning is a popular approach when the number of potential solutions (move sequences, in the case of Go) is so large that an exhaustive search is unfeasible. With Go, there are more potential moves than atoms in the known universe, so to find a good move requires a probabilistic approach.
In AlphaGo, two neural networks are used to predict the next best move and who will win the game. The game winning prediction is used for the reward function, and the prediction of the best move is used to decide what moves to evaluate. The algorithm then repeats the process to maximize the game win probability.
Diffblue Cover also uses probabilistic search, but it doesn’t need to use neural networks to make predictions. To write a unit test, the algorithm evaluates existing code and from that guesses at what a good test would be. It then runs it against the method under test. It evaluates the line coverage (how much of the code the test exercises) and other qualities of the test, and then predicts which changes to the test will trigger additional branches to produce higher coverage. Then it repeats these steps until it has found the best test in the time available (it’s time-bound).
The return value(s) of the method and any side effects that were observed are used to write assertions on the test’s behavior – because it’s no good having good code coverage if the tests don’t check what the code does. The result is a test that is known to work (we ran it) and to produce specific coverage and assertions.
Probabilistic searches cannot guarantee they will find the best solution – or any solution – because by design they don’t evaluate every possible case. In contrast with generative transformers, you don’t get gibberish or an incorrect result in that situation: you know when you’ve failed (!) This makes it suitable for fully automated test-writing because it won’t write gibberish tests that cause your pipeline to fail.
Cover also won’t find a test when the code is untestable – when neither a developer nor a machine could write one. Perhaps the simplest examples are private methods that can’t be called, so they can’t be tested. In general, very complex code becomes untestable because it’s very hard to find the right collection of coherent state that will result in the code running normally. This is a challenge for a human too, and oftentimes such code doesn’t have good unit tests, instead relying on integration tests or end-to-end functional testing. Diffblue uses an application recording approach to feed the reinforcement learning loop in this situation.
Any seasoned development manager has seen developer productivity tools come and go. There are plenty of templating tools out there that claim to increase developer productivity by producing skeleton code, but that they are not widely used suggests the improvement is not enough to justify the extra work they require. More focused autocompletion tooling that removes the drudgery and requirement to look up method signatures (e.g. in IntelliJ and VS Code) is very widely used.
The net? Use AI-based dev tools for what they’re good at rather than seeing them as general purpose solutions. In the case of Copilot, that is autocompletions based on good context for straightforward code in a variety of languages. Cover meanwhile, is specifically designed to write unit tests at scale in a way that can be directly integrated into CI to dramatically reduce developer time spent writing unit tests. Why not give it a try?
This article was updated in 2023 to reflect developments in the generative AI for code space.