New AI-based tools like Diffblue Cover and GitHub Copilot seek to take on more complicated tasks that are not straightforward to automate. This is not process automation, it is software writing software. Gartner estimates that 70% of manual work by developers can be automated.

So what’s the difference between Copilot and Diffblue Cover? At the highest level, AI-based software-writing-sofware tools are in two different categories: generative transformer-based systems (e.g. Copilot), and reinforcement learning-based systems (e.g. Diffblue Cover).

Here’s a quick summary of the differences:

Feature GitHub Copilot Diffblue Cover
AI approach Generative Transformer Reinforcement Learning
Learning approach Supervised (pre-trained) Unsupervised (no pre-training)
Training source Public GitHub repos N/A
How it writes code Attention-based transformer based on input and training Probabilistic search of set of possible tests
Code-writing approach Generates set of auto-completion options based on input; developer chooses best completion Analyses program and then writes tests autonomously
Interactive usage Describe code (signature) or start coding, click Copilot and choose best completion Select method/class and click “Write tests”
IDE supported VS Code IntelliJ
Batch usage N/A CLI to autonomously write tests for entire projects
Languages supported Wide range, including C, C++, Java, Javascript, Go, Python Java only
Generated code guaranteed to compile and run? No Yes
Delivery model Cloud-based SaaS service integrated with Visual Studio Code Software installed into developer’s environment
Minimum developer machine resources Enough to run VS Code and the dev environment. Model runs in the cloud. 2 CPU cores and 8GB memory
IP indemnification of generated code No Yes in Teams and Enterprise licenses

Copilot: Transformer-based code synthesis

Generative Transformers have been capturing the headlines thanks to the work of OpenAI, Google, Microsoft and others who have achieved uncannily good results in synthesizing text. Because code can be seen as “just text”, these generative models can also generate code that they have seen in training. TabNine produced an IDE autocompletion tool based on GPT-2, and GitHub has released its Copilot autocompletion tool based on a retrained and tuned GPT-3-based algorithm from OpenAI.

Transformers are deep learning models that incorporate the idea of attention, weighing the significance of the input data (e.g. what other words are next to the word, or whether the word is a verb, noun, adjective or other part of speech). They are very popular for natural language processing because they encode the significance of context to each part of the input text.

Some of the power of GPT-3 is in its monster size: it’s a 175 billion parameter model that is estimated to cost $5-10m to train. Training it essentially computes 175 billion numbers (“parameters”) and these form a statistical model that relates the input typed by the user to the output text. By comparison, Microsoft’s ResNet, a machine learning model that is better than humans at identifying images, is a mere 0.06 billion parameter model.

GitHub Copilot uses a similar mathematical model to GPT-3, but the training set is different: a blend of public repository source code on GitHub and text. OpenAI calls this model Codex. Copilot uses Codex to look at what the developer has typed and based on that assembles blocks of code that it has seen on GitHub before to deliver a likely useful completion (you can ask it to generate several probable completions and choose the best one).

This might sound a bit like “cargo cult programming,” but it is far more fine-grained: the blocks (“tokens” in GPT-3 terminology) are much smaller and Copilot rarely flat-out quotes a big chunk of code it has seen before. It most often does this when it has very little to go on – e.g. when the developer has just started a new file and is typing things at the very top (see https://docs.github.com/en/github/copilot/research-recitation). Copilot synthesizes code: the output is unique 99.9% of the time per GitHub’s testing.

Transformers can do an extraordinary job of completions, but they also generate utter gibberish and the algorithm can’t tell the difference. Users of GPT-3 were “simultaneously impressed by its coherence and amused by its brittleness.” GPT-3 does not produce an error code when it can’t do a good transformation, and tinkering with the model’s settings can produce wildly different results. Therefore it is incumbent on a human developer to check that what the tool is proposing is logically correct and reject erroneous completions that don’t satisfy the developer’s intent – and to spot errors and potential vulnerabilities. There’s an important reason why GitHub calls the tool Copilot and describes it as automated pair programming: using it interactively and with a critical eye is important – you can’t let the tool work autonomously.

The other challenge with this kind of ML is that the context you can feed into the input is limited. GPT-3 as an algorithm is limited to about 1500 tokens (words, more or less) of context. It doesn’t remember what came before, and is better at association than it is at processing sequences, so you can easily get GPT-3 to give you incorrect calculations (for example: https://www.nabla.com/blog/gpt-3/).

When it comes to real-world code, we know from our own experience with Diffblue Cover that the context needed for a good test can be very large indeed. Complex methods that process large and/or complex objects need a lot of state to be “just right” if they’re to do anything useful. That often means calling a lot of object factories with carefully chosen inputs. In short, writing tests is a much harder task than finding auto-completions. That’s a big challenge for an algorithm that struggles with sequences and doesn’t remember much context.

Diffblue Cover: Reinforcement learning code synthesis

Reinforcement learning is from a different branch of the AI tree: unsupervised learning. Instead of being pre-trained on example data, the learning happens in real time as the model seeks to find the best answer by searching the space of possible answers and following promising trajectories, and backtracking if things seem to be getting worse.

The algorithm is guided by a reward function that seeks to optimize the long-term reward. This is known as “probabilistic search” – a method where the space of potential solutions is sampled, and the algorithm then spends more time searching regions with a greater probability of a good solution.

Perhaps the best-known example of reinforcement learning is Google’s AlphaGo algorithm, where the search task is to find the best move in the game of Go. Reinforcement learning is a popular approach when the number of potential solutions (move sequences, in the case of Go) is so large that an exhaustive search is unfeasible. With Go, there are more potential moves than atoms in the known universe, so to find a good move requires a probabilistic approach.

In AlphaGo, two neural networks are used to predict the next best move and who will win the game. The game winning prediction is used for the reward function, and the prediction of the best move is used to decide what moves to evaluate. The algorithm then repeats the process to maximize the game win probability.

Diffblue Cover also uses probabilistic search, but it doesn’t need to use neural networks to make predictions. To write a unit test, the algorithm evaluates the code and from that guesses at what a good test would be. It then runs it against the method under test. It evaluates the line coverage (how much of the code the test exercises) and other qualities of the test, and then predicts which changes to the test will trigger additional branches to produce higher coverage. Then it repeats these steps until it has found the best test in the time available (it’s time-bound).

The return value(s) of the method and any side effects that were observed are used to write assertions on the test’s behavior – because it’s no good having good code coverage if the tests don’t check what the code did. The result is a test that is known to work (we ran it) and produce specific coverage and assertions.

Probabilistic searches cannot guarantee they will find the best solution – or any solution – because by design they don’t evaluate every possible case. In contrast with generative transformers, you don’t get gibberish or an incorrect result in that situation: you know when you’ve failed (!) This makes it suitable for fully automated test-writing because it won’t write gibberish tests that cause your pipeline to fail.

Cover also won’t find a test when the code is untestable – when neither a developer nor a machine could write one. Perhaps the simplest examples are private methods that can’t be called, so they can’t be tested. In general, very complex code becomes untestable because it’s very hard to find the right collection of coherent state that will result in the code running normally. This is a challenge for a human too, and often times such code doesn’t have good unit tests, instead relying on integration tests or end-to-end functional testing. Diffblue uses a trace recording approach to feed the reinforcement learning loop in this situation.

Conclusion

Any seasoned development manager has seen developer productivity tools come and go. There are plenty of templating tools out there that claim to increase developer productivity by producing skeleton code, but that they are not widely used suggests the improvement is not enough to justify the extra work they require. More focused autocompletion tooling that removes the drudgery and requirement to look up method signatures (e.g. that in IntelliJ and VS Code) is very widely used.

The net? Use AI-based dev tools for what they’re good at rather than seeing them as general purpose solutions. In the case of Copilot, that is autocompletions based on good context for straightforward code in a variety of languages. For Cover, the problem solved is unit testing rather than general coding, and today it is just for Java.