Precision over hallucination: Why AI in software development needs accuracy

Introduction

AI has increasingly become a key part of software development workflows. According to a recent survey by Stack Overflow, 70% of developers use or plan to use AI coding tools at work. Among the top benefits developers said they experienced from AI tools include improvements in productivity, efficiency, and learning.

Because of the growing popularity of coding tools, the last two years have seen an explosion in AI applications aimed at developers – everything from AI pair programmers, code generators, code suggestion tools, test generation tools, and autonomous code writing tools.

However, despite the widespread use of AI coding tools, only 3% of respondents in Stack Overflow’s survey said they ‘highly trust’ these tools. A surprising 58% of developers were either undecided, distrustful, or highly distrustful of AI coding tools. That’s led some to suggest that other AI coding tools are needed to regression test the often error-ridden work of other AI coding tools.

70%

of developers use or plan to use AI coding tools at work.

58%

of them are either undecided, distrustful, or highly distrustful of AI coding tools.

In this piece, we’ll take a deep dive into AI-driven development, or 'AI for Code', to help you parse the differences between tools and technologies – such as which are designed to be assistants and which are able to work autonomously.

We predict the future of AI for code is set to prioritise precision over hallucinations.

Current AI software development landscape

Pair programming tools get most of the buzz – but that’s changing

While large language model (LLM) systems like ChatGPT and GitHub Copilot have monopolised the media’s attention, other types of code generation tools are starting to attract more notice. CB Insights’ 2023 AI 100 Index, for example, highlighted two code generation companies specialising in other types of AI tools as part of their annual roundup of the most promising AI companies. That’s more than the number of image generation companies that made the list!

That shift in attention is likely grounded in a more nuanced understanding of how developers currently benefit from generative AI. In a 2023 report on AI and developer experience, Gartner discovered that tools like pair programmers currently only offer, “incremental, quality-of-life improvements to developers rather than significant boosts in productivity.”

Pair programming tools

Everything that an LLM creates has to be checked line by line by a developer. That’s why these products are marketed as ‘pair’ programming tools – they can’t work autonomously.

These benefits, the report claimed, could just as easily be obtained from search engines, development forums, or traditional code generation tools. The value these AI tools produced, then, came primarily from how they helped developers get the information they were searching for more quickly and with less context switching. Essentially, they acted as an ‘assistant’ in the development process.

Given the scope of their current impact, Gartner raised a critical question in the report: do the benefits outweigh the significant risks LLM-powered tools introduce around code quality, security, and performance?

According to Gartner, many AI tools currently only offer incremental, quality-of-life improvements to developers rather than significant boosts in productivity.

The next big thing: Autonomous AI tools

Concerns about LLM reliability have prompted many devs to seek generative AI applications that deliver greater productivity gains than LLM-powered tools with fewer risks. But that means finding the right technology – and the right use case.

One great example of this is AlphaDev by Google’s Deepmind, a reinforcement learning system that helps discover faster algorithms that use novel approaches. For example, AlphaDev uncovered an algorithm that optimises sorting making it 70% faster for shorter sequences and about 1.7% faster for sequences over 250,000 elements. That has applications for everything from search results to how data is processed on computers. AlphaDev is able to find faster algorithms by starting from scratch rather than just refining existing algorithms.

Diffblue Cover, which was one of the code generation platforms included in CBInsights’ 2023 AI 100 Index, is another example of AI for coding that focuses on precision over hallucinations. An autonomous unit test writing platform for Java and Kotlin applications, Diffblue Cover uses reinforcement learning instead of deep learning-based LLMs to create a more efficient and reliable solution to one of developers’ biggest pain points: unit testing.

report-light 1

Solving developer pain points via autonomous AI coding tools offers far more than ‘incremental’ improvements. Our 2019 developer survey, found that time spent writing unit tests, for example, costs companies an average of £14,287 per developer, per year. For a mid-sized company with 45 developers, that’s £643,000 spent annually on unit tests.

Unlike LLM-based code completion assistants, reinforcement learning tools like AlphaDev and Diffblue Cover work 100% autonomously and require no manual review. They’re able to do this because they're not pre-trained on sample data like LLMs which is then used to make coding suggestions. Reinforcement learning happens in real-time with the model seeking out the best answer by searching possible answers and following the most promising trajectories then focusing on how they get better. In the case of Diffblue Cover, that means running a unit test and observing how well it works then updating the test with those learnings and running the next iteration.

AlphaDev and Cover aren’t the only AI developer tools creating something outside the LLM box. There are many others, including these examples:

  • Testim: Uses proprietary ML methods for autonomous UI and functional testing.
  • Appvance: Uses 19 different ML methods to test both web and native apps.
  • CodeRL: Uses both LLMs and reinforcement learning unit testing to reduce hallucinations.

AI for code: the right tool for the right job

While large language models are an exciting technology that shows promise for many use cases, they’ve been incorrectly marketed as the best solution for every use case.

To understand why AI tools powered, entirely or in part, by other types of AI can be better fit for many tasks in software development, it’s critical to understand what each type of tool does.

There are two main types of artificial intelligence for coding:

Flask test dark 1

Assistive AI

Coding assistants or pair programmers work with developers to help them code more efficiently, often through code completion or suggestions. These AI tools require a developer to actively guide them and check their output. They are great assistants but can hallucinate.

Flask test dark 1

Autonomous AI

Autonomous AI for coding tools can tackle complex coding tasks on their own. These types of AI act autonomously and can ensure the code produced is always right. They are focused on precision.

Auto-complete generative AI assistants

Tools like ChatGPT, GitHub Copilot, Amazon CodeWhisperer, Tabnine, Codiga, Askcodi, Replit, and the many other pair programmers on the market, are interactive code suggestion tools for general code writing. Powered by very large deep learning models known as large language models, they’re trained on a database of public source code repositories scraped from GitHub.

LLMs write code by using a statistical model of coding patterns the LLM learned during training to suggest completions. Essentially, they write code based on what they believe is the most likely way a human would write it. However, that code can be riddled with errors due to what’s known as AI ‘hallucinations,’ a term that refers to LLMs tendency to confidently and incorrectly improvise when it has poor training data coverage of the topic in the prompt.

LLM written code can be riddled with errors due to AI ‘hallucinations,’ a term that refers to LLMs tendency to confidently and incorrectly improvise when it has poor training data coverage of the topic in the prompt.

When it comes to general chat, that might mean ChatGPT citing fictitious cases when asked to write a legal argument. When it comes to code, Vulcan Cyber researchers found ChatGPT will link to coding libraries that don’t exist and create questionable fixes to Common Vulnerabilities and Exposures (CVEs). For that reason, everything that an LLM creates has to be checked line by line by a developer. That’s why these products are marketed as ‘pair’ programming tools – they can’t work autonomously.

That’s also why ‘prompt engineering’ has become a buzzword.

Prompt Engineering

The belief is that with the right prompt you can predict and control the outcomes of an LLM. However, the fact that small changes in input prompts can have such an impact on the model’s outputs make it clear that it’s impossible to predict and control LLM outcomes.

A great example of the downsides of LLMs can be found in this video of Levy Rozman, an international chess master, playing against ChatGPT. In a game that ChatGPT forfeits in less than 10 moves, the model makes a number of illegal or absurd plays – including capturing its own pieces. The best chess-playing software today requires no AI at all because it’s a problem that can be solved by conventional programming – something that ChatGPT is unable to replicate.

report-light 1

The bottom line? Pair programmers like ChatGPT and Copilot work brilliantly for some tasks like generating general purpose coding suggestions and writing documentation – but for other tasks other AI approaches, or a mix of approaches, are exponentially better.

Autonomous generative AI

Thankfully, LLMs aren’t the only way to do AI for code. Reinforcement learning, and ensemble models are another approach.

Reinforcement learning can be used to create autonomous coding tools either in tandem with LLMs or on their own. It can do this because it seeks out the likeliest best answers and then tests them. Reinforcement learning algorithms are guided by what’s known as ‘probabilistic search.’ That means that the algorithm spends more time searching regions with a higher probability of a good solution. It then iterates to find the best answer it can within the time constraints it’s been allotted rather than simply a ‘good enough’ one. Reinforcement learning is therefore better for certain tasks where it’s critical that the correct outputs run and compile or where you want the AI to work autonomously. Reinforcement learning models don't require that developers check outputs line by line for accuracy because they don’t hallucinate.

Reinforcement learning models are better for use cases where autonomous coding or accuracy are key, because they don’t hallucinate.

The most famous example of reinforcement learning is Google’s AlphaGo algorithm. Given the task to find the best move in a game of Go, AlphaGo’s two neural networks work together. One model uses probabilistic search to evaluate the moves most likely to be the best and the other predicts who is more likely to win the game with any potential move. The goal is to maximise the probability of winning with each move chosen. Unlike with ChatGPT trying to play a game of chess with ‘good enough’ or hallucinated predictions of the best possible moves, millions of possible moves are evaluated and rejected to optimise the model’s gameplay.

This process which AIs use to arrive at an answer greatly affects the accuracy and reliability of their answers. For use cases where autonomous coding or accuracy are key, the quality of reinforcement learning outputs win over LLMs’ ability to produce ‘one shot’ or ‘few shot’ answers.

That makes autonomous AI development tools provide significantly more value to developers than coding assistants since they search out a likely answer, test it, and then iterate until they’ve found the best test or code in the time available.

Considerations when choosing generative AI-driven solutions

When choosing the right AI tool for software development, there are a number of different factors that engineering teams should consider.

Level of autonomy

The stated goal of leveraging generative AI tools in the development process is often to augment the developer in order to make them more productive. For some use cases, that can be accomplished with an LLM pair programming tool that helps you by answering questions or generating simple code.

However, if you’re looking to drive significant time savings to achieve greater velocity, autonomous AI solutions that don’t require that level of human intervention are a better choice.

Scalability

LLM-based coding tools provide several code suggestions a developer has to manually review and choose between. That doesn’t include the time it would take to write additional prompts if you want additional suggestions. On top of that, every line of code an LLM writes has to be reviewed and corrected for accuracy – meaning that development can only move at the speed of a human developer.

Pair programmers are good at suggesting one section of code. But that’s not going to scale if you want to create several thousand unit tests, for example. The only way to scale that kind of work is through autonomous AI tools working independently.

Every line of code an LLM writes has to be reviewed and corrected for accuracy – meaning that development can only move at the speed of a human developer.

Desired productivity gains

Any tools that necessitate humans-in-the-loop are limited in terms of productivity gains. You are still coding at the speed of humans – just somewhat augmented ones. While augmented workers still drive productivity gains that can move the needle, autonomous AI can generate transformative gains by autonomously completing the task.

Exposure to risks

LLMs have a number of known safety and security risks including inaccurate and inefficient code, cyber security risks, privacy risks, IP risks, and more. While OpenAI has stated that they don’t use data from their APIs to train their models, the company experienced a significant data privacy breach in March 2023 when data from chat histories became visible to other users.

One of the benefits of reinforcement learning-based dev tools is that they can sometimes operate on-prem and in your own code environment – something that would cost at least seven figures to do with an LLM. This can greatly reduce risks of breaches and ensure compliance in heavily regulated industries. Autonomous tools’ high accuracy also eliminate the need for human review. For enterprises, those time savings and risk reductions are critical.

Reinforcement learning-based dev tools can sometimes operate on-prem and in your own code environment.

Cost

For years, LLMs have been in an arms race. AI companies have been focused on creating models with the most parameters following OpenAI’s assumption that the more parameters a model has the better it will perform. While DeepMind’s Chinchilla project proved that smaller models trained on more data performed better, ‘smaller’ is relative (70 billion vs 270 billion) and the compute cost of model training remains in the billions. That translates into significant metered costs for the use of LLMs.

A recent blog post by well-known venture capitalist Tomasz Tunguz calculates the cost of a query with a 200 word response as varying from $0.03 using LLaMA 2 to $3.60 using GPT-4. But, in software development, context matters. If you need to provide the AI more context to get a usable coding suggestion, common queries might require significantly more tokens – and that would make individual queries cost far more.

In comparison, reinforcement learning methods don’t need to train a big model since they learn in real time. They can ingest the context they need and output significantly more code – all for a much lower price.

Time to value

Because most LLMs for coding are pair programmers, there’s a learning curve around using them. That often involves learning the right prompts to use to get the desired output and reduce hallucinations. But once that’s accomplished, there’s the question of return on investment. If LLMs are primarily providing developers with incremental value while charging $0.03 to $3.60 per query, how much value are businesses actually generating from their use?

Meanwhile, AI tools leveraging reinforcement learning don’t have the same learning curve. You just point the tool at the task and it autonomously completes it delivering exponential value from day one. The cost savings are huge – and immediate.

If LLMs are primarily providing developers with incremental value while charging $0.03 to $3.60 per query, how much value are businesses actually generating from their use?

Importance of precision over hallucination

The reason why LLMs are seen as so innovative is because they provide a ‘good enough’ ‘one shot’ or ‘few shot’ answer by training models using deep learning techniques and very large datasets. This creates a statistical model that helps LLMs determine the most likely right answer to any coding question you ask.

This is a departure from other forms of AI like reinforcement learning that focus on examining millions of potential probable answers to determine which one is the best fit. In comparison to that comprehensive approach, LLMs can feel like magic – capable of ‘knowing’ and ‘thinking.’ But, in reality, they’re just very large statistical models of the text they’ve seen during training.

Are they a type of statistics well-suited to software development? It depends on the importance of the coding task. If the task requires reliability and precision, then an LLM’s ‘good enough’ answer (which might be a hallucination) often isn’t good enough. In such cases, you’re much better off going with a reinforcement learning model that will search for the most probable good answer and test it instead.

A great maxim to remember for why reinforcement learning-based AI tools are often a better choice is that:

Impact on software development & developers

AI with the capacity to autonomously code at scale are game changers in the development process – especially for use cases where scale is crucial.

For example, in a Forrester Total Economic Impact report, Snyk was found to save companies $377,000 over three years by reducing the time developers spent investigating and remediating security vulnerabilities at scale.

Similarly, Diffblue Cover’s ability to independently write thousands of unit tests that are guaranteed to compile and run can save up to 20% of developer’s time that’s currently spent writing simple unit tests. As mentioned, that represents £643,000 spent annually on unit tests. Having an autonomous AI coding platform to do that work, frees developers up to write unit tests focused on more complex business logic, which a human can often produce more accurately than AI-generated tests.

20%

of developer’s time could be saved by Diffblue Cover’s ability to independently write thousands of unit tests that are guaranteed to compile and run.

That doesn’t just save developers from the repetitive drudge work of writing a large number of simple unit tests, it allows them to increase coverage to meet coverage targets or mandates.

Autonomous AI that’s capable of coding faster than humanly possible is set to increase developer velocity and speed up software delivery significantly. By giving developers the time they need to solve higher order problems, software won’t just ship faster – it will be better. Under-resourced or ambitious teams will also be able to more quickly get past P0 to P1 and P2, shipping more features and delivering more value and revenue.

Business value & impact

By saving a significant amount of developers’ time, increasing code quality, and ensuring the delivery of more accurate code, autonomous AI coding tools can transform the output of a team and supercharge velocity.

Here are some ways they deliver value to a business:

Time & velocity

Time savings & velocity wins

While much of the discourse around AI is focused on cost savings, that isn’t the only or even the main value of AI. Augmenting developers by automating repetitive development work frees them up to do more valuable work. That allows them to ship faster, improving company velocity and revenue, or go home earlier, improving employee satisfaction. In highly competitive industries having a deployment advantage and beating competitors in the war on talent can pay significant tactical dividends. Getting a product or feature to market before your competition can also lead to increased revenue and market share.

Code

Better code

Autonomous AI automates repetitive tasks while ensuring they’re done accurately. This doesn’t just free developers up to pay more attention to other more critical coding work but some applications can also improve things like code coverage and testing to ensure devs find and address bugs earlier in the development cycle. That results in teams delivering better products, keeping customers satisfied, and reducing the time and cost of refactoring later on.

Developer experience

Improved developer experience

How do you measure your DevOps teams? The DevOps Research and Assessment (DORA) team identified four metrics focused on traditional measures of productivity in 2020: deployment frequency, lead time for changes, change rate failure, and time to restore. In 2021, however, SPACE introduced five frameworks to measure DevOps effectiveness that focused more on the health of a team: satisfaction and well being, performance, activity, communication and collaboration, and efficiency and flow. Autonomous AI for coding tools can help with both – allowing teams to deploy better code more frequently while also freeing team members from repetitive and time-consuming testing work that can delay deployment. That alone has a huge impact on employee experience and the efficiency of a team’s workflow.

Model Risk Management concerns are delaying GenAI deployment

Large language models have been the subject of scrutiny by the risk management functions at many enterprise organisations. So much so that many are lagging behind on realising the much-hyped gains expected to come from generative AI. A 2023 study by KPMG found that, while 65% of executives feel generative AI will have a big impact on their organisations, 60% don’t expect to adopt their first generative AI solution for one to two years.

As the Model Risk teams at many large organisations try to assess the risk levels of certain tools, predicting what can possibly go wrong by adding large language models to the software development process in order to mitigate any potential risks has been a challenge. LLMs are expected to potentially impact data privacy and security and, unsurprisingly, KPMG found that those concerns were on the minds of 78% and 81% of the executives they surveyed respectively.

60%

don’t expect to adopt their first generative AI solution for one to two years.

81%

of the executives are concerned that LLMs wil impact data security.

However, other generative AI tools present very low risk. Not only is the reliability of tools based on other AI methods much higher, but they also offer significant security benefits because they can potentially function ‘on-prem.’ That makes these types of tools a great way for companies in highly regulated industries or with risk averse leadership to take advantage of generative AI advances immediately.

Conclusion

While LLMs are an exciting and disruptive assistive technology, expect the future of AI for software development to also centre autonomous AI coding tools. Ultimately, finding the right tool for the right job is critical. LLMs are a great choice for many coding tasks but not for those that require things like precision or the production of autonomous code at scale.

Much of the excitement around LLMs for coding has hinged on the expectation that what ChatGPT and Copilot can do now is just a preview of what they will soon be capable of. But, while some are hopeful the next generation of LLMs will solve the hallucination problem and deliver more precise outputs thus making LLM coding tools autonomous, leading experts in LLMs disagree about how fixable those challenges are.

Ilya Sutskever, a co-founder of OpenAI, has argued that reinforcement learning with human feedback (RLHF) is key to eliminating hallucinations and improving accuracy. Meanwhile, other experts in the field like Yann LeCun, Meta’s Chief AI Scientist, known as one of the ‘Godfathers of AI’ for his role in the development of neural networks, believe hallucinations are part of a fundamental flaw inherent to LLMs. Given that in 2020 OpenAI and other AI companies incorrectly announced that model size was the most critical factor in LLM performance only to be proven wrong by DeepMind’s Chinchilla research two years later, it's clear experts’ hypotheses can be wrong about the still-developing technology.

report-light 1

Whether LLMs will be able to someday produce flawless code autonomously on the first try is still extremely unclear. For that reason, companies building a generative AI strategy for software development shouldn’t put all their eggs in the assistive AI basket.

The future of generative AI involves leveraging multiple forms of AI that are good fits to the use cases developers are trying to solve for.

Instead, the future of generative AI involves leveraging multiple forms of AI that are good fits to the use cases developers are trying to solve for. That includes reinforcement learning-based AI dev tools like Diffblue Cover that can work autonomously and deliver greater accuracy and security at a lower cost.

Expect the future of AI for coding to be focused on precision over hallucinations.

Ready to try Diffblue Cover?