At Diffblue, we’ve been building a system that generates high quality tests for reachability analysis of code. There are many techniques for doing this, including bounded model checking, abstract interpretation, symbolic execution, and fuzzing. Our research is dedicated to improving on the state-of-the-art and investing in new techniques for all of these domains, which is why one of our teams is actively working on developing an advanced fuzzer for both managed and unmanaged languages. We are excited to share our early results for the many issues it has already found on a large Java code base.

Diffblue’s fuzz tester generates regression suites of component tests in a completely automated manner. Component tests are ideally suited to catch regressions in your service.  For instance, suppose that you are working on an HR service and that you recently updated some functionality in the calendar module. You probably expect that your change does not affect, say, your public API about users. But how can you be sure? How do you know there are not unintended interactions between the user and calendar functionality? How do you ensure that all API requests related to users still return the same responses?

Our fuzzer generates regression test suites that exhaustively exercise the interesting logical paths of your code. The generation algorithms are extremely good at finding corner cases in your code logic. The best part of this is that you don't have to do anything. It's an automated process. While this might look like magic to some, it's not. It's based on four decades of research in formal verification, static, and dynamic program analysis.

So, how does it work? Initially, you provide a small number of component tests. We analyze the logical paths they exercise. From that, we derive new test cases that cover new functions, as well as corner cases in the logic of your code.

To benchmark the fuzzer, we gave it 24 hours to generate tests for Apache Solr, a well-known enterprise search engine. Apache Solr is a quite complex code base, spanning 288k lines of Java code in 2287 source files. We configured Solr with a database of movies, following the quickstart tutorial. We also provided 31 initial test cases that we wrote by hand, mostly by taking URLs from the same tutorial.

Without knowing pretty much anything about the source code of Solr, we selected the following 7 Java packages to evaluate our success.

PackageLines of codeFunction
org.apache.solr.core12kApplication core
org.apache.solr.parser4kDatabase query parser
org.apache.solr.request4kRequest handler
org.apache.solr.response5kResponse formatter
org.apache.solr.search31kSearch request execution
org.apache.solr.servlet2kServlet container interface
org.apache.solr.util12kUtility classes
In total70k 

Our goal was to get an average end-to-end coverage of at least 30% in these packages. In end-to-end execution, functions are normally called in a predetermined environment. In contrast, unit tests often use mocks which can serve to increase coverage in comparison to unit end-to-end test.

Given all of this, we were quite impressed when our early, unoptimized version of the product produced 36% line coverage out of the box:

diagram showing the code coverage results from an early, unoptimized version of Diffblue's fuzzer

But we found something even more exciting than exceeding our coverage target. The system found hundreds of requests producing Java exceptions that trigger an HTTP 500 error response. The table below shows the number of different code locations found to throw those error-producing exceptions. Note that we only consider the line where the exception is thrown, not the entire stack trace.

ExceptionCode locations

Overall we are very pleased with these initial results, especially when we take into account that this is an early version. We will be making our fuzzing tool even smarter at catching unintended behaviors and offering multiple classification criteria for the tests generated.