Machine Learning for Code

Using learning for speedup and improved efficiency

Machine learning is used in many different areas to make sense out of large datasets. For us, the data comprises not only the code itself, but also the code history as stored in the versioning system. This helps us decide which parts of the code need most thorough analysis, or how to assign importance to the results when presenting them to users.

Our research tells us how to use dimension reduction in order to perform fast but clever initial sweep through the code before analysing it in detail. An approach based on Bayesian statistic to pinpoint the exact location of code that contains an error. In control systems applications, reinforcement learning helps us identify the most relevant space for analysis.

Read our published work

Probabilistic Fault Localisation by David Landsberg, Hana Chockler, and Daniel Kroening

Efficient fault localisation is becoming increasingly important as software grows in size and complexity. In this paper we present a new formal framework, denoted probabilistic fault localisation (pfl), and compare it to the established framework of spectrum based fault localisation (sbfl). We formally prove that pfl satisfies some desirable properties which sbfl does not, empirically demonstrate that pfl is significantly more effective at finding faults than all known sbfl measures in large scale experimentation, and show pfl has comparable efficiency. Results show that the user investigates 37% more code (and finds a fault immediately in 27% fewer cases) when using the best performing sbfl measures, compared to the pfl framework. Furthermore, we show that it is theoretically impossible to design strictly rational sbfl measures that outperform pfl techniques on a large set of benchmarks.

Automatically inferring loop invariants via algorithmic learning by Yungbum Jung, Soonho Kong, Cristina David, Bow-Yaw Wang, Kwangkeun Yi

By combining algorithmic learning, decision procedures, predicate abstraction and simple templates for quantified formulae, we present an automated technique for finding loop invariants. Theoretically, this technique can find arbitrary first-order invariants (modulo a fixed set of atomic propositions and an underlying satisfiability modulo theories solver) in the form of the given template and exploit the flexibility in invariants by a simple randomized mechanism. In our study, the proposed technique was able to find quantified invariants for loops from the Linux source and other realistic programs. Our contribution is a simpler technique than the previous works yet with a reasonable derivation power.

Verification of Markov Decision Processes using Learning Algorithms by Tomas Brazdil, Krishnendu Chatterjee, Martin Chmelik, Vojtech Forejt, Jan Kretinsky, Marta Kwiatkowska, David Parker and Mateusz Ujma

We present a general framework for applying machine-learning algorithms to the verification of Markov decision processes (MDPs). The primary goal of these techniques is to improve performance by avoiding an exhaustive exploration of the state space. Our framework focuses on probabilistic reachability, which is a core property for verification, and is illustrated through two distinct instantiations. The first assumes that full knowledge of the MDP is available, and performs a heuristic-driven partial exploration of the model, yielding precise lower and upper bounds on the required probability. The second tackles the case where we may only sample the MDP, and yields probabilistic guarantees, again in terms of both the lower and upper bounds, which provides efficient stopping criteria for the approximation. The latter is the first extension of statistical model checking for unbounded properties in MDPs. In contrast with other related techniques, our approach is not restricted to time-bounded (finite-horizon) or discounted properties, nor does it assume any particular properties of the MDP. We also show how our methods extend to LTL objectives. We present experimental results showing the performance of our framework on several examples.

Data-Efficient Bayesian Verification of Parametric Markov Chains by Elizabeth Polgreen, Viraj B. Wijesuriya, Sofie Haesaert, Alessandro Abate

Obtaining complete and accurate models for the formal verification of systems is often hard or impossible. We present a data-based verification approach, for properties expressed in a probabilistic logic, that addresses incomplete model knowledge. We obtain experimental data from a system that can be modelled as a parametric Markov chain. We propose a novel verification algorithm to quantify the confidence the underlying system satisfies a given property of interest by using this data. Given a parameterised model of the system, the procedure first generates a feasible set of parameters corresponding to model instances satisfying a given probabilistic property. Simultaneously, we use Bayesian inference to obtain a probability distribution over the model parameter set from data sampled from the underlying system. The results of both steps are combined to compute a confidence the underlying system satisfies the property. The amount of data required is minimised by exploiting partial knowledge of the system. Our approach offers a framework to integrate Bayesian inference and formal verification, and in our experiments our new approach requires one order of magnitude less data than standard statistical model checking to achieve the same confidence.