Sometimes the Most Definite Thing is a Maybe

Security testing automation is stranded by Gödel

7 min readNov 1, 2020

Everyone who has dealt with developing a comprehensive automated security testing tool knows that it is a challenging task. Staying on top of new security bugs and making the tool performant is one thing and keeping false alarms (false positives) and missed bugs (false negatives) as low as possible is something else. But there’s a one more and inherent limitation to any automated software security testing solution; SAST or DAST…

Sometimes, the most definite answer we can get out of a security testing tool is just a MAYBE

As a side note, we have written about soundness and precision in one of our earlier post and you can read about them in order to learn some fundamental terms about the quality of automated security testing.

The limitation which is always there

Automated application security tools whether they are dynamic or static suffer from Rice’s Theorem which goes like;

All non-trivial, semantic properties of programs are undecidable.

That means we can not create an algorithm that always produce YES or NO output for the question of whether a non-trivial property of a target program holds or not. Hmm.. That’s still mouthful.

How about this with a concrete example?

It’s impossible to write an algorithm that correctly checks the existence of SQL Injection weakness for all given programs.

Being SQL Injection-free seems to be a syntactic property of a program, however, that’s only the root cause, which is mixing code and data. In fact it relates more to the behaviour;

For a dynamic analysis, the behaviour of the program (against a SQL Injection attack) should be checked through the output.
For a static analysis, the behaviour of the program should be predicted through the flow of the data from a taint source to a dangerous SQL API sink.

Here’s another but this time more easily digestible property of a program that we can’t detect correctly all the time; whether a given program is free of infinite loops.

Rice’s Theorem improves upon and goes along with the theorems of Church & Turing’s in 1936 and ultimately of Gödel’s in 1930; proof for Halting Problem’s unsolvability and Incompleteness Theorems, respectively.

Kurt Gödel is the one who showed the inherent limitations of formal axiomatic systems (such as the Set Theory or Number Theory).

Gödel at the age of 25 provided a disproof for Hilbert’s Program.

What mathematics to Gödel was automated computation to Alan Turing. So, putting Turing Machines forward, he provided a solution to Entscheidungsproblem (Halting Problem) challenge of Hilbert;

We can’t build a machine that solves the problem of whether a program with a given input eventually stops or not.

You can watch a popular version of the proof if you haven’t seen it before.

So, this quite discomforting result begs the question of “how can we write a reliable complex security testing tool when we can’t even decide on whether a program stops or not for a given input”?

Let’s make this pessimistic curiosity more concrete at least for the topic of static code analysis on a specific question.

Undecidability through Control Flow Graphs

One of the most important techniques to use in order to analyze a program statically is constructing Control Flow Graphs (CFG). Coupled with Data Flow Analysis, CFGs allow us to follow tainted data to dangerous API calls, which mean a security weakness.

Let’s assume we have the following target program to analyze;

void Greet(string name)
{
   string msg = "Welcome"
   
   if(name is not Empty)
       msg.Append(", " + name)
    
   myLabel.SetText(msg)
}

It’s pretty straight-forward code but basically the action method Greet gets a user-provided input in name. If the name is not empty, the code appends it to a hard-coded message and return it to the user via a label called myLabel. Here is a simple CFG constructed from the above code block;

A simple CFG representation of the code block shown previously

The primitive question we have to ask to find a Cross Site Scripting (XSS) vulnerability is; Is there a path between the initial and the final node? In the above example, the initial node contains a user provided input and the final node contains a method that may cause an XSS. So, if the answer is YES a path exists, then there’s a possibility of a vulnerability.

A generic explanation of XSS: A weakness that occurs when the code outputs any user input to the client browser or any code rendering engine without any prior validation or neutralization .

I admit that the answer to this question alone won’t help much, however, it’s a required stepping stone towards further analysis. Here, shamelessly copying from Turing, I’ll try to show you that it’s not possible to create an algorithm that can always answer this reachability question for any given program and statement pairs inside.

A Basic disproof

Pursing reductio ad absurdum, let’s assume that it’s possible to write such a program which can answer reachability problem for any two nodes of any given CFG. Let’s call this R. Here’s the representation of R.

We assume there exists a program R solving the CFG reachability problem

P is the input program, plus stmt1 and stmt2 are two statements in P that we are seeking a reachability. If the stmt2 is reachable from stmt1, then R outputs YES, otherwise NO. Since the good developers of R would definitely make it a free or an open-source project, let’s now extend it a bit.

Let’s call this extended program H. It also takes the same input parameters and directs them right into R. And when it sees the output YES, we extend the code to add an infinite loop statement, which will make our H spin forever and never exit. Conversely, when we see the output NO, then we use the exit statement to, well, exit the application. Now, our H has an initial statement, that I call init and a single return statement called exit.

So far so good. What happens then, when we feed the H to itself as a target program and (init,exit) statement tuples as input. Let’s simulate;

If R outputs YES, H would spin forever. But that can’t be. Why? By outputting YES, R says that exit is reachable from init for H. But we know that that’s not possible due to the infinite loop.

If R outputs NO, H would exit. But that can’t be, neither. Why? By outputting NO, R says that exit is not reachable from init for H. But we know that it is, since H exited.

This means R can not exist. We can’t write a program that answer the question of statement reachability for any program and its any two statements.

It’s all incomplete, we just approximate

Albeit along similar lines, it’s also not possible to write a tool that is both consistent and complete at the first place. Let’s hear what Gödel’s Theorems say about this;

Any algorithm that produces statements about arithmetic either sometimes produces false statements, or else there are true statements about arithmetic that it never produces.

This is, too, a very strong result. Since Turing developed the idea of computation simulated by machinery and modern computers are nothing but fast Turing Machines only with finite memory, this holds for modern computers/programs, too.

So, what that means is that, any security testing tool is bound to produce some false positives or miss some real issues. We can’t nail both.

Coupling this with the Rice’s Theorem we mentioned previously, we have a heartbreaking truth in our hands. Since it’s certain that we can’t produce a %100 solution, all we can do is aiming the best we can get. That is approximation with dignity. :)

Every SAST or DAST solution approximates. That is they make assumptions about flows, behaviors, settings, APIs supported, input combinations, timings etc. Sure they can spot shiny and easy weaknesses with a good certainty or with validation, however, when they assume all they can output is a MAYBE instead of TRUE or FALSE. They then tell us that there maybe an XSS here or IDOR there but never %100 sure.

And as users of these tools this situation is usually fine with us. As long as we don’t get stupid looking or low number of false alarms. When it comes to missing alarms, we want tools to keep this number at the minimum. The quality of the assumptions determines this balance.

Conclusion

From time to time, we all hear marketing mambo jambo concerning the quality of security automated testing tools such as producing “zero false-alarms” or revealing “every existing bug”.

However, automated security testing methodologies are bound to be either incomplete or inconsistent. But let me assure you in practice, almost always both.

But that’s OK. We know by experience that these tools are invaluable. For example, complete manual code analysis of a moderate application takes an awful amount of time. So, automatic analysis, albeit they have limitations, provide a great leverage for finding both for decidable and undecidable program properties.

The key idea is, there are many criteria to contemplate on when choosing a SAST tool, however, for issue quality one of the most important thing that as a consumer we should be after is getting false positives and false negatives as less as possible. To that end we should definitely make use of benchmarking results to choose the best one for us.

At CodeThreat we provide such a list of test cases that you can use under an open-source project called FlowBlot.NET. You can always run your favorite SAST tools against it and compare the results at your heart’s content.