Sound as a Pound, as Precise as Atomic Clock

CodeThreat
6 min readJun 10, 2020

Over a distributed series of posts, we’ll try to explain some of the fundamentals of static code analysis. Armed with this information, it will be easier for security professionals, developers to comprehend and perhaps compare the internals of such tools, and CodeThreat is one of them.

This is Bedirhan and in this post, I’ll go through a few important attributes used to explain output quality in program analysis. Moreover, some of these attributes describe how tools approach to the static code analysis problem.

List of security bugs identified by running a static scan against Enterprise-BackOffice application

Let’s first start with these quadruple; True Positive, True Negative, False Negative and False Positive. We may be already familiar with the last two, but especially the last one. We’ll go over each of them under the light of the drawing above;

Assume we executed a security scan against a software project called Enterprise-BackOffice, which yielded a list of security issues. The drawing includes these flagged bugs that our scan produced. That is to say, the scan claims that Enterprise-BackOffice contains all these security bugs that should be fixed. Let’s show and encircle those claimed findings in our drawing composing a finding area.

Classified list of identified and unidentified security bugs of the scan results

Quite naturally some of these bugs are false alarms. In reality, Enterprise-BackOffice doesn’t have them. In more technical words, False Positives. Let’s assume that the left hand-side of the finding area denotes these wrong findings, which we shortly label as FP.

And obviously some of the findings are real alarms. They individually pose real risk and should be fixed in a time-plan. These are called True Positives. Let’s assume that the right hand-side of the finding area denotes these true findings, which we shortly label as TP.

That’s great. Let’s now focus on the parts that are out of the finding area. For example, the part in the right-hand side, out of the TP area seems interesting. These part contains all the real findings that our scan couldn’t locate! Our scan failed to find these actual risks. We label this part as False Negatives. These missed findings are scary for any automated vulnerability solution or even manual inspection services. We’ll get back to this later on.

How about the part in the left-hand side, out of the FP area? We will have some bug types in this part. I specifically emphasized “bug types” for a reason. Because as oppose to other areas this part doesn’t contain any findings. This part contains the risks that don’t exist in Enterprise-BackOffice. And our scan successfully identified these non-existing risks. For example, let’s assume that developers of Enterprise-BackOffice use secure SQL framework and APIs, so they wrote code that doesn’t have any SQL Injection in the runtime. In parallel, our static scan righteously didn’t flag any SQL Injection security bugs. This type of finding types are called True Negatives. We will label this part as TN.

Let’s now focus on the following code snippet.

A code piece with a possible Null Pointer bug

Let’s manually analyze this piece. There’s a potential Null Pointer problem in third line. Because the aggregate accounts might be NULL. However, let’s assume that in reality checkSatisfied boolean expression never becomes TRUE in the first place. So, when analyzing this code manually we won’t flag a Null Pointer bug here. However, when doing it automatically with a tool, it’s nearly impossible to decide on this unless the value of checkSatisfied is hard-coded or depend on extremely complex pre-calculations.

Therefore, all of the static code techniques and solutions approximate. For instance they assume that there might be a situation where checkSatisfied could be TRUE. So, they will and should flag Null Pointer bug here. This will of course create a False Positive. But it’s generally better to have a False Positive rather than a False Negative. Well, let me rephrase it.

It’s generally better to have manageable number of false alarms rather than having critical security bugs that go unnoticed.

As a metaphor, think about COVID-19 tests. It might be better to have some false alarms then having unnoticed COVID-19 patients.

While we are at it, let me give you our goals in CodeThreat;

- We aim to produce as less False Negatives possible as we can and
- We aim to produce as less False Positives possible as we can.

Most of the time, these two relate to each other. How? Here’s an exaggerated example;

As a static analysis solution provider, assume that we decide not to miss any SQL Injection bugs in a target software. So, we flag every method call that runs SQL commands whether its arguments are dynamic or not. This is called approximation. In fact, over-approximation. With this approach we won’t miss any SQL Injection bugs (hence, zero False Negatives) and that will make our tool sound. However, it will also produce lots of False Positives to the point that make our solution useless.

To be more specific, evaluating every execution of a program to produce alarms is Soundness. It’s a desired attribute of a solution, however, naturally, this needs over-approximation. Then while we will have less to none False Negatives, we will definitely have many False Positives.

Is it possible to have less False Positives while still doing approximations? That’s the million dollar question. Here are two more definitions;

Recall is a performance ratio that answers the question of “how many relevant items we found?”. This relates to the right hand-side of the drawing we had. Formulating it boils down to;

The definition of Recall by the number of True Positives and False Negatives

When we have zero false negatives, we will have a perfect looking Recall value, which is 1.

Precision, on the other hand, is a also performance ratio that answers the question of “how many items we found is relevant?”. This relates to the encircled are in the drawing we had. Formulating it boils down to;

The definition of Precision by the number of True Positives and False Positives.

When we have zero false positives, we will have a perfect looking Precision value, which is 1.

It’s extremely hard to get both Recall and Precision perfect. In fact, it’s extremely hard to get even one of these values perfect. So, if we map our listed goals above to these definitions, we want CodeThreat to be sound [1] and precise.

To recap, for a static code analysis tool it’s important to produce as little false alarms (FPs) AND missed alarms (FNs) as possible. In technical terms these are evaluated by precision and recall ratios explained above. While it’s hard to get both of these ratios perfect, these are some of the ultimate goals. In CodeThreat, we have produced benchmarking projects to measure our quality and progress of these goals. We’ll publish one or two of these projects in the near future.

[1] Well, soundish at least. This is another discussion for another post.

--

--

CodeThreat

CodeThreat is a static application security testing (SAST) solution. Visit codethreat.com