Yesterday, I had to explain what I do to a group of people and had a really hard time expressing myself puffing and sweating. The topic was program analysis. In my specific case, it was analyzing source code to find potential bugs without actually running the code.
Designing and implementing a program that can analyze other programs is not easy but it is equally challenging to explain the details to previously non-interested people.
This is Bedirhan and here’s my second take with an intuitive example that may explain the basic process.
Assume that a successful brain surgeon has the following breakfast routine. It may look like almost robotic but as you might guess she is extremely busy. So, let’s accept her the way she is.
The question we ask is whether this simplified routine is correct, efficient and secure? She is saving lives so we should do our part to help her, don’t you think… Before you keep reading, you can spend a few minutes trying to find out problems in her breakfast routine.
And here is a list of possible problems;
- She fetches an extra cup which is unnecessary
- Actually there’s no need to a cup altogether since there’s the bottle
- There’s no need to sit down again after refilling the bowl with more milk
- She doesn’t check the expiration date of the bottle of milk for obvious reasons
- After eating or upon an emergency call, she doesn’t replace the milk in the fridge and clear the table
You may locate problems other than this list. Fixing the related ones including the listed above, she can optimize and secure her routine for a better life.
What we did when trying to find problems is called manual analysis. Of course, in real life our routines are far more complex and come in varieties. It’s nearly impossible to analyze each of them and more importantly work on the aggregation of these little tasks as a total life routine.
So, can we do what we did by using an automatic program feeding the routines, like the breakfast one, into it? Well, this is called program analysis or in other words static code analysis. It is called static analysis, because we don’t have to run the routine in order to find the problems in it. We just have to feed it into our hypothetical program as text. And our program will output possible venues to improve it.
Let’s now map what we actually do to our example. Here the breakfast routine is a piece of software. Someone wrote it. Moreover, someone runs it to make use of it. But there may be problems in the software. It might not be optimized, secure or even correct. It is hard to analyze it by hand, since an average piece of software doesn’t contain 10–15 lines. They may contain ten thousands to millions lines of code. So, we are after doing the analysis automatically by producing another program by using and improving scientifically acknowledged Program Analysis techniques; such as live variable, reaching definition or control flow analysis.
Here’s how we essentially locate one of the problems we found in our surgeon’s breakfast routine; risk of getting poisoned because of not checking the expiration date on the bottle of milk. It might be easy to find the problem manually, however, when doing automatically we have to follow a technique, generally called data flow analysis, in order to locate one.
The figure below tries to explain the technique in an abstract way. Implementing it efficiently and correctly is another challenging task to achieve, but I’ll not focus on it for now. Maybe in other posts in more boring technical detail…
In the figure above, there’s a flow of milk from its bottle to a cup and then to the bowl. There might also be a spoon in somewhere, but I excluded that purposefully in order to keep the example concise and simple. Anyways, this flow ends up in our beloved surgeon’s digestion system. So, the milk has to be fresh, otherwise, she won’t be able to save lives for a while.
Opposite to the natural flow, we have shown it backwards in the figure and it’s a choice. Our static analysis program behaves like a detective and first finds dangerous points (called sinks) in the routine, such as eating.
Before our program runs, it already knows that eating can create problems if what is eaten hasn’t been checked. Moreover, our program already knows that a bottle of milk contains something that might pose risk to human life as well as something that is super healthy. Generally this prior information is called the rules in static analysis. So, armed with this rule, all our program has to do is to find if there’s any flow from the bottle of milk to our stomach. Once we locate the word eat, we track the content of what’s eaten. This is shown in the figure; from bowl to cup and from cup to bottle of milk.
But that’s not enough. Because the existence of such a flow by itself doesn’t mean that we have a problem. Maybe our cautious surgeon includes an action about checking the expiration date of the milk in the routine (which she doesn’t). So, our program has to know and take care of this action, too, in order not to produce false alarms. False alarms, when too many, turn everybody down.
Obviously there are difficult problems to solve and challenges to tackle remain and as we go deeper, I can make more analogies (language approximations, context-sensitivity, flow-sensitivity, points-to analysis, def-use chains etc.), however, this is a good place to stop. Nevertheless, I hope this post sheds some light on static code analysis for security for you.
Our team at CodeThreat is working towards producing easy to use, easy to understand and easy to deploy quality security code analysis solutions that will hopefully help securing your software.