Property Simulation – a new model for CDT’s Static Analysis (Part 2)

December 18, 2009

In the previous entry, I covered the details and motivation behind using Property Simulation for the CDT’s static analysis. Now we will move onto the implementation details. The first thing we needed was a control flow graph. Luckily the Parallel Language Development Tools sub-project of the Parallel Tools Platform already provides a control flow graph implementation for the CDT. I modified this to encapsulate control flow edges since they are crucial to the Property Simulation algorithm. Thus the modified control flow graph properly contains both vertices (blocks) and directed edges. The version of the Property Simulation algorithm I implemented is the simple intra-procedural case. This means the control flow of each function in the C/C++ project is processed separately.

The algorithm traverses the control flow graph in a breadth-first manner, populating state information in a dictionary that maps control flow edges to sets of Symbolic States (each is a set of Property States and an Execution State). Each block in the control flow graph represents a node (either an expression or statement) in the AST. To model Execution States I have created a structure that operates on Boolean formulas. The trick is to then break if conditionals into atoms that are either true or false. I chose IVariable objects as the basis for these atoms. These objects are obtainable from the AST and have the advantage of maintaining a binding to their declarations. Therefore all references to a variable have the same IVariable instance associated with them. So an Execution State is conjunction of IVariable atoms and negations of atoms.

The performance benefit of Property Simulation comes from the way it groups Symbolic States and discards branches that cannot occur. The grouping function partitions all Symbolic States by their Property State, then the Execution States for all Symbolic States with a given Property State are joined. For example, suppose at a merge point we have two Symbolic States [opened, xy] and [opened, x!y]. Since they have the same Property State, we will group them into one Symbolic State. The Execution State is then a disjunction of xy and x!y. However, Execution States are conjunctions of variables and as such, disjunctions must be distinct Execution States. The goal is to find a minimal set of terms that imply the original Execution States. For our example, xy + x!y simplifies to just x since x implies xy + x!y. This leaves us with one Symbolic State [opened, x]. Every Symbolic State must be processed for each block in the control flow graph. Thus the fewer Symbolic States we have, the faster the algorithm performs. To minimize a set of Execution States we use the Quine-McCluskey algorithm for Boolean minimization, which works much like I just demonstrated. We further simplify this by substituting in any truth assignments we have learned from assignment statements.

Property States are the extender’s customizable component. A checker tracks some temporal property that is encoded as a Finite State Machine of Property States. Property States are implemented as an abstract class where the subclass handles decisions about state transitions by examining a statement or expression. A Finite State Machine of Property States is defined via an interface with methods to get all Property States in the FSM, along with the special initial and error states. To define your own checker, use the org.eclipse.cdt.codan.core.checkers Extension Point and have your checker class extend AbstractPropSimChecker. Provide your checker with the finite state machine encoding of your temporal property and the framework handles the rest.

We report errors via a callback by the algorithm with the AST node that triggered the problem, and the Execution State that causes the error condition to occur. The Execution State’s toString method displays the Boolean form in a formatted String. Here is an example checking for cases where fclose is called on an unopened file:

Checking for closing unopened files


The test program is a modified excerpt from the md5sum program. Notice in one branch the file is not opened if binary is false.

Of course, this implementation is not complete. The analysis needs to be extended to consider the global control flow of the program, rather than just individual functions. We can do this by generating summaries for each function, which contain mappings from input Symbolic States to where the function takes them upon its return. We consult the summaries at function call sites to simulate executing the function.

Unstructured code is another hurdle to consider. break/continue, goto, switch/case statements and multiple return statements all cause problems in analyzing the control flow. We have an assumption that a branch block has two children and a merge block has two parents, which the mentioned constructs break. Conditional and assignment parsing also need to be improved. Currently, we can only recognize conditionals that take the form: (var == literal), (var != literal), (var) or (!var). Similarly, pointer operations (dereferencing, arithmetic) cause parsing issues. Lastly, we need to be able to analyze more than one file handle at a time. This requires an implementation of a value flow graph to track file handles.

I have hosted my code in a Git repository on Fedora People. My Property Simulation implementation along with the open/close example checker are in the org.eclipse.cdt.codan.extension plug-in. My modified PLDT plug-ins are also in the repository. The stock static analysis framework is also included.

To check out the code:

git clone git://fedorapeople.org/~ebaron/codan.git

…or browse the Gitweb interface.

I have submitted a talk for EclipseCon that covers Property Simulation and the CDT. If you are interested, please comment. Thank you.

Advertisement

Property Simulation – a new model for CDT’s Static Analysis

December 16, 2009

Several months ago, I discovered the need for improvements with Eclipse CDT’s static analysis. Namely, we need an analysis model that correctly accounts for control flow paths of a program. To test the abilities of the framework, I created a checker to detect instances where close was being called on a file descriptor that did not have a corresponding open call earlier in the control flow. Consider the following snippet of code:

int fd;
if (x) {
        fd = open("file", O_RDONLY);
}
close(fd);

If we simply traversed a function containing this code with an ASTVisitor, all we could do is detect that the open call is visited before the close call and assume there is no error condition. In reality, we know that close is called improperly if the branch is not executed. Thus we have a conditional error that will occur if x evaluates to false.

In my search for a solution, I came across a paper entitled “ESP: path-sensitive program verification in polynomial time” from Microsoft Research and the University of Washington. The paper explains and compares different static analysis techniques. A temporal property to check is expressed as a finite state machine. A finite state machine for our file open/close example could look like1:

The finite state machine’s states are known to the algorithm as Property States. We also keep track of the Execution State of the program at a given instant. This includes necessary conditions to execute a given block in the control flow graph. In our code snippet above, the open call would have an Execution State of x, meaning x must be true in order to execute that block. Each Execution State takes the form of a Boolean formula and subsequent if statements append another clause to it. The basic structure that contains all of this information in the algorithm is called a Symbolic State. Each Symbolic State contains one or more Property States and an Execution State. Each edge in the control flow graph references a set of Symbolic States that describe the state of the program at that point. In our snippet, the out-edge of the open statement block would have the following Symbolic State associated with it: [opened, x].

The paper covers three static analysis methods, which differ in the way Symbolic States are grouped and simplified. Here is how they differ at merge points:

  1. Fully Path-sensitive Analysis (PSA) – No grouping is performed. All Symbolic States from either side of the branch are kept.
  2. Dataflow Analysis – All Symbolic States from both sides of the branch are consolidated into a union of the Property States and join of Execution States.
  3. Property Simulation – Symbolic States are grouped by their Property State, corresponding Execution States are joined.

This diagram shows how each method analyzes a control flow graph1.

Click for the complete diagram

You can see that the Fully Path-sensitive Analysis (PSA) retains all information and performs no state grouping at merge points. Note that this means that each branch point doubles the number of states. Fully Path-sensitive Analysis is seen as intractable due to this exponential blowup in the number of states. The Dataflow Analysis does the very opposite. Whenever the Execution State is made more specific by a branch point, it is discarded at the merge point. We then mistakenly transition to an error state because we lost the Execution State context that caused the file to be opened.

Property Simulation meets the two approaches in the middle. It only discards Execution State information when the branch did not cause a Property State transition, thus it is not meaningful to our analysis. Note the Fully Path-sensitive Analysis, but not Property Simulation, stored Execution State information for p even though the branch that depends on p does not have any file operations. In doing this, we avoid the inefficiencies of the Fully Path-sensitive Analysis, while avoiding the false error reporting of the Dataflow Analysis.

Details on my work implementing this algorithm for the CDT will follow shortly.


1. Image Credit: Das, M., Lerner, S., and Seigle, M. 2002. ESP: path-sensitive program verification in polynomial time. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation (Berlin, Germany, June 17 – 19, 2002). PLDI ’02. ACM, New York, NY, 57-68.

First Look at CDT Code Analysis

September 2, 2009

Recently, I have been looking at Static Code Analysis for the Eclipse CDT. This allows for quick detection of potential coding problems without requiring a potentially long compile or running the program through another external tool.

Analyzing C/C++ source code makes use of the abstract syntax tree generated by the CDT’s C/C++ parser. A “checker” detects one kind of coding problem using a Visitor pattern to visit elements of interest in the abstract syntax tree of a translation unit/source file. New checkers are contributed via Eclipse’s Extension Point mechanism.

I have begun work on a new checker to warn users if they are attempting to close a file that they did not open yet. The visitor visits expressions looking for an open IASTFunctionCallExpression. Then it needs to find the variable that will store the returned file descriptor from open. The case of assigning the file descriptor to a variable in a declaration statement needs to be handled separately from assignment to an already declared variable. This is because the two types of statements are viewed quite differently in the AST.
The IASTName of the variable is stored in a list of opened file descriptor variables.

Close file descriptor checker in action

Close file descriptor checker in action

A close IASTFunctionCallExpression has its argument compared with the list of previously encountered opened file descriptor variables. If the IASTName for close’s argument does not match any in the list, then we report a warning to the user.

I am hosting my CDT code analysis work in a Git repository on Fedora People. Please take a look!

git clone git://fedorapeople.org/~ebaron/codan.git

There is still much work to be done for this checker to be usable for an end-user.
Problems:

  1. Visit order — it is likely not the case that the visitor will visit code in the order of the control flow of a program.
  2. Secondary variable assignment — we need to handle the case that an opened file descriptor variable may be copied to another variable.
  3. Multiple translation units — we should be able to track file descriptors passed between different files.