Idioms formalize this intuition. An idiom is a rule that relates combinations of code patterns to conceptual operations. The analysis then searches for these code patterns---the idea is that if all the code patterns that appear in an idiom also appear in a function (or its callees, if the code pattern search algorithm is interprocedural), then the function is said to perform the conceptual operation.
Idioms are expressed in a Datalog-like language. Each idiom consists of a conceptual operation on the left-hand-side, and a conjunction of code patterns (or their negations) on the right-hand-side. The grammar of the idiom language is as shown below. Each conceptual operation can appear on the left-hand-side of more than one idiom. The kernel analysis matches on all the idioms for each function, and returns all the conceptual operations that match. Thus, in effect, the rules are all combined using disjunction.
Idiom ::== OP :- AND_i (Codepat_i | NOT(Codepat_i)) Codepat ::== SET AST | SET AST TO value | READ AST | CALL AST | INCREMENT AST | DECREMENT AST AST ::== (type->)*fieldname |
Here are some noteworthy features of the idiom language:
Writing idioms is an iterative procedure: we first write a first-cut set of idioms, run TAHOE's kernel analysis, and inspect the output of the analysis manually to determine whether the conceptual operations as determined by TAHOE are indeed performed by the kernel function. If not, we find the idiom that triggered the conceptual operation, and refine it by adding more conjuncts (i.e, manipulations of code patterns) to the idiom.
We are working towards a tool which will serve as an assistant for idiom writing (much like Houdini, which is an annotation assistant for ESC/Java). The goal of this tool is to ease the burden of idiom writing, and empower an ordinary end-user (i.e., not an expert) to write idioms.
While these idioms are far from perfect, they give a taste of what is involved in idiom writing, and show how idioms capture semantic information in kernel code.
FILE__READ :- READ inode->i_size |
Reading a file also typically involves reading the size of the inode corresponding to the file. |
FILE__READ :- SET inode->i_atime |
Reading a file is also accompanied by updating the access time of the inode corresponding to the file. |
FILE__READ :- CALL inode_operations->read() |
The read is performed by a call via a function pointer (read())
of a file-system-specific function to do the read (such as ext2_read
or ext3_read). Note that the three idioms above for FILE__READ are all
disjoint, i.e., the analysis algorithm treats them disjunctively. The first two
idioms, when applied alone, may result in false positives, because functions
other than those that perform the conceptual operation FILE__READ may also read
the size of, or set the access time of the inode corresponding to the
file.
Thus, to refine the idioms, we can combine the code patterns from the first two idioms with the code pattern in this idiom, and create a more precise description of the FILE__READ operation. |
FILE__WRITE :- SET inode->i_size AND CALL file_operations->write() |
Writing to a file typically involves a low-level call to perform the actual write, and an adjustment to the file size by modifying the field i_size of the inode associated with the file. |
FILE__WRITE :- SET inode->i_size AND SET inode->i_mtime AND NOT READ inode->i_atime |
File write typically involves modifying the size of the inode, changing the modification time stamp on the inode. The access time i_atime is often not read during file write. This idiom is less precise than the previous one for the file write operation, but can also capture cases where the file write is performed without a call via the function pointer (file_operations->write()). |
FILE__WRITE :- SET inode->i_size AND CALL address_space->address_space_operations->prepare_write() AND CALL address_space->address_space_operations->commit_write() |
File write can also be achieved by calls via the function pointers prepare_write() and commit_write(), which are fields of address_space_operations, which in turn is a field of address_space. As before, the i_size of the inode is also modified. |
FILE__LOCK :- SET file_lock->fl_type AND SET file_lock->fl_flags AND SET file_lock->fl_user |
Obtaining a lock on a file typically involves creating a lock and setting several fields of the lock (such as fl_type, fl_flags, and fl_user) |
FILE__CREATE :- SET inode_operations->create() |
Creating a file typically involves calling a file-system-specific function, such as ext2_create or ext3_create |
FILE__CREATE :- SET inode->i_sb AND SET inode->i_dev AND SET inode->i_blkbits |
File creation is typically associated with initializing several fields associated with the inode of the file. Note that the function pointer-based idiom (presented previously) is often more reliable than this idiom, because functions other the ones that perform file creation may also manipulate these fields. However, idioms based upon fields of the inode data structure often help in cases where the file is created via other techniques than the function-pointer-based call to a file-system-specific function. |
FILE__LINK :- INCREMENT inode->i_nlink |
Linking a file involves incrementing its link count. |
FILE__UNLINK :- DECREMENT inode->i_nlink AND NOT SET inode->i_size TO 0 |
Unlinking a file involves incrementing its link count. The second conjunct helps reduce false positives because other operations, such as DIR__RMDIR also do a FILE__UNLINK. In order to prevent functions which do a DIR__RMDIR to also be classified as doing FILE__UNLINK (which is subsumed by DIR__RMDIR), we use the second conjunct. Note that if a function that performs DIR__RMDIR is also classified as performing FILE__UNLINK, it is not a false positive, because removing a directory also involves unlinking. |
FILE__SETATTR :- CALL inode_operations->setxattr() |
Setting the extended attributes of a file involves calling a file-system-specific function to do so. |
FILE__SETATTR :- SET jfs_inode_info->ea |
Setting the extended attributes in the JFS file system involves setting the field ea of jfs_inode_info. This is an example of a file-system-specific idiom. While TAHOE currently attempts to place hooks at the virtual file-system layer in order to prevent the burden of placing hooks in each new file system, this example shows that, if necessary, one can also use TAHOE to place hooks within a physical file system. It just involves writing idioms specific to the file system. |
Here are some idioms related to the networking subsystem, and their interpretation. As opposed to the file system idioms, where we often delved into the details of physical file systems, we did not consult any specific network protocols (e.g., IPv4, IPv6, 802, etc.) for writing these idioms---they are currently expressed in terms of function pointers, which call the appropriate protocol-specific function.
These idioms are self-explanatory.
SOCKET__ACCEPT :- CALL socket->proto_ops->accept() |
SOCKET__BIND :- CALL socket->proto_ops->bind() |
SOCKET__CONNECT :- CALL socket->proto_ops->connect() |
SOCKET__CREATE :- CALL net_proto_family->create() |
SOCKET__GETATTR :- CALL socket->proto_ops->getname() |
SOCKET__LISTEN :- CALL socket->proto_ops->listen() |
SOCKET__RECV_MESG :- CALL socket->proto_ops->recvmsg() |
As a thumb rule, while idioms based upon calls via a function pointer (for instance, the idioms above for socket operations) are easy to write. Each data structure in the kernel, such as inode and socket have a set of operations defined on them, and in most cases, a mapping exists between the conceptual operations and the operations defined on these data structures. However, it is advisable to avoid writing idioms based upon function pointers.
We find that while idioms based upon calls via function pointers will result in fewer false positives, ultimately idioms based upon low-level operations (such as file-system-specific modifications to structure-member accesses and network-protocol-specific modifications to structure-member accesses) have higher fidelity in expressing the set of canonical events that must happen for a conceptual operation to be performed.
While idioms based upon such low-level modifications to data structures are time-consuming to write, and will result in a large number of false positives initially, by continuously refining the idioms, their quality can be improved. We believe that idioms written using low-level modifications to data structures have higher fidelity, and hence will reduce (or even eliminate) false negatives from the results of the analysis. Because false negatives in the analysis directly result in missing conceptual operations, and thus missing hooks, they are dangerous. Thus, in the long run, it is advisable to write idioms with low-level data structure modifications