Idioms in TAHOE

Reference

Automatic Placement of Authorization Hooks in the Linux Security Modules Framework
Vinod Ganapathy, Trent Jaeger, and Somesh Jha.
12th ACM Conference on Computer and Communications Security (CCS 2005), Alexandria, Virginia, November 2005.

Important note

This webpage is supposed to supplement our CCS'05 paper, not replace it. The contents of this webpage assume familiarity with TAHOE's analysis algorithms, which are presented in detail in our paper.

What are idioms?

TAHOE works by identifying the set of conceptual operations (Section 2 of the CCS'05 paper has the details on conceptual operations) that each kernel function performs. To do so, it analyzes the kernel to search for occurences of certain code patterns. The idea is that each conceptual operation is typically performed by the kernel in a small number of ways, each of which involves certain canonical actions. For example, writing to a file (conceptual operation FILE__WRITE) typically involves a low-level operation to add or remove blocks of data, followed by an update of the change time of the inode representing the file.

Idioms formalize this intuition. An idiom is a rule that relates combinations of code patterns to conceptual operations. The analysis then searches for these code patterns---the idea is that if all the code patterns that appear in an idiom also appear in a function (or its callees, if the code pattern search algorithm is interprocedural), then the function is said to perform the conceptual operation.

Idioms are expressed in a Datalog-like language. Each idiom consists of a conceptual operation on the left-hand-side, and a conjunction of code patterns (or their negations) on the right-hand-side. The grammar of the idiom language is as shown below. Each conceptual operation can appear on the left-hand-side of more than one idiom. The kernel analysis matches on all the idioms for each function, and returns all the conceptual operations that match. Thus, in effect, the rules are all combined using disjunction.

Idiom   ::== OP :- AND_i (Codepat_i | NOT(Codepat_i))
Codepat ::== SET AST | 
             SET AST TO value |
             READ AST |
             CALL AST |
             INCREMENT AST |
             DECREMENT AST 
AST     ::== (type->)*fieldname
Figure: Idiom grammar.

Here are some noteworthy features of the idiom language:

  1. Idioms are written in terms of Abstract Syntax Trees of data structures, not the data structure instantiations themselves. Thus, for instance, the read of a field i_ctime of a variable in of type struct inode will be written as READ inode->i_ctime. Function calls are also expressed in terms of ASTs so as to express calls through a function pointer.
  2. There are six basic statements that we currently use in our idioms: SET, denoting that a variable is being assigned a new value; SET TO, denoting that a variable is being assigned a specific value; CALL, denoting a function call; READ, denoting that the variable is being accessed; INCREMENT and DECREMENT, denoting the corresponding operations.
  3. The idiom language allows code patterns to be negated in conjunctions of code patterns. In such a case, the search algorithm will ensure that the code pattern does NOT appear in the source code of the function. We had to add negation to the language because there seems to be a "hierarchical" structure in the conceptual operations. For instance, the conceptual operation FILE__UNLINK always happens when DIR__RMDIR happens, but not vice-versa. Thus, to differentiate between these two conceptual operations, we have to find a code pattern that is canonical to DIR__RMDIR, but not so for FILE__UNLINK. If this code pattern appears in the body of a function (in addition to other code patterns), we say that DIR__RMDIR is being performed by the kernel function; otherwise, we say that FILE__UNLINK is being performed. The running example used in the kernel analysis section of the CCS'05 paper contains an example of negation used in the idiom for FILE__UNLINK.
  4. The AST does not have to represent the complete name of the variable. Just a part also suffices for matching. That is, an AST could be inode->inode_operations->read, or simply inode_operations->read. The matching algorithm matches suffixes of actual code patterns with the patterns that appear in the idioms.

Writing idioms

We view idioms as being similar to annotations in annotation-driven program analysis tools (such as CQUAL and ESC/Java) and similar to analysis rules used by the metacompilation project. In the current incarnation of TAHOE, idioms are written manually by an expert who is familiar with the way the kernel goes about performing a conceptual operation. Having said this, idioms are often easy to write, and a cursory code inspection often reveals the ways in which the kernel performs each conceptual operation. For instance, the 100 idioms we used in TAHOE, though still quite preliminary, were written in about a week.

Writing idioms is an iterative procedure: we first write a first-cut set of idioms, run TAHOE's kernel analysis, and inspect the output of the analysis manually to determine whether the conceptual operations as determined by TAHOE are indeed performed by the kernel function. If not, we find the idiom that triggered the conceptual operation, and refine it by adding more conjuncts (i.e, manipulations of code patterns) to the idiom.

We are working towards a tool which will serve as an assistant for idiom writing (much like Houdini, which is an annotation assistant for ESC/Java). The goal of this tool is to ease the burden of idiom writing, and empower an ordinary end-user (i.e., not an expert) to write idioms.


Some examples of idioms used by TAHOE, and their interpretation

Here are some idioms related to conceptual operations on files and directories, and their interpretations. We consulted several physical file systems (such as ext2, ext3, ReiserFS, JFS) to identify how they perform conceptual operations, and wrote the idioms based upon the information we could gather by studying the code in these file systems. Some of these idioms are very effective at identifying the kernel functions that perform the conceptual operation, while the others are less so. Identifying "good" and "bad" idioms is manual as presented in the CCS'05 paper, but we are currently working on techniques to automate idiom writing and refinement.

While these idioms are far from perfect, they give a taste of what is involved in idiom writing, and show how idioms capture semantic information in kernel code.

Idiom for file related conceptual operations
Explanation
 FILE__READ :- READ inode->i_size 
Reading a file also typically involves reading the size of the inode corresponding to the file.
 FILE__READ :- SET inode->i_atime 
Reading a file is also accompanied by updating the access time of the inode corresponding to the file.
 FILE__READ :- CALL inode_operations->read() 
The read is performed by a call via a function pointer (read()) of a file-system-specific function to do the read (such as ext2_read or ext3_read). Note that the three idioms above for FILE__READ are all disjoint, i.e., the analysis algorithm treats them disjunctively. The first two idioms, when applied alone, may result in false positives, because functions other than those that perform the conceptual operation FILE__READ may also read the size of, or set the access time of the inode corresponding to the file.

Thus, to refine the idioms, we can combine the code patterns from the first two idioms with the code pattern in this idiom, and create a more precise description of the FILE__READ operation.

FILE__WRITE :- SET inode->i_size AND
               CALL file_operations->write()
Writing to a file typically involves a low-level call to perform the actual write, and an adjustment to the file size by modifying the field i_size of the inode associated with the file.
FILE__WRITE :- SET inode->i_size AND
               SET inode->i_mtime AND
               NOT READ inode->i_atime
File write typically involves modifying the size of the inode, changing the modification time stamp on the inode. The access time i_atime is often not read during file write. This idiom is less precise than the previous one for the file write operation, but can also capture cases where the file write is performed without a call via the function pointer (file_operations->write()).
FILE__WRITE :- SET inode->i_size AND
               CALL address_space->address_space_operations->prepare_write() AND
               CALL address_space->address_space_operations->commit_write()
File write can also be achieved by calls via the function pointers prepare_write() and commit_write(), which are fields of address_space_operations, which in turn is a field of address_space. As before, the i_size of the inode is also modified.
FILE__LOCK :- SET file_lock->fl_type AND 
              SET file_lock->fl_flags AND
              SET file_lock->fl_user
Obtaining a lock on a file typically involves creating a lock and setting several fields of the lock (such as fl_type, fl_flags, and fl_user)
 
FILE__CREATE :- SET inode_operations->create() 
Creating a file typically involves calling a file-system-specific function, such as ext2_create or ext3_create
FILE__CREATE :- SET inode->i_sb AND
                SET inode->i_dev AND
                SET inode->i_blkbits
File creation is typically associated with initializing several fields associated with the inode of the file. Note that the function pointer-based idiom (presented previously) is often more reliable than this idiom, because functions other the ones that perform file creation may also manipulate these fields. However, idioms based upon fields of the inode data structure often help in cases where the file is created via other techniques than the function-pointer-based call to a file-system-specific function.
FILE__LINK :- INCREMENT inode->i_nlink
Linking a file involves incrementing its link count.
FILE__UNLINK :- DECREMENT inode->i_nlink AND 
                NOT SET inode->i_size TO 0
Unlinking a file involves incrementing its link count. The second conjunct helps reduce false positives because other operations, such as DIR__RMDIR also do a FILE__UNLINK. In order to prevent functions which do a DIR__RMDIR to also be classified as doing FILE__UNLINK (which is subsumed by DIR__RMDIR), we use the second conjunct. Note that if a function that performs DIR__RMDIR is also classified as performing FILE__UNLINK, it is not a false positive, because removing a directory also involves unlinking.
FILE__SETATTR :- CALL inode_operations->setxattr()
Setting the extended attributes of a file involves calling a file-system-specific function to do so.
FILE__SETATTR :- SET jfs_inode_info->ea
Setting the extended attributes in the JFS file system involves setting the field ea of jfs_inode_info. This is an example of a file-system-specific idiom. While TAHOE currently attempts to place hooks at the virtual file-system layer in order to prevent the burden of placing hooks in each new file system, this example shows that, if necessary, one can also use TAHOE to place hooks within a physical file system. It just involves writing idioms specific to the file system.

Here are some idioms related to the networking subsystem, and their interpretation. As opposed to the file system idioms, where we often delved into the details of physical file systems, we did not consult any specific network protocols (e.g., IPv4, IPv6, 802, etc.) for writing these idioms---they are currently expressed in terms of function pointers, which call the appropriate protocol-specific function.

These idioms are self-explanatory.

Idioms for socket related conceptual operations
SOCKET__ACCEPT :- CALL socket->proto_ops->accept()
SOCKET__BIND :- CALL socket->proto_ops->bind()
SOCKET__CONNECT :- CALL socket->proto_ops->connect()
SOCKET__CREATE :- CALL net_proto_family->create()
SOCKET__GETATTR :- CALL socket->proto_ops->getname()
SOCKET__LISTEN :- CALL socket->proto_ops->listen()
SOCKET__RECV_MESG :- CALL socket->proto_ops->recvmsg()

As a thumb rule, while idioms based upon calls via a function pointer (for instance, the idioms above for socket operations) are easy to write. Each data structure in the kernel, such as inode and socket have a set of operations defined on them, and in most cases, a mapping exists between the conceptual operations and the operations defined on these data structures. However, it is advisable to avoid writing idioms based upon function pointers.

We find that while idioms based upon calls via function pointers will result in fewer false positives, ultimately idioms based upon low-level operations (such as file-system-specific modifications to structure-member accesses and network-protocol-specific modifications to structure-member accesses) have higher fidelity in expressing the set of canonical events that must happen for a conceptual operation to be performed.

While idioms based upon such low-level modifications to data structures are time-consuming to write, and will result in a large number of false positives initially, by continuously refining the idioms, their quality can be improved. We believe that idioms written using low-level modifications to data structures have higher fidelity, and hence will reduce (or even eliminate) false negatives from the results of the analysis. Because false negatives in the analysis directly result in missing conceptual operations, and thus missing hooks, they are dangerous. Thus, in the long run, it is advisable to write idioms with low-level data structure modifications


Idiom writing versus hook placement in LSM

Because idiom writing is a manual activity, it may appear in first glance as though TAHOE simply shifts the burden of writing manually placing hooks to that of writing idioms. However, we have found that writing idioms has several advantages:
  1. Hook placement is often an informal procedure; for each new hook to be added, the placement is often decided by consensus. While this procedure often eliminates obvious errors that may arise in hook placement, it is not perfect. An analysis that decouples placement of policy-specific hooks from the kernel code, and automates the procedure is much less prone to errors. Writing idioms achieves exactly that by forcing the kernel experts to write down formally (in the idiom language) their intuition as to how the kernel performs each conceptual operation.
  2. Idiom writing is a one-time activity. It only involves knowledge of the kernel code, and is independent of policy-specific hook code. Once the idioms are written for the kernel, they can be used to place hooks for several different kernel modules implementing hooks. In addition, because the kernel often performs conceptual operations in a few standard ways, which often do not change even as the kernel evolves, the set of idioms will be fairly stable. Thus, idiom writing is a one time activity, and the results can be used over and over again.

Further Questions/Comments?

Do you have further questions? Please contact me, and I will try to address them. Any feedback or suggestions for improvement are also most welcome.
Last updated on Thursday, September 1, 2005 by Vinod Ganapathy