Testing and analysis of web applications using page models

Introduction

Motivation

Code-based analysis of web applications is challenging because:

Client pages are generated dynamically on the server-side. This means that their structure is configured dynamically, as is their content (e.g., the options in drop-down lists, the links present, etc.)
Specification of client-side pages is usually in a different language (e.g., a HTML-based scripting language) than the language of the server-side code (e.g., Java).

In fact, as a result of these challenges, there are very few real tools or even research approaches for end-to-end white-box analysis of web applications. What we mean by an "end-to-end" analysis is analysis that:

spans application paths that go via multiple pages (in sequence), and
accounts for control- and data-flows due to page contents as well as control- and data-flows due to server-side code.

It is easy to see that pages affect control-flow; for e.g., a user could click different links on a page. More subtly, they affect data-flow also. For instance, the server could populate the options in a drop-down list using values in a server-side set S; now, if the user selects one of these options from the drop-down list and transfers control back to the server, a precise analysis should conclude that the option sent by the user is an element of the set S and not an arbitrary value.

Our approach and tool

With the aim of enabling end-to-end white-box analysis of web applications, we propose an approach that automatically translates each page-specification in the web application (which is usually in an HTML-based scripting language) into a "page model". A page model is a method in the same language as the server-side code. Basically, the page model conservatively over-approximates control-flows and data-flows that are possible due to the page under all possible server-side states. Links in a page become calls to server-side request-processing routines. The page model includes code that simulates randomly the user's choices, such as which option to choose from a drop-down list or which link to click.

Certain modifications are also performed on the server-side code by our approach. These are described in Sec. 4.1 of our paper.

We have implemented our approach as a tool. Our tool has been implemented in the context of J2EE applications. JSP (Java Server Pages) is a prevalent page-specification language, and is a part of the J2EE standard. Our tool takes as input a web application, and translates the JSP page specifications to page models in pure Java. The tool also performs the modifications to the server-side code mentioned above. The generated page models in cojunction with the modified server-side code results in a standard non-web program in the language of the server-side. Therefore, this program is amenable to white-box analysis using standard, off-the-shelf analysis tools that do not specifically address the complexities of web applications.

Our tool handles a rich set of J2EE features such as scope and session management, HTML forms and links, JSP tags, EL expressions and scriptlets. However, certain aspects, e.g., adding necessary "import" statements to resolve package references, are not automated, neither in the page models no in the server-side code. These are relatively simple to perform by a developer, but require more effort to automate. For our experiments, we manually make these changes within the programs emitted by our tool.

Experiments

We demonstrate the versatility and usefulness of our approach by applying two off-the-shelf analysis tools on the (non-web) programs generated by our tool.

Functional property checking using JPF

The first analysis tool we tried is JPF (Java Path Finder), which is a combined concrete- and symbolic-execution (i.e., concolic execution) based model-checker for checking functional properties of programs. The page models generated by our approach enable JPF to exhaustively explore different user choices, either concretely (links, forms, drop-down lists, etc.) or symbolically (text inputs), and to traverse application paths that involve visits to multiple pages in a sequence. The functional properties we checked were challenging properties that need traversal of many pages in a sequence; e.g, multiple registrations using the same email ID should not be allowed, and if the shopping cart is filled and then emptied, the total value shown in the cart should become zero.

Fault localization using Zoltar

The second off-the-shelf tool we consider is Zoltar, a dynamic-analysis based fault localization tool. We used this tool to find (seeded) bugs in web applications. We were able to run Zoltar in a completely off-the shelf manner, which would not have been possible on web applications as-is due to their tiered architecture, and due to the instrumentation that would have been required in the server-side framework and libraries.

Static slicing

In our submitted paper we also report results from applying static slicing on the programs generated by our approach using the slicing tool Wala. These experiments were the simplest to perform by us, and are easily reproducible. However, due to the complexities in packing and building all the packages that constitute the Wala tool into the VM, we omit this experiment from this artifact submission.

Assumptions and limitations

Our approach is primarily aimed at identifying functional errors in web applications caused due to code errors. It is not aimed at identifying security vulnerabilities. In order to identify functional errors with a low rate of false positives, our approach is based on the assumption that users are non-adversarial. That is, they start from the "main" page of the application, and choose options and click links that appear on any visited page, but do not copy and paste URLs of inner pages into the browser's location bar, and do not modify outgoing requests to the server.
Our approach has a limitation that it does not address Javascript or Ajax currently. Extending our approach to handle these features in discussed briefly in the paper.
Our tool is aimed at the J2EE technology stack. We believe that in principle our core approach can be extended to address other technology stacks that use HTML-based scripting languages for page specifications, such as PHP, Ruby on Rails, and Struts.

Summary of our contributions

Compared to previous automated white-box analysis approaches, our approach is the first one to generate a standard program on which any off-the-shelf analysis tool -- whether based on static analysis or dynamic analysis -- can be applied. Previous white-box approaches that analyzed both client-page scripts as well as server-side code were restricted to solving specific analysis problems, such as slicing [references 8, 10, 22 in our paper], well-formedness checking of generated pages [7], taint analysis [19, 24], property checking [2], etc.
Several previous approaches did not address dataflows via request parameters and via session attributes simultaneously [ref. 8, 10, 22 in the paper]. We address this scenario. The challenges in doing so are discussed later in the paper.
An implementation of our approach as a tool, and demonstration of the versatility and precision of our approach by applying off-the-shelf analysis tools on the (non-web) programs generated by our tool.

Note about reviewer information: Our artifact does not collect or send out any information about reviewers.

The rest of this README file is organized as follows. Section 2 describes the web applications that we have selected as benchmarks. Section 3 describes how to run our tool on web applications to produce analyzable pure-Java programs that incorporate the generated page models. Section 4 describes how to run the JPF model checker to check functional properties on the benchmark programs. Finally, Section 5 describes how to run Zoltar on the benchmark programs to perform fault localization.

Selected benchmarks

Benchmark	JSP page specifications	Server-side Java classes
Trainers Direct (TD)	18	30
Royal Odyssey (RO)	55	3
Music Store (MS)	31	34
Help Desk (HD)	27	3
iTrust (IT)	9	19

The table above provides information about the five real web applications that we have selected for our experiments. Our primary criteria for selecting these benchmarks were:

they use JSP
they do not use too many complex frameworks or libraries in the server-side code (which could become bottlenecks for off-the-shelf analysis tools)
they possess rich features, which allow interesting functional properties can be checked
they either do not use Javascript, or use it minimally.

TD, MS, and RO are e-commerce applications. HD is a help-desk appilcation for tasks such as entering complaint tickets, managing an address book, etc. IT is a medical records management application. Of these benchmarks, HD and IT are frequently downloaded, with 995 and 7292 downloads, respectively, since January 2010.

The first four benchmarks above were amenable to be translated by our tool with very minimal code modifications. IT was different, because many of its pages use Javascript, and also because they use JSP expressions, which are now deprecated in favor of the newer EL expressions standard. Our tool handles only EL expressions. Therefore, we had to manually remove Javascript scripts from pages, and had to manually translate JSP expressions to EL expressions. IT is also very large (223 JSP pages and 365 Java classes). Therefore, to restrict the scope of our manual changes, we chose a smaller module within this application and analyzed only this module. The statistics given in the table above pertain only to this module. The usage of Javascript in these pages was minimal. Still, their removal means that the precision and/or correctness of some of our analyses could have been impacted to a certain extent for this benchmark.

Running our translation tool

We have packaged our artifact as a VirtualBox VM image. The image is available here for download: ISSTA2017_artifact.ova. It requires at least 12 GB of free disk space, and preferably a machine with 8 GB RAM. The user-name to login upon booting up the VM is "issta2017artifact", and the password is "abcd1234".

It is not possible to directly analyze the programs generated by our approach using analysis tools. This is because of the manual changes required on top of the generated code, as mentioned above. Therefore, we have directly placed the manually modified source files in the projects where JPF and Zoltar are to be run (see below). In this section, as samples, we discuss how our translator can be run on two of the benchmarks, namely, TD and IT.

cd to /home/issta2017artifact/translation. Run run.sh.

The generated page models and modified server-side code are available within the sub-folders IT and TD, respectively.

One can cd into each of the sub-folders above, and type ant to build the generated code. There will be build errors, due to reasons discussed above such as missing import statements. These will need to fixed manually, in general; it is not necessary for the artifact evaluator to do this, because we have already placed the manually modified code in the JPF and Zoltar projects.

Experiments using JPF

Introduction

In this experiment, we run the concolic execution tool that comes with JPF (i.e., Symbolic Path Finder) on the non-web programs generated by our tool, to detect functional errors of the sort that manifest only when users visit specific sequences of pages with specific kinds of inputs on each page. The idea is to use JPF to visit page sequences exhaustively by simulating link clicks, to exhaustively select all possible drop-down list options in each page, and to enter symbolic values into text boxes to simulate all possible inputs to text boxes. The objective then is to see if the property is violated or not, up to a given bound b_P on the number of pages to visit sequentially in any single run of the application. Note that JPF cannot be applied directly on web applications in an end-to-end manner, unless a non-web program is produced from a given web application as in our approach. This is because web applications have complex features such as client-server communication, control-flow and data-flow through generated HTML pages, and runtime support provided to the server-side code by the web server framework.

We check 19 functional properties, across the five selected benchmarks. The properties for benchmarks TD, RO, MS, and HD were identified by graduate students who were not aware of our work, by actually using the web applications and observing its behavior. The properties for IT were identified by us directly from its user documentation. The breakup of the properties among the benchmarks is as follows: 5 properties for TD, 5 for MS, and 3 each for HD, RO, and IT. We have provided intuitive descriptions of all the properties in the file /home/issta2017artifact/jpf/property-descriptions.txt.

Due to the exhaustiveness of the search, and due to the symbolic values given to text boxes, a key soundness guarantee of this experiment is that if JPF does not find a violation of a property, then no actual execution of the web application (that visits no more than b_P pages) can violate the property.

Baseline

To serve as a baseline, we also provide a simplified variant of our approach, which simulates a previous approach. This variant also involves running JPF on translated applications (i.e., non-web programs), to use JPF to exhaustively traverse page sequences upto the bound b_P. The difference from our approach is that the baseline ignores page contents, and focuses on concolic execution of server-side alone. Whenever a page is visited, it is modeled as simply sending symbolic values for all request parameters from the page. That is, effectively, all drop-down lists are considered to contain arbitrary options. This has the effect of increasing the number of false-positives in reported property violations. This baseline simulates previous approaches such as [ref. 14, 27] in our paper. We call this baseline Weave (after the approach [ref. 14], whose name is Weave).

Steps to be followed by the reviewer to check all properties using JPF

Start a terminal window inside the VM, and cd to the directory /home/issta2017artifact/jpf/.
Run the script run.sh in this directory. This will analyze all 19 properties (i.e., 5 properties for TD, 5 for MS, and 3 each for HD, RO, and IT), one by one, using our approach and using both baseline approaches mentioned above. The on-screen messages that appear while the script is running indicate how many properties have been checked so far.
Once the script run.sh completes, open the output file summary-table.html in the same directory to view the summarized results.

The generated summary html file contains two tables. The first table (titled PageModel) contains results from our approach for all 19 properties. The second table contains results from the Weave baseline for all 19 properties.

Both tables have the same format. A fragment of first table (i.e., from our approach) is shown below.

Property Violations Found? Num. page sequences Num. unique page sequences Time (in s) Page Bound

propHD1 No 8 8 1.642 5

propHD2 Yes 17 5 1.401 5

Property	Violations Found?	Num. page sequences	Num. unique page sequences	Time (in s)	Page Bound
propHD1	No	8	8	1.642	5
propHD2	Yes	17	5	1.401	5

Each row in the table pertains to one of the 19 properties. The columns in the table have the following meanings, respectively:

The short-name of the property. (The letters "HD" in the name indicate that the property is for the benchmark HD.)
Whether JPF reports a property violation or not in any of the page sequences that it traverses while checking this property.
The total number of page sequences traversed (i.e., total number of runs of the benchmark explored). Note that all these runs of the benchmark occur as part of a single run of JPF, as enabled by its backtracking approach.
The total number of unique sequences of pages traversed, across all runs.
Time taken (in seconds) for all the runs.
The value of the page-length bound b_P that we used while checking this property. Note that we chose this bound manually by identifying the minimum length page sequence that would need to be traversed in order to be able to check the property.

Note that multiple runs of the benchmark may correspond to the same unique page sequence. This is because the drop-down list option selected in two corresponding pages in the two runs could differ, or the symbolic string entered in a text box in two corresponding pages in the two runs could cause control to go through different paths in the server-side code.

The results

We summarize the results of functional property checking as follows. Eight of the 19 properties were found to be violated by our approach. These are propHD2, propIT2, propMS3, propMS4, propTD1, propTD2, propTD3, propTD4 (please see the file /home/issta2017artifact/jpf/summary-table.html). Six of these violations were reproducible by us (all except propIT2 and propMS3). This is evidence for the low false positive rate of our approach. The Weave baseline produces reports five extra violations on top of the 8 reported by our approach. These are for properties propTD5, propMS1, propMS5, propRO1, and propRO2. These five reports are necessarily false positives, as explained in our paper.

Detailed output files and other artifacts

For each of the 19 properties prop, there is a folder /home/issta2017artifact/jpf/pageModel/properties/prop. For instance, there are folders /home/issta2017artifact/jpf/pageModel/properties/propMS1, /home/issta2017artifact/jpf/pageModel/properties/propTD1, etc.
The folder /home/issta2017artifact/jpf/pageModel/properties/prop/output/ contains the following output files:
- pageSequences.txt. Each line in this file pertains to a run of the benchmark. It contains a list of the names of the pages visited in the run, and also contains the drop-down list option selected in each page (if there are any drop-down lists). The number of lines in this file is the value in the "Num. page sequences" column in summary-table.html.
- violatingSequences.txt. If property prop was violated, this file contains a non-empty list of numbers. For each number n in this file, line number n in the output file pageSequences.txt basically represents a run in which the property fails.
- uniquePageSequences.txt. This contains the unique page sequences visited by the runs of the benchmark.
- jpfSummary.txt. This contains summarized statistics emitted by JPF about all the runs explored as part of checking this property.
As an example, let us consider the folder /home/issta2017artifact/jpf/pageModel/properties/propMS1/output/. Fifty two runs of the benchmark were used to check this property, as can be seen from the line numbers in pageSequences.txt. These runs visited four unique page sequences (uniquePageSequences.txt). None of these runs violated the property (violatingSequences.txt).
A similar set of output files is present for each of the 19 properties, under the directory /home/issta2017artifact/jpf/weave. These represent the baseline experiments discussed earlier.
Each of the property folders mentioned above (for our approach, as well as for the baselines) also contains a full copy of the translated (non-web) program, which is actually analyzed using JPF. This copy is not identical to the translated program produced by our tool. An important difference is that for each property, we have manually added code at an appropriate point in the program to check at run time whether the property is satisfied or violated. If any run of the benchmark fails the check at this point, the property is reported as violated. As an example, consider the file /home/issta2017artifact/jpf/pageModel/properties/propMS1/src/music/cart/DisplayCartServlet.java. The property check has been added right after the comment /*new property 1*/.
We have also done other code changes manually on top of the translated program produced by our tool. These (semantics-preserving) changes exist to increase the effectiveness and scalability of JPF model-checking. Such changes are generally required whenever JPF is to be used on any program -- even programs not produced by our approach. These changes are mentioned in more detail in Sections 5.2 and 6.1 in our submitted paper.

Experiments using Zoltar

Zoltar is a dynamic-analysis based fault localization tool. Our objective in this experiment was to use Zoltar to identify bug locations in web applications. Note that Zoltar is not applicable as-is on web applications, for the reasons discussed earlier in Section 1. Therefore, we applied Zoltar on the translated (non-web) programs produced by our tool. We first setup the entire experiment as follows:

Manually created a set of test-cases for each benchmark. A test-case is nothing but a sequence of pages to visit, along with a tuple of input choices within each page that is visited, and a final oracle check of a desired property. All our test cases were passing test cases. We devised our own text-format to represent test-cases.
Manually modified the page-models generated by our tool to follow the provided test inputs instead of choosing links and drop-down list options randomly.
Seeded bugs in each benchmark program. To do this in an unbiased manner, we used the PIT mutation testing tool. This tool uses a set of patterns to suggest mutations, and reports those mutations that cause at least one test case to fail.
From the mutations reported by PIT, we selected 20 mutations for each benchmark program. Applying these mutations one at a time, we created 20 buggy versions of each benchmark.

Steps to be followed by the reviewer to execute Zoltar on all buggy versions of all benchmarks

Start a terminal window inside the VM, and cd to the directory /home/issta2017artifact/zoltar/.
Run the script run.sh in this directory. This script will apply Zoltar on each of the 20 buggy versions of each benchmark, using the test-cases provided for the respective benchmarks.
Once the script run.sh completes, open the output file summary-table.html in the same directory to view the summarized results.

The table summary-table.html contains 21 rows for each benchmark. Each of the first 20 rows corresponds to a buggy version of a benchmark. The first two columns depict the name of the benchmark, and the buggy version's number. The third column depicts the name of the method where the bug is seeded. The last column "rank of buggy method" is the important one. Zoltar assigns a suspicion score to each method, indicating how likely it is that the method contains the buggy statement. The last column indicates the percentage of methods in the benchmark that have suspicion scores that are greater than or equal to the suspicion score of the method that contains the seeded bug. A lower percentage indicates that the method that contains the seeded bug appears high in Zoltar's report, which is a list of methods ranked in descending order of suspicion score. Therefore, the developer will quickly identify the buggy method. The 21st row mentions the geometric mean of the ranks of the buggy methods across the 20 buggy versions.

The results

It should be clear from the table /home/issta2017artifact/zoltar/summary-table.html. that Zoltar is extremely effective in localizing faults when applied on our generated non-web programs. The geometric mean of the rank of the buggy method over the 20 buggy versions of each benchmark ranges from just 0.24% (benchmark IT) to around 5% (RO). This means that on average a developer would have to look at just 0.24% to 5% of the methods in an application before they are able to zero-in on the buggy method.

Detailed output files and other artifacts

For each benchmark bm, there is a folder /home/issta2017artifact/zoltar/bm/ in the VM image. For instance, there are folders /home/issta2017artifact/zoltar/MS, /home/issta2017artifact/zoltar/TD, etc. Under each of these folders there are folders named Mutation1, Mutation2, ..., Mutation20. Each of these Mutation folders contains a buggy version of the benchmark, within its src/ sub-folder. Each of these Mutation folders also contains an output/ sub-folder, which contains the following detailed output files:

summary-table.html. This is like the output file with the same name in the folder /home/issta2017artifact/zoltar/, but contains a single row. This row just indicates the position of the buggy method among all methods in this (buggy) version of the benchmark as per zoltar's ranked list of methods. This file is emitted by our script.
individualScores.txt. This file is emitted by zoltar. Each line in this file contains the suspicion score of a line of code. Each line in this file contains the name of a method, followed by a hash sign, then the line number of code within the method, and then the suspicion score of this line of code (a value between 0 and 1, with 1 indicating the highest suspicion). The suspicion score of a method is the maximum suspicion score of any line of code within the method.
testCaseReports.txt. For each of the provided test cases, this file, which is also emitted by zoltar, indicates whether the test-case passes or fails on this buggy version of the benchmark.

For example, consider the file /home/issta2017artifact/zoltar/MS/Mutation1/output/individualScores.txt. There are 2705 lines in this detailed report. Three lines out of 2705 have the highest score suspicion score of 1.0. These 3 lines all happen to belong to the method where the bug was seeded, which is the method client.Index.jspService(). In fact, two of these lines are exactly the location where the bug occurs. Therefore, the file /home/issta2017artifact/zoltar/summary-table.html mentions the rank of the buggy method client.Index.jspService() in Mutation1 of benchmark MS as being 0.25% (1st method out of a total of 392 methods in the application).

Reviewer can try their own mutations

We encourage the reviewer to try seed their own errors and see how effective Zoltar is in identifying the bug locations. Any arbitrary seeded bug may not be a good candidate, because zoltar can only find bugs that cause at least one of the provided test cases to fail. To this end, using the tool PIT, we have identified about 22 candidate mutations for the benchmark MS, which each cause at least one test to fail. These candidate mutations are listed in file /home/issta2017artifact/zoltar/MS/ZoltarList.txt. For instance, the first mutation suggested in this file is as follows:

Mutation 1: com.client.run/MultiUserDriver.java 148

1.1 Location : exitTo Killed by : com.music.tests.EmailTest.test2(com.music.tests.EmailTest) removed call to client/Index::jspService is KILLED

This suggestion can be tried as follows:

cd to /home/issta2017artifact/zoltar/MS/NoMutation. This is a non-buggy version of the MS benchmark.
Open the file src/com/client/run/MultiUserDriver.java, go to line 148, and remove the call to the method Index.jspService (this is the suggested mutation -- -- --see above).
cd back to directory /home/issta2017artifact/zoltar/MS/NoMutation, and run ant and then run.sh in the same directory.
cd to the output/ sub-folder. Look at the file individualScores.txt. The lines of code in method exitTo in class MultiUserDriver (which is the method that contains the just-seeded bug) should have high suspicion scores (i.e., close to 1.0).