Testing and analysis of web applications using page models
Introduction
Motivation
Code-based analysis of web applications is challenging because:
- Client
pages are generated dynamically on the server-side. This means that their
structure is configured dynamically, as is their content (e.g., the
options in drop-down lists, the links present, etc.)
- Specification of client-side pages
is usually in a different language (e.g., a HTML-based scripting language)
than the language of the server-side code (e.g., Java).
In fact, as a result of these challenges, there are very few real tools or
even research approaches for end-to-end white-box analysis of web
applications. What we mean by an "end-to-end" analysis is analysis that:
- spans application paths that go via multiple pages (in sequence), and
- accounts for control- and data-flows due to page contents as well as
control- and data-flows due to server-side code.
It is easy to see that pages affect control-flow; for e.g., a user could
click different links on a page. More subtly, they affect data-flow
also. For instance, the server could populate the options in a drop-down
list using values in a server-side set S; now, if the user selects
one of these options from the drop-down list and transfers control back
to the server, a precise analysis should conclude that the option sent by
the user is an element of the set S and not an arbitrary value.
Our approach and tool
With the aim of enabling end-to-end white-box analysis of web applications,
we propose an approach that automatically translates each
page-specification in the web application (which is usually in an
HTML-based scripting language) into a "page model". A page model is a
method in the same language as the server-side code. Basically, the page
model conservatively
over-approximates control-flows and data-flows that are possible due to the
page under all possible server-side states. Links in a page become calls to
server-side request-processing routines. The page model includes code
that simulates randomly the user's choices, such as which option to choose
from a drop-down list or which link to click.
Certain modifications are also performed on the server-side code by our
approach. These are described in Sec. 4.1 of our paper.
We have implemented our approach as a tool. Our tool has been implemented
in the context of J2EE applications. JSP (Java Server Pages) is a prevalent
page-specification language, and is a part of the J2EE standard. Our tool
takes as input a web application, and translates the JSP page
specifications to page models in pure Java. The tool also performs the
modifications to the server-side code mentioned above. The generated page
models in cojunction with the modified server-side code results in a
standard non-web program in the language of the server-side. Therefore,
this program is amenable to white-box analysis using standard,
off-the-shelf analysis tools that do not specifically address the
complexities of web applications.
Our tool handles a rich set of J2EE features such as scope and session
management, HTML forms and links, JSP tags, EL expressions and scriptlets.
However, certain aspects, e.g., adding necessary "import" statements to
resolve package references, are not automated, neither in the page models
no in the server-side code. These are relatively simple to perform by a
developer, but require more effort to automate. For our experiments, we
manually make these changes within the programs emitted by our tool.
Experiments
We demonstrate the versatility and usefulness of our approach by applying
two off-the-shelf analysis tools on the (non-web) programs generated by our
tool.
Functional property checking using JPF
The first analysis tool we tried
is JPF (Java Path
Finder), which is a combined concrete- and symbolic-execution (i.e.,
concolic execution) based model-checker for checking functional properties
of programs. The page models generated by our approach enable JPF to
exhaustively explore different user choices, either concretely (links,
forms, drop-down lists, etc.) or symbolically (text inputs), and to
traverse application paths that involve visits to multiple pages in a
sequence. The functional properties we checked were challenging properties
that need traversal of many pages in a sequence; e.g, multiple
registrations using the same email ID should not be allowed, and
if the shopping cart is filled and then emptied,
the total value shown in the cart should become zero.
Fault localization using Zoltar
The second off-the-shelf tool we
consider is Zoltar, a dynamic-analysis
based fault localization tool. We used this tool to find (seeded) bugs in
web applications. We were able to run Zoltar in a completely off-the shelf
manner, which would not have been possible on web applications as-is due to
their tiered architecture, and due to the instrumentation that would have
been required in the server-side framework and libraries.
Static slicing
In our submitted paper we also report results from applying static slicing
on the programs generated by our approach using the slicing
tool Wala. These experiments were
the simplest to perform by us, and are easily reproducible. However, due to
the complexities in packing and building all the packages that constitute
the Wala tool into the VM, we omit this experiment from this artifact
submission.
Assumptions and limitations
- Our approach is primarily aimed at identifying functional errors in web
applications caused due to code errors. It is not aimed at identifying
security vulnerabilities. In order to identify functional errors with a low
rate of false positives, our approach is based on the assumption that users
are non-adversarial. That is, they start from the "main" page of the
application, and choose options and click links that appear on any visited
page, but do not copy and paste URLs of inner pages into the browser's
location bar, and do not modify outgoing requests to the server.
- Our approach has a limitation that it
does not address Javascript or Ajax currently. Extending our approach to
handle these features in discussed briefly in the paper.
- Our tool is aimed
at the J2EE technology stack. We believe that in principle our core
approach can be extended to address other technology stacks that use
HTML-based scripting languages for page specifications, such as PHP, Ruby
on Rails, and Struts.
Summary of our contributions
- Compared to previous automated white-box analysis approaches, our
approach is the first one to generate a standard program on which any
off-the-shelf analysis tool -- whether based on static analysis or dynamic
analysis -- can be applied. Previous white-box approaches that analyzed both
client-page scripts as well as server-side code were restricted to solving
specific analysis problems, such as slicing [references 8, 10, 22 in our
paper], well-formedness checking of generated pages [7], taint analysis
[19, 24], property checking [2], etc.
- Several previous approaches did not address dataflows via request
parameters and via session attributes simultaneously [ref. 8, 10, 22 in the
paper]. We
address this scenario. The challenges in doing so are discussed later in
the paper.
- An implementation of our approach as a tool, and demonstration of
the versatility and precision of our approach by applying
off-the-shelf analysis tools on the (non-web) programs generated by our
tool.
Note about reviewer information:
Our artifact does not collect or send out
any information about reviewers.
The rest of this README file is organized as follows. Section 2 describes
the web applications that we have selected as benchmarks.
Section 3 describes
how to run our tool on web applications to produce analyzable pure-Java
programs that incorporate the generated page models. Section 4 describes
how to run the JPF model checker to check functional properties on the
benchmark programs. Finally, Section 5 describes how to run Zoltar on the
benchmark programs to perform fault localization.
Selected benchmarks
The table above provides information about the five real web applications
that we have selected for our experiments. Our primary criteria for
selecting these benchmarks were:
- they use JSP
- they do not use too many complex frameworks or libraries in the
server-side code (which could become bottlenecks for off-the-shelf analysis
tools)
- they possess rich features, which allow interesting functional properties
can be checked
- they either do not use Javascript, or use it minimally.
TD, MS, and RO are e-commerce applications. HD is a help-desk appilcation
for tasks such as entering complaint tickets, managing an address book,
etc. IT is a medical records management application.
Of these benchmarks, HD and IT are frequently downloaded, with 995 and 7292
downloads, respectively, since January 2010.
The first four benchmarks above were amenable to be translated by our tool
with very minimal code modifications. IT was different, because many of its
pages use Javascript, and also because they use JSP expressions,
which are now deprecated in favor of the newer EL expressions
standard. Our tool handles only EL expressions. Therefore, we had to
manually remove Javascript scripts from pages, and had to manually
translate JSP expressions to EL expressions. IT is also very large (223 JSP
pages and 365 Java classes). Therefore, to restrict the scope of our manual
changes, we chose a smaller module within this application and analyzed
only this module. The statistics given in the table above pertain only to
this module. The usage of Javascript in these pages was minimal. Still,
their removal means that the precision and/or correctness of some of our
analyses could have been impacted to a certain extent for this benchmark.
Running our translation tool
We have packaged our artifact as a VirtualBox VM image. The image is
available here for download:
ISSTA2017_artifact.ova.
It requires at least 12 GB of
free disk space, and preferably a machine with 8 GB RAM. The user-name to
login upon booting up the VM is "issta2017artifact", and the password is
"abcd1234".
It is not possible to directly analyze the
programs generated by our approach using analysis tools. This is because of
the manual changes required on top of the generated code, as mentioned
above. Therefore, we have directly placed the manually modified source
files in the projects where JPF and Zoltar are to be run (see below).
In this section, as samples, we discuss how our translator can be run on
two of the benchmarks, namely, TD and IT.
cd
to /home/issta2017artifact/translation
. Run run.sh
.
The generated page models and modified server-side code are available
within the sub-folders IT
and TD
,
respectively.
One can cd
into each of the sub-folders above, and
type ant
to build the generated code. There will be
build errors, due to reasons discussed above such as
missing import
statements. These will need to fixed
manually, in general; it is not necessary for the artifact evaluator to do this,
because we have already placed the manually modified code in the JPF
and Zoltar projects.
Experiments using JPF
Introduction
In this experiment, we run the concolic execution tool that comes with JPF
(i.e., Symbolic Path Finder) on the non-web programs generated by our
tool, to detect functional errors of the sort that manifest only when users
visit specific sequences of pages with specific kinds of inputs on each
page. The idea is to use JPF to visit page sequences exhaustively by
simulating link clicks, to exhaustively select all possible drop-down list
options in each page, and to enter symbolic values into text boxes to
simulate all possible inputs to text boxes. The objective then is to see if
the property is violated or not, up to a given bound bP
on the number of pages to visit sequentially in any single run of the
application. Note that JPF cannot be applied directly on web applications
in an end-to-end manner, unless a non-web program is produced from a given
web application as in our approach. This is because web applications have
complex features such as client-server communication, control-flow and
data-flow through generated HTML pages, and runtime support provided to the
server-side code by the web server framework.
We check 19 functional properties, across the five selected
benchmarks. The properties for benchmarks TD, RO, MS, and HD were
identified by graduate students who were not aware of our work, by actually
using the web applications and observing its behavior. The properties for
IT were identified by us directly from its user documentation. The breakup
of the properties among the benchmarks is as follows: 5 properties for TD,
5 for MS, and 3 each for HD, RO, and IT. We have provided intuitive
descriptions of all the properties in the file /home/issta2017artifact/jpf/property-descriptions.txt
.
Due to the exhaustiveness of the search, and due to the symbolic values
given to text boxes, a key soundness guarantee of this experiment
is that if JPF does not find a violation of a property, then no actual
execution of the web application (that visits no more
than bP pages) can violate the property.
Baseline
To serve as a baseline, we also provide a simplified variant of our
approach, which simulates a previous approach. This variant also involves
running JPF on translated applications (i.e., non-web programs), to use
JPF to exhaustively traverse page sequences upto the
bound bP. The difference from our approach is that
the baseline ignores page contents, and focuses on concolic
execution of server-side alone. Whenever a page is visited, it is modeled
as simply sending symbolic values for all request parameters
from the page. That is, effectively, all drop-down lists are considered
to contain arbitrary options. This has the effect of increasing the
number of false-positives in reported property violations. This baseline
simulates previous approaches such as [ref. 14, 27] in our paper. We call
this baseline Weave (after the approach [ref. 14], whose name is
Weave).
Steps to be followed by the reviewer to check all properties using
JPF
- Start a terminal window inside the VM, and
cd
to the
directory /home/issta2017artifact/jpf/
.
- Run the script
run.sh
in this directory. This will
analyze all 19 properties (i.e., 5 properties for TD, 5 for MS, and 3
each for HD, RO, and IT), one by one, using our approach and using both
baseline approaches mentioned above. The on-screen messages that appear
while the script is running indicate how many properties have been
checked so far.
- Once the script
run.sh
completes, open the output
file summary-table.html
in the same directory to view the
summarized results.
The generated summary html file contains two tables. The first table
(titled PageModel) contains results from our approach for all 19
properties. The second table contains results from the Weave
baseline for all 19 properties.
Both tables have the same format. A fragment of first table (i.e.,
from our approach) is shown below.
Property | Violations Found? | Num. page sequences |
Num. unique page sequences |
Time (in s) |
Page Bound |
propHD1 | No | 8 | 8 | 1.642 | 5 |
propHD2 | Yes | 17 | 5 | 1.401 | 5 |
Each row in the table pertains to one of the 19 properties. The columns in
the table have the following meanings, respectively:
- The short-name of the property. (The letters "HD" in
the name indicate that the property is for the benchmark HD.)
- Whether JPF reports a property violation or not in any of
the page sequences that it traverses while checking this property.
- The total number of page sequences traversed (i.e., total number of
runs of the benchmark explored). Note that all these runs of the
benchmark occur as part of a single run of JPF, as enabled by
its backtracking approach.
- The total number of unique sequences of pages traversed,
across all runs.
- Time taken (in seconds) for all the runs.
- The value of the page-length bound bP that we used
while checking this property. Note that we chose this bound manually by
identifying the minimum length page sequence that would need to be
traversed in order to be able to check the property.
Note that multiple runs of the benchmark may correspond to the same
unique page sequence. This is because the drop-down list option selected in
two corresponding pages in the two runs could differ, or the symbolic
string entered in a text box in two corresponding pages in the two runs
could cause control to go through different paths in the server-side code.
The results
We summarize the results of functional property checking as follows. Eight
of the 19 properties were found to be violated by our approach. These are
propHD2, propIT2, propMS3, propMS4, propTD1, propTD2, propTD3, propTD4
(please see the
file /home/issta2017artifact/jpf/summary-table.html
). Six of
these violations were reproducible by us (all except propIT2 and
propMS3). This is evidence for the low false positive rate of our
approach.
The Weave baseline produces reports five extra violations on top
of the 8 reported by our approach. These are for properties propTD5,
propMS1, propMS5, propRO1, and propRO2. These five reports
are necessarily false positives, as explained in our paper.
Detailed output files and other artifacts
- For each of the 19 properties prop, there is a
folder
/home/issta2017artifact/jpf/pageModel/properties/
prop. For
instance, there are
folders /home/issta2017artifact/jpf/pageModel/properties/propMS1
,
/home/issta2017artifact/jpf/pageModel/properties/propTD1
, etc.
The
folder /home/issta2017artifact/jpf/pageModel/properties/
prop/output/
contains the following output files:
-
pageSequences.txt
. Each line in this file pertains to a
run of the benchmark. It contains a list of the names of the pages
visited in the run, and also contains the drop-down list option selected
in each page (if there are any drop-down lists). The number of lines in
this file is the value in the "Num. page sequences" column
in summary-table.html
.
-
violatingSequences.txt
. If property prop was
violated, this file contains a non-empty list of numbers. For each
number n in this file, line number n in the output
file pageSequences.txt
basically represents a run in
which the property fails.
-
uniquePageSequences.txt
. This contains
the unique page sequences visited by the runs of the
benchmark.
-
jpfSummary.txt
. This contains summarized
statistics emitted by JPF about all the runs explored as part of checking
this property.
As an example, let us consider the
folder /home/issta2017artifact/jpf/pageModel/properties/propMS1/output/
. Fifty
two runs of the benchmark were used to check this property, as can be
seen from the line numbers in pageSequences.txt
. These runs
visited four unique page sequences
(uniquePageSequences.txt
). None of these runs violated the
property (violatingSequences.txt
).
A similar set of output files is present for each of the 19 properties,
under the directory /home/issta2017artifact/jpf/weave
.
These represent the baseline experiments discussed earlier.
-
Each of the property folders mentioned above (for our approach, as well as
for the baselines) also contains a full copy of
the translated (non-web) program, which is actually analyzed using JPF.
This copy is not identical to
the translated program produced by our tool. An important difference is that
for each property, we have manually added code at an appropriate point in
the program to check at run time whether the property is satisfied or
violated. If any run of the benchmark fails the check at this
point, the property is reported as violated. As an example, consider the
file
/home/issta2017artifact/jpf/pageModel/properties/propMS1/src/music/cart/DisplayCartServlet.java
. The
property check has been added right after the comment /*new property
1*/
.
We have also done other code changes manually on top of the translated
program produced by our tool. These (semantics-preserving) changes exist to
increase the effectiveness and scalability of JPF model-checking. Such
changes are generally required whenever JPF is to be used on any
program -- even programs not produced by our approach. These changes are
mentioned in more detail in Sections 5.2 and 6.1 in our submitted paper.
Experiments using Zoltar
Zoltar is a dynamic-analysis based fault localization tool. Our objective
in this experiment was to use Zoltar to identify bug locations in web
applications. Note that Zoltar is not applicable as-is on web
applications, for the reasons discussed earlier in Section 1. Therefore, we
applied Zoltar on the translated (non-web) programs produced by our tool.
We first setup the entire experiment as follows:
- Manually created a set of test-cases for each benchmark. A test-case
is nothing but a sequence of pages to visit, along with a tuple of input
choices within each page that is visited, and a final oracle check of a
desired property. All our test cases were passing
test cases. We devised our own text-format to represent test-cases.
- Manually modified the page-models generated by our tool to follow
the provided test inputs instead of choosing links and drop-down list
options randomly.
- Seeded bugs in each benchmark program.
To do this in an unbiased manner, we used
the PIT mutation testing tool. This
tool uses a set of patterns to suggest mutations, and reports those
mutations that cause at least one test case to fail.
- From the mutations reported by PIT, we selected 20 mutations for
each benchmark program. Applying these mutations one at a time, we
created 20 buggy versions of each benchmark.
Steps to be followed by the reviewer to execute Zoltar on all buggy
versions of all benchmarks
- Start a terminal window inside the VM, and
cd
to the
directory /home/issta2017artifact/zoltar/
.
- Run the script
run.sh
in this directory. This script
will apply Zoltar on each of the 20 buggy versions of each benchmark,
using the test-cases provided for the respective benchmarks.
- Once the script
run.sh
completes, open the output
file summary-table.html
in the same directory to view the
summarized results.
The table summary-table.html
contains 21 rows for each
benchmark. Each of the first 20 rows corresponds to a buggy version of a
benchmark. The first two columns depict the name of the benchmark, and the
buggy version's number. The third column depicts the name of the method
where the bug is seeded. The last column "rank of buggy method" is the important
one. Zoltar assigns a suspicion score to each method, indicating
how likely it is that the method contains the buggy statement. The last
column indicates the percentage of methods in the benchmark that have
suspicion scores that are greater than or equal to the suspicion score of
the method that contains the seeded bug. A lower percentage indicates that
the method that contains the seeded bug appears high in Zoltar's report,
which is a list of methods ranked in descending order of suspicion score.
Therefore, the developer will quickly identify the buggy method. The 21st
row mentions the geometric mean of the ranks of the buggy methods across
the 20 buggy versions.
The results
It should be clear from the
table /home/issta2017artifact/zoltar/summary-table.html
. that
Zoltar is extremely effective in localizing faults when applied on our
generated non-web programs. The geometric mean of the rank of the buggy
method over the 20 buggy versions of each benchmark ranges from just 0.24%
(benchmark IT)
to around 5% (RO). This means that on average a developer would have to look at
just 0.24% to 5% of the methods in an application before they are able
to zero-in on the buggy method.
Detailed output files and other artifacts
For each benchmark bm, there is a
folder /home/issta2017artifact/zoltar/
bm/ in the VM image. For
instance, there are
folders /home/issta2017artifact/zoltar/MS
,
/home/issta2017artifact/zoltar/TD
, etc. Under each of these folders
there are folders named Mutation1
, Mutation2
,
..., Mutation20
. Each of these Mutation
folders contains a buggy version of the benchmark, within
its src/
sub-folder. Each of
these Mutation
folders also contains
an output/
sub-folder, which contains the following detailed
output files:
-
summary-table.html
. This is like the output file with
the same name in the folder /home/issta2017artifact/zoltar/
, but contains a
single row. This row just indicates the position of the buggy method
among all methods in this (buggy) version of the benchmark as per
zoltar's ranked list of methods. This file is emitted by our script.
-
individualScores.txt
. This file is emitted by
zoltar. Each line in this file contains the suspicion score of a line of
code. Each line in this file contains the name of a method, followed by a
hash sign, then the line number of code within the method, and then the
suspicion score of this line of code (a value between 0 and 1, with 1
indicating the highest suspicion). The suspicion score of a method is the
maximum suspicion score of any line of code within the method.
-
testCaseReports.txt
. For each of the provided test
cases, this file, which is also emitted by zoltar, indicates whether the
test-case passes or fails on this buggy version of the benchmark.
For example, consider the
file /home/issta2017artifact/zoltar/MS/Mutation1/output/individualScores.txt
. There
are 2705 lines in this detailed report. Three lines out of 2705 have the
highest score suspicion score of 1.0. These 3 lines all happen to belong to
the method where the bug was seeded, which is the method
client.Index.jspService(). In fact, two of these lines are
exactly the location where the bug occurs. Therefore, the
file /home/issta2017artifact/zoltar/summary-table.html
mentions the rank
of the buggy method client.Index.jspService() in Mutation1 of
benchmark MS as being 0.25% (1st method out of a total of 392 methods in
the application).
Reviewer can try their own mutations
We encourage the reviewer to try seed their own errors and see how
effective Zoltar is in identifying the bug locations. Any arbitrary seeded
bug may not be a good candidate, because zoltar can only find bugs that
cause at least one of the provided test cases to fail. To this end, using
the tool PIT, we have identified about 22 candidate mutations for the
benchmark MS, which each cause at least one test to fail. These candidate
mutations are listed in
file /home/issta2017artifact/zoltar/MS/ZoltarList.txt
. For
instance, the first mutation suggested in this file is as follows:
Mutation 1: com.client.run/MultiUserDriver.java 148
1.1 Location : exitTo Killed by : com.music.tests.EmailTest.test2(com.music.tests.EmailTest) removed call to client/Index::jspService is KILLED
This suggestion can be tried as follows:
-
cd
to /home/issta2017artifact/zoltar/MS/NoMutation
. This is a
non-buggy version of the MS benchmark.
- Open the file
src/com/client/run/MultiUserDriver.java
,
go to line 148, and remove the call to the
method Index.jspService
(this is the suggested mutation -- --
--see above).
-
cd
back to
directory /home/issta2017artifact/zoltar/MS/NoMutation
, and
run ant
and then run.sh
in the same directory.
-
cd
to the output/
sub-folder. Look at the
file individualScores.txt
. The lines of code in
method exitTo
in class MultiUserDriver
(which
is the method that contains the just-seeded bug) should have high
suspicion scores (i.e., close to 1.0).