Modeling Statement Context to Surface even Rare Diffused Topics
Automatically
Suparna Bhattacharya, Mrinal Kanti Das, Chiranjib Bhattacharyya,
K. Gopinath
Statistical topic models infer topics from statistical information
contained in a dataset. Existing topic models can detect topics if they are
present *prominently* in some files or scattered widely across the files.
However, they fail if a topic is diffused across a small percentage of
files or in other words if a topic is neither prominent inside any file nor
diffused *widely* across files. In this work we explore the problem of
detecting such *rare diffused* topics. We observe that the local context of
lines in a file play a key role in surfacing these topics. We introduce
various mechanisms to control a topic model's sensitivity towards local
context. We propose CSTM (*Context Sensitive Topic Model*), a new model
that is capable of discovering *prominent*, *widely diffused* as well as
*rare diffused* topics by leveraging the context of individual lines within
each file.
Rare diffused topics are quite common in software code, particularly in
framework based software. We evaluate our model on surfacing software
*concerns* automatically at the fine granularity of *individual program
statements*. CSTM achieves a statement level concern assignment accuracy
that agrees 70\% of the time with typical programmer interpretation (as
measured using systematically gathered feedback from 35 programmers for
four Java applications). The ability to discover statement level concerns
paves the way for a new class of automated analyses correlating latent
concerns with program properties that vary at statement granularity. As a
novel application, we demonstrate a completely unsupervised automatic
summarization of byte-code execution profiles in terms of latent concerns.
pdf