Modeling Statement Context to Surface even Rare Diffused Topics Automatically

Suparna Bhattacharya, Mrinal Kanti Das, Chiranjib Bhattacharyya, K. Gopinath

Statistical topic models infer topics from statistical information contained in a dataset. Existing topic models can detect topics if they are present *prominently* in some files or scattered widely across the files. However, they fail if a topic is diffused across a small percentage of files or in other words if a topic is neither prominent inside any file nor diffused *widely* across files. In this work we explore the problem of detecting such *rare diffused* topics. We observe that the local context of lines in a file play a key role in surfacing these topics. We introduce various mechanisms to control a topic model's sensitivity towards local context. We propose CSTM (*Context Sensitive Topic Model*), a new model that is capable of discovering *prominent*, *widely diffused* as well as *rare diffused* topics by leveraging the context of individual lines within each file. Rare diffused topics are quite common in software code, particularly in framework based software. We evaluate our model on surfacing software *concerns* automatically at the fine granularity of *individual program statements*. CSTM achieves a statement level concern assignment accuracy that agrees 70\% of the time with typical programmer interpretation (as measured using systematically gathered feedback from 35 programmers for four Java applications). The ability to discover statement level concerns paves the way for a new class of automated analyses correlating latent concerns with program properties that vary at statement granularity. As a novel application, we demonstrate a completely unsupervised automatic summarization of byte-code execution profiles in terms of latent concerns.