Pragmatic Data Mining: Novel Paradigms for Tackling Key Challenges

Abstract :
Over the last few decades, Data Mining has progressively evolved into an extremely significant field for active research. Accordingly, with a tremendous spurt in the amount of real data being generated, the attention has diverted from synthesis and accumulation of data to its analysis and application. Many of the well-established techniques in the literature, pertaining to some integral machine learning and pattern recognition areas such as Dimensionality Reduction, Clustering and Classification, have been rendered ineffective as a result of this paradigm shift in focus. In this work, we present a comprehensive overview of the key challenges facing these areas, and offer new insights into overcoming these challenges. In particular, we make the following contributions:
we (a) propose a generic dimension reduction technique for extracting significant information, especially in the context of image data depicting dynamic scenes,
(b) characterize the notion of order independence in incremental learning,
(c) propose improvements in the prototype Leader algorithm to obtain better quality of clustering,
(d) introduce an algorithm, RACK, based on height balanced trees, which significantly improves upon the time taken by the popular k-means algorithm, without compromising much on the quality of clustering,
(e) demonstrate how the integration of partition based clustering techniques can be achieved using an algorithm, EPIC, for elegantly incorporating the domain knowledge,
(f) show how an order independent algorithm based on Shapley value, SHARPC, views the problem of clustering as a natural manifestation of the interactions among the points in a convex game setting, and thereby improves the quality of clustering,
(g) introduce the Q-Optimal Disk Layout problem in the context of suffix trees, show it to be NP-Hard, and suggest an algorithm Approx. Q-OptDL to obtain a disk layout that is guaranteed to have a performance within twice of the optimal layout asymptotically, and
(h) introduce the alpha-MFC problem for addressing the `curse of dimensionality' in classification, and propose Feature Subspace SVMs( FS-SVMs) for an approximate solution to the alpha-MFC problem in the context of high dimensional handwritten digit recognition.
Our experimental results strongly corroborate the efficacy of our work.

pdf