Pragmatic Data Mining: Novel Paradigms for Tackling Key Challenges
Abstract :
Over the last few decades, Data Mining has progressively evolved into an extremely
significant field
for active research. Accordingly, with a tremendous spurt in the amount of real data
being generated,
the attention has diverted from synthesis and accumulation of data to its analysis
and application.
Many of the well-established techniques in the literature, pertaining to some
integral machine learning
and pattern recognition areas such as Dimensionality Reduction, Clustering and
Classification, have
been rendered ineffective as a result of this paradigm shift in focus. In this work, we
present a comprehensive
overview of the key challenges facing these areas, and offer new insights into
overcoming
these challenges. In particular, we make the following contributions:
we (a) propose
a generic dimension
reduction technique for extracting significant information, especially in the
context of image data
depicting dynamic scenes,
(b) characterize the notion of order independence in
incremental learning,
(c) propose improvements in the prototype Leader algorithm to obtain better quality
of clustering,
(d)
introduce an algorithm, RACK, based on height balanced trees, which significantly
improves upon the
time taken by the popular k-means algorithm, without compromising much on the
quality of clustering,
(e) demonstrate how the integration of partition based clustering techniques can be
achieved using an
algorithm, EPIC, for elegantly incorporating the domain knowledge,
(f) show how an
order independent
algorithm based on Shapley value, SHARPC, views the problem of clustering as a
natural manifestation
of the interactions among the points in a convex game setting, and thereby improves
the quality of
clustering,
(g) introduce the Q-Optimal Disk Layout problem in the context of suffix
trees, show it to
be NP-Hard, and suggest an algorithm Approx. Q-OptDL to obtain a disk layout that is
guaranteed
to have a performance within twice of the optimal layout asymptotically, and
(h)
introduce the alpha-MFC
problem for addressing the `curse of dimensionality' in classification, and propose
Feature Subspace
SVMs( FS-SVMs) for an approximate solution to the alpha-MFC problem in the context
of high dimensional
handwritten digit recognition.
Our experimental results strongly corroborate the
efficacy of our work.
pdf