View all Seminars  |  Download ICal for this event

Novel First-order Algorithms for Non-smooth Optimization Problems in Machine Learning

Series: Ph.D Thesis Colloquium- ON-LINE

Speaker: Mr. Achintya KunduPh.D (Engg.) student Dept. of CSA

Date/Time: Apr 27 09:30:00

Location: Microsoft Teams - ON-LINE

Faculty Advisor: Prof. Chiranjib Bhattacharyya

This thesis is devoted to the design of efficient optimization algorithms for machine learning (ML) problems where the underlying objective function to be optimized is convex but not necessarily differentiable. Such non-smooth objective functions are ubiquitous in ML mainly due to the use of one or more of the following: (a) non-differentiable loss functions, (b) sparsity promoting regularization terms, (c) constraint sets to induce specific structure on the parameters to be learned. Motivated by a wide range of learning problems that can be cast as optimization of a non-smooth convex objective, we focus on developing first-order algorithms with non-asymptotic convergence rate guarantees to solve such problems in large-scale settings. Based on shortcomings of the existing research in this domain, we address the following specific issues in this thesis.
First, we consider the problem of learning a kernel matrix from m similarity matrices under a general convex loss. The existing algorithms do not apply if one employs arbitrary loss functions and often can not handle m>1 case. Based on the Mirror Descent (MD) framework, we present several provably convergent iterative algorithms that exploit the availability of off-the-shelf support vector machine (SVM) solvers. One of the significant contributions involves an extension of the well-known MD algorithm for simplex to handle the Cartesian product of positive semidefinite (PSD) matrices leading to a novel algorithm called Entropic Multiple Kernel Learning. We also show simulation results on protein structure classification involving several similarity matrices to demonstrate the proposed algorithms efficacy.
Next, we focus on minimizing a convex function over a feasible set given by the intersection of finitely many simple sets, each of which is equipped with a projection oracle. Examples of constraint sets that possess such structure include the set of doubly stochastic matrices, elliptope, the intersection of PSD cone with an L1-norm ball, etc. The main difficulty lies in computing the projection of a point onto the feasible set. Exploiting the intersecting sets linear regularity property, we present an exact penalty approach that leads to first-order algorithms with explicit guarantees on the approximate solutions distance from the feasible set. Further, we show improved iteration-complexity when the objective possesses structural smoothness / strong convexity. This is achieved through a saddle-point reformulation where the proximal operators required by the primal-dual algorithms can be computed in closed form. We illustrate the benefits of our approach on a graph transduction problem and on graph matching.
Third, we consider convex-concave saddle point problems with bilinear interaction structure. This class of problems encompasses most convex optimization problems arising in ML and includes minimizing the sum of many simple non-smooth convex functions as a special case; thereby, it subsumes learning problems involving complex regularization terms such as total-variation based image denoising, overlapping group lasso, isotonic regression, etc. We first propose a primal-dual algorithm for this general class of problems that can achieve the O(1/T) convergence rate guarantee on the non-ergodic primal-dual iterate pair. Further, assuming strong convexity in the primal domain, we show an improved non-ergodic convergence rate of O(1/T^2). In contrast, the existing primal-dual algorithms achieve such convergence rate only in the ergodic or semi-ergodic setting.
Finally, we consider the classical setting of minimizing the sum of two convex functions: a smooth one (possessing Lipschitz continuous gradient) and a simple non-smooth one with easy to compute proximal operator. The well-known FISTA algorithm (also Nesterovs accelerated gradient method) achieves the optimal O(1/T^2) non-ergodic convergence rate for this problem. One of the drawbacks of these fast gradient methods is that they require computing gradients of the smooth function at points different from those on which the convergence rate guarantee applies. Inspired by the use of past gradients as momentum in training deep nets, we propose an accelerated gradient algorithm to rectify this drawback keeping the convergence rate intact. We achieve this through a judicious choice of momentum in both primal and dual space. To the best of our knowledge, this is the first accelerated gradient algorithm that achieves an O(1/T^2) convergence rate guarantee on the iterates at which gradients are evaluated. This fills a significant research gap as Polyaks Heavy Ball method guarantees accelerated convergence rate through gradient momentum only for a restrictive class of twice differentiable and strongly convex objective functions.
The Microsoft Teams link:

Speaker Bio:

Host Faculty: