Show that for function f(x) = f(x1, x2) = ax2 1 + x2 2 for x ∈ R2 and a > 0, if we use Adagrad to optimize f with learning rate η = 1, derive the ratio (as a function of a) of the pre-conditioner: given a fixed initial point x0 = (c, c) where c ̸= 0.

2 Convex optimization but the minimizer is at infinity

Li Hua claims that he has spotted a huge bug in the convergence proof of gradient descent convex functions in the lecture. He claims that the proof can not be applied to logistic regression, where the optimizer is at infinity. Hence gradient descent might take forever to converge.

Now, you want to show him some math:

1. (5 points) For logistic function h(x) = log(1 + ex) as a function h : R → R, show that the minimizer of h is indeed at infinity.
Solution

2. (5 points) However, show that for every ϵ > 0, there is an x such that |x| = O(log(1/ϵ)) with h(x) ≤ ϵ.
Solution

3. (30 points) Now, for general function f : Rd → R, suppose f is L-smooth convex function, and for an ϵ > 0, there is an x with ∥x∥2 ≤ R such that f(x) ≤ ϵ.
Now, show that starting from x0 = 0, gradient descent update with η ≤ L1 : converges to f(xT ) ≤ 2ϵ in T ≤ poly(R, L, 1/η, 1/ϵ) many iterations. (Hint: Apply Mirror Descent Lemma on x)

3 Random Query for Convex Optimization (20 points)

This problem studies the efficiency of the following optimization algorithm for a d-dimensional convex function f:

1. Fix a probability distribution D, which does not depend on f. You don’t know f, but let’s say you somehow know that the optimum x∗ has nonzero probability density over D.

2. Sample x1, x2, …xN ∼ D

3. Return xm = argminx1,…xN f(x)

Let D be the uniform distribution over the cube Q = {x | −1 ≤ xi ≤ 1} = {x | ∥x∥∞ ≤ 1}.
Show that for every fixed x0 ∈ Rd, if we sample N = dO(1) points x1, x2, · · · , xN uniformly from Q, then with probability lower bounded by 1−e−Ω(d), the following holds for all i ∈ [N]: EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!

E-mail: easydue@outlook.com  微信:easydue

EasyDue™是一个服务全球中国留学生的专业代写公司