Convergence Analysis for Block Coordinate Decent Algorithm and Powell's Examples

We mainly focus on the convergence of Block coordinate decent with exact minimization, whose block update strategy employs Gauss-Seidel manner. And then use Powell's example to see what will happen if some conditions are not met.

Reference: 1. Dimitri .P Bertsekas, Nonlinear Programming 2ed 2. Powell ,1973, ON SEARCH DIRECTIONS FOR MINIMIZATION ALGORITHMS

Problem description

Notations

We want to solve the problem:

minxXf(x)

where X is a Cartesian product of closed convex sets X1,...,Xm:X=i=1nXi

We assume that Xi is a closed convex subset of Rni and n=i=1mni. The vector is partitioned into m block(s) such that xiXni.

We denote if as the gradient of f with respect to component xi.

Assumption

We shall assume that for every xX and i=1,2,...m the optimization problem

minξXif(x1,...,xi1,ξ,xi+1,....,xm)

has at least one solution.

Algorithm

The Gauss-Seidel method, generates the next iterate xk+1=(x1k+1,...,xmk+1), given the current the iterate xk=(x1k,...,xmk), according to the iteration

xik+1=argminξXif(x1k+1,...,xi1k+1,ξ,xi+1k,...,xmk)

Convergence Analysis

Theorem Suppose that f is continuously differentiable over the set X defined as above. Furthermore, suppose that for each i and xX,

f(x1,...,xi1,ξ,xi+1,....,xm)

viewed as a function of ξ, attains a unique minimum x¯i over Xi and is monotonically non-increasing in the interval from xi to ξ¯. Let {xk} be the sequence generated by the block coordinate method with Gauss-Seidel manner. Then, every limit point of {xk} is a stationary point.

PROOF

Let

zik=(x1k+1,...,xik+1,xi+1k,...,xmk)

By the nature of this algorithm, for all k0, we have following inequality

f(xk)f(z1k)f(z2k)...f(zm1k)f(xk+1)()

Since {xk}inX, we can assume {xkj} is the subsequence that converges to x¯=(x¯1,..,x¯m).

Now we want prove that x¯ is the stationary point of f.

From (*), we know that

f(z1kj)f(x1,x2kj,...,xmkj)x1X1

Let j+, we derive

f(x¯)f(x1,x¯2,...,x¯m)=Δh(x1)x1X1

which implies that x¯i is the minima of h(x1) on X1. Using the optimality over a convex set, we conclude that

h(x¯1)(x¯1x1)0(x1x¯1)T1f(x¯1)0x1X1

At this stage, if we can prove that {z1kj} converges to x¯, we can show that

(x2x¯2)T2f(x¯2)0x2X2, since

f(z1kj)=f(x1kj+1,x2kj,x3kj,...,xmkj)f(x1kj+1,x2,x3kj,...,xmkj)x2X2

Let j+, we derive

f(x¯)f(x¯1,x¯2,x¯3,...,x¯m)x2X2

and

(x2x¯2)T2f(x¯2)0x2X2

(Note: Although x1kj+1 may not in the sequence {x1kt}t1 ,which convergences to x¯1, but {z1kj} converges to x¯, so its component x1kj+1 converges to x¯1).

Furthermore, if we prove that for i=1,2,...,m1,{zikj} convergences to x¯, then we have

(xix¯i)Tif(x¯i)0xiXi

And thus x¯ is a stationary point, since (xx¯)Tf(x¯)0

By far, it remains to prove that {zikj},i convergence to x¯. First,we try to prove that {z1k1} convergence to x¯.

Assume the contrary that rkj=||z1kjxkj|| doesn't convergence to 0. Let s1kj=(z1kjxkj)/rkj. Thus, z1kj=xkj+rkjs1kj , ||rkj||=1 and s1kj differs from 0 only along the first block-component. Since {s1kj} belong to a compact set and therefore without loss of generality, we assume s1kj convergences to s¯1.

Since rkj>0,we can find a ϵ(0,1), such that xkj+ϵs1kj lies on the segment joining xkj and xkj+s1kj=z1kj. Using the non-increasing property of f,we derive,

f(z1kj)f(xkj+ϵs1kj)f(xkj)

Again, using (*), we conclude

f(xkj+1)f(z1kj)f(xkj+ϵs1kj)f(xkj)

Let j+, we derive f(x¯)=f(x¯+ϵs¯1), which contradicts the hypothesis that f is uniquely minimized when viewed as a function of the first block component. This contradiction establishes that {z1k1} convergence to x¯.

Similarly, let rtkj=||ztkjzt1kj|| for t=2,3,...,m1 and using the same technique shown above, we finally prove that {zikj},i.

Powell's example

In ON SEARCH DIRECTIONS FOR MINIMIZATION ALGORITHMS, Power actually gives three examples that sequences generated by the algorithm discussed above do not convergence to stationary points once some hypothesis are not met.

  • The first example is straightforward, However, the remarkable properties of this example can be destroyed by making a small perturbation to the starting vector x0.

  • The second example is not sensitive to either small changes in the initial data or to small errors introduced during the iterative process, for example computer rounding errors.

  • The third example suggests that a function that is infinitely differentiable that also causes an endless loop in the iterative minimization method.

We here only presents the first example. Consider the following function

f(x,y,z)=(xy+yz+zx)+(x1)+2+(x1)+2+(y1)+2+(y1)+2+(z1)+2+(z1)+2

where

(xc)+2={0,xc<0(xc)2,xc0

Given the starting point x0=(1e,1+12e,114e) and use block coordinate decent algorithm,and we update the variable in a manner of xyzx... with

xk+1sign(yk+zk)[1+12|yk+zk|]

yk+1sign(xk+1+zk)[1+12|xk+1+zk|]

zk+1sign(xk+1+yk+1)[1+12|xk+1+yk+1|]

We here present the first six steps of this case

cycle/totall iteration x y z
1/1 1+18e 1+e -1-14e
1/2 1+18e -1-116e -1-14e
1/3 1+18e -1-116e 1+132e
2/4 -1-164e -1-116e 1+132e
2/5 -1-164e 1+1128e 1+132e
2/6 -1-164e 1+1128e -1-1256e
3/7 1+1512e 1+1128e -1-1256e
... ... ... ...

This result implies that the sequence obtained by this algorithm can not converge to one single point since xcoordinate change its sign as the even cycle and odd cycle alternate. Situations are similar for ycoordinate and zcoordinate.

But {xk} has six sub-sequences which convergence to (1,1,-1), (1,-1,-1), (1,-1,1), (-1,-1,1),(-1,-1,1),(-1,1,1),(-1,1,-1) respectively.

Remark

  1. A hint to derive the update formula:

    xsign(y+z)[1+12(y+z)]

    Indeed, derivates of (x1)+2 and (x1)+2 are as follows respecively

    d(x1)+2dx={2(x1),x10,x<1d(x1)+2dx={2(x1),x10,x>1

    So for the univariate optimization problem, setting the derivate of g(x)=f(x,y,z) to zero, we conclude

    f(x,y,x)x=0{x1:x=1+12(y+z)1<x<1:(y+z)=0x1:x=1+12(y+z)

  2. The gradient of f(x,y,z) on this cyclic path, is f(x,y,z)=(yz,xz,xy) and ||f(x,y,z)||1=2

  3. This example is unstable with respect to small perturbations. Small changes in the starting point x0=(1e,1+12e,114e) or smal errors in the numbers that are computed during the calculation will destroy the cyclic behavior.

    It's s clear the choice of perturbations e plays a key role. Say, x0=(1e1,1+e2,1e3) and we have ek=12(ek2ek1)

    cycle/totall iteration x y z
    1/1 1+e4 1+e2 -1-e3
    1/2 1+e4 -1-e5 -1-e3
    1/3 1+e4 -1-e5 1+e6
    2/4 -1-e7 -1-e5 1+e6
    2/5 -1-e7 1+e8 1+e6
    2/6 -1-e7 1+e8 -1-e9
    ... ... ... ...

    To preserve the cyclic behavior , we have to make sure that ek2>ek1

    And in practice, when we do some numerical tests, we shall find that, this theoretically-existed endless loop actual breaks down due to the rounding errors. A brief illustration is given below. In this experiment, loop ends at the 52 steps.

    Snip20161117_13
    Snip20161117_14 Snip20161117_15

  4. As
    f(x,y,x)x=0{x1:x=1+12(y+z)1<x<1:(y+z)=0x1:x=1+12(y+z)

    suggests that, when 1<x<1, the choice of x is arbitrary and we set x=0 in the case above. So the uniqueness requirement is violated. It turns out that the six vertices are even not the stationary points.

    For example, at point x¯=(1,1,1), f(x¯)=(0,0,2) and for any ponit x in the unit cubic (xx¯)Tf(x¯)0. Say, x=(0.9,0.9,0.9), (xx¯)Tf(x¯)=0.2<0

    Actually, as in the proof of Theorem, we prove that {z1kj} converges to x¯, where x¯ is the limit point of {xkj}. But in this example, the limit point of {z1kj} is (1,1,-1) while the limit point of {xkj} is either (-1,1,-1) or (1,-1,1). So the requirement of uniqueness is not met.

R codes for numerical experiments

####################
### Function for test ###
####################

PowellE1<-function(xstart,cycles,fig=T){
  #######function part ##############
  UpdateCycle<-function(x){
    Sign<-function(x){
      if (x>0){
        return(1)
      }else{
        if (x<0){
          return(-1)
        }else{
          return(0)
        }
      }
    }
    x.new<-c()
    x.new[1]<-Sign(x[2]+x[3])*(1+0.5*abs(x[2]+x[3]))
    x.new[2]<-Sign(x.new[1]+x[3])*(1+0.5*abs(x.new[1]+x[3]))
    x.new[3]<-Sign(x.new[1]+x.new[2])*(1+0.5*abs(x.new[1]+x.new[2]))
    cycle<-matrix(c(x.new[1],x[2],x[3],x.new[1],x.new[2],x[3],x.new[1],x.new[2],x.new[3]),
                  ncol=3,byrow=T)
    return(cycle)
  }
  
  fpowell<-function(x){
    
    PostivePart<-function(x){
      ifelse(x>=0,x,0)
    }
    
    fval<-(-(x[1]*x[2]+x[2]*x[3]+x[1]*x[3]))+
      PostivePart(x[1]-1)^2+PostivePart(-x[1]-1)^2+
      PostivePart(x[2]-1)^2+PostivePart(-x[2]-1)^2+
      PostivePart(x[3]-1)^2+PostivePart(-x[3]-1)^2
    return(fval)
  }
  ############ operation part ################
  x.store<-matrix(ncol=3,nrow=cycles*3+1)
  x.store[1,]<-xstart
  for (i in seq_len(cycles)){
    x.store[(3*i-1):(3*i+1),]<-UpdateCycle(x.store[3*i-2,])
  }
  x.store<-x.store[-1,]
  fval<-rep(0,cycles*3)
  
  for(i in seq_len(cycles*3)){
    fval[i]<-fpowell(x.store[i,])
  }
  fval<-as.matrix(fval)
  
  if (fig==T){
    plot(fval,ylim=c(min(fval)-1,max(fval)+1),type="l",xlab="Iterations",ylab = "F value")
  }
  r<-list()
  r$x.iterate<-x.store
  r$fval<-fval
  return(r)
}


##################
#### Test 1 ########
##################


perturb<-0.5
xstart<-c(-1-perturb,1+0.5*perturb,-1-0.25*perturb)
cycles<-20

r<-PowellE1(xstart,cycles,fig=T)

##################
#### Test 2 ########
##################

perturb<-0.5
xstart<-c(-1-perturb,1+0.5*perturb,-1-0.25*perturb)
cycles<-20

r<-PowellE1(xstart,cycles,fig=T)

##################
#### Test 3 ########
##################

xstart<-c(3,2,1)
cycles<-100

r<-PowellE1(xstart,cycles,fig=T)

Ph.D. student

My research interests lie at the intersection of statistical modeling and optimization.

Related