Convergence Analysis for Block Coordinate Decent Algorithm and Powell's Examples

Last updated on Jun 16, 2021 optimization

Problem description
Convergence Analysis
Powell's example
R codes for numerical experiments

We mainly focus on the convergence of Block coordinate decent with exact minimization, whose block update strategy employs Gauss-Seidel manner. And then use Powell's example to see what will happen if some conditions are not met.

Reference: 1. Dimitri .P Bertsekas, Nonlinear Programming 2ed 2. Powell ,1973, ON SEARCH DIRECTIONS FOR MINIMIZATION ALGORITHMS

Problem description

Notations

We want to solve the problem:

$\underset{x \in X}{m i n} f (x)$

where X is a Cartesian product of closed convex sets $X_{1}, . . ., X_{m} : X =_{i = 1}^{n} X_{i}$

We assume that $X_{i}$ is a closed convex subset of $R^{n_{i}}$ and $n = \sum_{i = 1}^{m} n_{i}$ . The vector is partitioned into $m$ block(s) such that $x_{i} \in X^{n_{i}}$ .

We denote $\nabla_{i} f$ as the gradient of $f$ with respect to component $x_{i}$ .

Assumption

We shall assume that for every $x \in X$ and $i = 1, 2, . . . m$ the optimization problem

$\underset{ξ \in X_{i}}{m i n} f (x_{1}, . . ., x_{i - 1}, ξ, x_{i + 1, . . . ., x_{m}})$

has at least one solution.

Algorithm

The Gauss-Seidel method, generates the next iterate $x^{k + 1} = (x_{1}^{k + 1}, . . ., x_{m}^{k + 1})$ , given the current the iterate $x^{k} = (x_{1}^{k}, . . ., x_{m}^{k})$ , according to the iteration

$x_{i}^{k + 1} = \underset{ξ \in X_{i}}{a r g m i n} f (x_{1}^{k + 1}, . . ., x_{i - 1}^{k + 1}, ξ, x_{i + 1}^{k}, . . ., x_{m}^{k})$

Convergence Analysis

Theorem Suppose that $f$ is continuously differentiable over the set $X$ defined as above. Furthermore, suppose that for each $i$ and $x \in X$ ,

$f (x_{1}, . . ., x_{i - 1}, ξ, x_{i + 1, . . . ., x_{m}})$

viewed as a function of $ξ$ , attains a unique minimum ${\bar{x}}_{i}$ over $X_{i}$ and is monotonically non-increasing in the interval from $x_{i}$ to $\bar{ξ}$ . Let ${x_{k}}$ be the sequence generated by the block coordinate method with Gauss-Seidel manner. Then, every limit point of ${x_{k}}$ is a stationary point.

PROOF

Let

$z_{i}^{k} = (x_{1}^{k + 1}, . . ., x_{i}^{k + 1}, x_{i + 1}^{k}, . . ., x_{m}^{k})$

By the nature of this algorithm, for all $k \geq 0$ , we have following inequality

$f (x^{k}) \geq f (z_{1}^{k}) \geq f (z_{2}^{k}) \geq . . . \geq f (z_{m - 1}^{k}) \geq f (x^{k + 1}) (*)$

Since ${x_{k}} i n X$ , we can assume ${x^{k_{j}}}$ is the subsequence that converges to $\bar{x} = ({\bar{x}}_{1}, . ., {\bar{x}}_{m})$ .

Now we want prove that $\bar{x}$ is the stationary point of $f$ .

From (*), we know that

$f (z_{1}^{k_{j}}) \leq f (x_{1}, x_{2}^{k_{j}}, . . ., x_{m}^{k_{j}}) \forall x_{1} \in X_{1}$

Let $j \to + \infty$ , we derive

$f (\bar{x}) \leq f (x_{1}, {\bar{x}}_{2}, . . ., {\bar{x}}_{m}) \overset{Δ}{=} h (x_{1}) \forall x_{1} \in X_{1}$

which implies that ${\bar{x}}_{i}$ is the minima of $h (x_{1})$ on $X_{1}$ . Using the optimality over a convex set, we conclude that

$h^{'} ({\bar{x}}_{1}) ({\bar{x}}_{1} - x_{1}) \geq 0 \Leftrightarrow (x_{1} - {\bar{x}}_{1})^{T} \nabla_{1} f ({\bar{x}}_{1}) \geq 0 x_{1} \in X_{1}$

At this stage, if we can prove that ${z_{1}^{k_{j}}}$ converges to $\bar{x}$ , we can show that

$(x_{2} - {\bar{x}}_{2})^{T} \nabla_{2} f ({\bar{x}}_{2}) \geq 0 x_{2} \in X_{2}$ , since

$f (z_{1}^{k_{j}}) = f (x_{1}^{k_{j} + 1}, x_{2}^{k_{j}}, x_{3}^{k_{j}}, . . ., x_{m}^{k_{j}}) \leq f (x_{1}^{k_{j} + 1}, x_{2}, x_{3}^{k_{j}}, . . ., x_{m}^{k_{j}}) x_{2} \in X_{2}$

Let $j \to + \infty$ , we derive

$f (\bar{x}) \leq f ({\bar{x}}_{1}, {\bar{x}}_{2}, {\bar{x}}_{3}, . . ., {\bar{x}}_{m}) \forall x_{2} \in X_{2}$

and

$(x_{2} - {\bar{x}}_{2})^{T} \nabla_{2} f ({\bar{x}}_{2}) \geq 0 x_{2} \in X_{2}$

(Note: Although $x_{1}^{k_{j} + 1}$ may not in the sequence ${x_{1}^{k_{t}}}_{t \geq 1}$ ,which convergences to ${\bar{x}}_{1}$ , but ${z_{1}^{k_{j}}}$ converges to $\bar{x}$ , so its component $x_{1}^{k_{j} + 1}$ converges to ${\bar{x}}_{1}$ ).

Furthermore, if we prove that for $i = 1, 2, . . ., m - 1$ , ${z_{i}^{k_{j}}}$ convergences to $\bar{x}$ , then we have

$(x_{i} - {\bar{x}}_{i})^{T} \nabla_{i} f ({\bar{x}}_{i}) \geq 0 x_{i} \in X_{i}$

And thus $\bar{x}$ is a stationary point, since $(x - \bar{x})^{T} \nabla f (\bar{x}) \geq 0$

By far, it remains to prove that ${z_{i}^{k_{j}}}, \forall i$ convergence to $\bar{x}$ . First,we try to prove that ${z_{1}^{k_{1}}}$ convergence to $\bar{x}$ .

Assume the contrary that $r^{k_{j}} = | | z_{1}^{k_{j}} - x^{k_{j}} | |$ doesn't convergence to 0. Let $s_{1}^{k_{j}} = (z_{1}^{k_{j}} - x^{k_{j}}) / r^{k_{j}}$ . Thus, $z_{1}^{k_{j}} = x^{k_{j}} + r^{k_{j}} s_{1}^{k_{j}}$ , $| | r_{k_{j}} | | = 1$ and $s_{1}^{k_{j}}$ differs from 0 only along the first block-component. Since ${s_{1}^{k_{j}}}$ belong to a compact set and therefore without loss of generality, we assume $s_{1}^{k_{j}}$ convergences to ${\bar{s}}_{1}$ .

Since $r^{k_{j}} > 0$ ,we can find a $ϵ \in (0, 1)$ , such that $x^{k_{j}} + ϵ s_{1}^{k_{j}}$ lies on the segment joining $x^{k_{j}}$ and $x^{k_{j}} + s_{1}^{k_{j}} = z_{1}^{k_{j}}$ . Using the non-increasing property of $f$ ,we derive,

$f (z_{1}^{k_{j}}) \leq f (x^{k_{j}} + ϵ s_{1}^{k_{j}}) \leq f (x^{k_{j}})$

Again, using (*), we conclude

$f (x^{k_{j + 1}}) \leq f (z_{1}^{k_{j}}) \leq f (x^{k_{j}} + ϵ s_{1}^{k_{j}}) \leq f (x^{k_{j}})$

Let $j \to + \infty$ , we derive $f (\bar{x}) = f (\bar{x} + ϵ {\bar{s}}_{1})$ , which contradicts the hypothesis that $f$ is uniquely minimized when viewed as a function of the first block component. This contradiction establishes that ${z_{1}^{k_{1}}}$ convergence to $\bar{x}$ .

Similarly, let $r_{t}^{k_{j}} = | | z_{t}^{k_{j}} - z_{t - 1}^{k_{j}} | |$ for $t = 2, 3, . . ., m - 1$ and using the same technique shown above, we finally prove that ${z_{i}^{k_{j}}}, \forall i$ .

Powell's example

In ON SEARCH DIRECTIONS FOR MINIMIZATION ALGORITHMS, Power actually gives three examples that sequences generated by the algorithm discussed above do not convergence to stationary points once some hypothesis are not met.

The first example is straightforward, However, the remarkable properties of this example can be destroyed by making a small perturbation to the starting vector $x^{0}$ .
The second example is not sensitive to either small changes in the initial data or to small errors introduced during the iterative process, for example computer rounding errors.
The third example suggests that a function that is infinitely differentiable that also causes an endless loop in the iterative minimization method.

We here only presents the first example. Consider the following function

$f (x, y, z) = - (x y + y z + z x) + (x - 1)_{+}^{2} + (- x - 1)_{+}^{2} + (y - 1)_{+}^{2} + (- y - 1)_{+}^{2} + (z - 1)_{+}^{2} + (- z - 1)_{+}^{2}$

where

$(x - c)_{+}^{2} = {\begin{cases} 0, x - c < 0 \\ (x - c)^{2}, x - c \geq 0 \end{cases}$

Given the starting point $x_{0} = (- 1 - e, 1 + \frac{1}{2} e, - 1 - \frac{1}{4} e)$ and use block coordinate decent algorithm,and we update the variable in a manner of $x \to y \to z \to x . . .$ with

$x_{k + 1}^{* *} \leftarrow sign (y_{k} + z_{k}) [1 + \frac{1}{2} | y_{k} + z_{k} |]$

$y_{k + 1}^{* *} \leftarrow sign (x_{k + 1} + z_{k}) [1 + \frac{1}{2} | x_{k + 1} + z_{k} |]$

$z_{k + 1}^{* *} \leftarrow sign (x_{k + 1} + y_{k + 1}) [1 + \frac{1}{2} | x_{k + 1} + y_{k + 1} |]$

We here present the first six steps of this case

cycle/totall iteration	x	y	z
1/1	1+ $\frac{1}{8} e$	1+ $e$	-1- $\frac{1}{4} e$
1/2	1+ $\frac{1}{8} e$	-1- $\frac{1}{16} e$	-1- $\frac{1}{4} e$
1/3	1+ $\frac{1}{8} e$	-1- $\frac{1}{16} e$	1+ $\frac{1}{32} e$
2/4	-1- $\frac{1}{64} e$	-1- $\frac{1}{16} e$	1+ $\frac{1}{32} e$
2/5	-1- $\frac{1}{64} e$	1+ $\frac{1}{128} e$	1+ $\frac{1}{32} e$
2/6	-1- $\frac{1}{64} e$	1+ $\frac{1}{128} e$	-1- $\frac{1}{256} e$
3/7	1+ $\frac{1}{512} e$	1+ $\frac{1}{128} e$	-1- $\frac{1}{256} e$
...	...	...	...

This result implies that the sequence obtained by this algorithm can not converge to one single point since $x - c o o r d i n a t e$ change its sign as the even cycle and odd cycle alternate. Situations are similar for $y - c o o r d i n a t e$ and $z - c o o r d i n a t e$ .

But ${x_{k}}$ has six sub-sequences which convergence to (1,1,-1), (1,-1,-1), (1,-1,1), (-1,-1,1),(-1,-1,1),(-1,1,1),(-1,1,-1) respectively.

Remark

A hint to derive the update formula:

$x \leftarrow sign (y + z) [1 + \frac{1}{2} (y + z)]$

Indeed, derivates of $(x - 1)_{+}^{2}$ and $(- x - 1)_{+}^{2}$ are as follows respecively

$\frac{d (x - 1)_{+}^{2}}{d x} = {\begin{cases} 2 (x - 1), x \geq 1 \\ 0, x < 1 \end{cases} \frac{d (- x - 1)_{+}^{2}}{d x} = {\begin{cases} 2 (- x - 1), x \leq - 1 \\ 0, x > - 1 \end{cases}$

So for the univariate optimization problem, setting the derivate of $g (x) = f (x, y, z)$ to zero, we conclude

$\frac{\partial f (x, y, x)}{\partial x} = 0 \Rightarrow {\begin{cases} x \geq 1 : x = 1 + \frac{1}{2} (y + z) \\ - 1 < x < 1 : - (y + z) = 0 \\ x \leq - 1 : x = - 1 + \frac{1}{2} (y + z) \end{cases}$
The gradient of $f (x, y, z)$ on this cyclic path, is $\nabla f (x, y, z) = (- y - z, - x - z, - x - y)$ and $| | \nabla f (x, y, z) | |_{1} = 2$
This example is unstable with respect to small perturbations. Small changes in the starting point $x_{0} = (- 1 - e, 1 + \frac{1}{2} e, - 1 - \frac{1}{4} e)$ or smal errors in the numbers that are computed during the calculation will destroy the cyclic behavior.

It's s clear the choice of perturbations $e$ plays a key role. Say, $x_{0} = (- 1 - e_{1}, 1 + e_{2}, - 1 - e_{3})$ and we have $e_{k} = \frac{1}{2} (e_{k - 2} - e_{k - 1})$

cycle/totall iteration x y z

1/1 1+ $e_{4}$ 1+ $e_{2}$ -1- $e_{3}$

1/2 1+ $e_{4}$ -1- $e_{5}$ -1- $e_{3}$

1/3 1+ $e_{4}$ -1- $e_{5}$ 1+ $e_{6}$

2/4 -1- $e_{7}$ -1- $e_{5}$ 1+ $e_{6}$

2/5 -1- $e_{7}$ 1+ $e_{8}$ 1+ $e_{6}$

2/6 -1- $e_{7}$ 1+ $e_{8}$ -1- $e_{9}$

... ... ... ...

To preserve the cyclic behavior , we have to make sure that $e_{k - 2} > e_{k - 1}$

And in practice, when we do some numerical tests, we shall find that, this theoretically-existed endless loop actual breaks down due to the rounding errors. A brief illustration is given below. In this experiment, loop ends at the 52 steps.
As
$\frac{\partial f (x, y, x)}{\partial x} = 0 \Rightarrow {\begin{cases} x \geq 1 : x = 1 + \frac{1}{2} (y + z) \\ - 1 < x < 1 : - (y + z) = 0 \\ x \leq - 1 : x = - 1 + \frac{1}{2} (y + z) \end{cases}$

suggests that, when $- 1 < x < 1$ , the choice of $x$ is arbitrary and we set $x^{*} = 0$ in the case above. So the uniqueness requirement is violated. It turns out that the six vertices are even not the stationary points.

For example, at point $\bar{x} = (1, 1, - 1)$ , $\nabla f (\bar{x}) = (0, 0, - 2)$ and for any ponit $x$ in the unit cubic $(x - \bar{x})^{T} \nabla f (\bar{x}) \leq 0$ . Say, $x = (0.9, 0.9, - 0.9)$ , $(x - \bar{x})^{T} \nabla f (\bar{x}) = - 0.2 < 0$

Actually, as in the proof of Theorem, we prove that ${z_{1}^{k_{j}}}$ converges to $\bar{x}$ , where $\bar{x}$ is the limit point of ${x^{k_{j}}}$ . But in this example, the limit point of ${z_{1}^{k_{j}}}$ is (1,1,-1) while the limit point of ${x^{k_{j}}}$ is either (-1,1,-1) or (1,-1,1). So the requirement of uniqueness is not met.

cycle/totall iteration	x	y	z
1/1	1+ $e_{4}$	1+ $e_{2}$	-1- $e_{3}$
1/2	1+ $e_{4}$	-1- $e_{5}$	-1- $e_{3}$
1/3	1+ $e_{4}$	-1- $e_{5}$	1+ $e_{6}$
2/4	-1- $e_{7}$	-1- $e_{5}$	1+ $e_{6}$
2/5	-1- $e_{7}$	1+ $e_{8}$	1+ $e_{6}$
2/6	-1- $e_{7}$	1+ $e_{8}$	-1- $e_{9}$
...	...	...	...

R codes for numerical experiments

####################
### Function for test ###
####################

PowellE1<-function(xstart,cycles,fig=T){
  #######function part ##############
  UpdateCycle<-function(x){
    Sign<-function(x){
      if (x>0){
        return(1)
      }else{
        if (x<0){
          return(-1)
        }else{
          return(0)
        }
      }
    }
    x.new<-c()
    x.new[1]<-Sign(x[2]+x[3])*(1+0.5*abs(x[2]+x[3]))
    x.new[2]<-Sign(x.new[1]+x[3])*(1+0.5*abs(x.new[1]+x[3]))
    x.new[3]<-Sign(x.new[1]+x.new[2])*(1+0.5*abs(x.new[1]+x.new[2]))
    cycle<-matrix(c(x.new[1],x[2],x[3],x.new[1],x.new[2],x[3],x.new[1],x.new[2],x.new[3]),
                  ncol=3,byrow=T)
    return(cycle)
  }
  
  fpowell<-function(x){
    
    PostivePart<-function(x){
      ifelse(x>=0,x,0)
    }
    
    fval<-(-(x[1]*x[2]+x[2]*x[3]+x[1]*x[3]))+
      PostivePart(x[1]-1)^2+PostivePart(-x[1]-1)^2+
      PostivePart(x[2]-1)^2+PostivePart(-x[2]-1)^2+
      PostivePart(x[3]-1)^2+PostivePart(-x[3]-1)^2
    return(fval)
  }
  ############ operation part ################
  x.store<-matrix(ncol=3,nrow=cycles*3+1)
  x.store[1,]<-xstart
  for (i in seq_len(cycles)){
    x.store[(3*i-1):(3*i+1),]<-UpdateCycle(x.store[3*i-2,])
  }
  x.store<-x.store[-1,]
  fval<-rep(0,cycles*3)
  
  for(i in seq_len(cycles*3)){
    fval[i]<-fpowell(x.store[i,])
  }
  fval<-as.matrix(fval)
  
  if (fig==T){
    plot(fval,ylim=c(min(fval)-1,max(fval)+1),type="l",xlab="Iterations",ylab = "F value")
  }
  r<-list()
  r$x.iterate<-x.store
  r$fval<-fval
  return(r)
}


##################
#### Test 1 ########
##################


perturb<-0.5
xstart<-c(-1-perturb,1+0.5*perturb,-1-0.25*perturb)
cycles<-20

r<-PowellE1(xstart,cycles,fig=T)

##################
#### Test 2 ########
##################

perturb<-0.5
xstart<-c(-1-perturb,1+0.5*perturb,-1-0.25*perturb)
cycles<-20

r<-PowellE1(xstart,cycles,fig=T)

##################
#### Test 3 ########
##################

xstart<-c(3,2,1)
cycles<-100

r<-PowellE1(xstart,cycles,fig=T)

Powell-Example

Ph.D. student

My research interests lie at the intersection of statistical modeling and optimization.