Gradient Descent in Logistic Regression
Problem Formulation
There are commonly two ways of formulating the logistic regression problem, depending on the way we label the response variable
First Formulation:
Consider restrict
Motivating Example
Consider two simulated datasets:
Dataset 1:
0.3 | 0.9 | 1 |
0.5 | 1.5 | -1 |
Dataset 2:
2 | 1 | 1 |
-1 | -1 | -1 |
Some Analysis:
- The objective function
is strictly convex by looking at its Hessian, which is positive defined. However, it is not strongly convex. - For given Data set, the Hessian is upper bounded by
(see Appendix). - The stepsize can be chosen as
.
Applying the gradient descent with constant stepsize
Dataset | ||
---|---|---|
1 | -0.12058225 | -0.36174676 |
2 | 3.59370507 | 3.04825501 |
Also we plot out following figures to check the convergence. The top two figures describe the algorithm’s performance on the dataset 1 while the bottom two is for the dataset 2.

Fig1: Apply GD with the constant stepsize on two different datasets. The blue curves depicts how the norm of gradient at iterates change while the red curves show the change of the function value in each iteration.
Analysis: why this happens?
First, if we want to minimize
- The global minimal is not attainable, i.e.,
, though we can have , which means , hence the iterates diverge. - Indeed, the
converges to by as it monotonously decreasing and lower bounded by . - The worst-case iteration complexity is
, indicating a sublinear convergence rate.
Now, let’s back to the example. The figure 2 shows that the first dataset and second dataset, which correspond to the non-separable and separable case respectively.

Fig2: (Left) First dataset. (Right) Second dataset. The fitted separating line is derived by
.
We also plot out the norm of iterates at each iteration in figure 3.

Fig3: The top figure shows the norm of iterates for the first dataset while the bottom one shows case for the second dataset.
We can see that for non-separable case, the norm of iterates are bounded while the latter goes to infinity (if we increase the number of iterations).
In non-separable case,

Fig4: (Left) Non-separable dataset
. The green dot line is . The objective function (blue line) preserves the strong convexity in a certain range and the minimal stays in this range. The red point is the start point . (Right). Separable dataset . Although the objective function is endowed with the strong convexity property in a certain range, however the global minimal is outside of this range.

Fig5: Dataset 1 is shown in the top 2 pictures with the right one zooming into a particular range. Dataset 2 is shown in the bottom pictures. The blue dots trace the progression of iterates.
Questions
- Why the separability would cause such a difference? From the Fig4 and Fig5, we know data as parameters can influence the shape of the objective function a lot. Given the data set, can we predict the behavior of the performance of gradient descent with constant stepsize, i.e., linear convergence rate or sublinear convergence rate? Can we extend our conclusion to higher dimension?
- In real world application, it’s likely that the data is semi-separable, i.e., most data points can be split into two groups with a few exceptions. How’s that influence the performance of the algorithm?
- Will second formulation (see below) also encounter the similar issue? My guess is yes.
Appendix
Second Formulation of Logistic Regression
Consider restrict
If we use the second formulation, then maximizing the likelihood is equivalent to
Derivation of the gradient and Hessian of the loss function (first formualtion)
Consider
Reference:
Lecture notes 9 and 10 presented on this course website.
The code for generating graphs can be found in my git repo.