I present the numerical optimization algorithm known as the Conjugate Gradient. The source code for all algorithms and plots can be found here.
I encourage you to do two things as you read this series of posts. First, understand and interact with the presented plots. If you understand the geometry of a topic, everything else usually comes naturally. The plots presented here allow for quite a bit of interaction including zooming, rotating, and panning. Secondly, I encourage you to read, understand, and experiment with the source code.
Preliminaries
I will frequently refer to various definitions and results from the book Linear Algebra Done Right, Third Edition by Sheldon Axler.
I will frequently use the inner product and its properties. The inner product of vectors $u$ and $v$ will be denoted by $\innprd{u}{v}$ or occasionally by $u^Tv$. For a real (i.e. based on $\wR$) inner product space $V$, here are some useful properties of the inner product and its corresponding norm:
- $\innprd{u}{v}=\innprd{v}{u}$ for all $u,v$ in $V$
- $\innprd{\lambda u}{v}=\innprd{u}{\lambda v}=\lambda\innprd{u}{v}$ for all $\lambda\in\wR$ and all $u,v\in V$
- $\innprd{v}{v}\geq0$ for all $v\in V$
- $\innprd{v}{v}=0$ if and only if $v=0$
- $0=\innprd{u}{0}=\innprd{0}{u}$ for all $u\in V$
- $\innprd{u+v}{w}=\innprd{u}{w}+\innprd{v}{w}$ for all $u,v,w\in V$
- $\innprd{u}{v+w}=\innprd{u}{v}+\innprd{u}{w}$ for all $u,v,w\in V$
- $\innprd{Au}{v}=\innprd{u}{A^Tv}$ for any matrix $A$ on $V$ and for all $u,v\in V$
- $\dnorm{v} = \sqrt{\innprd{v}{v}}$ for all $v\in V$
- $\dnorm{v} = 0$ if and only if $v=0$
- $v = \sum_{j=1}^n\innprd{v}{v_j}v_j$ for all $v\in V$ and for any orthonormal basis $v_1,\dots,v_n$ of $V$
- $\sum_{j=1}^ra_j^2 = \dnorm{\sum_{j=1}^ra_jv_j}^2$ for any orthonormal list $v_1,\dots,v_r$ of $V$ and for any $a_1,\dots,a_r\in\wR$
I will frequently use the linearity property of a matrix. That is, for a matrix $A$ on $V$ and vectors $u,v\in V$ and scalars $\alpha,\beta\in\wR$, we have:
$$\align{ A(\alpha u+\beta v) =\alpha Au+\beta Av }$$
Recall that a matrix $A$ is symmetric if $A=A^T$.
The Problem
Let $A$ be an $n\times n$ matrix and let $b$ be an $n\times1$ vector. We wish to find an $n\times1$ vector $x_{*}$ such that
$$ Ax_{*}=b\tag{TP.1} $$
Consider the quadratic form $f:\wR^n\mapsto\wR$ defined by
$$\align{ f(x) &\equiv \frac12\innprd{Ax}{x}-\innprd{b}{x}+c\tag{TP.2} \\ &= \frac12x^TAx-b^Tx+c }$$
where $c\in\wR$ is some scalar. Note that the gradient of $f$ is
$$ f'(x) = \frac12A^Tx+\frac12Ax-b\tag{TP.3} $$
If $A$ is symmetric, this reduces to
$$ f'(x) = Ax-b\tag{TP.4} $$
Setting the gradient of $f$ to zero, we see that $x$ is a critical point of $f$ if and only if it is a solution to $Ax=b$. That is, we obtain equation TP.1, the linear system we wish to solve.
Recall the definition a positive definite matrix: a symmetric matrix $A$ is positive definite if and only if, for every nonzero vector $x$, we have
$$ \innprd{Ax}{x}\gt0\tag{TP.5} $$
Proposition TP.6 Suppose $A$ is positive definite (and hence symmetric) and define $f$ by TP.2. Then $x_{*}$ is a solution to $Ax_{*}=b$ if and only if it is the global minimum point of $f$.
Proof First suppose that $x_{*}$ is the global minimum point of $f$. Then $x_{*}$ is a critical point of $f$ and we have
$$ 0 = f'(x_{*}) = Ax_{*}-b $$
In the other direction, suppose that $x_{*}$ is a solution to $Ax_{*}=b$. Let $x\in\wR^n$ and put $e\equiv x-x_{*}$. Then
$$\align{ f(x) &= f(x_{*}+e) \\ &= \frac12\innprd{A(x_{*}+e)}{x_{*}+e}-\innprd{b}{x_{*}+e}+c\tag{by TP.2} \\ &= \frac12\innprd{Ax_{*}+Ae}{x_{*}+e}-\innprd{b}{e}-\innprd{b}{x_{*}}+c \\ &= \frac12\innprd{Ax_{*}+Ae}{x_{*}}+\frac12\innprd{Ax_{*}+Ae}{e}-\innprd{b}{e}-\innprd{b}{x_{*}}+c \\ &= \frac12\innprd{Ax_{*}}{x_{*}}+\frac12\innprd{Ae}{x_{*}}+\frac12\innprd{Ax_{*}}{e}+\frac12\innprd{Ae}{e}-\innprd{b}{e}-\innprd{b}{x_{*}}+c \\ &= \frac12\innprd{Ax_{*}}{x_{*}}+\frac12\innprd{e}{A^Tx_{*}}+\frac12\innprd{Ax_{*}}{e}+\frac12\innprd{Ae}{e}-\innprd{b}{e}-\innprd{b}{x_{*}}+c \\ &= \frac12\innprd{Ax_{*}}{x_{*}}+\frac12\innprd{e}{Ax_{*}}+\frac12\innprd{Ax_{*}}{e}+\frac12\innprd{Ae}{e}-\innprd{b}{e}-\innprd{b}{x_{*}}+c\tag{by symmetry of $A$} \\ &= \frac12\innprd{Ax_{*}}{x_{*}}+\innprd{Ax_{*}}{e}+\frac12\innprd{Ae}{e}-\innprd{b}{e}-\innprd{b}{x_{*}}+c \\ &= \frac12\innprd{Ax_{*}}{x_{*}}+\innprd{b}{e}+\frac12\innprd{Ae}{e}-\innprd{b}{e}-\innprd{b}{x_{*}}+c\tag{since $Ax_{*}=b$} \\ &= \frac12\innprd{Ax_{*}}{x_{*}}+\frac12\innprd{Ae}{e}-\innprd{b}{x_{*}}+c \\ &= \frac12\innprd{Ax_{*}}{x_{*}}-\innprd{b}{x_{*}}+c+\frac12\innprd{Ae}{e} \\ &= f(x_{*})+\frac12\innprd{Ae}{e} }$$
If $x\neq x_{*}$, then $e$ is nonzero and $f(x)\gt f(x_{*})$ by the positive definite definition TP.5. Hence $x_{*}$ is the global minimum point of $f$. For reference, we succinctly state
$$ f(x) = f(x_{*})+\frac12\innprd{A(x-x_{*})}{x-x_{*}}\tag{TP.7} \\ $$
$\wes$
Unless stated otherwise, we will assume that $A$ is positive definite and hence symmetric. And we will make frequent use of the following fact: the eigenvalues of a positive definite matrix are positive. To see this, let $\lambda$ be an eigenvalue of $A$ corresponding to the eigenvector $v$. That is
$$ Av=\lambda v $$
Taking the inner product of both sides with $v$, we get
$$ 0\lt\innprd{Av}{v}=\innprd{\lambda v}{v}=\lambda\innprd{v}{v} $$
The inequality on the left follows from the positive definite definition. Since eigenvectors are nonzero, then $\innprd{v}{v}\gt0$. Hence $\lambda\gt0$.
Steepest Descent
Steepest Descent is another iterative optimization algorithm. We will see later on that the Conjugate Gradient can be interpreted as an improvement of Steepest Descent. So we will quickly derive Steepest Descent and then identify its inefficiency.
Let $x_{(i)}$ denote the estimated value for $x_{*}$ after the $i^{th}$ iteration. We define the error $e_{(i)}\equiv x_{(i)}-x_{*}$ and the residual $r_{(i)}\equiv b-Ax_{(i)}$. Note that TP.4 gives
$$ r_{(i)} = b - Ax_{(i)} = -f'(x_{(i)}) $$
Recall that $-f’(x_{(i)})$ gives the direction of steepest descent from the point $x_{(i)}$. It’s helpful to keep in mind that, after the $i^{th}$ iteration, the residual $r_{(i)}=-f’(x_{(i)})$ gives the direction that takes us to the bottom of $f$ at the fastest possible rate. Additionally, we see that
$$ r_{(i)} = b - Ax_{(i)} = Ax_{*} - Ax_{(i)} = A(x_{*} - x_{(i)}) = -Ae_{(i)} $$
So we can also think of the residual as being the error mapped by $A$ into the same space as $Ax_{*}=b$.
We start with an arbitrary point $x_{(0)}$ and we compute the residual $r_{(0)}=b-Ax_{(0)}=-f’(x_{(0)})$. Since $r_{(0)}=-f’(x_{(0)})$ gives the direction of steepest descent, then we perform a line search in that direction for the minimum point $x_{(1)}\equiv x_{(0)}+\alpha_{(0)}r_{(0)}$ of $f$. That is, we find the value for $\alpha_{(0)}\in\wR$ that minimizes $f(x_{(1)})=f(x_0+\alpha_{(0)}r_0)$. This $\alpha_{(0)}$ satisfies
$$\align{ 0 &= \wderiv{}{\alpha_{(0)}}f(x_{(1)}) \\ &= \innprdBg{f'(x_{(1)})}{\wderiv{x_{(1)}}{\alpha_{(0)}}}\tag{by the Multivariable Chain Rule} \\ &= \innprdBg{-r_{(1)}}{\wderiv{(x_{(0)}+\alpha_{(0)}r_{(0)})}{\alpha_{(0)}}} \\ &= -\innprd{r_{(1)}}{r_{(0)}} }$$
That is, $\alpha_{(0)}$ minimizes $f(x_{(1)})=f(x_0+\alpha_{(0)}r_0)$ if and only if $r_{(1)}$ and $r_{(0)}$ are orthogonal. Hence
$$\align{ 0 &= \innprd{r_{(1)}}{r_{(0)}} \\ &= \innprd{b-Ax_{(1)}}{r_{(0)}}\tag{because $r_{(i)}\equiv b-Ax_{(i)}$} \\ &= \innprd{b-A(x_{(0)}+\alpha_{(0)}r_{(0)})}{r_{(0)}}\tag{because $x_{(1)}\equiv x_{(0)}+\alpha_{(0)}r_{(0)}$} \\ &= \innprd{b-Ax_{(0)}-\alpha_{(0)}Ar_{(0)}}{r_{(0)}} \\ &= \innprd{r_{(0)}-\alpha_{(0)}Ar_{(0)}}{r_{(0)}}\tag{because $r_{(i)}\equiv b-Ax_{(i)}$} \\ &= \innprd{r_{(0)}}{r_{(0)}}-\alpha_{(0)}\innprd{Ar_{(0)}}{r_{(0)}} \\ }$$
Hence
$$\align{ \alpha_{(0)}=\frac{\innprd{r_{(0)}}{r_{(0)}}}{\innprd{r_{(0)}}{Ar_{(0)}}} \\\\ }$$
So we have computed $x_{(1)}=x_{(0)}+\alpha_{(0)}r_{(0)}$. Then we can similarly compute $x_{(2)}=x_{(1)}+\alpha_{(1)}r_{(1)}$ and so on. Hence our algorithm is
$$\gather{ r_{(i)}=b-Ax_{(i)}\tag{SD.1} \\\\ \alpha_{(i)}=\frac{\innprd{r_{(i)}}{r_{(i)}}}{\innprd{r_{(i)}}{Ar_{(i)}}}\tag{SD.2} \\\\ x_{(i+1)}=x_{(i)}+\alpha_{(i)}r_{(i)}\tag{SD.3} }$$
This algorithm requires two matrix-vector multiplications; we compute $Ax_{(i)}$ in SD.1 and $Ar_{(i)}$ in SD.2. Matrix-vector multiplications are expensive but we can eliminate one of them. Multiply SD.3 by $-A$ and then add $b$ to get:
$$ r_{(i+1)}=b-Ax_{(i+1)}=b-A(x_{(i)}+\alpha_{(i)}r_{(i)})=r_{(i)}-\alpha_{(i)}Ar_{(i)} $$
More succintly, we have
$$ r_{(i+1)}=r_{(i)}-\alpha_{(i)}Ar_{(i)}\tag{SD.4} $$
So rather than iteratively computing SD.1, SD.2, and SD.3, we instead compute $r_{(0)}=b-Ax_{(0)}$ and then we iteratively compute SD.4, SD.2, and SD.3.
The disadvantage of using SD.4 is that the recurrent residuals are generated without any feedback from the value of $x_{(i)}$. That accumulation of floating point roundoff errors may cause convergence to some point near but not exactly $x_{*}$. This effect can be avoided by periodically using SD.1 to compute the correct residual.
Inefficiency of Steepest Descent
The plot below gives us some intuition into the inefficiency of Steepest Descent. The point $x_{(0)}$ is an arbitrarily chosen starting point. The point $x_{(1)}$ is the point reached after a single step of Steepest Descent. And the point $x_{*}$ is the minimum of $f$. Notice that each step of Steepest Descent is orthogonal to the last.
The rotated ellipses are the contours of $f$. In the next section, I present a proof that the level curves of $f$ are ellipsoidal. This proof also shows that the orthonormal eigenvectors of $A$ ($v_1$ and $v_2$ in this plot) coincide with the axes of the ellipsoids (red lines).
In this example and from this particular starting point $x_{(0)}$, it takes $49$ iterations of Steepest Descent to reach the minimum point $x_{*}$ (with tolerance $=1e-11$). You can visualize this by repeatedly zooming in at the point $x_{*}$. And you can programmatically verify this by running the function $\text{figure_ISD_1}$ from the source code and using your favorite debugger to check the shape of the steps variable.
Most importantly, we see that we repeat previous search directions. For example, when we reach the point $x_{(2)}$, we turn and go in the same search direction that we went when we started at $x_{(0)}$. And when we reach the point $x_{(3)}$, we turn and go in the same search direction that we went when we turned at $x_{(1)}$.
These repetitive search directions occur in $\wR^3$ (and $\wR^n$) but they’re not quite as easy to visualize. You’ll see this same inefficiency in the next plot if you zoom, pan, and rotate between $x_{(0)}$ and the center point $x_{*}$ of the ellipsoid. Of course we can definitively capture repetitive search directions in $\wR^3$ (and $\wR^n$) by checking for them programmatically. This is done in the function $\text{figure_ISD_2}$.
The Conjugate Gradient algorithm is more efficient than Steepest Descent because it precludes repeat search directions. But before we get to the Conjugate Gradient, we require some helpful tools.
We will not formally prove the time complexity of Steepest Descent here but we will quickly mention the result. Let $\largesteigenval$ denote the largest eigenvalue of $A$ and let $\smallesteigenval$ denote its smallest. Define $\kappa\equiv\frac{\largesteigenval}{\smallesteigenval}$. When $A$ is positive definite, we call $\kappa$ the condition number of $A$. It can be shown that the Steepest Descent algorithm has time complexity $\mcalO(h\kappa)$ where $h$ is the number of nonzero entries in $A$.