Gradient of a scalar valued function
We start with functions whose output is a scalar. The gradient (of a function) is defined only for functions whose output is a scalar. For vector-valued functions, the concept of a gradient is generalized to the concept of a Jacobian.
0. What is a gradient?
Consider a scalar-valued function \( f(\underbrace{x}_{\rose{\in R^n}}) = \underbrace{y}_{\rose{\in R}} \). Then a small change in output given a small change in input can be expressed as an inner product with the change in input \( dx \):
\[
\begin{align}
\overbrace{L_f[dx]}^{\rose{\text{scalar}}} &= \langle \ \overbrace{\text{?}}^{\rose{\in R^n}} \ , dx \rangle \\
&= \amber{\langle \ \underbrace{\nabla f}_{\rose{grad(f)}} \ , dx \rangle}
\end{align}
\]
This is because \( L_f[dx] \) is a linear function of \( dx \) and its output is a scalar. Note that this become possible only because the output is a scalar. \( \nabla f \) is the gradient of \( f \). It is a vector with the same dimension as the input vector \( x \).
Conceptually, the gradient of a scalar-valued function is a vector whose components are the partial derivatives of the function with respect to each of the coordinates of the input vector.$$
\langle
\underbrace{
\amber{
\begin{bmatrix}
\partial f / \partial x_1 \\
\partial f / \partial x_2 \\
\vdots \\
\partial f / \partial x_n
\end{bmatrix}
}
}_{
\amber{\nabla f}
}
,
\underbrace{
\begin{bmatrix}
dx_1 \\
dx_2 \\
\vdots \\
dx_n
\end{bmatrix}
}_{
dx
}
\rangle
= df
$$
0.1 Gradient is perpendicular to the tangent plane
In the above expression, setting \( df = 0 \) gives us \( \langle \nabla f, dx \rangle = 0 \). Since the output does not change, the vector \( dx \) lies in the tangent plane of the function \( f \) at the point \( x \). The gradient \( \nabla f \) is perpendicular to the tangent plane.
1. Gradient and the inner product
The idea of inner product between two vectors can be extended to matrices. This allows us to use matrix identities to compute the gradient of a scalar-valued function. The inner product of two matrices \( A \) and \( B \) (of the same shape) is the sum of their elementwise products. It is also known as the Frobenius inner product and denoted by \( \langle A, B \rangle_F \).
1.1 Frobenius inner product and Frobenius norm
The trace of a square matrix \( M \), denoted by \( \text{tr}(M) \), is the sum of its diagonal elements. It is connected to the Frobenius inner product like so:\(
\langle \ \underbrace{A,B}_{\rose{\text{m,n}}} \ \rangle_F = \text{tr}(\underbrace{A^T B}_{\rose{\text{n,n}}}) = \text{tr}(\underbrace{B^T A}_{\rose{\text{m,m}}})
\)One can also extend the idea of the norm of a vector to the norm of a matrix. The Frobenius norm of a matrix \( A \), denoted by \( \|A\|_F \), is the square root of the Frobenius inner product of \( A \) with itself.\(
\|A\|_F = \sqrt{\langle A,A \rangle_F} = \sqrt{\text{tr}(A^T A)}
\)
2. Some useful identities
It is helpful to remember some identities that are useful in computing the gradient of a scalar-valued function.
2.1 Trace of a matrix
Linear: \( \text{tr}(pA + qB) = \text{p tr}(A) + \text{q tr}(B) \)
Invariant under transposition: \( \text{tr}(A) = \text{tr}(A^T) \)
Commutative: \( \text{tr}(AB) = \text{tr}(BA) \)
Invariant under circular shifts: \( \text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA) \)
Invariant under similarity: \( \text{tr}(P^{-1}AP) = \text{tr}(A) \)
And finally: \( d (\text{tr}(A)) = \text{tr}(dA) \)
2.2 Kronecker product
More details can be found in this article . Two key properties of the Kronecker product that will come in handy are:
\( vec(\ \underbrace{C}_{\rose{c_1,c_2}} \overbrace{M}^{\rose{c_2,b_2}} \underbrace{B^T}_{\rose{b_2,b_1}} \ ) = \underbrace{(B \otimes C)}_{\rose{b_1c_1,b_2c_2}} \overbrace{vec(M)}^{\rose{b_2c_2}} \)
\( \text{tr}(A \otimes B) = \text{tr}(A) \text{tr}(B) \)
3. Examples
Note that all the functions below are scalar valued.
3.1 \( f(A) = \| A \|_F \)
The function can be rewritten as \( f = \|A\|_F = \sqrt{\text{tr}(A^T A)} \). Then:\(
df =
\overbrace{
\toggle
{f(A+dA) - f(A)}
{\frac{1}{2\|A\|} \ d(\text{tr}A^TA)}
{\frac{1}{2\|A\|} \ \text{tr}(d(A^TA))}
{\frac{1}{2\|A\|} \ \text{tr}(A^TdA + dA^TA)}
{\frac{1}{2\|A\|} \ \text{tr}(A^TdA + A^TdA)}
{\frac{1}{2\|A\|} \ \text{tr}(2A^TdA)}
{\frac{1}{\|A\|} \ \text{tr}(A^TdA)}
{\text{Start again}}
\endtoggle{}
}^{\rose{\text{Click for next step}}}
= \langle \underbrace{\amber{\frac{A}{\|A\|}}}_{\amber{\nabla f}}, dA \rangle
\)
3.2 \( f(A) = x^T A y \)
Here we use the trick that the trace of a scalar is the scalar itself. Hence \( \text{tr}(c) = c, c \in R \).\(
df =
\overbrace{
\toggle
{f(A+dA) - f(A)}
{x^T \ dA \ y}
{\text{tr}(x^T \ dA \ y)}
{\text{tr}(yx^T \ dA)}
{\text{tr}((xy^T)^T \ dA)}
{\text{Start again}}
\endtoggle{}
}^{\rose{\text{Click for next step}}}
= \langle \underbrace{\amber{xy^T}}_{\amber{\nabla f}}, dA \rangle
\)Note that using the Kronecker identity of the bilinear form, we could also write:\(
df = x^T \ dA \ y = \langle \underbrace{\amber{x\otimes y}}_{\amber{\text{vec}(\nabla f)}}, \text{vec}(dA) \rangle
\)Indeed one can verify that \( \text{vec}(xy^T) = x\otimes y \).
3.3 \( f(A) = \text{det}(A) \)
Assume that \( A^{-1} \) exists. First we establish that \( d(\text{det}(I + dX)) = \text{tr}(dX) \) for a small perturbation \( dX \).\(
\text{det}(I + dX) =
\overbrace{
\toggle
{\prod_{i}(1 + dX_{ii}) + \cancel{\grey{\text{higher order terms}}}}
{1 + \sum_{i}dX_{ii} + \cancel{\grey{\text{higher order terms}}}}
{1 + \sum_{i}dX_{ii}}
{1 + \text{tr}(dX)}
{\text{det}(I) + \text{tr}(dX) \ \zinc{\blacksquare}}
{\grey{\text{(Start over)}}}
\endtoggle{}
}^{\rose{\text{Click for next step}}}
\)Next we perturb the matrix \( A \) by a small amount \( dA \) and rewrite it as \( A+dA = A(I + A^{-1}dA) \).\(
d(\text{det}(A + dA)) =
\overbrace{
\toggle
{d(\text{det}(A) \ \rose{\text{det}(I + A^{-1}dA)})}
{\text{det}(A) \ \underbrace{\rose{d(\text{det}(I + A^{-1}dA))}}_{\rose{d(\text{det}(I+dX)) = \text{tr}(dX)}}}
{\text{det}(A) \rose{\text{tr}(A^{-1}dA)}}
{\text{tr}(\text{det}(A) A^{-1}dA)}
{\text{tr}(\underbrace{\rose{\text{det}(A) A^{-1}}}_{= C_A^T} \ dA)}
{\text{tr}(C_A^T \ dA)}
{\langle \underbrace{\amber{C_A}}_{\amber{\nabla f}}, dA \rangle \ \zinc{\blacksquare}}
{\grey{\text{(Start over)}}}
\endtoggle{}
}^{\rose{\text{Click for next step}}}
\)Here \( C_A \) is the cofactor matrix of \( A \). Thus, the gradient of determinant of a matrix is its cofactor matrix.
Next we will generalize the idea of a gradient to functions whose output is a vector.