Series: Matrix Calculus
Let's start with the central idea of calculus - the idea of a small change in output \( dy \) given a small change in input \( dx \) for a given function \( y = f(x) \) where \( x,y \) can be scalars or vectors: \( dy = f(x+dx) - f(x) \). By canceling out the higher order terms involving \( dx \), we get \( dy \) as a linear function of \( dx \). As an example, consider the function \( Y = X^2 = XX \). Then we have:
\[
\begin{align}
dY &= f(X+dX) - f(X) \\
&= (X + dX)(X + dX) - X^2 \\
&= \underbrace{ \amber{X(dX) + (dX)X} }_{\rose{X(dX) \ne (dX)X}} + \grey{\cancel{(dX)(dX)}}
\end{align}
\]
A linear approximation of the change in output \( dY \) for a small change in input \( dX \)
While it may not be immediately obvious, \( dY \) is a linear function of \( dX \). In addition, for a fixed \( dX \), the output of \( dY \) is a (not necessarily linear) function of \( X \).
0. Linear operator
A linear operator is a linear map from one vector space to another vector space. It has a subtle difference from a function - a function can be a map from any set to any set, but here we constrain the input and outputs to be elements of a vector space. The most common example of a linear operator is a matrix (ref: Matrix is a function ). In fact, any linear operator can be represented as a matrix. However it is often more efficient to write it in the less obvious way (as in the above example). It will become clearer as we proceed.
Linear operator \( L_f \) is a linear map from \( R_n \) to \( R_m \)
\( L_f \) is induced by the function \( f \) (shown in dark grey).
For a (smooth) function \( f: R_n \rightarrow R_m \), we can obtain a linear operator \( L_f \) which is a function of \( dX \) (a small change in input) and whose output is a function of \( X \) (a small change in output \( dY \) at the point \( X \)). For the above example we have:
\[
\begin{align}
f(X) &= X^2 \\ \\
L_f[dX] &= X(dX) + (dX)X \\
\end{align}
\]
import torch
def square(X):
return X @ X
def d_square(dX):
def operator(X):
return X @ dX + dX @ X
return operator
X = torch.randn(1000,1000,dtype=torch.float64)
dX = torch.randn_like(X).mul(1e-8)
actual_dy = (X+dX)@(X+dX) - X@X
# The below line is the same as
# dy = X @ dX + dX @ X
dy = d_square(dX)(X)
torch.allclose(actual_dy, dy)
'''
Outputs:
True
'''
Note: \( L_f[dX] \) is linear in \( dX \) but may not be linear in \( X \). For example:
\[
\begin{align}
f(X) &= X^3 \\ \\
L_f[dX] &= f(X+dX) - f(X) \\
&= \underbrace{(dX)X^2 + X(dX)X + X^2(dX)}_{\rose{\text{Not linear in } X}} \\
\end{align}
\]
1. Product rule and chain rule
Consider two (possibly vector-valued) functions along with their (differential) linear operators:
\[
\begin{align}
dh &= h(x+dx) - h(x) \\
&= L_f[dx] * g(x) + f(x) * L_g[dx] + \grey{\cancel{L_f[dx] * L_g[dx]}} \\
&= \amber{ L_f[dx] * g(x) + f(x) * L_g[dx] }
\end{align}
\]
\( f(x+dx) = f(x) + L_f[dx] \)\( g(x+dx) = g(x) + L_g[dx] \)
Then for a function \( h(x) = f(x) * g(x) \), we have:The product rule for vector valued functions
Now consider another function \( p(x) = f(g(x)) \). Then we have:
\[
\begin{align}
dp &= p(x+dx) - p(x) \\
&= f(g(x+dx)) - f(g(x)) \\
&= f(g(x) + \underbrace{L_g[dx]}_{ \rose{dg} }) - f(g(x)) \\
&= \underbrace{L_f[dg]}_{\rose{\text{function of} \ g}} + \grey{\cancel{f(g(x)) - f(g(x))}} \\ \\
&= \underbrace { \amber{L_f \overbrace{[L_g[dx]}^{\rose{ \text{function of x}}} ]} }_{\rose{\text{function of g(x)}}}
\end{align}
\]
The chain rule for vector valued functions
Note that \( L_g[dx] \) is evaluated at \( x \) and \( L_f[L_g[dx]] \) is evaluated at \( g(x) \). This will become clear with an example below.
2. Some examples
Below are some interesting examples of differential linear operators. Note that \( d(A^T) = (dA)^T \) for any matrix or vector \( A \).
2.1. \( f(x) = \lVert x \rVert ^2 \) where \( x \) is a vector
This function calculates the squared norm of a vector. It can be re-written as:\( f(x) = \lVert x \rVert ^2 = x^Tx \\ \)Now we just use \( df \) instead of \( L_f \):\( df = \overbrace{\toggle
{f(x+dx)-f(x)}
{(x+dx)^T(x+dx)}
{(dx)^Tx + x^T(dx)}
{\amber{\underbrace{(dx)^Tx}_{\text{A scalar} = x^Tdx}} + x^T(dx)}
{\amber{2x^Tdx}}
{\grey{\text{Click to go to the start}}}
\endtoggle}^{\rose{\text{Click here}}} \)Thus, if the change \( dX \) is orthogonal to the vector \( x \), then the change in output \( dY \) is zero since \( x^Tdx = 0 \). This makes sense because this small change can be seen as a rotation of the vector \( x \). The rotation does not change the length of \( x \).
2.2. \( f(X) = X^n \)
\( df = \sum_{i=0}^{n-1} X^i(dX)X^{n-1-i} \)We already saw this for the case of \( f(X) = X^2 \). To prove this, one can apply the product rule and use induction with the initial condition \( df(X) = dX \). Assuming that it is true for \( f(X) = X^{n-1} \), we have:\( df = \overbrace{\toggle
{\underbrace{d(XX^{n-1})}_{\rose{\text{Apply product rule}}}}
{(dX)X^{n-1} + X\underbrace{\amber{d(X^{n-1})}}_{\rose{\text{Expand this}}}}
{(dX)X^{n-1} + X \amber{\sum_{i=0}^{n-2} X^i(dX)X^{n-2-i}}}
{\amber{\sum_{i=0}^{n-1} X^i(dX)X^{n-1-i}}}
{\grey{\text{Click to go to the start}}}
\endtoggle}^{\rose{\text{Click here}}} \)Note how it is similar to the case of \( d(x^n) = nx^{n-1} \). However since matrix multiplication is not commutative, the order of \( dX \) and \( X \) is important.
2.3. \( f(X) = X^{-1} \)
One can use the identity \( XX^{-1} = I \) and then product rule to it:
\[
\begin{align}
d(XX^{-1}) &= d(I) \\
(dX)X^{-1} + X(dX^{-1}) &= 0 \\
X(dX^{-1}) &= -(dX)X^{-1} \\
dX^{-1} &= \amber{-X^{-1}(dX)X^{-1}}
\end{align}
\]
2.4. \( f(Q) = QQ^T \) where \( Q \) is an orthonormal matrix
Using the property of orthonormal matrix, we have \( Q^TQ = I \). We can use the fact that \( d(Q^TQ) = d(I) = 0 \).
\[
\begin{align}
d(Q^TQ) &= d(I) \\
dQ^TQ + Q^TdQ &= 0 \\
Q^TdQ &= -(Q^TdQ)^T \\
Q^TdQ &= \underbrace{-(Q^TdQ)^T}_{Q^TdQ \ \rose{\text{is negative of its transpose}}} \\
\end{align}
\]
Thus \( Q^TdQ \) is a skew-symmetric matrix. This insight is a motivation for key ideas behind Lie algebra, a quick introduction of which can be found here .
3. Closing example
We end this part with an example of a chain rule for a matrix-valued function. We have:
\[
\begin{align}
g(X) &= X^2 \\
f(X) &= X^{-1} \\
h(X) &= f(g(X)) \\
dh &= L_f[L_g[dx]] \\
\end{align}
\]
This can be numerically verified like so:
import torch
x = torch.rand(1000,1000).double()
dx = torch.randn_like(x).mul(1e-8)
def square(x):
return x @ x
def invert(x):
return torch.linalg.inv(x)
def d_square(dx):
def operator(x):
return x @ dx + dx @ x
return operator
def d_invert(dx):
def operator(x):
x_inv = torch.linalg.inv(x)
return -x_inv @ dx @ x_inv
return operator
Finally, we perturb the input by a small value and record the change in the output \( dY \). Finally we calculate the change in output using the chain rule \( df \).
y = invert(square(x))
dy = invert(square(x+dx)) - y
g = square(x)
dg = d_square(dx)(x)
df = d_invert(dg)(g)
torch.allclose(df, dy)
'''
Outputs:
True
'''