Series: Matrix calculus

Series: Matrix Calculus

Let's start with the central idea of calculus - the idea of a small change in output \( dy \) given a small change in input \( dx \) for a given function \( y = f(x) \) where \( x,y \) can be scalars or vectors: \( dy = f(x+dx) - f(x) \). By canceling out the higher order terms involving \( dx \), we get \( dy \) as a linear function of \( dx \). As an example, consider the function \( Y = X^2 = XX \). Then we have:

\[ \begin{align} dY &= f(X+dX) - f(X) \\ &= (X + dX)(X + dX) - X^2 \\ &= \underbrace{ \amber{X(dX) + (dX)X} }_{\rose{X(dX) \ne (dX)X}} + \grey{\cancel{(dX)(dX)}} \end{align} \]

A linear approximation of the change in output \( dY \) for a small change in input \( dX \)

While it may not be immediately obvious, \( dY \) is a linear function of \( dX \). In addition, for a fixed \( dX \), the output of \( dY \) is a (not necessarily linear) function of \( X \).

0. Linear operator

A linear operator is a linear map from one vector space to another vector space. It has a subtle difference from a function - a function can be a map from any set to any set, but here we constrain the input and outputs to be elements of a vector space. The most common example of a linear operator is a matrix (ref: Matrix is a function ). In fact, any linear operator can be represented as a matrix. However it is often more efficient to write it in the less obvious way (as in the above example). It will become clearer as we proceed.

Linear operator \( L_f \) is a linear map from \( R_n \) to \( R_m \)

\( L_f \) is induced by the function \( f \) (shown in dark grey).

For a (smooth) function \( f: R_n \rightarrow R_m \), we can obtain a linear operator \( L_f \) which is a function of \( dX \) (a small change in input) and whose output is a function of \( X \) (a small change in output \( dY \) at the point \( X \)). For the above example we have:

\[ \begin{align} f(X) &= X^2 \\ \\ L_f[dX] &= X(dX) + (dX)X \\ \end{align} \]

import torch

def square(X):
  return X @ X
  
def d_square(dX):
  def operator(X):
    return X @ dX + dX @ X
  return operator
  
X = torch.randn(1000,1000,dtype=torch.float64)
dX = torch.randn_like(X).mul(1e-8)

actual_dy = (X+dX)@(X+dX) - X@X
# The below line is the same as
# dy = X @ dX + dX @ X
dy = d_square(dX)(X) 

torch.allclose(actual_dy, dy)

'''
Outputs:
True
'''

Note: \( L_f[dX] \) is linear in \( dX \) but may not be linear in \( X \). For example:

\[ \begin{align} f(X) &= X^3 \\ \\ L_f[dX] &= f(X+dX) - f(X) \\ &= \underbrace{(dX)X^2 + X(dX)X + X^2(dX)}_{\rose{\text{Not linear in } X}} \\ \end{align} \]

1. Product rule and chain rule

Consider two (possibly vector-valued) functions along with their (differential) linear operators:

\( f(x+dx) = f(x) + L_f[dx] \)\( g(x+dx) = g(x) + L_g[dx] \)

Then for a function \( h(x) = f(x) * g(x) \), we have:

\[ \begin{align} dh &= h(x+dx) - h(x) \\ &= L_f[dx] * g(x) + f(x) * L_g[dx] + \grey{\cancel{L_f[dx] * L_g[dx]}} \\ &= \amber{ L_f[dx] * g(x) + f(x) * L_g[dx] } \end{align} \]

The product rule for vector valued functions

Now consider another function \( p(x) = f(g(x)) \). Then we have:

\[ \begin{align} dp &= p(x+dx) - p(x) \\ &= f(g(x+dx)) - f(g(x)) \\ &= f(g(x) + \underbrace{L_g[dx]}_{ \rose{dg} }) - f(g(x)) \\ &= \underbrace{L_f[dg]}_{\rose{\text{function of} \ g}} + \grey{\cancel{f(g(x)) - f(g(x))}} \\ \\ &= \underbrace { \amber{L_f \overbrace{[L_g[dx]}^{\rose{ \text{function of x}}} ]} }_{\rose{\text{function of g(x)}}} \end{align} \]

The chain rule for vector valued functions

Note that \( L_g[dx] \) is evaluated at \( x \) and \( L_f[L_g[dx]] \) is evaluated at \( g(x) \). This will become clear with an example below.

2. Some examples

Below are some interesting examples of differential linear operators. Note that \( d(A^T) = (dA)^T \) for any matrix or vector \( A \).

2.1. \( f(x) = \lVert x \rVert ^2 \) where \( x \) is a vector

This function calculates the squared norm of a vector. It can be re-written as:\( f(x) = \lVert x \rVert ^2 = x^Tx \\ \)Now we just use \( df \) instead of \( L_f \):\( df = \overbrace{\toggle {f(x+dx)-f(x)} {(x+dx)^T(x+dx)} {(dx)^Tx + x^T(dx)} {\amber{\underbrace{(dx)^Tx}_{\text{A scalar} = x^Tdx}} + x^T(dx)} {\amber{2x^Tdx}} {\grey{\text{Click to go to the start}}} \endtoggle}^{\rose{\text{Click here}}} \)Thus, if the change \( dX \) is orthogonal to the vector \( x \), then the change in output \( dY \) is zero since \( x^Tdx = 0 \). This makes sense because this small change can be seen as a rotation of the vector \( x \). The rotation does not change the length of \( x \).

2.2. \( f(X) = X^n \)

\( df = \sum_{i=0}^{n-1} X^i(dX)X^{n-1-i} \)We already saw this for the case of \( f(X) = X^2 \). To prove this, one can apply the product rule and use induction with the initial condition \( df(X) = dX \). Assuming that it is true for \( f(X) = X^{n-1} \), we have:\( df = \overbrace{\toggle {\underbrace{d(XX^{n-1})}_{\rose{\text{Apply product rule}}}} {(dX)X^{n-1} + X\underbrace{\amber{d(X^{n-1})}}_{\rose{\text{Expand this}}}} {(dX)X^{n-1} + X \amber{\sum_{i=0}^{n-2} X^i(dX)X^{n-2-i}}} {\amber{\sum_{i=0}^{n-1} X^i(dX)X^{n-1-i}}} {\grey{\text{Click to go to the start}}} \endtoggle}^{\rose{\text{Click here}}} \)Note how it is similar to the case of \( d(x^n) = nx^{n-1} \). However since matrix multiplication is not commutative, the order of \( dX \) and \( X \) is important.

2.3. \( f(X) = X^{-1} \)

One can use the identity \( XX^{-1} = I \) and then product rule to it: \[ \begin{align} d(XX^{-1}) &= d(I) \\ (dX)X^{-1} + X(dX^{-1}) &= 0 \\ X(dX^{-1}) &= -(dX)X^{-1} \\ dX^{-1} &= \amber{-X^{-1}(dX)X^{-1}} \end{align} \]

2.4. \( f(Q) = QQ^T \) where \( Q \) is an orthonormal matrix

Using the property of orthonormal matrix, we have \( Q^TQ = I \). We can use the fact that \( d(Q^TQ) = d(I) = 0 \). \[ \begin{align} d(Q^TQ) &= d(I) \\ dQ^TQ + Q^TdQ &= 0 \\ Q^TdQ &= -(Q^TdQ)^T \\ Q^TdQ &= \underbrace{-(Q^TdQ)^T}_{Q^TdQ \ \rose{\text{is negative of its transpose}}} \\ \end{align} \] Thus \( Q^TdQ \) is a skew-symmetric matrix. This insight is a motivation for key ideas behind Lie algebra, a quick introduction of which can be found here .

3. Closing example

We end this part with an example of a chain rule for a matrix-valued function. We have: \[ \begin{align} g(X) &= X^2 \\ f(X) &= X^{-1} \\ h(X) &= f(g(X)) \\ dh &= L_f[L_g[dx]] \\ \end{align} \] This can be numerically verified like so:

import torch
x = torch.rand(1000,1000).double()
dx = torch.randn_like(x).mul(1e-8)

def square(x):
  return x @ x

def invert(x):
  return torch.linalg.inv(x)

def d_square(dx):
  def operator(x):
    return x @ dx + dx @ x
  return operator

def d_invert(dx):
  def operator(x):
    x_inv = torch.linalg.inv(x)
    return -x_inv @ dx @ x_inv
  return operator

Finally, we perturb the input by a small value and record the change in the output \( dY \). Finally we calculate the change in output using the chain rule \( df \).

y = invert(square(x))
dy = invert(square(x+dx)) - y

g = square(x)
dg = d_square(dx)(x)
df = d_invert(dg)(g)
torch.allclose(df, dy)

'''
Outputs:
True
'''

• 0) Linear operator

1) Gradient of a function →