Series: Matrix calculus

Derivatives of key functions

0. $ f(A) = \sigma_i(A) $, the i-th singular value of A

Assume all singular values are unique. The trick is to find a closed form equation for $ \sigma_i(A) $ in terms of $ A $. Calculating the gradient from that point is the easy part. Given the SVD, $ A = UDV^T $, with $ u_i, v_i $ as i-th columns of $ U, V $ respectively, note that:$$ \sigma_i(A) = \overbrace{\toggle {u_i^TAv_i} {u_i^TUDV^Tv_i} { \underbrace{\rose{u_i^TU}}_{\rose{e_i^T}} D \underbrace{\rose{V^Tv_i}}_{\rose{e_i}} } {\rose{e_i^T} D \rose{e_i}} {\amber{\sigma_i(A) \ \blacksquare}} {\grey{\text{Start over}}} \endtoggle}^{\rose{\text{Click to see intermediate steps}}} $$This gives us: \[ \begin{align} df &= u_i^T \ dA \ v_i \\ &= \text{tr}(u_i^T \ dA \ v_i) \ \grey{\text{(trace of a scalar)}} \\ &= \text{tr}(v_iu_i^T \ dA ) \ \grey{\text{(cyclic property of trace)}} \\ &= \langle \underbrace{\amber{u_iv_i^T}}_{\amber{\nabla f}}, dA \rangle \end{align} \] Thus the gradient of $ \sigma_i(A) $ is given $ \nabla f = u_iv_i^T $.

import torch
x = torch.randn(1000,1000).double()
dx = torch.randn_like(x).mul(1e-7)

u1,d1,v1 = torch.linalg.svd(x)
u2,d2,v2 = torch.linalg.svd(x+dx)

index = 0
df_true = d2[index] - d1[index]
df_pred = ((u1[:,index].view(-1,1) @ v1.t()[:,index].view(1,-1)) * dx).sum()

torch.allclose(df_true, df_pred)

'''
Outputs:
True
'''

1. Jacobian of elementwise functions

Jacobian of an elementwise function $ f: R^n \rightarrow R^n $ is a diagonal matrix of shape $ n \times n $. Eg consider the function $ \sigma(x) = \frac{1}{1+e^{-x}} $. It's derivative is $ \sigma'(x) = \sigma(x) (1 - \sigma(x)) $. Applying it to each element of a vector we get: \begin{align} f(\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}) &= \begin{bmatrix} \sigma(x_1) \\ \sigma(x_2) \\ \sigma(x_3) \end{bmatrix} \\ \nabla f &= \begin{bmatrix} \sigma'(x_1) & 0 & 0 \\ 0 & \sigma'(x_2) & 0 \\ 0 & 0 & \sigma'(x_3) \end{bmatrix} \\ &= \begin{bmatrix} \sigma(x_1) (1 - \sigma(x_1)) & 0 & 0 \\ 0 & \sigma(x_2) (1 - \sigma(x_2)) & 0 \\ 0 & 0 & \sigma(x_3) (1 - \sigma(x_3)) \end{bmatrix} \end{align} Practically it is easier to store the diagonal matrix as a vector.

2. Softmax function

The softmax function ($ \text{softmax}: R^n \rightarrow R^n $) is defined as:$$ \text{softmax}(x) = \frac{e^x}{\sum_{i=1}^n e^x_i} := y $$We calculate the partial derivative:$$ \frac{\partial \ y_i} {\partial x_j} = \begin{cases} y_i - y_iy_j = y_i(1 - y_i) & i = j \\ 0 -y_iy_j = -y_iy_j & i \neq j \end{cases} $$This expression forms the entry $ i,j $ of the Jacobian matrix. And the Jacobian takes a succint form:$$ \amber{\text{Jacobian}(y) = \text{diag}(y) - y y^T} $$

3. Row-wise vector functions

This can be seen as an extension of the idea of elementwise functions. If a function $ f: {R^\text{in}} \rightarrow {R^\text{out}} $ is applied to each row of a matrix of shape $ (n, \text{in}) $, the output is a matrix of shape $ (n, \text{out}) $. The Jacobian for each row is a matrix of shape $ (\text{out}, \text{in}) $ and there are $ n $ of them.

$ f $ acts on each row of the matrix with shape $ (n, \text{in}) $.

Another point of view is that it can be seen as a function acting on the entire matrix. From that point of view, the Jacobian has a shape $ ({n \times \text{out}}, {n \times \text{in}}) $. This matrix is block diagonal where each block is of shape $ (\text{out}, \text{in}) $ and there are $ n $ of them.

Jacobian is a block matrix with each block of shape $ (\text{out}, \text{in}) $.

It is clear that storing the entire block diagonal matrix is a lot more expensive than storing the individual Jacobian matrices. This idea is explored later on in the article on VJP and JVP .

← 2) Jacobian of a function

4) Function is a DAG →