Derivatives of key functions
0. \( f(A) = \sigma_i(A) \), the i-th singular value of A
Assume all singular values are unique. The trick is to find a closed form equation for \( \sigma_i(A) \) in terms of \( A \). Calculating the gradient from that point is the easy part. Given the SVD, \( A = UDV^T \), with \( u_i, v_i \) as i-th columns of \( U, V \) respectively, note that:$$
\sigma_i(A) =
\overbrace{\toggle
{u_i^TAv_i}
{u_i^TUDV^Tv_i}
{ \underbrace{\rose{u_i^TU}}_{\rose{e_i^T}} D \underbrace{\rose{V^Tv_i}}_{\rose{e_i}} }
{\rose{e_i^T} D \rose{e_i}}
{\amber{\sigma_i(A) \ \blacksquare}}
{\grey{\text{Start over}}}
\endtoggle}^{\rose{\text{Click to see intermediate steps}}}
$$This gives us:
\[
\begin{align}
df &= u_i^T \ dA \ v_i \\
&= \text{tr}(u_i^T \ dA \ v_i) \ \grey{\text{(trace of a scalar)}} \\
&= \text{tr}(v_iu_i^T \ dA ) \ \grey{\text{(cyclic property of trace)}} \\
&= \langle \underbrace{\amber{u_iv_i^T}}_{\amber{\nabla f}}, dA \rangle
\end{align}
\]
Thus the gradient of \( \sigma_i(A) \) is given \( \nabla f = u_iv_i^T \).
import torch
x = torch.randn(1000,1000).double()
dx = torch.randn_like(x).mul(1e-7)
u1,d1,v1 = torch.linalg.svd(x)
u2,d2,v2 = torch.linalg.svd(x+dx)
index = 0
df_true = d2[index] - d1[index]
df_pred = ((u1[:,index].view(-1,1) @ v1.t()[:,index].view(1,-1)) * dx).sum()
torch.allclose(df_true, df_pred)
'''
Outputs:
True
'''
1. Jacobian of elementwise functions
Jacobian of an elementwise function \( f: R^n \rightarrow R^n \) is a diagonal matrix of shape \( n \times n \). Eg consider the function \( \sigma(x) = \frac{1}{1+e^{-x}} \). It's derivative is \( \sigma'(x) = \sigma(x) (1 - \sigma(x)) \). Applying it to each element of a vector we get:
\begin{align}
f(\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}) &= \begin{bmatrix} \sigma(x_1) \\ \sigma(x_2) \\ \sigma(x_3) \end{bmatrix} \\
\nabla f &= \begin{bmatrix} \sigma'(x_1) & 0 & 0 \\ 0 & \sigma'(x_2) & 0 \\ 0 & 0 & \sigma'(x_3) \end{bmatrix} \\
&= \begin{bmatrix} \sigma(x_1) (1 - \sigma(x_1)) & 0 & 0 \\ 0 & \sigma(x_2) (1 - \sigma(x_2)) & 0 \\ 0 & 0 & \sigma(x_3) (1 - \sigma(x_3)) \end{bmatrix}
\end{align}
Practically it is easier to store the diagonal matrix as a vector.
2. Softmax function
The softmax function (\( \text{softmax}: R^n \rightarrow R^n \)) is defined as:$$ \text{softmax}(x) = \frac{e^x}{\sum_{i=1}^n e^x_i} := y $$We calculate the partial derivative:$$
\frac{\partial \ y_i} {\partial x_j} =
\begin{cases}
y_i - y_iy_j = y_i(1 - y_i) & i = j \\
0 -y_iy_j = -y_iy_j & i \neq j
\end{cases}
$$This expression forms the entry \( i,j \) of the Jacobian matrix. And the Jacobian takes a succint form:$$ \amber{\text{Jacobian}(y) = \text{diag}(y) - y y^T} $$
3. Row-wise vector functions
This can be seen as an extension of the idea of elementwise functions. If a function \( f: {R^\text{in}} \rightarrow {R^\text{out}} \) is applied to each row of a matrix of shape \( (n, \text{in}) \), the output is a matrix of shape \( (n, \text{out}) \). The Jacobian for each row is a matrix of shape \( (\text{out}, \text{in}) \) and there are \( n \) of them.
\( f \) acts on each row of the matrix with shape \( (n, \text{in}) \).
Another point of view is that it can be seen as a function acting on the entire matrix. From that point of view, the Jacobian has a shape \( ({n \times \text{out}}, {n \times \text{in}}) \). This matrix is block diagonal where each block is of shape \( (\text{out}, \text{in}) \) and there are \( n \) of them.
Jacobian is a block matrix with each block of shape \( (\text{out}, \text{in}) \).
It is clear that storing the entire block diagonal matrix is a lot more expensive than storing the individual Jacobian matrices. This idea is explored later on in the article on VJP and JVP .