Derivatives of key functions

0. \( f(A) = \sigma_i(A) \), the i-th singular value of A
Assume all singular values are unique. The trick is to find a closed form equation for \( \sigma_i(A) \) in terms of \( A \). Calculating the gradient from that point is the easy part. Given the SVD, \( A = UDV^T \), with \( u_i, v_i \) as i-th columns of \( U, V \) respectively, note that:$$ \sigma_i(A) = \overbrace{\toggle {u_i^TAv_i} {u_i^TUDV^Tv_i} { \underbrace{\rose{u_i^TU}}_{\rose{e_i^T}} D \underbrace{\rose{V^Tv_i}}_{\rose{e_i}} } {\rose{e_i^T} D \rose{e_i}} {\amber{\sigma_i(A) \ \blacksquare}} {\grey{\text{Start over}}} \endtoggle}^{\rose{\text{Click to see intermediate steps}}} $$This gives us: \[ \begin{align} df &= u_i^T \ dA \ v_i \\ &= \text{tr}(u_i^T \ dA \ v_i) \ \grey{\text{(trace of a scalar)}} \\ &= \text{tr}(v_iu_i^T \ dA ) \ \grey{\text{(cyclic property of trace)}} \\ &= \langle \underbrace{\amber{u_iv_i^T}}_{\amber{\nabla f}}, dA \rangle \end{align} \] Thus the gradient of \( \sigma_i(A) \) is given \( \nabla f = u_iv_i^T \).
import torch
x = torch.randn(1000,1000).double()
dx = torch.randn_like(x).mul(1e-7)

u1,d1,v1 = torch.linalg.svd(x)
u2,d2,v2 = torch.linalg.svd(x+dx)

index = 0
df_true = d2[index] - d1[index]
df_pred = ((u1[:,index].view(-1,1) @ v1.t()[:,index].view(1,-1)) * dx).sum()

torch.allclose(df_true, df_pred)

'''
Outputs:
True
'''
1. Jacobian of elementwise functions
Jacobian of an elementwise function \( f: R^n \rightarrow R^n \) is a diagonal matrix of shape \( n \times n \). Eg consider the function \( \sigma(x) = \frac{1}{1+e^{-x}} \). It's derivative is \( \sigma'(x) = \sigma(x) (1 - \sigma(x)) \). Applying it to each element of a vector we get: \begin{align} f(\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}) &= \begin{bmatrix} \sigma(x_1) \\ \sigma(x_2) \\ \sigma(x_3) \end{bmatrix} \\ \nabla f &= \begin{bmatrix} \sigma'(x_1) & 0 & 0 \\ 0 & \sigma'(x_2) & 0 \\ 0 & 0 & \sigma'(x_3) \end{bmatrix} \\ &= \begin{bmatrix} \sigma(x_1) (1 - \sigma(x_1)) & 0 & 0 \\ 0 & \sigma(x_2) (1 - \sigma(x_2)) & 0 \\ 0 & 0 & \sigma(x_3) (1 - \sigma(x_3)) \end{bmatrix} \end{align} Practically it is easier to store the diagonal matrix as a vector.
2. Softmax function
The softmax function (\( \text{softmax}: R^n \rightarrow R^n \)) is defined as:$$ \text{softmax}(x) = \frac{e^x}{\sum_{i=1}^n e^x_i} := y $$We calculate the partial derivative:$$ \frac{\partial \ y_i} {\partial x_j} = \begin{cases} y_i - y_iy_j = y_i(1 - y_i) & i = j \\ 0 -y_iy_j = -y_iy_j & i \neq j \end{cases} $$This expression forms the entry \( i,j \) of the Jacobian matrix. And the Jacobian takes a succint form:$$ \amber{\text{Jacobian}(y) = \text{diag}(y) - y y^T} $$
3. Row-wise vector functions
This can be seen as an extension of the idea of elementwise functions. If a function \( f: {R^\text{in}} \rightarrow {R^\text{out}} \) is applied to each row of a matrix of shape \( (n, \text{in}) \), the output is a matrix of shape \( (n, \text{out}) \). The Jacobian for each row is a matrix of shape \( (\text{out}, \text{in}) \) and there are \( n \) of them.
\( \begin{bmatrix} - & x_1^T & - \\ - & x_2^T & - \\ - & x_3^T & - \end{bmatrix} \)\( \xrightarrow{f} \ \)\( \begin{bmatrix} y_1 \end{bmatrix} \)\( \xrightarrow{f} \ \)\( \begin{bmatrix} y_2 \end{bmatrix} \)\( \xrightarrow{f} \ \)\( \begin{bmatrix} y_3 \end{bmatrix} \)
\( f \) acts on each row of the matrix with shape \( (n, \text{in}) \).
Another point of view is that it can be seen as a function acting on the entire matrix. From that point of view, the Jacobian has a shape \( ({n \times \text{out}}, {n \times \text{in}}) \). This matrix is block diagonal where each block is of shape \( (\text{out}, \text{in}) \) and there are \( n \) of them.
(out, in)
Jacobian is a block matrix with each block of shape \( (\text{out}, \text{in}) \).
It is clear that storing the entire block diagonal matrix is a lot more expensive than storing the individual Jacobian matrices. This idea is explored later on in the article on VJP and JVP .