The Kronecker Product Visualized

A while ago I came across a small but amusing problem - how to visualize the attention value of each image patch. To provide some context, a Vision Transformer (ViT) divides an image into non-overlapping patches. For example, an image of dimensions 224x224 is divided into 7x7 patches, each patch of dimension 32x32. Each patch is then converted into a vector and fed to transformer layers. We can extract the attention values from the last transformer layer. These attention values (each value is between 0 and 1) form a 7x7 grid, with each value corresponding to a 32x32 patch in the image. We can overlay these values on top of an image by multiplying each patch with the corresponding value. An example is shown below.

An image is divided into 7x7 patches. Each patch is assigned a probability by the attention layer.

Before I talk about its connection with the title of this article, think about how you will implement it. At the end of this article, I will mention its connection with Kronecker product.

1. Definition of Kronecker product

The Kronecker product of two matrices \( A \) of shape \( (r_A,c_A) \) and \( B \) of shape \( (r_B,c_B) \) is a matrix denoted by \( A⊗B \) of shape \( (r_A*r_B,c_A*c_B) \) where each element of A, \( a_{ij} \) is replaced by a block matrix \( a_{ij} * B \). Here \( a_{ij} * B \) is the elementwise product of \( a_{ij} \) with each element of \( B \).

Kronecker product can be seen as replacing each element c of the left matrix with the right matrix scaled by the value c.

An example is shown below:

\( \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{bmatrix} \)\( \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} \)\(=\)\( \begin{bmatrix} a_{11}B & a_{12}B & a_{13}B \\ a_{21}B & a_{22}B & a_{23}B \end{bmatrix} \)\(=\)\( \begin{bmatrix} a_{11}b_{11} & a_{11}b_{21} & a_{12}b_{11} & a_{12}b_{21} & a_{13}b_{11} & a_{13}b_{21} \\ a_{11}b_{21} & a_{11}b_{22} & a_{12}b_{21} & a_{12}b_{22} & a_{13}b_{21} & a_{13}b_{22} \\ a_{21}b_{11} & a_{21}b_{21} & a_{22}b_{11} & a_{22}b_{21} & a_{23}b_{11} & a_{23}b_{21} \\ a_{21}b_{21} & a_{21}b_{22} & a_{22}b_{21} & a_{22}b_{22} & a_{23}b_{21} & a_{23}b_{22} \end{bmatrix} \)

2. Kronecker product of two vectors

Perhaps the most common Kronecker product of two vectors is the outer product - this is the product of a column vector of length \( m \) with a row vector of length \( n \). The outer product has the shape \( (m,n) \). Kronecker product of two column vectors is a column vector whose length is the product of individual column vectors. Similarly the Kronecker product of two row vectors is a row vector.

2.1 Some lesser known properties of Kronecker product of two vectors

2.1.1 Knonecker product of two unit vectors is a unit vector

If \( u \) and \( v \) are two unit vectors (i.e. they have a norm of 1), their Kronecker product \( u⊗v \) also has a norm of 1. If \( u⊗v \) is a matrix, you can view it as a vector by stacking its columns vertically. In general \( ||u⊗v||^2 = \Sigma_{i,j} u_i^2v_j^2 = (\Sigma_i u_i^2)(\Sigma_j v_j^2) = ||u||^2||v||^2 \). In other words, the norm of \( u⊗v \) is the product of norms of \( u \) and \( v \). Note that this is true even for products of matrices - you just need to view the matrix as a vector.

2.1.2 Kronecker product of two probability distributions is a probability distribution

If \( u \) and \( v \) are two (independent) probability distributions, their Kronecker product (viewed as a vector) is a (joint) probability distribution. More generally, Kronecker product of two stochastic matrices is itself a stochastic matrix. If these input matrices have steady states, the steady state of the output matrix is the Kronecker product of the steady states of the input matrices.

2.1.3 Connection with products of two polynomials

Given two polynomials \( p(x) = p_0 + p_1x + ... + p_mx^m \) and \( q(x) = q_0 + q_1x + ... + q_nx^n \), their product is \( p(x)q(x) = s_0 + s_1x + ... + s_{m+n}x^{m+n} \) where \( s_i \) is the sum of \( i-th \) diagonal of the product\( \begin{bmatrix} p_0 \\ p_1 \\ ... \\ p_m \end{bmatrix} \)\( ⊗ \)\( \begin{bmatrix} q_0 & q_1 & ... & q_n \end{bmatrix} \)For instance \( (a + bx + cx^2)(p + qx + rx^2 + sx^3) = s_0 + s_1x + s_2x^2 + s_3x^3 + s_4x^4 + s_5x^5 \) where \( s_i \) is the sum of entries along the \( i-th \) diagonal of the matrix \( \begin{bmatrix} a \\ b \\ c \end{bmatrix} \)\( ⊗ \)\( \begin{bmatrix} p & q & r & s \end{bmatrix} \):

The product of polynomials \( a + bx + cx^2 \) and \( p + qx + rx^2 + sx^3 \)

3. Bilinear form and Kronecker product: the most important takeaway

Note that we need to view a matrix as a vector (conventionally formed by stacking columns vertically) in order to interpret some properties of this product. This change in view is actually very helpful in solving equations of the form \( AX + XB = C \). The insight lies in viewing a bilinear form as a dot product.

Given column vectors \( b \) and \( c \) and a matrix \( M \), the bilinear form \( c^TMb \) can be seen as a dot product between \( b⊗c \) and \( vec(M) \) which is a vector formed by stacking columns of \( M \).

The bilinear form \( c^TMb \) can be seen as a dot product \( (b^T ⊗ c^T).vec(M) = (b⊗c)^T.vec(M) \)

3.1 Extending to products of the form \( CXB^T \)

This idea can be extended to products of the form \( CMB^T \). We have:\( vec(CMB^T) = (B⊗C).vec(M) \).

This is useful when writing linear operators as matrix-vector products in Matrix calculus. Eg\(dM^2 = (dM)M + M(dM) = I(dM)M + M(dM)I \Rightarrow vec(dM^2) = (M^T⊗I + I⊗M)vec(dM)\)where \( I \) is the identity matrix of the appropriate shape. This gives us:\( ∇M^2 = M^T⊗I + I⊗M \)

Similarly the Sylvester equation \( AX + XB = C \) can be solved like so:\( C = AX + XB = AXI + IXB \Rightarrow vec(C) = (I⊗A + B^T⊗I)vec(X) \)

4. Kronecker product preserves many properties of the matrices

It was mentioned earlier that if the input matrices are stochastic, their Kronecker product is also stochastic. Similarly, if the input matrices have a norm of 1, so does their Kronecker product. Similarly, if the input matrices are permutation matrices, their Kronecker product is also a permutation matrix. If the input matrices are unitary, so is their Kronecker product. For a complete list, check out the Wikipedia article .

5. Coming back to the original problem

Now let's revisit the problem mentioned at the beginning of this article, and how it relates to the Kronecker product. The Kronecker product of the 7x7 attention matrix with a matrix of 1s of shape 32x32 gives the desired matrix which, when multiplied elementwise to the input image, gives the final overlaid image.\( \begin{bmatrix} a_{00} & a_{01} & a_{02} & a_{03} & a_{04} & a_{05} & a_{06} \\ a_{10} & a_{11} & a_{12} & a_{13} & a_{14} & a_{15} & a_{16} \\ a_{20} & a_{21} & a_{22} & a_{23} & a_{24} & a_{25} & a_{26} \\ a_{30} & a_{31} & a_{32} & a_{33} & a_{34} & a_{35} & a_{36} \\ a_{40} & a_{41} & a_{42} & a_{43} & a_{44} & a_{45} & a_{46} \\ a_{50} & a_{51} & a_{52} & a_{53} & a_{54} & a_{55} & a_{56} \\ a_{60} & a_{61} & a_{62} & a_{63} & a_{64} & a_{65} & a_{66} \end{bmatrix} \)\( ⊗ [1] = \)\( \begin{bmatrix} a_{00}.[1] & a_{01}.[1] & a_{02}.[1] & a_{03}.[1] & a_{04}.[1] & a_{05}.[1] & a_{06}.[1] \\ a_{10}.[1] & a_{11}.[1] & a_{12}.[1] & a_{13}.[1] & a_{14}.[1] & a_{15}.[1] & a_{16}.[1] \\ a_{20}.[1] & a_{21}.[1] & a_{22}.[1] & a_{23}.[1] & a_{24}.[1] & a_{25}.[1] & a_{26}.[1] \\ a_{30}.[1] & a_{31}.[1] & a_{32}.[1] & a_{33}.[1] & a_{34}.[1] & a_{35}.[1] & a_{36}.[1] \\ a_{40}.[1] & a_{41}.[1] & a_{42}.[1] & a_{43}.[1] & a_{44}.[1] & a_{45}.[1] & a_{46}.[1] \\ a_{50}.[1] & a_{51}.[1] & a_{52}.[1] & a_{53}.[1] & a_{54}.[1] & a_{55}.[1] & a_{56}.[1] \\ a_{60}.[1] & a_{61}.[1] & a_{62}.[1] & a_{63}.[1] & a_{64}.[1] & a_{65}.[1] & a_{66}.[1] \end{bmatrix} \)

The Kronecker product of the attention values with a ones matrix \( [1] \) of shape\( (32,32) \) gives the desired output matrix.

Three interpretations of matrix products visualized