Image quality assessment with CLIP projection matrices

This is a neat little trick that makes use of i) multi-modality and ii) projection matrices for image quality assessment. The idea here is that one can describe what image quality in the form of text and then you can compare the corresponding text vector with an image vector. So you can basically define a prompt like "a high quality image" and find images with highest cosine similarity to the corresponding text vector.
The problem with using text prompt
There are two issues with describing the image quality in English:
• The prompt has limited token length.
• If you provide multiple prompts, there is no clear way to combine the cosine similarities of an image with each prompt. Eg the prompts scenic and picturesque are very similar to each other but not as similar to the prompt good composition. You might want to minimize such strong correlations when selecting prompts.
The solution: projection matrix
Instead of combining cosine similarities of an image vector with multiple text vectors, you can calculate the norm of the projection of the unit image vector in the subspace defined by the text vectors. This is the same as cosine similarity when the subspace is one dimensional.
If the text vectors \( t_i \) are the columns of a matrix \(M = \)\( \begin{bmatrix} | & | & | & | \\ t_1 & t_2 & ... & t_n \\ | & | & | & | \end{bmatrix} \), the projection matrix is \( M(M^TM)^{-1}M^T \). This can be implemented using torch like so:
def create_projection_matrix(vectors):
    # Note that P = A(A^TA)^-1A^T
    return vectors @ ( torch.linalg.inv(vectors.t() @ vectors) ) @ vectors.t()
You can even create a list of negative prompts and then generate a projection matrix from those prompts. The smaller the norm of the image vector projection in this subspace, the better its quality is!
An Example
I created a simple demo in this repo. Below are the arrays of positive and negative prompts that I used:
POSITIVE_PROMPTS = [
    "A high-quality portrait photo",
    "Good composition",
    "Good lighting",
    "Happy",
    "Cute",
    "Smiling",
    "Beautiful",
    "A person smiling",
    "Face clearly visible",
    "People celebrating",
]

NEGATIVE_PROMPTS = [
    "Bad-quality photo",
    "Blurred photo",
    "Random photo",
    "Sad or angry",
    "Disturbing, scary",
    "Face partially visible",
    "Face covered",
    "Out of focus",
    "Too bright",
    "Too dark",
]
    
I used the KonIQ-10k IQA Database for this demo. I gave equal weights to the positive and negative projection matrices. Below are the top ten images based on these prompts:
And here are the bottom ten images:
The entire code is available here: https://github.com/vinsis/clip-projection-matrices.
Read the e-book 📖

to learn linear algebra from scratch through a series of visual essays.