Using CLIP Score to evaluated images
- 7 minutes read - 1387 wordsTable of Contents
CLIP Score is a widely recognized method for measuring the similarity between an AI-generated image and its corresponding text caption. It is a powerful tool for computer vision and language understanding tasks.
The goal of CLIP is to enable models to understand the relationship between visual and textual data and to use this understanding to perform various tasks, such as image captioning, visible question answering, and image retrieval.
Why evaluate an AI image?
CLIP score is an established method to measure an image’s proximity to a text. You need indicators to measure the quality of an image to ensure it meets a text caption. Automation of this process is possible with the right indicators. That might be relevant for storytelling or storyboard development, for example.
For a manual more or less explorative approach, it is not that relevant, but if you want to maintain style and quality over a series of images CLIP score will be helpful. CLIP score returns values between +1 and -1.
CLIP is not always accurate, especially since details like styles of artist names might be incorrect. However, it is still helpful because CLIP is the industry standard for this task. It also guides you in prompt development and engineering.
What is CLIP?
CLIP (Contrastive Language-Image Pretraining) is an OpenAI model that combines computer vision and natural language understanding capabilities, it is considered a image classification method.
It trains on large images with captions to learn representations for images and text in a joint embedding space. Images and their captions are close together in this space, while unrelated images and captions are further apart.
CLIP can extract text from images, and the resulting text can be compared with a given text.
Which problems are solved by CLIP
Learning to understand and generate embeddings for images and text in a shared representation space. Converting images to text, means to extracting an embedding of an image and lookup similar text embeddings in the CLIP AI model.
CLIP is capable of recognizing the meaning and context of words in the captions, and then finding the corresponding images in a database.
By leveraging powerful AI algorithms, CLIP can quickly and accurately map an image caption to the relevant image, allowing for faster and more accurate image searches.
CLIP enables machines to learn how to interpret natural language and understand the relationships between images and their associated captions in order to make accurate predictions.
This allows machines to answer questions about an image, identify objects in the image, and make other predictions.
By training on a large variety of images and associated captions, CLIP can be used to create more robust models that are able to generalize better to unseen data.
Ultimately, CLIP provides a powerful tool for understanding and generating embeddings for images and text in a shared representation space.
This allows the model to perform tasks requiring computer vision and natural language understanding capabilities.
How to compute CLIP score
To compute a CLIP score, compare the similarity of an image and a text description within the same embedding space. Cosine similarity, a metric that determines the cosine of the angle between two vectors in a multidimensional space, can be used to measure the similarity. The cosine similarity scale is -1 to 1:
+1 indicates that the vectors are identical
0 meaning that they are orthogonal
-1 suggesting that they are opposite
High-level breakdown of how to compute a CLIP score
Get the pre-trained CLIP model from OpenAI’s GitHub repository or load it with a library like Hugging Face’s Transformers.
Prepare your image and text as follows: Tokenize your text input and process your image using the appropriate preprocessing steps (such as resizing and normalization).
Produce embeddings: Use the CLIP model to create embeddings (vector representations) for both the image and the text.
Cosine similarity is calculated as follows: Calculate the cosine similarity of the image and text embeddings. This will provide you with the CLIP score.
A higher CLIP score indicates that the image and text are more semantically related, whereas a lower score indicates that they are not. This score can be used to perform tasks such as zero-shot classification, image caption retrieval, and image ranking based on relevance to a given text query.
Code Example
To calculate the CLIP score using Python, you can use the clip package provided by OpenAI. Runs on Apple Silicon, on a Mac Studio execution, took less than 10 seconds.
First, you need to install the package and download the necessary model weights:
pip install -U torch torchvision
pip install -U git+https://github.com/openai/CLIP.git
After installing the required packages, you can use the following Python code to calculate the CLIP score:
import torch
import clip
from PIL import Image
def get_clip_score(image_path, text):
# Load the pre-trained CLIP model and the image
model, preprocess = clip.load('ViT-B/32')
image = Image.open(image_path)
# Preprocess the image and tokenize the text
image_input = preprocess(image).unsqueeze(0)
text_input = clip.tokenize([text])
# Move the inputs to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
image_input = image_input.to(device)
text_input = text_input.to(device)
model = model.to(device)
# Generate embeddings for the image and text
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_input)
# Normalize the features
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# Calculate the cosine similarity to get the CLIP score
clip_score = torch.matmul(image_features, text_features.T).item()
return clip_score
image_path = "path/to/your/image.jpg"
text = "your text description"
score = get_clip_score(image_path, text)
print(f"CLIP Score: {score}")
Replace “path/to/your/image.jpg” with the path to your image file and “your text description” with the text to which you want to compare the image. The get_clip_score function will return the CLIP score, the cosine similarity between the image and text embeddings.
Remember that this code assumes you have the necessary hardware and software to run the CLIP model, such as an NVIDIA GPU with CUDA and cuDNN installed. If you don’t have a GPU, the code will still work but may be slower on a CPU.
CLIP Benchmarking
CLIP Benchmark is an API that allows you to compare the differences between different CLIP models and versions.
You can learn about these models’ performance, strengths, and weaknesses by testing them on various tasks and datasets. The benchmark can assist you in making informed decisions about which model or version to use for your particular application or project.
CLIP Benchmark typically evaluates models on a variety of criteria, including:
- Accuracy: The model’s performance on tasks such as image classification, object detection, and caption generation. This metric can help you understand how effectively the model makes accurate predictions or embeddings.
- Robustness: The model’s ability to generalize to new or out-of-distribution data. This can assist you in determining whether the model is likely to perform well in real-world scenarios where it may encounter data that differs from its training set.
- Efficiency is the rate at which the model processes inputs and generates outputs. This metric is handy when deploying models in production environments or working with large datasets.
- Transfer learning capability: A model’s ability to be fine-tuned or adapted to new tasks with little or no additional training. This is especially important when you only have a limited amount of labeled data for a particular job.
By comparing CLIP models and versions with the CLIP Benchmark, you can identify the best model for your application while ensuring that you use the most recent and effective performance.
Conclusions
Instead of dealing with two different unstructured types of content, you get a single number. That simplifies automated AI image evaluation.
It also allows you to evaluate the performance of generative AI models.
Different CLIP models and versions do have different strengths and weaknesses, with CLIP Benchmark they can be compared.