Harnessing the Power of LMM for Enhanced Image Evaluation and Selection
- 3 minutes read - 621 wordsTable of Contents
This reference page provides an in-depth explanation of the image evaluation process using a Large Multi-Modal Model (LMM), specifically Gemini Flash 1.5 in combination with Python. The evaluation focuses on various aspects such as relevance, characteristics, quality, and AI evaluation. This guide is designed to assist content professionals in understanding and utilizing the capabilities of LMM for semi- or fully automatic content pipelines.
Relevance of LMM
Large Multi-Modal Models (LMMs) (like Gemini Flash 1.5 ) enable the automatic description of content, including images. By understanding the capabilities of media understanding, content professionals can enhance productivity for content processes. LMMs can assist in selecting images based on soft and hard criteria, facilitating more efficient and effective content pipelines.
Characteristics
The LMM analyzes the characteristics of the image to provide a comprehensive evaluation.
Shot
The LMM describes the shot or scene, offering insights into the composition, framing, and overall context of the image. A LMM like Gemini Pro 1.5 or GPT-4o also can be uses to analyse an image and derive a scene from it .
The shot description might be different from the original prompt. That means the LMM is interpreting the image differently. If the difference between the shot and prompt is more significant, the AI image generator is less adherent to the prompt.
Aesthetic Score:
The LMM evaluates the aesthetic of the image on a scale of 0 to 1, with 0 indicating low aesthetic quality and 1 indicating high aesthetic quality.
Mood
The LMM determines the mood of the image, categorizing it as epic, dramatic, nostalgic, or other relevant moods.
Quality
The LMM assesses the quality of the image based on various factors.
Entropy
The entropy is calculated using Python of the image, which ranges from 0 to 10. A higher entropy value indicates a more dynamic image, while a lower entropy value suggests a more static image.
Noise
The noise level is analysed using Python . It shows the image quality based on the noise level, with 0 indicating no noise and higher values are indicating more noise. A higher noise level may indicate lower image quality, but it can also indicate unique visual effects, depending on the use case.
Prompt CLIP Score
The CLIP score is analysed (using ViT-L/14 which is more accurate, but slower) programmatically, which measures the similarity between an AI-generated image and its corresponding text caption. The CLIP score ranges from -1 to +1, with +1 representing perfect similarity and -1 indicating no similarity at all.
AI Evaluation
The LMM provides insights into the likelihood of AI generation and any image errors that may exist.
Likelihood of AI
The LMM estimates the likelihood that the image is AI-generated, with 0 indicating no AI generation and 1 indicating high AI generation.
Image Errors
The LMM assesses any image errors that may exist, providing valuable insights for image correction and enhancement.
In most cases, LMMs will not detect or understand distortions, missing parts like fingers, or too many fingers.
Conclusion
By leveraging the capabilities of LMMs for image evaluation, content professionals can make informed decisions about image selection and enhancement. This guide serves as a valuable resource for understanding the various evaluation factors, ultimately enhancing the effectiveness and efficiency of content pipelines.