Move AI image generation into a new direction with better image descriptions
- 9 minutes read - 1727 wordsTable of Contents
Engineering is all about creativity and problem-solving, and what better way to flex those muscles than by reinterpreting images using cutting-edge AI technology?
Image interrogation also called “image-to-text” can be used to create text prompts, MM-React is a more accurate implementation.
This article will explore how Midjourney, MM-ReAct, and ChatGPT can transform images into something entirely new and unexpected.
What is Midjourney?
Midjourney is an AI image generator that uses advanced machine learning algorithms and deep neural networks to create new images that are visually stunning and creatively unique.
By analyzing and understanding visual data, Midjourney can generate new realistic and imaginative images, allowing for endless possibilities for creative expression.
What is MM-ReAct
MM-ReAct is an AI model that can describe images in a way that is more nuanced and detailed than traditional computer vision models like CLIP.
Using advanced natural language processing algorithms, MM-ReAct can accurately identify and describe objects, scenes, and emotions in an image, allowing for a more nuanced visual data interpretation.
MM-ReAct effectively addresses complicated visual understanding and can be applied in different scenarios.
What is Image interrogation?
Image interrogation analyses and extracts information from digital images using specialized software tools and techniques. It is also called „image-to-text. “ Image interrogation is an image classification task.
CLIP or derivatives of CLIP (BLIP )are used for image interrogation. CLIP can be used for image interrogation tasks like object recognition and image captioning. However, CLIP shows a low performance in describing abstract art or pictures which were computer generated.
What is ChatGPT?
ChatGPT is a state-of-the-art natural language processing (NLP) model developed by OpenAI.
It is based on transformer architecture and is trained on massive text data, allowing it to generate human-like responses to various prompts.
ChatGPT has demonstrated impressive performance on various language tasks, including language translation, text summarization, and question answering.
Combining Midjourney, MM-ReAct, and ChatGPT
Using MM-ReAct and Midjourney together, along with the natural language generation capabilities of ChatGPT, can lead to some awe-inspiring results.
Upload an image into MM-ReAct on Huggingface to analyze and describe the image, providing a more detailed and nuanced interpretation than possible with traditional computer vision models.
Use the resulting text directly in Midjourney or ask ChatGPT to enhance the prompt.
Examples
The sample image is from another blog post created with Midjourney and blend. The point is that this image might not have been in any training set. The image is especially suitable for testing the capabilities of MM-ReAct because of this.
Interpret the sample image and enhance prompt
MM-ReAct Describing the sample image
This image is likely a screenshot of an art installation with pink lights in a tunnel. The image also contains close-ups of pink eyes, a blue and pink grid, and a blue and pink grid floor. The object tags for the image include art, colorfulness, screenshot, graphics, light, and abstract.
The description is spot-on.
Example 1: use the resulting text directly in Midjourney
Just enter the description as is into Midjourney.
Result
The result is less abstract than the sample image and missing the eyes.
Example 2: Enhance and reinterpret the result with ChatGPT
Add „More details, be creative:„ to the result from the sample image, and use this as a prompt in ChatGPT.
ChatGPT Result
The image appears to be a mesmerizing and immersive art installation, drawing the viewer into a vivid and dreamlike world. The pink lights in the tunnel cast a soft and ethereal glow, creating a sense of warmth and comfort. The tunnel seems to stretch out endlessly, inviting the viewer to explore and discover what lies beyond.As the viewer moves further into the installation, they are confronted with close-ups of pink eyes, which seem to stare back at them with a sense of curiosity and wonder. The eyes are both eerie and captivating, drawing the viewer deeper into the world of the installation.The blue and pink grid adds a sense of structure and order to the chaotic and abstract surroundings. It creates a visual contrast between the fluid and organic forms of the installation and the rigid and geometric shapes of the grid.Finally, the blue and pink grid floor serves as a visual anchor, grounding the viewer in the space and providing a sense of stability and balance. It also adds an additional layer of depth and complexity to the installation, as the viewer is forced to navigate the grid while exploring the other elements of the installation.Overall, the image is a stunning example of abstract and immersive art, using color, light, and form to create a unique and captivating experience for the viewer. The object tags of art, colorfulness, screenshot, graphics, light, and abstract accurately capture the essence of the installation, conveying its beauty and complexity in a concise and evocative way.
Create image
Enter the result from ChatGPT into Midjourney and add, for example —ar 16:9 —v 5
; it should look like:
/imagine prompt: result_from_chat_gpt —ar 16:9 —v 5
Result
Compare description with BLIP-2
Like [CLIP]({{ref “latent-space-clip-score”}}) or MM-React, BLIP is captioning an image, a task called image-to-text. The resulting text is much short and lacks the details created by MM-React.
BLIP
a 3d image of two eyes in a dark space
The image created using the BLIP text differs entirely from the sample or the images created from MM-ReAct.
Repeat
Of course, it is possible to the process of interpretation and creation again and again. That is especially interesting in combination with using ChatGPT to refine or enhance the description created by MM-ReAct.
Description:
This image is likely a screenshot of a person standing in a tunnel with pink and purple lights. The image also contains a person walking on a stage, a purple and pink light on a concrete floor, a purple fabric with a curved line, a purple floor with lights, a close-up of a red fabric, and a pink and black background. The object tags for the image include purple, magenta, violet, light, and art.“
The description is not 100% accurate because there is no stage.
Result using MM-ReAct input without modification
Description from ChatGPT
ChatGPT enhanced description
The primary focus of the image is a person standing in a tunnel, which is illuminated with pink and purple lights. The tunnel appears to be made of concrete or some other industrial material, and the lighting creates a dramatic and surreal atmosphere. The person in the tunnel is not clearly visible, but they seem to be standing in the middle of the frame, with the light casting a pink and purple glow on their body. In addition to the tunnel, the image also contains several other elements. There is a person walking on a stage, which is lit with a purple and pink light. The stage appears to be made of wood, and there are some wires or cables visible in the background, suggesting that this may be a performance of some kind. There is also a purple fabric with a curved line visible in the image. It is unclear what this fabric is or how it relates to the rest of the image, but it appears to be a close-up shot, with the fabric filling the frame. Another element of the image is a purple floor with lights. It is not clear where this floor is located or how it relates to the rest of the image, but it appears to be a close-up shot, with the floor filling the frame. A close-up of a red fabric is also visible in the image. It is not clear what this fabric is or how it relates to the rest of the image, but it appears to be a close-up shot, with the fabric filling the frame. Finally, there is a pink and black background visible in the image. It is unclear what this background is or how it relates to the rest of the image, but it appears to be a solid color. The object tags for the image include purple, magenta, violet, light, and art. These tags suggest that the image is related to art or design, and that color is an important element of the image. The use of pink and purple lighting in the tunnel and on the stage creates a striking visual effect, and the other elements of the image may be intended to complement or contrast with this effect.
Result with ChatGPT enhanced description
In comparison to the previous image, the resulting image is less abstract. This suggests that when given more details, Midjourney’s ability to produce photorealistic images improves.
Interpretation
The first direct interpretation captures the original image rather well. Colors and shapes are matched as well as the abstract style.
Enhancing the result with ChatGPT adds more detail and reinterprets the original image, it is also moving the style of the image in a new direction.
Repeating the process with ChatGPT again and again is changing the direction much faster, while sticking with the output generated by MM-ReAct is changing the direction much slower.
Conclusion
The combination of Midjourney, MM-ReAct, and ChatGPT can be a powerful tool for creative professionals seeking to explore new image interpretation and generation possibilities.
MM-React is much more accurate than image interrogation with CLIP. MM-ReAct is also describing unseen images rather well.
The approach can be used for creative prompt engineering.