Move AI image generation into a new direction with better image descriptions

edited on:October 1, 2024- published: March 28, 2023 - 9 minutes read - 1727 words

Tags:

<<< Why google is not doomed. Entropic Shaping and Chain of Thought: Harnessing AI's Probabilistic Nature for Innovation >>>

image from Creative prompt engineering with MM-ReAct, Midjourney, and ChatGPT

Engineering is all about creativity and problem-solving, and what better way to flex those muscles than by reinterpreting images using cutting-edge AI technology?

Image interrogation also called “image-to-text” can be used to create text prompts, MM-React is a more accurate implementation.

This article will explore how Midjourney, MM-ReAct, and ChatGPT can transform images into something entirely new and unexpected.

What is Midjourney?

Midjourney is an AI image generator that uses advanced machine learning algorithms and deep neural networks to create new images that are visually stunning and creatively unique.

By analyzing and understanding visual data, Midjourney can generate new realistic and imaginative images, allowing for endless possibilities for creative expression.

What is MM-ReAct

Affiliate Links

Stable Diffusion with Python

Master Stable Diffusion for AI image generation using Python. Control and customize your creations.

Mastering Midjourney: AI Art Guide

Unlock Midjourney V6 features and create exceptional AI art.

Midjourney Prompt Book: AI Image Generation

Master Midjourney with this comprehensive guide for beginners and pros.

MM-ReAct is an AI model that can describe images in a way that is more nuanced and detailed than traditional computer vision models like CLIP.

Using advanced natural language processing algorithms, MM-ReAct can accurately identify and describe objects, scenes, and emotions in an image, allowing for a more nuanced visual data interpretation.

MM-ReAct effectively addresses complicated visual understanding and can be applied in different scenarios.

What is Image interrogation?

Image interrogation analyses and extracts information from digital images using specialized software tools and techniques. It is also called „image-to-text. “ Image interrogation is an image classification task.

CLIP or derivatives of CLIP (BLIP )are used for image interrogation. CLIP can be used for image interrogation tasks like object recognition and image captioning. However, CLIP shows a low performance in describing abstract art or pictures which were computer generated.

What is ChatGPT?

ChatGPT is a state-of-the-art natural language processing (NLP) model developed by OpenAI.

It is based on transformer architecture and is trained on massive text data, allowing it to generate human-like responses to various prompts.

ChatGPT has demonstrated impressive performance on various language tasks, including language translation, text summarization, and question answering.

Combining Midjourney, MM-ReAct, and ChatGPT

Using MM-ReAct and Midjourney together, along with the natural language generation capabilities of ChatGPT, can lead to some awe-inspiring results.

Upload an image into MM-ReAct on Huggingface to analyze and describe the image, providing a more detailed and nuanced interpretation than possible with traditional computer vision models.

Use the resulting text directly in Midjourney or ask ChatGPT to enhance the prompt.

Examples

The sample image is from another blog post created with Midjourney and blend. The point is that this image might not have been in any training set. The image is especially suitable for testing the capabilities of MM-ReAct because of this.

Sample Image

Interpret the sample image and enhance prompt

MM-ReAct Describing the sample image

This image is likely a screenshot of an art installation with pink lights in a tunnel. The image also contains close-ups of pink eyes, a blue and pink grid, and a blue and pink grid floor. The object tags for the image include art, colorfulness, screenshot, graphics, light, and abstract.

The description is spot-on.

Example 1: use the resulting text directly in Midjourney

Just enter the description as is into Midjourney.

Result

Result-0-0

The result is less abstract than the sample image and missing the eyes.

Example 2: Enhance and reinterpret the result with ChatGPT

Add „More details, be creative:„ to the result from the sample image, and use this as a prompt in ChatGPT.

ChatGPT Result

The image appears to be a mesmerizing and immersive art installation, drawing the viewer into a vivid and dreamlike world. The pink lights in the tunnel cast a soft and ethereal glow, creating a sense of warmth and comfort. The tunnel seems to stretch out endlessly, inviting the viewer to explore and discover what lies beyond.As the viewer moves further into the installation, they are confronted with close-ups of pink eyes, which seem to stare back at them with a sense of curiosity and wonder. The eyes are both eerie and captivating, drawing the viewer deeper into the world of the installation.The blue and pink grid adds a sense of structure and order to the chaotic and abstract surroundings. It creates a visual contrast between the fluid and organic forms of the installation and the rigid and geometric shapes of the grid.Finally, the blue and pink grid floor serves as a visual anchor, grounding the viewer in the space and providing a sense of stability and balance. It also adds an additional layer of depth and complexity to the installation, as the viewer is forced to navigate the grid while exploring the other elements of the installation.Overall, the image is a stunning example of abstract and immersive art, using color, light, and form to create a unique and captivating experience for the viewer. The object tags of art, colorfulness, screenshot, graphics, light, and abstract accurately capture the essence of the installation, conveying its beauty and complexity in a concise and evocative way.

Related Content

AI Gratitude Facial Expressions Explore the power of AI in generating realistic facial expressions, specifically focusing on gratitude.

AI Forehead-to-Forehead Poses Discover how AI can generate images with unique poses, including the intimate 'forehead-to-forehead' pose.

AI Close-Up Image Generation Learn about AI's ability to create images with specific camera positions, such as close-up shots.

AI Extreme Long Shots Explore the use of AI in generating images with various camera positions, including extreme long shots.

AI Art Deco Image Generation Discover how AI can generate images with specific artistic styles, such as the Art Deco aesthetic.

AI Image Generation Models Compare and contrast different AI models like Mistral, GPT, and Claude in terms of their capabilities in image generation.

AI Image Generation with CV Data Explore the use of synthetic data, specifically created CVs, in training AI models for image generation.

Create image

Enter the result from ChatGPT into Midjourney and add, for example —ar 16:9 —v 5; it should look like:

/imagine prompt: result_from_chat_gpt —ar 16:9 —v 5

Result

Result-0-0-chatgpt

Compare description with BLIP-2

Like [CLIP]({{ref “latent-space-clip-score”}}) or MM-React, BLIP is captioning an image, a task called image-to-text. The resulting text is much short and lacks the details created by MM-React.

BLIP

a 3d image of two eyes in a dark space

Result-blip

The image created using the BLIP text differs entirely from the sample or the images created from MM-ReAct.

Repeat

Of course, it is possible to the process of interpretation and creation again and again. That is especially interesting in combination with using ChatGPT to refine or enhance the description created by MM-ReAct.

Input-image-1

Description:

This image is likely a screenshot of a person standing in a tunnel with pink and purple lights. The image also contains a person walking on a stage, a purple and pink light on a concrete floor, a purple fabric with a curved line, a purple floor with lights, a close-up of a red fabric, and a pink and black background. The object tags for the image include purple, magenta, violet, light, and art.“

The description is not 100% accurate because there is no stage.

Result using MM-ReAct input without modification

Result-1-0

Description from ChatGPT

ChatGPT enhanced description

The primary focus of the image is a person standing in a tunnel, which is illuminated with pink and purple lights. The tunnel appears to be made of concrete or some other industrial material, and the lighting creates a dramatic and surreal atmosphere. The person in the tunnel is not clearly visible, but they seem to be standing in the middle of the frame, with the light casting a pink and purple glow on their body. In addition to the tunnel, the image also contains several other elements. There is a person walking on a stage, which is lit with a purple and pink light. The stage appears to be made of wood, and there are some wires or cables visible in the background, suggesting that this may be a performance of some kind. There is also a purple fabric with a curved line visible in the image. It is unclear what this fabric is or how it relates to the rest of the image, but it appears to be a close-up shot, with the fabric filling the frame. Another element of the image is a purple floor with lights. It is not clear where this floor is located or how it relates to the rest of the image, but it appears to be a close-up shot, with the floor filling the frame. A close-up of a red fabric is also visible in the image. It is not clear what this fabric is or how it relates to the rest of the image, but it appears to be a close-up shot, with the fabric filling the frame. Finally, there is a pink and black background visible in the image. It is unclear what this background is or how it relates to the rest of the image, but it appears to be a solid color. The object tags for the image include purple, magenta, violet, light, and art. These tags suggest that the image is related to art or design, and that color is an important element of the image. The use of pink and purple lighting in the tunnel and on the stage creates a striking visual effect, and the other elements of the image may be intended to complement or contrast with this effect.

Result with ChatGPT enhanced description

Result-1-0-chatgpt

In comparison to the previous image, the resulting image is less abstract. This suggests that when given more details, Midjourney’s ability to produce photorealistic images improves.

Interpretation

The first direct interpretation captures the original image rather well. Colors and shapes are matched as well as the abstract style.

Enhancing the result with ChatGPT adds more detail and reinterprets the original image, it is also moving the style of the image in a new direction.

Repeating the process with ChatGPT again and again is changing the direction much faster, while sticking with the output generated by MM-ReAct is changing the direction much slower.

Conclusion

The combination of Midjourney, MM-ReAct, and ChatGPT can be a powerful tool for creative professionals seeking to explore new image interpretation and generation possibilities.

MM-React is much more accurate than image interrogation with CLIP. MM-ReAct is also describing unseen images rather well.

The approach can be used for creative prompt engineering.

Move AI image generation into a new direction with better image descriptions

Table of Contents

What is Midjourney?

What is MM-ReAct

What is Image interrogation?

What is ChatGPT?

Combining Midjourney, MM-ReAct, and ChatGPT

Examples

Interpret the sample image and enhance prompt

MM-ReAct Describing the sample image

Example 1: use the resulting text directly in Midjourney

Result

Example 2: Enhance and reinterpret the result with ChatGPT

ChatGPT Result

Create image

Result

Compare description with BLIP-2

BLIP

Repeat

Description:

Result using MM-ReAct input without modification

Description from ChatGPT

ChatGPT enhanced description

Result with ChatGPT enhanced description

Interpretation

Conclusion

Sources: