Exploring Image-to-Text Tools for Generative AI: An In-Depth Analysis
- 6 minutes read - 1237 wordsTable of Contents
This article provides a comprehensive overview of image-to-text tools designed for generative AI, with tools chosen to suit various usage scenarios and provide multiple options.
Instead of providing an exhaustive list of options, the article provides an overview of solutions tailored to specific use cases.
In addition, the article examines the image-to-text tools in detail, analyzing their features, capabilities, and limitations.
Image-to-Prompt by Use Case
The tools covered in this article are selected to address different usage scenarios and provide various options. Instead of providing a complete list of all options, the focus is to give an overview of solutions for specific use cases. All tools do have their strengths and weaknesses.
- Muti-Modal LLM: The workhorse
- Midjourney Describe: Artists best friend
- CLIP Interrogator Locally: Local Independence
- Image To Prompt via Replicate: Automation
- Scenex by Jina AI: Business Class
In-depth Analysis of Image-to-Prompt Methods
This section delves deeper into the previously introduced image-to-text tools. It delves into each tool’s features, capabilities, and limitations, providing insights to help users determine which tool best meets their generative AI needs.
Multi-Modal LLM
Chat GPT 4o since Version 4o GPT can describe not just images but also videos and music. You just upload an image, a video, or a music file and ask GPT to describe it. It also can be used via an API and can be used with an ai agent approach (for example) .
There are other multi-model capable LLMs like llama3, mistral, or llava (50+ models). It is expected that multi-modality will be the norm for LLMs. Combining the capabilities of a LLM with media recognition abilities allows for versatile and flexible media powered workflows.
Pros:
- API Support
- Combine LLM capabilities with media asset description
Cons:
- Fine-tuning effort
- Runtime
- Costs (depending on provider)
Midjourney Describe
Midjourney Describe was introduced in March 2023 and can be used via the Discord interface, like all other Midjourney commands.
The image is uploaded and analyzed by Midjourney. It returns four different prompts which can be used directly to create images. The „describe“ command fits nicely into the usual Midjourney workflow.
The resulting images can be used with the „blend“ command or image prompts to create new ideas based on the descriptions. The image descriptions created by Midjourney „describe“ can, of course, be used with Dall-E 2, Stable Diffusion, and other AI Image Generators; depending on the language understanding of the AI Image Generators, the resulting images fit more or less.
Photographies work well; abstract art and unusual image work less well. Midjourney „describe“ also can recognize words. Describe is also recognizing geometric shapes. Usually, it also recognizes colors rather well.
Pros:
- Very Good understanding of visual concepts
- Recognizing Words
- Recognition of geometric shapes
- Very Fast
Cons:
- No API
CLIP Interrogator Locally (using ViT-H-14/laion2b_s32b_b79k)
One of the first image-to-prompts tools was the CLIP interrogator; it is available on:
- https://github.com/pharmapsychotic/clip-interrogator
- https://replicate.com/pharmapsychotic/clip-interrogator
- https://colab.research.google.com/github/pharmapsychotic/clip-interrogator/blob/main/clip_interrogator.ipynb
- https://cloud.lambdalabs.com/demos/ml/CLIP-Interrogator
It is nearly omnipresent. Since the CLIP Interrogator is available as a GitHub repo, it can be run locally. This makes sense if you want to process a lot of images and you want to minimize your API costs.
Running the CLIP interrogator locally demands a powerful GPU; in principle, it runs on a CPU very slowly and may not run on a Mac. Depending on the requirements of your project or task, using the API might be the better solution, but as mentioned before, it depends on the number of images you want to process.
The CLIP interrogator works well with photographs but does not recognize texts and struggles with abstract art. It can realize shapes to some extent.
Pros:
- API
- Open Source/Runs locally (with powerful Hardware)
- Good understanding of visual concepts
- Supports multiple CLIP models
Cons:
- Struggles with abstract concepts
- Slow compared to Midjourney
Image To Prompt via Replicate
Image to Prompt is an alternative to Pharma’s CLIP Interrogator, delivering a similar performance. It is based on the original CLIP interrogator. Using Replicate statistics, image2prompt is used 3x times more than the original.
A relevant Difference is that image-to-prompt only supports one CLIP model, while the original CLIP interrogator supports at least two models. As a result, the image-to-prompt API returns image descriptions that contain fewer details. For photographs, the result is comparable, but not for illustrations like logo descriptions.
Pros:
- API
- Somewhat helpful understanding of visual concepts
Cons:
- Only supports the old CLIP model
- Slow compared to Midjourney
Scenex by Jina AI
Scene Explain by Jina AI is a commercial product, unlike the other solutions in this selection.
It offers a playground and API. The image descriptions are rich in detail, and the response time is fast, exactly what you would expect from a commercial service. The visual understanding is good but is not approaching the precision of MM-React or Midjourney description (comet and dune).
The service also offers options for faster and cheaper responses and more detailed image descriptions.
Pros:
- API Access
- Fast
- Good understanding of visual concepts
- Recognises geometric shapes
Cons:
- Image recognition could offer a higher level of precision
Outlook: The Future of Image-to-Text in Generative AI
The research for image-to-prompt or multi-modal content analysis is not standing still. Whereas MM-React and Midjourney „describe“commands are fast while offering a high level of precision, it can be expected that the prominence of conversational multi-modal AI (like ChatGPT) is asking for even more performance and precision, some alternatives (with code):
- EVA-CLIP
- GenerativeImage2Text
- MiniGPT4 (Q&A with an image, description might not be useful as a text prompt )
- LLaVA: Large Language and Vision Assistant (Q&A with an image, description might not be useful as a text prompt)
Conclusions
There are several image-to-text tools for generative AI, each with its own set of advantages and disadvantages.
Midjourney Describe, CLIP Interrogator Locally, Image to Prompt via Replicate, MM-ReAct, and Scenex by Jina AI are all intriguing approaches. Before selecting a tool, users should consider their specific requirements.
Midjourney Describe provides quick results and a good understanding of visual concepts but does not have an API.
Local CLIP Interrogator has an API and understands visual concepts well but struggles with abstract concepts. CLIP Interrogator is similar to Image to Prompt via Replicate but only supports one model.
MM-ReAct understands visual concepts well, recognizes texts and abstract shapes well, and is fast but heavily reliant on external services.
Scenex by Jina AI is a commercial product that understands visual concepts well, recognizes geometric shapes, and responds quickly, but lacks the precision of MM-React or Midjourney Describe.
As image-to-text research advances, the prominence of conversational multi-modal AI will necessitate even more outstanding performance and precision.
Alternatives such as EVA-CLIP and GenerativeImage2Text provide promising solutions to these problems.