Content Creation: AI Artworks from text input
- 7 minutes read - 1365 wordsTable of Contents
Creating images and even videos from text input are made possible using various ML models. Midjourney, Dalle-E mini, and AI notebooks allow the creation of pictures and even videos. AI notebooks running in Google Colab will enable you to adjust the process more in detail. Image generation from text combines text and image machine learning.
Midjourney
Midjourney describes itself as an independent research lab to explore the creativity of humanity, by using machine learning. It is run and advised by an exciting group of people. Their background ranges from Max Planck institute, Apple, Github, and Second Life to Avid. One of the advisers is Katherine Crowson, probably the same person who also created the original version of the VQGAN-CLIP (VQGAN uses convolutional neural networks).
Midjourney is interfacing with users via a bot using Discord. You enter „/imagine prompt: your text, “ and the bot starts to create four images.
You can see the diffusor process. It creates a selection of 4 images, which you can reprocess if you want to try another result. Upscaling is also offered; the upscaling appears to be lossless.
Using a Discord channel as the user interface to interact with the bot is an interesting approach, but it needs some time to get used to it. You can share the Discord channel with other people and see their experiments. It is slightly chaotic because your experiment might move up in the timeline, but you know what other people are trying.
Using text to image also needs some training for the user; you need to learn how to express your idea. A slightly different text might result in a much better picture; the Discord channel allows you to learn faster.
DALLE 2 kickstarted the text to image race
DALLE 2 can generate unique, realistic images and art based on a text description . It is capable of combining concepts, attributes, and styles. DALLE 2 can realistically edit existing images based on a natural language caption. It can add and remove elements while accounting for shadows, reflections, and textures.
DALLE 2 can take an image and generate multiple variations based on the original. The approach also considers the relationships between images and the text used to describe them. It employs a technique known as “diffusion,” which begins with a pattern of random dots and gradually changes that pattern to resemble an image as it recognizes specific aspects of that image.
The predecessor DALLE was introduced by OpenAI in January 2021. After a year, our newest system, DALLE 2, produces more realistic and accurate images with four times the resolution.
DALLE 2 is only accessible in beta and needs permission from OpenAI. OpenAI is afraid that it is used to produce photorealistic fake images and that underlying biases in the training data could result in photographs with a negative connotation. Google has created Imagen , which is also not available to the public for the same reasons.
Craiyon AI
Text-to-image creation created a demand by showing what is possible, and it denied access to the technology at the exact moment. This motivated AI researchers to develop their version. Dalle-E mini is one of them and also the most popular.
Dalle-E started at Huggingface (an AI community). However, it became so successful that OpenAI asked the makers of Dalle-E mini to rename it. It is now called Craiyon.
It is available as a service on Craiyon . Here, you can create images from text. It is also known as a Google Colab script.
Running it needs CPU and GPU resources provided by Colab, but with some limits. It might be that the script cannot be executed because the resources are not sufficient. In this case, you can buy a subscription (recommended) to run the Dalle-E mini script.
Dalle-E mini uses a smaller model than OpenAI because the resources needed to create a vast pre-trained model like the model from OpenAI are enormous. One of the limitations is that the images have a size of 256x256 pixels. They can be scaled up lossless (also with another AI script).
The most obvious reason to use the script on Google Colab is that you do not need to set up your notebook infrastructure, including a GPU. An alternative to Google Colab is AWS Sagemaker . Google Colab is good starting fast, on the other side Sagemaker is a complete Data Science/Engineering data platform . The learning curve for Sagemaker could be steeper, but it offers you more options, including more powerful hardware options.
Dalle-E mini has limitations like bias, only works with English text input, and animals and humans might not be realistic. The result might be disturbing or irritating. Human faces and animal bodies are often rendered in a distorted way.
VQGAN-CLIP notebook
VQGAN-Clip is a notebook that can be executed at Google Colab or locally. It is written in Python and was developed by Katherine Crowson AI/generative artist’s Twitter profile .
The default model used is Imagenet 16384, 16384 is the number of neurons the model has. More neurons mean that it may produce more detailed images than a model with fewer neurons.
However, more neurons do not mean that the image will be more interesting from an artistic perspective. In other words, a smaller model might be sufficient, especially for experimentation. The download size of Imagenet 16384 is about 1 GB.
Compared to the Dalle-E mini, you also can use target images, which essentially means putting one or more photos on it that the AI will recognize as a “target,” serving the same purpose as placing text on it.
It also supports creating bigger images and will create a series of photographs; these images can be used to create a video. The video helps understand how the diffusor makes the image step by step. Dalle-E mini generates images in one go; you will not see the process.
Interestingly you also can scale up the images with another script. If you want to host your own, you get a blueprint for a text-to-image processing pipeline. The hand documentation is minimalistic; it might be necessary to study the code to understand each parameter.
Conclusions
DALLE and DALLE 2 kicked off the competition for text-to-image services allowing to generation of images from a single line of text (a prompt). It is possible to run your own pipeline in a notebook and in a local or cloud environment. The resources and costs are manageable if you do not train a model and instead use a pre-trained model.
Like NLU/NLP processing ML-based text to image, models are restricted by potential bias and misuse. Using Midjourney, you can see that some users are creating political content, which might be disturbing. So far, the generative art community consists of artists, ML engineers, and data scientists trying out creativity. This might change the moment technology becomes more accessible.
It is realistic to expect that more people will also shift the characteristic of content created. Currently, available and accessible technologies also create art content visibly. This is not the case for Google’s and OpenAI’s offerings, which have much more resources and can make much more powerful models and high-quality images.
It appears only to be a question of time until generative models will allow AI-based video makers and video editors to render movies. However, the resources machine learning algorithms need to do this will probably be massive.