Generative AI is pretty impressive these days in terms of reliability, as viral memes like Balenciaga Pope would suggest. The latest systems can conjure up scenes from city skylines to cafes, creating images that seem surprisingly realistic – at least at first glance.
But one of the long-standing weaknesses of text-to-image AI models is, ironically, text. Even the best models struggle to generate images with readable logos let alone text, calligraphy or fonts.
But that could change.
Last week, DeepFloyd, a research group powered by Stability AI, unveiled DeepFloyd IF, a text-to-image model that can “smartly” integrate text into images. Trained on a dataset of more than a billion images and text, DeepFloyd IF, which requires a GPU with at least 16 GB of RAM to run, can create an image of a prompt like “a teddy bear wearing a shirt with the text ‘Deep Floyd’ ” — optional in different styles.
DeepFloyd IF is available in open source, licensed in a way that prohibits commercial use – for now. The limitation was likely dictated by the current weak legal status of generative AI art models. Several commercial model sellers have come under fire from artists who claim that the sellers profit from their work without compensating them by scraping that work off the web without permission.
But NightCafe, the generative arts platform, got early access to DeepFloyd IF.
Angus Russell, CEO of NightCafe, spoke to AapkaDost about what sets DeepFloyd IF apart from other text-to-image models and why it could represent a major step forward for generative AI.
According to Russell, DeepFloyd IF’s design was heavily inspired by Google’s Imagen model, which was never made public. Unlike models such as OpenAI’s DALL-E 2 and Stable Diffusion, DeepFloyd IF uses multiple different processes stacked on top of each other in a modular architecture to generate images.
With a typical diffusion model, the model learns how to gradually subtract noise from a starting image that is almost entirely noise, bringing it closer to the target prompt step by step. DeepFloyd IF runs diffusion not once but several times, generates a 64x64px image, then scales the image to 256x256px and finally to 1024x1024px.
Why the need for multiple diffusion steps? DeepFloyd IF works directly with pixels, explains Russell. For the most part, diffusion models are latent diffusion models, which essentially means that they operate in a lower-dimensional space that represents many more pixels, but in a less precise way.
The other main difference between DeepFloyd IF and models such as Stable Diffusion and DALL-E 2 is that the former uses a large language model to understand and represent prompts as a vector, a basic data structure. Due given the size of the large language model embedded in DeepFloyd IF’s architecture, the model is particularly good at understanding complex prompts and even spatial relationships described in prompts (e.g., “a red cube on top of a pink sphere”).
“It’s also very good at generating readable and correctly spelled text in images and can even understand prompts in multiple languages,” added Russell. “Of these capabilities, the ability to generate readable text in images is perhaps the biggest breakthrough that sets DeepFloyd IF apart from other algorithms.”
Since DeepFloyd IF can generate text in images quite proficiently, Russell expects it to unlock a wave of new generative art possibilities – think logo design, web design, posters, billboards and even memes. The model should also be much better at generating things like hands, he says, and because it can understand prompts in other languages, it should be able to create text in those languages as well.
“NightCafe users are excited about DeepFloyd IF largely because of the capabilities unlocked by generating text within images,” said Russell.Stable Diffusion XL was the first open source algorithm to advance text generation — it can accurately generate one or two words some of the times – but it’s still not good enough at it for cases where text is important.
That’s not to say that DeepFloyd IF is the holy grail of text-to-image models. Russell notes that the base model ddoes not generate images that are just like that AESTHETICALLY pleasing like some diffusion models, though he expects refinement will improve that.
But the bigger question to me is to what extent DeepFloyd IF suffers from the same shortcomings as its generative AI brethren.
A growing body of research has revealed racial, ethnic, gender and other forms of stereotyping in image-generating AI, including Stable Diffusion. This month, researchers from AI startup Hugging Face and the University of Leipzig published a tool showing that models such as Stable Diffusion and OpenAI’s DALL-E 2 tend to produce images of people appearing white and masculine, especially when asked to depict people in positions of authority.
The DeepFloyd team credits the potential for bias in the fine print at DeepFloyd IF:
Texts and images from communities and cultures using other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as defaults.
Aside from this, DeepFloyd IF, like other open source generative models, can be used for harm, such as generating pornographic deepfakes of celebrities and graphic images of violence. On DeepFloyd IF’s official webpage, the DeepFloyd team says they used “custom filters” to remove watermarks, “NSFW” and “other inappropriate content” from the training data.
But it’s unclear exactly what content was removed – and how much may have been missed. Ultimately, time will tell.