Fantasìa: AI Generated Art in Physics and Astrophysics

Introduction: Deep Generative Models

Letizia Pizzini, Luca Bottero

Machine Learning Journal Club



Deep learning has made it possible to achieve incredible results in emulating human intelligence in increasingly complex tasks. A fascinating challenge has been recently enthralling researchers from all over the world, that is to emulate artistic creativity.

Software like DALL-E and the beta version of Midjourney make it possible to create striking images starting from text descriptions, even for people without drawing skills. They use algorithms that are variations of Stable Diffusion, an open source model (meaning that anyone can access and modify neural network parameters), released by in August 2022.

Stable Diffusion is a type of deep generative model, a machine learning model that is able to learn the underlying structure of a dataset and generate new, synthetic data samples similar to the ones in the original dataset. These models are used for a wide range of applications, including audio synthesis or text generation tasks as the popular ChatGPT.

Stable Diffusion is a system made up of several components and models. It includes a text-understanding component (text encoder) which translates text information into a numeric representation that cap- tures the ideas in the text description. That information is then presented to the Image Generator, which is composed of a couple of components itself.

The image generator goes through two stages:

Image information creator


This component runs for multiple steps to generate image information. The image information creator works completely in the image information space (or latent space). In technical terms, this component is made up of a UNet neural network and a scheduling algorithm.


The word “diffusion” describes what happens in this component. It is the step by step process- ing of information that leads to a high-quality image being generated in the end (by the next component, the image decoder).

Image information creator


The image decoder paints a picture from the information it got from the information creator. It runs only once at the end of the process to produce the final pixel image.


The diffusion process takes place inside the image information creator component. Having the token embeddings that represent the input text, and a random starting image information array (these are also called latents), the process produces an information array that the image decoder uses to paint the final image. This process happens in a step-by-step fashion, each step operates on an input latents array, and produces another latents array that better resembles the input text and all the visual information the model picked up from all images it was trained on.

The central idea of generating images with diffusion models relies on the fact that we have powerful computer vision models. Given a large enough dataset, these models can learn complex operations and generate high-quality, realistic data samples. However, stable diffusion models have some limitations as, since they require many iterations of the diffusion process, they can be computationally expensive. To speed up the image generation process it is possible to run the diffusion process not on the pixel images themselves, but on a compressed version of the image. This compression (and later decompression/painting) is done via an autoencoder. The autoencoder compresses the image into the latent space using its encoder, then reconstructs it using only the compressed information using the decoder. Midjourney software is based on the class of algorithms just described to which fine-tuning is applied (i.e. a retraining of the neural network, or part of it, on data from the same domain, using pre-trained weights as initial condition) and also using specific pre-processing and post-processing techniques on the resulting image.