How Image Generation Works · AI Engineer

1How Image Generation Works

Making pictures from nothing

Here's something that would have sounded like science fiction a decade ago: you type a sentence, and a computer creates a brand new image that matches your description. Not a search result. Not a collage. A completely new picture that never existed before.

How?

To understand image generation, you first need to understand what an image actually is to a computer. It's a grid of pixels. Each pixel is a set of three numbers — red, green, and blue — ranging from 0 to 255. A 512x512 image is just 786,432 numbers arranged in a specific pattern.

Generating an image means choosing the right values for all those numbers. And the "right" values depend entirely on what the image is supposed to look like.

That's the core challenge. The space of possible 512x512 images is unimaginably huge. Almost all random combinations look like TV static. Only a tiny, tiny fraction look like actual photos, paintings, or anything recognizable.

So the question becomes: how do you teach a model to only generate the combinations that look like real images?

Over the past decade, three major approaches have tackled this problem. Each one built on the lessons of the last — autoencoders, GANs, and diffusion models. Understanding the evolution explains why we ended up where we are.

The compression approach: autoencoders

One early idea was surprisingly intuitive. What if you could teach a network to compress images into a small set of numbers, and then reconstruct them back?

That's a Variational Autoencoder, or VAE.

Think of it like this. You have a photo of a cat. The encoder squishes it down into maybe 100 numbers — a compact "code" that captures the important features. Pointy ears. Fur color. Eye shape. The decoder takes those 100 numbers and tries to rebuild the original image.

The clever part: once the model is trained, you can throw away the encoder entirely. Just sample random numbers for the latent code and run them through the decoder. Out comes a new image.

The downside? VAE images tend to look blurry. The model plays it safe — instead of committing to sharp details, it hedges its bets and produces an average of what the image could be. Like asking a hundred painters to draw the same cat and then averaging all their paintings together — you'd get something vaguely cat-shaped but lacking any specific detail.

Why the blurriness?

The technical reason is interesting. The VAE has to map similar images to nearby points in latent space. But "nearby" in latent space doesn't always mean "similar looking." The decoder has to produce a single image for each latent code, and when that code sits between several possible images, the decoder compromises. It produces the average.

This matters because it reveals a fundamental tradeoff. VAEs are great at learning the structure of images — what makes a face a face, what distinguishes a dog from a cat. But they struggle with sharpness. They understand the concept but fumble the details.

Still, the VAE idea of compressing images into a latent space turned out to be extremely important. Not for generating images directly, but as a building block inside more powerful systems. We'll see it again when we get to Stable Diffusion.

The counterfeiter and the detective: GANs

In 2014, Ian Goodfellow came up with a genuinely clever idea. What if you trained two neural networks against each other?

One network — the generator — tries to create fake images. The other — the discriminator — tries to tell fakes from real photos.

The analogy that stuck: it's a counterfeiter vs. a detective. The counterfeiter keeps making better fakes. The detective keeps getting better at spotting them. Over time, the counterfeiter gets so good that even the detective can't tell the difference.

This was Generative Adversarial Network — GAN for short. And it worked. GAN-generated faces went from blobby messes in 2014 to photorealistic portraits by 2018 (remember the "This Person Does Not Exist" website?).

Why GANs were a big deal

Before GANs, generated images were obviously fake. Blurry, distorted, uncanny. GANs produced sharp, detailed images because the discriminator kept pushing the generator to improve. If the fake had any telltale sign — weird teeth, asymmetric ears, strange backgrounds — the discriminator would catch it, and the generator would fix it.

Why GANs were also a headache

Training a GAN is like balancing on a knife edge. If the discriminator gets too good too fast, the generator gives up — it can't fool the detective, so it stops learning. If the generator gets too good too fast, the discriminator can't provide useful feedback.

This instability is called mode collapse. The generator finds one type of image that fools the discriminator and just keeps making variations of that one thing. You wanted diverse faces, but you got 10,000 slightly different versions of the same face.

Researchers spent years developing tricks to stabilize GAN training — progressive growing, spectral normalization, Wasserstein loss. It worked, sort of. But GANs always felt fragile. One wrong hyperparameter and the whole thing fell apart.

The GAN zoo

Despite the difficulties, GANs spawned an incredible variety of specialized models. StyleGAN could generate faces so realistic that they fooled most people. Pix2Pix could convert sketches to photos. CycleGAN could turn horses into zebras (seriously). Each new variant addressed a specific weakness or added a new capability.

By 2020, the research community had published hundreds of GAN variants. Someone even created a "GAN Zoo" document trying to catalog them all.

But the underlying fragility never went away. Every time you wanted to train a GAN on a new type of data, you'd spend weeks tuning hyperparameters, adjusting learning rates, and crossing your fingers. There had to be a better way.

The noise revolution: diffusion models

Around 2020, a completely different approach started gaining traction. It was based on an idea that sounds almost backwards.

What if, instead of learning to generate images from scratch, you learned to remove noise from corrupted images?

That's the core idea behind diffusion models. And it turned out to be one of the most important breakthroughs in generative AI.

The forward process: adding noise

Take a clean image. Add a tiny bit of random noise. The image looks almost the same — maybe slightly grainy. Add a little more noise. And more. And more.

After enough steps (typically 1,000), the image is completely destroyed. It's pure static. No trace of the original remains.

This is the forward process, and it's easy. You're just adding random numbers to pixels, step by step.

Step 0:    [Clean photo of a dog]
Step 200:  [Slightly noisy, dog still visible]
Step 500:  [Very noisy, vague shape of dog]
Step 800:  [Mostly noise, hard to see anything]
Step 1000: [Pure random noise]

The reverse process: learning to denoise

Here's where the magic happens. You train a neural network to do the opposite. Given a noisy image and a number telling it how much noise was added, the network predicts what the noise looks like.

If it knows what the noise looks like, it can subtract it. Remove the noise, and you get a slightly cleaner image. Do that 1,000 times, and you go from pure static to a clean image.

The training objective

During training, you show the model millions of examples. Each example is: a clean image, a random noise step, and the noisy version at that step. The model has to predict the noise that was added.

That's it. The loss function is just: how well did you predict the noise?

No adversarial training. No balancing two networks. No mode collapse. Just a single model learning to predict noise. It sounds too simple to work this well.

But it does.

The name "diffusion"

Why is it called diffusion? Think of a drop of ink in water. The ink diffuses — it spreads out randomly until the water is uniformly colored. You can't tell where the original drop was.

The forward process is like that diffusion. The structured image (the ink drop) spreads into random noise (uniform color). The reverse process is like magically un-diffusing the ink — going from uniform randomness back to a structured drop.

In physics, diffusion is a one-way process. You can't un-mix ink from water. But a neural network can learn to reverse it, because it has seen millions of examples of what the "ink drop" (clean images) looked like before diffusion happened. It knows what structure looks like, so it can reconstruct it from noise.

Why diffusion won

By 2022, diffusion models had overtaken GANs on virtually every image quality benchmark. A few reasons:

The biggest win was stable training. No more adversarial balancing act — you train one network with a straightforward loss function, and it converges reliably. Researchers could finally scale up models without the whole thing collapsing.

Then there was diversity. GANs tend to get stuck generating similar-looking outputs. Diffusion models explore the full range of what images can look like. You get genuine variety — not 10,000 versions of the same face.

Controllability turned out to be the killer feature for practical applications. The step-by-step denoising process gives you natural knobs to turn — you can guide the process at each step, conditioning on text, other images, or any signal you want. This is what made text-to-image generation possible.

And the images just look better. Sharper details, more coherent compositions, fewer artifacts. On top of all that, the math behind diffusion is clean and well-understood. GANs were more of a dark art — you'd try things and hope they worked. Diffusion gave the field rigorous foundations to build on.

This combination of practical quality and theoretical soundness is rare in deep learning. Diffusion took over quickly not just because of better results, but because researchers finally had a framework they could reason about and extend systematically.

A quick comparison

	VAE	GAN	Diffusion
Training	Stable	Fragile	Stable
Image quality	Blurry	Sharp	Sharp
Diversity	Good	Mode collapse risk	Excellent
Speed	Fast	Fast	Slow
Controllability	Limited	Limited	Excellent

The main tradeoff? Speed. A GAN generates an image in one forward pass. A diffusion model needs hundreds or thousands of denoising steps. That's a lot of compute.

But researchers have found ways to reduce the number of steps (from 1,000 down to 20-50), and to do the diffusion in a compressed space rather than on full-resolution pixels. That's actually the key insight behind Stable Diffusion — but that's the next article.

Where things stand

GANs aren't dead. They're still used for specific tasks where speed matters — real-time image enhancement, video super-resolution, style transfer. But for general-purpose image generation, diffusion has become the default.

The field moved fast. From blurry VAE outputs in 2013, to photorealistic GAN faces in 2018, to text-guided diffusion art in 2022. Each generation solved the previous one's biggest weakness.

And the story doesn't stop at still images. The same diffusion framework — start with noise, learn to remove it — turns out to work for video, audio, 3D models, and more. The principle is surprisingly general.

What's next?

Now you know how a model generates an image from noise. But we skipped something important: how do you tell it what to generate? How does a text prompt turn into a specific picture?

That's exactly what the next article covers — Text-to-Image and Stable Diffusion — where CLIP, latent spaces, and U-Nets come together to turn your words into pixels.