AI from Scratch #6: How AI Creates Art from Words

The World's Weirdest Pictionary Game

You're playing Pictionary. Your teammate draws a card that says "a cat riding a skateboard on the moon." They grab a marker and start drawing. First, rough shapes — a circle for the moon, a blob for the cat. Then details — whiskers, wheels, craters. Then shading and refinement until you can actually tell what it is.

They started with nothing and built up a picture, step by step, guided by the words on the card.

Now imagine that same process, but in reverse. Instead of building up from a blank canvas, what if you started with a TV screen full of static — pure random noise — and removed the noise, step by step, until a picture of a cat on a skateboard on the moon appeared?

That's basically how AI image generators like DALL-E, Midjourney, and Stable Diffusion work. And it's one of the most mind-bending ideas in modern AI.

Starting with Static

Seriously. The AI starts with random noise — a grid of completely random pixel colors. It looks like the static on an old TV.

Then it asks itself: "If I take a tiny bit of noise away, what should the image look like?" It removes a small amount of randomness, and the image gets slightly less chaotic. Then it removes a bit more. And a bit more.

After hundreds of these tiny denoising steps, the random static has been sculpted into a coherent image. Like Michelangelo chipping away at a block of marble and saying "the statue was always in there — I just removed the extra stone."

This process is called a diffusion model, and it works in two phases:

Training (learning what images look like): Take millions of real images. For each one, gradually add noise until it becomes pure static. The AI learns to reverse this process — given a noisy image, predict what it looked like one step less noisy.

Generation (creating new images): Start with pure static. Apply the learned denoising process step by step. The AI doesn't copy any training image — it generates something entirely new by applying everything it learned about what images look like.

Where Do the Words Come In?

Here's where it gets really cool. The denoising process doesn't happen in a vacuum — it's guided by your text prompt.

When you type "a golden retriever giving a TED talk about quantum physics," the AI converts your words into a numerical representation (similar to how language models work in Article #3). This representation acts like a compass during denoising, pulling the random noise toward an image that matches your description.

At each denoising step, the AI doesn't just ask "what should this look like?" It asks "what should this look like, given that it's supposed to be a golden retriever giving a TED talk about quantum physics?"

The words steer the noise removal. "Golden retriever" pulls toward dog-shaped features with golden fur. "TED talk" pulls toward a stage and podium. "Quantum physics" might pull toward equations on a screen behind the dog.

The result: an image that never existed before, created entirely from random noise and guided by your words.

Why the Results Are So Good (and So Weird)

Think about what the AI learned during training. It saw millions of photos of dogs, stages, whiteboards, people giving presentations. It learned what each of those things looks like from hundreds of angles, in thousands of lighting conditions.

When you combine concepts that never appeared together in training — like a dog at a podium — the AI blends what it knows about dogs with what it knows about podiums. Usually, the blend looks surprisingly good. Sometimes, it looks bizarre.

This is why AI-generated images often have weird hands. The AI has seen hands in millions of photos, but hands are incredibly complex — five fingers, multiple joints, different angles, overlapping with objects. The AI learned approximate patterns for hands, but the details often don't quite work. It's gotten much better, but hands remain one of the hardest things for AI to generate correctly.

It's the same reason you might be great at drawing faces but terrible at drawing hands. Some visual patterns are just harder to learn than others.

The Deepfake Question

The same technology that creates cute AI art can also create convincing fake photos and videos of real people. These are called deepfakes, and they raise serious concerns.

If AI can generate a photorealistic image of anyone doing anything, how do you trust what you see online? The answer is: you can't always. And that's a problem society is still figuring out.

Some approaches to fighting deepfakes:

Watermarking: AI companies embed invisible markers in generated images that detection tools can find
Detection AI: Other AI systems trained specifically to spot AI-generated images (fighting AI with AI)
Digital provenance: New standards that track where an image came from and whether it's been modified

It's an arms race — generators get better, detectors get better, and the cycle continues. Media literacy — knowing that any image could be fake — is becoming as important as reading comprehension.

Beyond Images: Music, Video, and 3D

Text-to-image was just the beginning. The same diffusion approach is now being applied to:

Video: Generate short video clips from text descriptions
Music: Create original songs from a text prompt describing the style and mood
3D models: Generate 3D objects from text for games, VR, and design
Code: Generate software from natural language descriptions (that's partially how AI coding assistants work)

The core idea is always the same: start with noise, denoise guided by a prompt, end up with something new. The "noise" just takes different forms for different types of content.

Try It Yourself

Think of the most ridiculous image you can describe in one sentence. "An astronaut riding a horse on Mars." "A Victorian-era portrait of a corgi wearing a monocle." "A Renaissance painting of someone eating a burrito."

Now go to any free AI image generator and type it in. Watch how the image emerges from nothing in seconds. Then try slight variations of your prompt and see how the results change. Add "photorealistic" or "watercolor painting" or "pixel art" to the end and watch the style shift.

You're controlling a diffusion model — steering noise removal with your words. That's all the magic is.

The Big Takeaway

Generative AI creates images by starting with pure random noise and removing it step by step, guided by your text description. It doesn't copy existing images — it generates new ones by applying patterns learned from millions of training images.

The same core technology powers image generation, video creation, music composition, and more. It's arguably the most visible form of AI in daily life right now, and it works by turning the process of adding noise to images completely on its head.

What's Next

In Article #7, we'll tackle one of the most important questions in AI: why can it be unfair? If an AI learns from data, and the data reflects real-world biases, the AI inherits those biases. We'll explore how this happens, why it matters, and what's being done about it.

This is part of the AI from Scratch series — making AI and machine learning understandable for everyone, no PhD required. Follow along on Medium or at netcausal.ai/blog.