AI from Scratch6 min read

AI from Scratch #4: How AI Learned to Beat You at Video Games

You trained your dog with treats. You learned to skateboard by eating pavement. AI learns the exact same way — by trying stuff, failing, and chasing rewards.

RM

Raghu Mudumbai

CEO & Chief Scientist, netcausal.ai

Training a Puppy, Basically

You just got a puppy. You want it to sit on command. Do you hand it a textbook? Show it a PowerPoint? Obviously not. You say "sit," and when it accidentally sits, you give it a treat. When it doesn't sit, no treat.

After enough repetitions — sit, treat, sit, treat — the puppy connects the dots: that sound + this action = snack. Nobody explained sitting. The puppy figured it out by trial and error, driven by one thing: the reward.

That's reinforcement learning. And it's exactly how AI learned to play chess, dominate video games, and beat the best human Go player on the planet.

The Three Ingredients

Every reinforcement learning system has three simple pieces:

  • Agent — The thing making decisions (the puppy, or the AI)
  • Environment — The world it's operating in (your living room, or a video game)
  • Reward signal — The feedback that says "good job" or "bad move" (treats, or points)

The agent takes an action in the environment. The environment responds. The agent gets a reward (or a penalty). Then it adjusts its strategy and tries again.

That's the entire loop: act → observe → reward → adjust → repeat.

Sound familiar? It should. In Article #1, neural networks learned by adjusting weights after making mistakes. Reinforcement learning is the same idea, except the AI isn't just classifying cats — it's making sequences of decisions over time.

How AI Beat Atari (Without Reading the Instructions)

In 2013, a company called DeepMind (now part of Google) built an AI that could play Atari games — Breakout, Pong, Space Invaders — better than humans. Here's the wild part: nobody told it the rules.

The AI saw only two things:

  1. The pixels on the screen (raw image data)
  2. The score (reward signal)

That's it. No manual. No tutorial. No "the paddle hits the ball." It just started pressing random buttons and watching what happened to the score.

At first, it was terrible. Random button mashing. Score: basically zero.

But every time the score went up, even by a little, the AI thought: "Whatever I just did, do more of that." Every time the score dropped: "Do less of that."

After millions of games — playing way faster than any human could — something incredible happened. In Breakout, the AI discovered a strategy that even the game's designers didn't anticipate: tunnel through one side of the wall and bounce the ball behind the bricks. Maximum points with minimum effort.

Nobody taught it that. It figured it out from pure trial and error.

The Explore vs. Exploit Problem

Here's a dilemma you face every day without realizing it.

It's Friday night. You could go to your favorite restaurant — the one you know is great. Or you could try that new place that just opened, which might be amazing... or might be terrible.

Going to the favorite = exploiting what you already know works. Trying the new place = exploring to potentially find something better.

If you always exploit, you'll never discover anything new. If you always explore, you'll waste time on bad options and never enjoy the good ones.

Reinforcement learning AI faces the exact same trade-off. It has to balance:

  • Exploit: Repeat actions that gave high rewards before
  • Explore: Try random new actions that might lead to even higher rewards

Getting this balance right is one of the hardest problems in AI. Too much exploitation = the AI gets stuck in a mediocre strategy. Too much exploration = it never settles on anything good.

The AlphaGo Moment

In 2016, Google's AlphaGo defeated Lee Sedol, one of the greatest Go players in history. Go is an ancient board game with more possible positions than atoms in the universe. Chess computers had existed for decades, but Go was considered too complex for AI.

AlphaGo learned through reinforcement learning. It played millions of games against itself — literally training by being its own opponent. Each game produced a winner and a loser, which provided the reward signal: win = good, lose = bad.

In Game 2, AlphaGo played Move 37 — a move so unusual that human experts thought it was a mistake. No human would play it. But it turned out to be brilliant, and AlphaGo won the game. The AI had discovered a strategy that thousands of years of human gameplay had missed.

Lee Sedol later said the experience was "beautiful." The AI hadn't just memorized human strategies — it had invented new ones.

Why Not Use Reinforcement Learning for Everything?

If RL is so powerful, why don't we use it for all AI? Because it has a major weakness: it needs a clear reward signal, and it needs a lot of practice.

Training a puppy to sit? Clear reward. Training an AI to maximize a game score? Clear reward.

But what reward do you give an AI for "write a good essay"? Or "have a helpful conversation"? Or "be creative"? These are fuzzy, subjective goals — and defining the right reward is often harder than building the AI itself.

This is called the reward design problem, and getting it wrong can lead to bizarre behavior. An AI playing a boat racing game once figured out that spinning in circles and collecting bonus items scored more points than actually finishing the race. Technically, it maximized the reward. Just... not the way anyone intended.

Try It Yourself

Think about the last time you learned something without anyone giving you explicit instructions. Maybe a new video game, a skateboard trick, or cooking a meal for the first time.

You probably followed this exact loop:

  1. Tried something
  2. Saw what happened
  3. Adjusted your approach
  4. Tried again

You were doing reinforcement learning. The "reward" was the game working, the trick landing, or the food tasting decent. The "penalty" was losing, falling, or burning dinner.

Now imagine doing that loop a million times faster, with perfect memory of every attempt. That's AI.

The Big Takeaway

Reinforcement learning is how AI learns from experience instead of instructions. It tries actions, gets feedback, and gradually discovers strategies that maximize rewards — just like a puppy learning to sit, a skateboarder learning kickflips, or a gamer mastering a new level.

The most impressive part? It can discover strategies that humans never thought of. Not because it's smarter, but because it can try billions of possibilities that no human has the time or patience to explore.

What's Next

In Article #5, we'll explore how AI sees — how your phone recognizes your face even when you're wearing sunglasses, how self-driving cars spot pedestrians, and how Instagram filters know exactly where your nose is. It all starts with a simple question: how do you turn a photo into numbers?


This is part of the AI from Scratch series — making AI and machine learning understandable for everyone, no PhD required. Follow along on Medium or at netcausal.ai/blog.

ai-from-scratchreinforcement-learningbeginnersmachine-learninggaming
Share

Stay ahead of the curve

Get insights on causal AI, network infrastructure, and enterprise technology delivered to your inbox.

No spam. Unsubscribe anytime.