AI from Scratch #5: How Your Phone Recognizes Your Face (Even with Sunglasses)

Spotting Your Friend in a Crowd

You're in a packed school hallway between classes. Hundreds of people, all moving fast. But somehow, in half a second, you spot your best friend three rows deep. You didn't scan every face. You recognized their height, their hair, the way they walk, that jacket they always wear.

Now think about how weird that is. Your eyes took in a chaotic mess of shapes, colors, and movement — and your brain instantly said "that's Maya" or "that's Jordan." You didn't think about it. It just happened.

That's computer vision — teaching AI to look at pixels and understand what it's seeing. It's how your phone unlocks with your face, how self-driving cars spot stop signs, and how Instagram knows exactly where to put dog ears on your selfie.

A Photo Is Just Numbers

Here's the first thing to understand: a computer doesn't see images the way you do. It sees numbers.

Every digital photo is a grid of tiny dots called pixels. Each pixel has a color value — a number for how much red, green, and blue it contains. A typical phone photo might be 4,000 × 3,000 pixels. That's 12 million tiny color numbers.

So when your phone "looks" at your face, it's not seeing eyes, a nose, and a mouth. It's seeing something like:

Pixel (1042, 756): Red 210, Green 180, Blue 165
Pixel (1043, 756): Red 212, Green 181, Blue 166
...12 million more of these

The challenge: how do you go from 12 million numbers to "that's a human face"?

Building Up From Edges

Your brain doesn't process a photo all at once either. Neuroscientists discovered that your visual cortex works in layers — strikingly similar to how AI does it.

Layer 1: Edges. The first thing your brain (and an AI) detects is edges — where one color stops and another begins. The boundary between your hair and the wall behind you. The outline of your jaw. The edge of your eyebrow.

Layer 2: Shapes. Edges combine into shapes. A curved edge becomes an arc. Two arcs near each other become a circle. A circle with a dark center becomes... an eye.

Layer 3: Features. Shapes combine into features. Two eyes above a nose above a mouth = a face layout. Pointy shapes on top of a round head = ears (or a cat).

Layer 4: Objects. Features combine into recognition. "This arrangement of eyes, nose, mouth, with these proportions = my friend Maya."

This layered approach is exactly how a Convolutional Neural Network (CNN) works — the type of AI that powers most computer vision today. Each layer builds on the one before it, going from simple patterns to complex understanding.

How Filters Work (Like Instagram, but Smarter)

You know how Instagram filters can sharpen a photo, blur the background, or detect edges? CNNs use similar filters, but they learn which filters to use on their own.

Imagine sliding a tiny magnifying glass across a photo, checking small patches at a time. Each filter asks a specific question:

"Is there a vertical edge here?"
"Is there a curve here?"
"Is there a skin-tone color here?"

Early layers use simple filters: edge detectors, corner detectors, color patterns. Deeper layers combine those into complex filters: "eye detector," "nose detector," "wheel detector."

The magic is that nobody programs these filters manually. The CNN learns them during training — the same trial-and-error process from Article #1. Show it millions of labeled photos ("this is a face," "this is a car," "this is a dog"), and it figures out which filters are useful for telling them apart.

Face Recognition: It's Personal

Detecting "a face" is one thing. Knowing whose face it is? That's harder.

Face recognition systems work in two steps:

Step 1: Find the face. Scan the image and locate rectangular regions that contain faces. This uses those learned filters — eye-shaped features + nose-shaped features + mouth-shaped features in the right arrangement = probably a face.

Step 2: Create a "face fingerprint." The CNN converts each detected face into a list of numbers — called an embedding — that captures what makes that face unique. The distance between your eyes, the shape of your jawline, the proportions of your features — all encoded as numbers.

Your phone doesn't store a photo of your face. It stores this numerical fingerprint. When you try to unlock, it computes a new fingerprint from the camera image and checks: does this match the stored one? If the numbers are close enough — even with sunglasses, a new haircut, or different lighting — it unlocks.

That's why it works with sunglasses but not with a mask covering your nose and mouth. The AI relies on certain facial proportions, and covering too much removes the data it needs.

Self-Driving Cars: Vision at 60 MPH

The same technology, scaled up massively, powers self-driving cars. But instead of recognizing one face, the car needs to recognize everything: pedestrians, other cars, lane markings, traffic lights, stop signs, cyclists, construction cones, that random plastic bag blowing across the road.

And it needs to do it 30 times per second, in real time, in rain, snow, and darkness.

Self-driving vision systems use multiple cameras (plus radar and lidar), and each frame goes through CNNs that simultaneously:

Detect objects ("there's a pedestrian")
Classify them ("it's a child")
Track them ("they're moving left at 3 mph")
Predict what they'll do ("they're about to cross the street")

All in about 50 milliseconds. That's faster than you can blink.

The Limits: When AI Vision Fails

AI vision isn't perfect. It can be fooled in ways that would never trick a human:

A few stickers placed on a stop sign can make an AI read it as a speed limit sign, even though a human would still clearly see "STOP"
Photos slightly modified with invisible noise (called adversarial examples) can make an AI think a panda is a gibbon with 99% confidence
AI trained mostly on photos from certain countries may struggle to recognize objects common in other cultures

These failures happen because AI doesn't understand what it sees the way you do. It recognizes patterns in pixels. If the pixels are disrupted in just the right way, the pattern breaks — even if the image looks perfectly normal to you.

Try It Yourself

Hold your phone at arm's length and slowly cover your face with your hand, starting from the chin going up. Watch when Face ID stops working. You'll find there's a threshold — cover too much of the critical region (eyes, nose, cheekbones) and it can't match the pattern anymore.

Now try it with sunglasses, a hat, different lighting. Notice what the AI can handle and what breaks it. You're probing the limits of a CNN — finding out which features it considers essential.

The Big Takeaway

Computer vision works by building up understanding in layers — from pixels to edges, from edges to shapes, from shapes to features, from features to objects. It's remarkably similar to how your own visual cortex processes information.

The same core technology that unlocks your phone also powers Instagram filters, medical image analysis (detecting tumors in X-rays), quality control in factories, and the eyes of self-driving cars. The difference is what it's trained to look for.

What's Next

In Article #6, we'll flip the script. Instead of AI recognizing images, we'll look at how AI creates them. How does DALL-E turn the sentence "a cat riding a skateboard on Mars" into a photorealistic image? The answer involves starting with pure static and slowly sculpting it into art.

This is part of the AI from Scratch series — making AI and machine learning understandable for everyone, no PhD required. Follow along on Medium or at netcausal.ai/blog.