Understanding Information

Chapter 1

When Does a Message Actually Tell You Something?

You wake up and check your phone. Your friend texts: "It's sunny today." How much have you actually learned? If you live in Seattle in November, this message is genuinely surprising — you've learned quite a bit. If you live in Phoenix in July, you've learned almost nothing. You already knew it would be sunny.

Both messages contain the same words, the same "facts." But they don't give you the same amount of information. The technical definition connects information with surprise — it's about how much uncertainty the message removes.

☀️ Same Message, Different Information

Click a city to see how much information "It's sunny today" carries

Pick a city above to compare

Information measures how much uncertainty gets removed, not just what facts get communicated.

Chapter 2

The Uncertainty You Started With

The amount you learn depends on how uncertain you were to begin with. If you live somewhere with completely unpredictable weather — it rains about half the days and is sunny the other half — then checking the forecast genuinely tells you something.

But if it's sunny 99 days out of 100, checking the forecast barely helps. You already had a pretty good idea what it would say.

☕ The Coffee Shop Scenario

Click each scenario to compare how much you learn from "Let's meet at Java Junction"

Choose a scenario

Same message — "Let's meet at Java Junction" — but completely different amounts of information. In Scenario 1, you learned nothing. In Scenario 2, you went from complete uncertainty to certainty.

Chapter 3

Why Rare Events Feel More Informative

If your normally punctual coworker shows up late, that tells you something — maybe they hit traffic, maybe something's wrong. You pay attention. But if your chronically late coworker shows up late again, you barely notice. The unusual event is surprising because it contradicts your expectations.

🚨 Fire Drill Surprise

Click the building to arrive at work and see what happens

Fire drill probability: 1 in 100 days

Click to start your work days

Rare events carry more information than common events. When something with a 1% chance actually happens, it's way more surprising — more informative — than when something with a 99% chance happens.

Chapter 4

Measuring Surprise with Numbers

Your friend is going to text you a single letter, and you need to figure out what it is by asking yes/no questions. The number of questions you need tells you the information content.

🔤 The Letter Guessing Game

Choose a set size, then click "Ask a Question" to narrow down the answer

Pick a letter set to begin

With 2 equally likely options you need 1 question. With 4 you need 2 questions. With 8 you need 3 questions. The pattern? It's the logarithm base 2 of the number of options.

Chapter 5

When Options Aren't Equal

Real life isn't usually about equally likely options. When you wake up and see it's sunny (50% likely), you've learned some information. When you see it's snowy (5% likely), you've learned a lot. We want our measure to capture this difference.

Self-Information Formula

Self-information = −log₂(probability)

The rarer the event (smaller p), the higher the self-information.

📊 Weather Self-Information

Compare how much information each weather outcome gives you

Sunny 50%

1.0 bit

Cloudy 25%

2.0 bits

Rainy 20%

2.3 bits

Snowy 5%

4.3 bits

The unit bit comes from "binary digit" — it relates back to those yes/no questions. One bit is the information you get from one yes/no answer. The logarithm ensures that if two independent things happen, the total information is just the sum.

Chapter 6

The Average Surprise: Entropy

Entropy is the average self-information across all possible outcomes, weighted by how often each outcome occurs. It tells you how unpredictable a system is overall — not just how surprising one event is, but how much uncertainty you face on average.

🌡️ Entropy Comparison: Two Cities

Click each city to see its entropy calculation

City A — Varied Weather

Sunny 50% · Cloudy 25% · Rainy 20% · Snowy 5%

City B — Predictable Weather

Sunny 90% · Rainy 10%

Click a city to see its entropy

Entropy Formula

H = Σ p(x) × (−log₂(p(x)))

City A: (0.5×1) + (0.25×2) + (0.20×2.3) + (0.05×4.3) ≈ 1.68 bits

City B: (0.9×0.15) + (0.1×3.32) ≈ 0.47 bits

High entropy = high unpredictability = lots of information when you observe the outcome. Low entropy = high predictability = little information when you observe the outcome.

Three ways to think about entropy: (1) The Guessing Game — how hard it is to guess what will happen. (2) The Surprise Meter — your average surprise level. (3) The Information Content — how much each observation teaches you.

Chapter 7

Real-World Application: Text Messages

When you type a message, your phone tries to predict the next letter. Some predictions are easy: after "QU," the next letter is almost certainly a vowel. That's low entropy. But after "TH," the next letter could be E, A, I, O, R, or several others. That's higher entropy.

📱 Predictive Text Entropy

Click a letter pair to see how predictable the next letter is

Click a pair above

Your phone's predictive text works better when entropy is low. It's information theory in action — the phone minimizes the information you need to transmit by exploiting patterns in language.

Chapter 8

When Your Model Doesn't Match Reality

What if you have a guess about how something works, but your guess is wrong? Imagine you predict your bus will be on time 70% and late 30% of the time, but reality is actually 50/50. You're going to be surprised more often than you expect.

Cross-entropy measures the average surprise you'll experience if you use your predicted probabilities but reality follows the actual probabilities.

🚌 Bus Prediction: Model vs Reality

Adjust your predicted "on-time" probability with the slider and see how cross-entropy changes

Reality: Bus is on time 50%, late 50%

Your prediction — On-time probability: 70%

5%50%95%

True Entropy

1.00 bits

Cross-Entropy

1.13 bits

KL Divergence (Extra Surprise)

0.13 bits

KL divergence (Kullback-Leibler divergence) measures the "extra surprise" you experience from using the wrong model. If your model perfectly matched reality, the KL divergence would be zero.

Chapter 9

Comparing Two Models

Suppose you're predicting tomorrow's temperature: Cold, Mild, or Hot. You have two competing models, and reality follows its own distribution. Which model is better?

⚖️ Model Showdown

Click each model to see its cross-entropy against reality

Reality: Cold 50% · Mild 35% · Hot 15%

Click a model to evaluate it

This is exactly how we evaluate machine learning models. When training a spam filter, a recommendation system, or a weather predictor, we calculate cross-entropy between predictions and reality. Lower cross-entropy means better predictions.

Chapter 10

What Two Things Tell Us About Each Other

Mutual information quantifies how much knowing one variable tells you about another. Before you know about a student's study habits, you have some uncertainty about their test score. After learning they studied a lot, you have less uncertainty. That reduction is the mutual information.

📚 Study Habits & Test Scores

Click a condition to see how it reduces uncertainty about test scores

Click a condition above

If two variables are completely independent — like your test score and the weather in Tokyo — the mutual information is zero. If they're perfectly related — like temperature in Celsius and Fahrenheit — the mutual information is maximal.

Chapter 11

Putting It All Together: Spam Detection

Email spam detection shows how all these concepts work together in a real system. Your email program faces a question with uncertainty: is this message spam or not?

📧 Spam Detection Pipeline

Click each step to see how information theory concepts apply

Click a step to explore the pipeline

The full picture: Entropy tells us how unpredictable spam/not-spam is. Mutual information tells us which email features are most useful. Cross-entropy tells us how well our model predicts. KL divergence tells us how much worse our model is than perfect.

Chapter 12

Why This Matters

Information theory gives us a precise way to talk about uncertainty and learning. Instead of vague statements like "this seems pretty random," we can measure exactly how random something is and exactly how much we learned.

🌍 Where Information Theory Shows Up

📡

Communication

How much can we compress a file? How much bandwidth do we need?

🤖

Machine Learning

Which model is better? Which features matter most?

🔬

Science

Is this pattern real or just noise?

💡

Everyday Thinking

Rare events are more informative. Learning depends on prior uncertainty.

Information is fundamentally about reducing uncertainty. When we receive a message, learn a fact, or observe an outcome, we're not just accumulating data — we're reducing our uncertainty about the world. The amount we learn depends on how uncertain we were before, how surprising the observation is, and how well our mental model matches reality.

Your Toolkit

📩

Information

How much uncertainty a message removes

🎯

Self-Information

−log₂(p) — surprise of a single event

🌡️

Entropy

Average surprise across all outcomes

🔀

Cross-Entropy

Surprise when your model mismatches reality

📏

KL Divergence

Extra surprise from a wrong model

🔗

Mutual Information

How much one variable reveals about another

Understanding Information

When Does a Message Actually Tell You Something?

The Uncertainty You Started With

Why Rare Events Feel More Informative

Measuring Surprise with Numbers

When Options Aren't Equal

The Average Surprise: Entropy

City A — Varied Weather

City B — Predictable Weather

Real-World Application: Text Messages

When Your Model Doesn't Match Reality

True Entropy

Cross-Entropy

KL Divergence (Extra Surprise)

Comparing Two Models

What Two Things Tell Us About Each Other

Putting It All Together: Spam Detection

Why This Matters

Your Toolkit

Information Theory — A Visual Guide

Cite this post