๐Ÿ“ก
๐ŸŽฒ
๐Ÿง 

Understanding Information

What we really learn when we learn something โ€” and how to measure surprise, uncertainty, and prediction

โ†“
Chapter 1

When Does a Message Actually Tell You Something?

You wake up and check your phone. Your friend texts: "It's sunny today." How much have you actually learned? If you live in Seattle in November, this message is genuinely surprising โ€” you've learned quite a bit. If you live in Phoenix in July, you've learned almost nothing. You already knew it would be sunny.

Both messages contain the same words, the same "facts." But they don't give you the same amount of information. The technical definition connects information with surprise โ€” it's about how much uncertainty the message removes.

โ˜€๏ธ Same Message, Different Information
Click a city to see how much information "It's sunny today" carries
Pick a city above to compare

Information measures how much uncertainty gets removed, not just what facts get communicated.

Chapter 2

The Uncertainty You Started With

The amount you learn depends on how uncertain you were to begin with. If you live somewhere with completely unpredictable weather โ€” it rains about half the days and is sunny the other half โ€” then checking the forecast genuinely tells you something.

But if it's sunny 99 days out of 100, checking the forecast barely helps. You already had a pretty good idea what it would say.

โ˜• The Coffee Shop Scenario
Click each scenario to compare how much you learn from "Let's meet at Java Junction"
Choose a scenario

Same message โ€” "Let's meet at Java Junction" โ€” but completely different amounts of information. In Scenario 1, you learned nothing. In Scenario 2, you went from complete uncertainty to certainty.

Chapter 3

Why Rare Events Feel More Informative

If your normally punctual coworker shows up late, that tells you something โ€” maybe they hit traffic, maybe something's wrong. You pay attention. But if your chronically late coworker shows up late again, you barely notice. The unusual event is surprising because it contradicts your expectations.

๐Ÿšจ Fire Drill Surprise
Click the building to arrive at work and see what happens
Fire drill probability: 1 in 100 days
Click to start your work days

Rare events carry more information than common events. When something with a 1% chance actually happens, it's way more surprising โ€” more informative โ€” than when something with a 99% chance happens.

Chapter 4

Measuring Surprise with Numbers

Your friend is going to text you a single letter, and you need to figure out what it is by asking yes/no questions. The number of questions you need tells you the information content.

๐Ÿ”ค The Letter Guessing Game
Choose a set size, then click "Ask a Question" to narrow down the answer
Pick a letter set to begin
With 2 equally likely options you need 1 question. With 4 you need 2 questions. With 8 you need 3 questions. The pattern? It's the logarithm base 2 of the number of options.
Chapter 5

When Options Aren't Equal

Real life isn't usually about equally likely options. When you wake up and see it's sunny (50% likely), you've learned some information. When you see it's snowy (5% likely), you've learned a lot. We want our measure to capture this difference.

Self-Information Formula
Self-information = โˆ’logโ‚‚(probability)
The rarer the event (smaller p), the higher the self-information.
๐Ÿ“Š Weather Self-Information
Compare how much information each weather outcome gives you
Sunny 50%
1.0 bit
Cloudy 25%
2.0 bits
Rainy 20%
2.3 bits
Snowy 5%
4.3 bits

The unit bit comes from "binary digit" โ€” it relates back to those yes/no questions. One bit is the information you get from one yes/no answer. The logarithm ensures that if two independent things happen, the total information is just the sum.

Chapter 6

The Average Surprise: Entropy

Entropy is the average self-information across all possible outcomes, weighted by how often each outcome occurs. It tells you how unpredictable a system is overall โ€” not just how surprising one event is, but how much uncertainty you face on average.

๐ŸŒก๏ธ Entropy Comparison: Two Cities
Click each city to see its entropy calculation

City A โ€” Varied Weather

Sunny 50% ยท Cloudy 25% ยท Rainy 20% ยท Snowy 5%

City B โ€” Predictable Weather

Sunny 90% ยท Rainy 10%
Click a city to see its entropy
Entropy Formula
H = ฮฃ p(x) ร— (โˆ’logโ‚‚(p(x)))
City A: (0.5ร—1) + (0.25ร—2) + (0.20ร—2.3) + (0.05ร—4.3) โ‰ˆ 1.68 bits
City B: (0.9ร—0.15) + (0.1ร—3.32) โ‰ˆ 0.47 bits

High entropy = high unpredictability = lots of information when you observe the outcome. Low entropy = high predictability = little information when you observe the outcome.

Three ways to think about entropy: (1) The Guessing Game โ€” how hard it is to guess what will happen. (2) The Surprise Meter โ€” your average surprise level. (3) The Information Content โ€” how much each observation teaches you.
Chapter 7

Real-World Application: Text Messages

When you type a message, your phone tries to predict the next letter. Some predictions are easy: after "QU," the next letter is almost certainly a vowel. That's low entropy. But after "TH," the next letter could be E, A, I, O, R, or several others. That's higher entropy.

๐Ÿ“ฑ Predictive Text Entropy
Click a letter pair to see how predictable the next letter is
Click a pair above

Your phone's predictive text works better when entropy is low. It's information theory in action โ€” the phone minimizes the information you need to transmit by exploiting patterns in language.

Chapter 8

When Your Model Doesn't Match Reality

What if you have a guess about how something works, but your guess is wrong? Imagine you predict your bus will be on time 70% and late 30% of the time, but reality is actually 50/50. You're going to be surprised more often than you expect.

Cross-entropy measures the average surprise you'll experience if you use your predicted probabilities but reality follows the actual probabilities.

๐ŸšŒ Bus Prediction: Model vs Reality
Adjust your predicted "on-time" probability with the slider and see how cross-entropy changes
Reality: Bus is on time 50%, late 50%
5%50%95%

True Entropy

1.00 bits

Cross-Entropy

1.13 bits

KL Divergence (Extra Surprise)

0.13 bits

KL divergence (Kullback-Leibler divergence) measures the "extra surprise" you experience from using the wrong model. If your model perfectly matched reality, the KL divergence would be zero.

Chapter 9

Comparing Two Models

Suppose you're predicting tomorrow's temperature: Cold, Mild, or Hot. You have two competing models, and reality follows its own distribution. Which model is better?

โš–๏ธ Model Showdown
Click each model to see its cross-entropy against reality
Reality: Cold 50% ยท Mild 35% ยท Hot 15%
Click a model to evaluate it

This is exactly how we evaluate machine learning models. When training a spam filter, a recommendation system, or a weather predictor, we calculate cross-entropy between predictions and reality. Lower cross-entropy means better predictions.

Chapter 10

What Two Things Tell Us About Each Other

Mutual information quantifies how much knowing one variable tells you about another. Before you know about a student's study habits, you have some uncertainty about their test score. After learning they studied a lot, you have less uncertainty. That reduction is the mutual information.

๐Ÿ“š Study Habits & Test Scores
Click a condition to see how it reduces uncertainty about test scores
Click a condition above

If two variables are completely independent โ€” like your test score and the weather in Tokyo โ€” the mutual information is zero. If they're perfectly related โ€” like temperature in Celsius and Fahrenheit โ€” the mutual information is maximal.

Chapter 11

Putting It All Together: Spam Detection

Email spam detection shows how all these concepts work together in a real system. Your email program faces a question with uncertainty: is this message spam or not?

๐Ÿ“ง Spam Detection Pipeline
Click each step to see how information theory concepts apply
Click a step to explore the pipeline
The full picture: Entropy tells us how unpredictable spam/not-spam is. Mutual information tells us which email features are most useful. Cross-entropy tells us how well our model predicts. KL divergence tells us how much worse our model is than perfect.
Chapter 12

Why This Matters

Information theory gives us a precise way to talk about uncertainty and learning. Instead of vague statements like "this seems pretty random," we can measure exactly how random something is and exactly how much we learned.

๐ŸŒ Where Information Theory Shows Up
๐Ÿ“ก
Communication
How much can we compress a file? How much bandwidth do we need?
๐Ÿค–
Machine Learning
Which model is better? Which features matter most?
๐Ÿ”ฌ
Science
Is this pattern real or just noise?
๐Ÿ’ก
Everyday Thinking
Rare events are more informative. Learning depends on prior uncertainty.

Information is fundamentally about reducing uncertainty. When we receive a message, learn a fact, or observe an outcome, we're not just accumulating data โ€” we're reducing our uncertainty about the world. The amount we learn depends on how uncertain we were before, how surprising the observation is, and how well our mental model matches reality.

Your Toolkit

๐Ÿ“ฉ
Information
How much uncertainty a message removes
๐ŸŽฏ
Self-Information
โˆ’logโ‚‚(p) โ€” surprise of a single event
๐ŸŒก๏ธ
Entropy
Average surprise across all outcomes
๐Ÿ”€
Cross-Entropy
Surprise when your model mismatches reality
๐Ÿ“
KL Divergence
Extra surprise from a wrong model
๐Ÿ”—
Mutual Information
How much one variable reveals about another

Cite this post

Rizvi, I. (2026). Information Theory โ€” A Visual Guide. theog.dev.
DOI: 10.5281/zenodo.19413297

@article{rizvi2026informationtheory,
  title   = {Information Theory โ€” A Visual Guide},
  author  = {Rizvi, Imran},
  year    = {2026},
  url     = {https://theog.dev/ml-ds/information-theory-visual.html},
  doi     = {10.5281/zenodo.19413297}
}