Probability Theory Basics
Sample spaces, measurable events, and probability as measure
“At a purely formal level, one could call probability theory the study of measure spaces with total measure one, but that would be like calling number theory the study of strings of digits which terminate.”
—Terence Tao, Australian-American mathematician and Fields medalist
Let’s walk through the formal structure of probability theory that Kolmogorov set down in 1933. This is the machinery that makes everything work, and once you see it, a lot of probability suddenly makes more sense.
Sample Spaces and Events
We start with the sample space 𝛺, which is just the set of all possible outcomes of whatever random experiment we’re thinking about. If you flip a coin, your sample space is 𝛺 = {H, T}. Roll a standard die and you get 𝛺 = {⚀, ⚁, ⚂, ⚃, ⚄, ⚅}. Modeling something continuous like tomorrow’s temperature or next week’s stock price? Then you’re looking at 𝛺 = ℝ or maybe 𝛺 = ℝ+ if you want to restrict to positive values.
Now, an event is any subset of your sample space. The event “rolling an even number” is the subset {⚁, ⚃, ⚅}. The event “getting heads” is just {H}. Simple enough. You might think we can just assign probabilities to any subset we want, and for finite sample spaces that’s basically true. But here’s where things get interesting.
𝜎-Algebras: Deciding What’s Measurable
When your sample space is infinite, you run into trouble if you try to assign probabilities to arbitrary subsets. There are genuine mathematical contradictions lurking there. The Vitali set is a classic example: you can construct a subset of [0,1] that cannot be assigned a Lebesgue measure consistently. Any attempt to assign probabilities to all subsets breaks down.
So we need to be selective about which subsets get to be events that we can assign probabilities to. That’s what a 𝜎-algebra does. A 𝜎-algebra, denoted ℱ (this fancy-looking script F, though sometimes you’ll see capital Sigma 𝛴), is a collection of subsets of 𝛺 that tells us exactly which events are measurable. To be a 𝜎-algebra, ℱ has to satisfy three requirements.
First, the whole sample space 𝛺 has to be in ℱ. That’s just saying we can assign a probability to “something happens,” which better be the case.
Second, if some event A is in ℱ, then its complement Ac has to be in ℱ too. This means if we can measure the probability that it rains, we can also measure the probability that it doesn’t rain. Complements have to work.
Third, if you have a countable sequence of events A1, A2, A3, … that are all in ℱ, then their union
has to be in ℱ as well. The union A ∪ B of two events means “A or B happens,” which is everything in A, everything in B, or both. So if you can measure each event individually, you can measure “at least one of these happens.” Notice this is countable unions, not arbitrary unions. That restriction is what saves us from the paradoxes while still being general enough for everything we actually need.
For finite sample spaces, we typically just use the power set, meaning ℱ = 2𝛺, which includes every possible subset. For the real line, we use the Borel 𝜎-algebra ℬ, which is generated by taking all the open intervals and then closing up under complements and countable unions. This gives you all the intervals, all the unions of intervals, and a whole lot more, but it excludes the weird pathological sets.
Measurable Spaces
A measurable space is the pair (𝛺, ℱ). That’s it. You’ve got a sample space and you’ve specified which subsets are measurable. This is the stage on which probability operates. You need this structure before you can start assigning probabilities to things.
Probability Measures
Once you have your measurable space (𝛺, ℱ), you can define a probability measure P. This is a function that takes any measurable event A ∈ ℱ and gives you a number P(A) between 0 and 1. That number is the probability of A.
Kolmogorov said a probability measure has to satisfy three axioms, and these are beautifully minimal.
First axiom
Probabilities can’t be negative.
Second axiom
The probability that something happens is 1.
Third axiom
…and this is the key one:
if you have a countable collection of pairwise disjoint events A1, A2, A3, … then
This is countable additivity. If events don’t overlap, their probabilities add up.
That’s it. Those three axioms generate everything else in probability theory. Conditional probability, independence, Bayes’ theorem: they all follow from these.
The complete structure (𝛺, ℱ, P) is called a probability space. Sample space, 𝜎-algebra, probability measure: that’s your foundation.
Why We Call It a Measure
You might wonder why we call P a “measure.” It’s because probability theory is really just a special case of measure theory, which is the branch of math that studies how to assign sizes to sets. When you measure the length of an interval [a, b], you compute b - a. When you measure the area of a region in the plane, you integrate. A probability measure is doing the same thing — it’s assigning a “size” to events in your sample space. The only difference is that it’s a measure with total mass 1, since P(𝛺) = 1.
This perspective explains a lot about why probabilities behave the way they do. Measures are additive over disjoint sets, which is why
when A and B don’t overlap. Measures can’t double-count. When events do overlap, you have to correct for their intersection:
The intersection A ∩ B means “both A and B happen,” which is everything that’s in both sets. That’s measure theory at work.
And once you recognize that expectation is integration with respect to a probability measure, all the tools of integration theory become available. The dominated convergence theorem, Fubini’s theorem, the change of variables formula: these all apply to probability because probability is measure theory.
What Kolmogorov Did
Before 1933, probability theory didn’t really have rigorous foundations. People argued about what probability meant. Was it long-run frequency? Degree of belief? Logical relationship between propositions? There were all these competing interpretations and no agreement on the mathematical structure underneath.
Kolmogorov cut through all of that by showing that whatever you think probability means philosophically, the math is the same. The formal theory follows from the structure (𝛺, ℱ, P). You can be a frequentist or a Bayesian or whatever you want: the math doesn’t care. He separated the formal mathematical theory from questions about interpretation.
This was huge for the development of statistics and applied probability. It made possible rigorous theorems about continuous random variables, justified limiting results like the law of large numbers and central limit theorem, and gave us the foundation for stochastic process theory. It also made clear which probability questions actually make sense mathematically and which don’t.
Why This Matters for Applied Work
So why should you care about 𝜎-algebras if you’re doing applied statistics or economics? Because information structures are 𝜎-algebras. When you’re modeling decision-making under uncertainty or information revelation over time (which is basically all of financial economics), the 𝜎-algebra ℱt represents the information available at time t. A smaller 𝜎-algebra means coarser information. A larger 𝜎-algebra means finer information. The mathematical framework gives you precise language for talking about who knows what when.
Measurability also clarifies what conditioning means. When you write 𝔼[Y | X], you’re computing expectation with respect to the 𝜎-algebra generated by X. That’s the coarsest 𝜎-algebra that makes X measurable. This makes conditional expectation less mysterious: it’s a projection onto a smaller information space.
The measure-theoretic foundation also justifies operations you do routinely. Why can you integrate a probability density function to get probabilities? Because probability is a measure and integration is how you compute measures of sets. Why does the law of total probability work? Because measures are additive over partitions. Why does the change-of-variables formula work when you transform random variables? Because it’s just the change-of-variables theorem from measure theory.
Most of the time this formal machinery stays in the background. But understanding it clarifies why probability methods work the way they do and helps you avoid conceptual errors when things get more advanced.



I read your post, had some questions that ChatGPT answered, and that led me to some fascinating stuff. I thought you might be interested.
I don’t know if you can see this but if you can, comments would be appreciated.
https://chatgpt.com/share/69091cba-7b74-8004-81b5-673bb2a624c1