# Terminology Recap: Random Variable / Distribution / PMF / PDF / Independence / Marginal Distribution / Joint Distribution / Conditional Random Variable

Yao Yao on February 26, 2019

TOC:

## Prerequisite #1 : $\sigma$-algebra

Definition: In mathematical analysis and in probability theory, a $\sigma$-algebra on a set $S$ is a subset $\Sigma \subset 2^S$ that includes $S$ itself. It is closed under complement and countable unions.

• 因为 $S \in \Sigma$ 同时它是 closed under complement，所以 $\varnothing \in \Sigma$
• $\sigma$-algebra, $\sigma$-ring 和 $\sigma$-field 都是有关系的，但这里不表

## Prerequisite #2 : Borel Set / Borel $\sigma$-algebra

In mathematics, a Borel set is any set in a topological space that can be formed from open sets (or, equivalently, from closed sets) through the operations of countable union, countable intersection, and relative complement.

• relative complement of $A$ in $B$ 就是 $A - B$
• relative complement of $B$ in $A$ 就是 $B - A$

For a topological space $X$, the collection of all Borel sets on $X$ forms a $\sigma$-algebra $\mathcal{B}$, known as the Borel algebra or Borel $\sigma$-algebra. The Borel $\sigma$-algebra on $X$ is the smallest $\sigma$-algebra containing all open sets (or, equivalently, all closed sets).

• A set $S$ is said to be countable if it’s finite or $\mathbf{card}(S) = \mathbf{card}(\mathbb{N})$
• $\mathbf{card}(\mathbb{R}) > \mathbf{card}(\mathbb{N})$ (Cantor Diagonal Argument)
• If $\mathcal{B}$ is a Borel algebra in $\mathbb{R}$, then $\mathbf{card}(\mathcal{B}) = \mathbf{card}(\mathbb{R})$
• 结论：$\mathcal{B}$ 不可数

## Prerequisite #3 : Measurable Function / Measurable Space

Definition: A measurable space is a tuple of $(S, \Sigma)$ where $S$ is a set and $\Sigma$ is a $\sigma$-algebra over $S$.

• measurable space 又称 Borel space

Definition: Let $(X, \Sigma_X)$ and $(Y, \Sigma_Y)$ be measurable spaces. Function $f:X \to Y$ is called a measurable function if $\forall E_Y \in \Sigma_Y, f^{-1}(E_Y) \in \Sigma_X$

• $f^{-1}$ 是 inverse function
• 扩展一下 $f^{-1}$ 的定义：$f^{-1}(E_Y) := \lbrace x \in X \vert f(x) \in E_Y \rbrace$
• 这个定义相当于：$\forall E_Y \in \Sigma_Y, \exists E_X \in \Sigma_X$ 使得 $f(E_X) = E_Y$
• 这个 $E_X$ 即 $f^{-1}(E_Y)$
• 为了强调 $f$ 是一个 measurable function，我们也可以把它写作 $f: (X, \Sigma_X) \to (Y, \Sigma_Y)$

## Prerequisite #4 : Measure / Measure Space

Definition: Let $(S, \Sigma)$ be a measurable space. Function $\mu: \Sigma \to \mathbb{R} \cup \lbrace -\infty, \infty \rbrace$ is called a measure if it satisfies the following properties:

1. Non-negativity: $\forall E \in \Sigma, \mu(E) \geq 0$
• 注：不满足这个条件的 measure 是存在的，比如 signed measure
2. Null empty set: $\mu(\varnothing) = 0$
3. Countable additivity (or $\sigma$-additivity): $\forall \text{ countable collection } \lbrace E_i \rbrace^{\infty}_{i=1}$ where $E_i \in \Sigma, \forall i$ and $E_i \cap E_j = \varnothing, \forall i, j$:

Definition: A measure space is such a triple of $(S, \Sigma, \mu)$

## Prerequisite #5 : Probability Measure / Probability Space

Definition: Measure $\mu$ is probability measure if $\mu(S) = 1$.

• $S$ 指全集

Definition: A probability space is a measure space with a probability measure, denoted by $(\Omega, \mathcal{F}, \mathbb{P})$ where:

• $\omega \in \Omega$ is called an outcome
• $E \in \mathcal{F}$ is called an event
• $\mathbb{P}: \mathcal{F} \to [0,1]$ is a probability measure
• $\mathbb{P}(E)$ is the probability of $E$

## Prerequisite #3/#4/#5 Summary

• measurable function $f$ 定义在 measurable space $(S, \Sigma)$ 上
• measurable function $f$ 有潜力构成一个 measure $\mu$
• measure $\mu$ + measurable space $(S, \Sigma)$ = measure space $(S, \Sigma, \mu)$
• probability measure $\mathbb{P}$ 是特殊的 measure
• 装备 probability measure 的 measure space 是 probability space $(\Omega, \mathcal{F}, \mathbb{P})$

• $f: S \to \mathbb{R}$
• $\mu: \Sigma \to \mathbb{R}$
• 比如我们可以定义 $\mu(\lbrace x \rbrace) = f(x)$ 然后根据 $\sigma$-additivity 有：
• 注意我这里的意思是：我们可以这样做，但没有规定说一定要这样做；$\mu$ 也不一定要通过 $f$ 定义，$f$ 也不一定满足进化成 $\mu$ 的要求

## 1. Random Variable

Definition: A random variable $X$ is a measurable function $X: (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B})$ such that $\forall$ Borel set $B\in \mathcal{B}$

• 准确来说应该是 $\mathbb{R} \cup \lbrace -\infty, \infty \rbrace$ 而不仅仅是 $\mathbb{R}$
• $\mathcal{B}$ 是 $\mathbb{R}$ 上的 Borel $\sigma$-algebra

• 我们称 $X$ 是 $\mathcal{F}$-measurable. We define $\mathcal{F}(X)$ to be the smallest $\sigma$-algebra on $\Omega$ for which $X$ is measurable.
• 比较一下 $X$ 和 $\mathbb{P}$:
• 首先注意定义域：
• $X: \Omega \to \mathbb{R}$ (random variable 接收 outcome)
• $\mathbb{P}: \mathcal{F} \to [0, 1]$ (probability measure 接收 event)
• $X$ 是 measurable function，$\mathbb{P}$ 是 probability measure，我们可以像上面 $f$ $\overset{\text{进化}}{\Rightarrow}$ $\mu$ 一样定义一个 $X$ 使它可以 $X$ $\overset{\text{进化}}{\Rightarrow}$ $\mathbb{P}$，但是！没有必要。后面 distribution 的部分会阐述。

• $\Omega = \lbrace 1,2,3,4,5,6 \rbrace$
• $\mathcal{F}$ 包括但不限于 $\Omega$、$\lbrace 1 \rbrace$、$\lbrace 2 \rbrace$、$\lbrace 3 \rbrace$、$\lbrace 4 \rbrace$、$\lbrace 5 \rbrace$、$\lbrace 6 \rbrace$
• 假设有 $\mathbb{P}(\lbrace 1 \rbrace) = \mathbb{P}(\lbrace 2 \rbrace) = \mathbb{P}(\lbrace 3 \rbrace) = \mathbb{P}(\lbrace 4 \rbrace) = \mathbb{P}(\lbrace 5 \rbrace) = \mathbb{P}(\lbrace 6 \rbrace) = \frac{1}{6}$
• 注意 event $\lbrace 1,3 \rbrace$ 表示 “roll 出 1 或者 3”，而不是 “roll 两次，一次是 1 一次是 3”
• “roll 两次，一次是 1 一次是 3” 的 event 应该是 $\big \lbrace \lbrace 1,3 \rbrace \big \rbrace$
• 所以 $\mathbb{P}(\lbrace 1,3 \rbrace) = \mathbb{P}(\lbrace 1 \rbrace) + \mathbb{P}(\lbrace 3 \rbrace) = \frac{1}{3}$，同理有 $\mathbb{P}(\Omega) = 1$
• “roll 出 1 且 3” 是不可能事件，即 $\varnothing$，由 measure 的定义得到 $\mathbb{P}(\varnothing) = 0$

## 2. Distribution of a Random Variable

Definition: The push-forward measure of $\mathbb{P}$ by $X$ is a function $\mathbb{P}_{X}: \mathcal{B} \to \mathbb{R}$ such that $\forall B \in \mathcal{B}$,

• 注意根据 random variable 的定义，$\forall B \in \mathcal{B}, X^{-1}(B) \in \mathcal{F}$，所以 $X^{-1}(B)$ 在 $\mathbb{P}$ 的定义域内
• $\mathbb{P}_{X}$ 一定是一个 probability measure，使得 $(\mathbb{R}, \mathcal{B}, \mathbb{P}_{X})$ 构成一个 probability space
• 若 $X = I$，即 $X(\omega) = \omega$，可得 $\mathbb{P}_{X} = \mathbb{P}$

• $\mathbb{P}: \mathcal{F} \to [0, 1]$
• $X: (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B})$
• 按理来说，$X^{-1}$ 应该是 $X^{-1}: \mathbb{R} \to \Omega$，但是我们通过 $X^{-1}(B)$ 的定义把它扩展成了 $X^{-1}: \mathcal{B} \to \mathcal{F}$
• 于是 $\mathbb{P}_{X} = \mathbb{P} \circ X^{-1}$ 就成了一个 $\mathcal{B} \to \mathcal{F} \to [0, 1]$ 的函数
• 所以 $X: \mathcal{F} \to \mathcal{B}$ 就可以看作一个 “event encoder“，它把每一个 event $E \in \mathcal{F}$ 映射到一个 Borel set $B \in \mathcal{B}$
• 同理$X^{-1}: \mathcal{B} \to \mathcal{F}$ 就可以看成一个 “event decoder“，它把每一个 Borel set $B \in \mathcal{B}$ 又映射回原来的 event $E \in \mathcal{F}$
• Event encoding 的作用在于：可以把各种不同的、具体的 $(\Omega, \mathcal{F})$ 转化为统一的、抽象的 $(\mathbb{R}, \mathcal{B})$
• 比如 “投骰子” 和 “黑盒子里 6 个不同颜色的球，抓一个出来” 这两个实验，它们的 event 是不一样的，但我们明显可以看出它们的本质是一样的，这个本质体现在它们通过 $X$ encoding 以后，得到的 Borel set 是一样的 (或者说得到的 $\mathbb{P}_X$ 函数是一样的)
• Event decoding 的作用在于计算，因为 $\mathbb{P}_X$ 需要借助 $\mathbb{P}$ 才能算出具体的值
• 我们平时根本就没有注意到这个 event encoding/decoding 的过程是因为：它太顺理成章了。比如上面 “投骰子” 的例子，我们直接就写出了 $\Omega = \lbrace 1,2,3,4,5,6 \rbrace$，所以可以有 $E = B$，亦即 $X = I$，等于没有做 event encoding/decoding，于是我们也没有区分 $\mathbb{P}_{X}$ 和 $\mathbb{P}$，因为 $\mathbb{P}_{X} = \mathbb{P}$
• 但是我也可以定义说 $\Omega = \lbrace \text{I}, \text{II}, \text{III}, \text{IV}, \text{V}, \text{VI}\rbrace$，那你可能需要 encode 一下，得到:
• $X(\text{I}) = 1$
• $\dots$
• $X(\text{VI}) = 6$
• 所以 $\mathbb{P}_{X}(\lbrace 3 \rbrace) = \mathbb{P}(X^{-1}(\lbrace 3 \rbrace)) = \mathbb{P}(\lbrace \text{III} \rbrace)$
• 当然，你的 $X$ 的定义可以不用与 event 的语义对应，比如我定义 $X(\text{I}) = 100, \dots, X(\text{VI}) = 600$，也是可以的

• 先说结论：这是个有点过分的简写
• 首先 $\mathbb{P}(X = 3)$ 应该是 $\mathbb{P}(\lbrace X = 3 \rbrace)$ ($\mathbb{P}$ 接收 event)
• 二来 $X = 3$ 应该理解为 $X \in \lbrace 3 \rbrace$
• 这么一来，令 $B = \lbrace 3 \rbrace$，套公式可得：
• 所以 $X = 3$ 整体是一个 event $E \in \mathcal{F}$ (informal)；而 $\lbrace 3 \rbrace$ 是一个 Borel set $B \in \mathcal{B}$
• 若 $X = I$，则 $E = B$, $\mathbb{P}_{X} = \mathbb{P}$，从而 $\mathbb{P}(X = 3) \overset{\text{informal}}{=} \mathbb{P}_X(\lbrace 3 \rbrace) = \mathbb{P}(\lbrace 3 \rbrace)$

$\mathbb{P}_{X}$ 的性质还有：

• If $\mathbb{P}_{X}$ gives measure one to a countable set of reals, then $X$ is called a discrete random variable.
• $\mathbb{P}_{X}: \mathcal{B} \to [0, 1]$, 然后 $\mathcal{B}$ 不可数
• 但 $\mathbb{P}_{X}$ 的 domain 可能只是 $\mathcal{B}$ 的一个可数子集
• If $\mathbb{P}_{X}$ gives zero measure to every singleton set, and hence to every countable set, $X$ is called a continuous random variable.
• Every random variable can be written as a sum of a discrete random variable and a continuous random variable.
• All random variables defined on a discrete probability space are discrete

Definition: 对任意的 (locally finite) measure $\mu$ on $\mathbb{R}$，我们定义 distribution function of $\mu$ as

## 3. Probability Mass Functions (for the discrete), and Probability Density Functions (for the continuous)

Definition: Probability mass function for discrete random variable $X$, $p_X: \mathbb{R} \to [0, 1]$, can be defined as:

Definition: Probability density function for continuous random variable $X$, $f_X: \mathbb{R} \to [0, \infty)$, is one satisfying:

• 严格来说，$f_X$ 应该叫做 “the density or Radon–Nikodym derivative with respect to Lebesgue measure of random variable $X$”

• 我们可以写 $F_X(x) = \int_{-\infty}^{x} f_X(t) \mathrm{d}t$
• If $f_X$ is continuous at $t \Rightarrow f_X(x) = F_X’(x)$

## 4. Tilde $\sim$ / i.i.d.

• 所以 $\mathcal{N}(0, 1)$ 它不是 distribution，而是一个 random variable
• 如果 $X \sim \mathcal{N}(0, 1)$，那么 $X(x) = \mathcal{N}(x; 0, 1)$
• 如果 $\mu, \sigma^2$ 不确定，$\mathcal{N}(\mu, \sigma^2)$ 可以看做一个 parametric random variable
• 注意如果有 $X \sim \mathcal{N}(\mu, \sigma^2)$，那么这里 $\mathcal{N}(\mu, \sigma^2)$ 一定是表示一个具体的 random variable (once $\mu, \sigma^2$ 确定下来)，而不能理解为是一个 family of random variables

• 若 $X \sim Y$ 都是 discrete random variable，那么明显 $\mathbb{P}_{X}$ 更直接，所以一般我们用 $\mathbb{P}_{X} = \mathbb{P}_{Y}$ 这个结论
• 进而有 $p_X = p_Y$
• 若 $X \sim Y$ 都是 continuous random variable，那么明显 $F_X$ 才有意义，所以一般我们用 $F_X = F_Y$ 这个结论
• 进而有 $f_X = f_Y$

• $X_1 \sim X_2 \sim \dots X_{n-1} \sim X_n$ (我觉得诡异的是这么多年我就没见过哪本教材用这个式子来描述 i.i.d.)
• $X_1, \dots , X_n$ 互相是 independent 的

## 5. Independence / Marginal Distribution / Join Distribution

Definition: (1) Two events $E_1, E_2$ are called independent if

(2) A collection of events $\lbrace E_i \rbrace$ is called independent if $\forall$ distinct $E_1, \dots, E_n$,

(3) A collection of events $\lbrace E_i \rbrace$ is called pairwise independent if $\forall$ distinct $E_i, E_j$,

(4) A finite collection of $\sigma$-algebras $\mathcal{F}_1, \dots, \mathcal{F}_n$ is called independent if $\forall$ $E_1 \in \mathcal{F}_1, \dots, E_n \in \mathcal{F}_n$, $\lbrace E_1, \dots, E_n \rbrace$ is independent.

(5) An infinite collection of $\sigma$-algebras is called independent if every subcollection is independent.

If $X_1, \dots , X_n$ are random variables, we can consider them as a random vector $(X_1, \dots , X_n)$ and hence as ONE random variable $X_{1:n}: \mathcal{B(\mathbb{R}^n)} \to \mathbb{R}^n$

• Let $\mathcal{T}(\mathbb{R}^n) \subset \mathcal{P}(\mathbb{R}^n)$ denote the standard topology on $\mathbb{R}^n$ consisting of all open sets
• $\mathcal{P}(S) = 2^S$
• $\mathcal{B(\mathbb{R}^n)}$ is the $\sigma$-algebra generated by all the open set, i.e. $\mathcal{B(\mathbb{R}^n)} = \sigma \big ( \mathcal{T}(\mathbb{R}^n) \big )$

• $\Omega_1 \times \Omega_2$ is the Cartesian product of the two sets
• $\mathcal{F}_1 \otimes \mathcal{F}_2$ is the $\sigma$-algebra on $\Omega_1 \times \Omega_2$, generated by subsets of the form $E_{1} \times E_{2}$ where $E_{1} \in \mathcal{F}_{1}$ and $E_{2} \in \mathcal{F}_{2}$
• A product measure $\mathbb{P}_1 \times \mathbb{P}_2$ is defined to be a measure on the measurable space $(\Omega_1 \times \Omega_2, \mathcal{F}_1 \otimes \mathcal{F}_2)$ satisfying $\forall E_{1} \in \mathcal{F}_{1}, \forall E_{2} \in \mathcal{F}_{2}$,

• distribution: $\mathbb{P}_{X_1}, \dots, \mathbb{P}_{X_n}$
• distribution function of distribution: $F_{X_1}, \dots, F_{X_n}$
• PMF: $p_{X_1}, \dots, p_{X_n}$
• PDF: $f_{X_1}, \dots, f_{X_n}$

Definition: For random variable $X_{1:n} = (X_1, \dots , X_n)$, its joint distribution $\mathbb{P}_{X_{1:n}}: \mathcal{B(\mathbb{R}^n)} \to \mathbb{R}$ can be defined as: $\forall B_{1:n} = B_1 \times \dots \times B_n$, $B_{1:n} \in \mathcal{B(\mathbb{R}^n)}$

Definition: For joint distribution $\mathbb{P}_{X_{1:n}}$, its joint distrbution function $F_{X_{1:n}} \overset{\text{abbrev.}}{=} F_{\mathbb{P}_{X_{1:n}}}: \mathbb{R}^n \to [0, 1]$ can be defined as: $\forall t_i \in \mathbb{R}$

Definition: For random variable $X_{1:n} = (X_1, \dots , X_n)$, its joint probability mass function $p_{X_{1:n}}: \mathbb{R}^n \to [0, 1]$ can be defined as:

Definition: For random variable $X_{1:n} = (X_1, \dots , X_n)$, its joint probability density function $f_{X_{1:n}}: \mathbb{R}^n \to [0, \infty]$ is one statisfying: $\forall B_{1:n} = B_1 \times \dots \times B_n, B_{1:n} \in \mathcal{B(\mathbb{R}^n)}$,

Definition: Random variables $X_1, \dots, X_n$ are said to be independent if any of these (equivalent) conditions hold:

(1) Joint distribution is the product of all marginal distributions:

• This is equivalent of saying “joint distribution is the product measure of all marginal distributions”:
• Marginal distribution of $X$ 其实就是 $X$’s individual distribution，它只在 joint distribution 这个 context 下有意义。语出二维的 discrete joint distribution table，比如：

(2) Joint distribution function is the product of all marginal distribution functions:

(3) Joint PMF is the product of all individual PMFs:

(4) Joint PDF (if exists) is the product of all individual PDFs:

(5) The $\sigma$-algebras $\mathcal{F}(X_1), \dots \mathcal{F}(X_n)$ are independent.

## 6. Conditional Random Variable

Suppse $X, Y$ are discrete random variables over $(\Omega, \mathcal{F}, \mathbb{P})$. If $\mathbb{P}(Y = y) \neq 0$, then we can define the conditional probability (measure):

Definition: The discrete conditional random variable $X \mid Y = y$, read “$X$ given $Y = y$”, has PMF

Similarly, we can have

Definition: The continuous conditional random variable $X \mid Y = y$, has PDF