5.4.3 中心极限定理可视化

什么是中心极限定理?这里有一份可视化解释

https://www.zhihu.com/topic/20009708/hot


什么是中心极限定理?这里有一份可视化解释

作者:Mike Freeman

编译:Bot

编者按:中心极限定理是概率论中的一组重要定理,它的中心思想是无论是什么分布的数据,当我们从中抽取相互独立的随机样本,且采集的样本足够多时,样本均值的分布将收敛于正态分布。为了帮助更多学生理解这个概念,今天,UW iSchool的教师Mike Freeman制作了一些直观的可视化图像,让不少统计学教授大呼要把它们用在课堂上。

本文旨在尽可能直观地解释统计学基础理论之一——中心极限定理的核心概念。通过下文中的一系列动图,读者应该能真正理解这个定理,并从中汲取应用灵感,把它用于决策树等其他项目。

需要注意的是,这里我们不会介绍具体推理过程,所以它不涉及定理解释。

教科书上的中心极限定理

在看可视化前,我们先来回顾一下统计学课程对中心极限定理的描述。

来源:LthID
n>30一般为大样本的分界线 来源:LthID

一个简单的例子

为了降低这个定理的理解门槛,首先我们来举个简单的例子。假设有一个包含100人的团体,他们在某些问题上的意见分布在0-100之间。如果以可视化的方式把他们的意见分数表示在水平轴上,我们可以得到下面这幅图:深色竖线表示所有人意见分数的平均值。

假如你是一名社会科学家,你想知道这个团体的立场特点,并用一些信息,比如上面的“平均意见得分”来描述他们。但可惜的是,由于时间、资金有限,你没法一一询问。这时候,你就可能需要对这100人进行抽样。比方说,在有限的时间、资金条件内,你可以从中随机抽取10个人作为自己的采访对象(n=10),向他们询问有关特定问题的具体想法:

随机抽取10个样本

如你所见,这些样本的均值可能会和整个团体的总体均值有很大差异。那么,怎么采样才能更可靠呢?

考虑多个样本

假设我们可以从团体中采集多个样本。虽然这种做法在现实中是客观存在的(尤其是在政治民意调查中),但在这里,我们会更多地将其作为一种解释工具(当你进行重复采样时,实际上会有一些意料之外的因素出现)。对于每个样本,我们在每次采样时都跟踪样本均值与整体平均值的差。

多次重复该过程,我们就能获得样本均值的分布,它通常被称为样本均值分布,或者(更简单的)抽样分布。下面是对100人的团体进行多次抽样后(每次10人),样本均值的变化情况:

第一次采样,样本均值和总体均值有明显偏差

多次采样后,样本均值和总体均值的偏差变小了

可以发现,随着抽样次数逐渐增多,总体均值和样本均值之间的差距正在不断缩小。这是可以理解的,因为整个过程就相当于从100人中抽取更多样本。但之前我们也说了,资金、时间是有限的,这没有解决资源受限的问题,也无法反映人整个团体在特定问题上的立场。

为了了解每次计算样本均值的效果,我们得先看看抽样分布的分布情况。

理解分布

鉴于上述可视化图像在分布上不够直观,所以在这里,我们把原先表示每个意见的圆圈变成方块,以直方图的形式展现总体分布的情况:

显然,我们的数据分布并不正常。虽然上图中有些部分的曲线是符合正态分布的,但大多数是不符合的,这段曲线没法帮助我们理解这100个人的习性。相反地,我们可以从样本均值的分布情况着手,看看抽样分布的变化情况:

随着采样次数上升,抽样分布正在发生变化

进一步增加采样次数,抽样分布的形状逐渐趋于稳定

随着采样数量的增加,采样分布在可视化中形成了一条钟形曲线,符合正态分布。如上所述,随着重复采样次数的增加,样本均值(抽样分布的平均值)会变得越来越准确。

为什么重要

当采样的数量接近无穷大时,我们的抽样分布就会近似于正态分布。这个统计学基础理论意味着我们能根据个体样本推断所有样本。结合正态分布的其他知识,我们可以轻松计算出给定平均值的值的概率。同样的,我们也可以根据观察到的样本均值估计总体均值的概率。

维基百科对于“中心极限定理”的定义:中心极限定理是概率论中的一组定理。中心极限定理说明,在适当的条件下,大量相互独立随机变量的均值经适当标准化后依分布收敛于正态分布。

在留言中,美国田纳西州范德堡大学的医学院生物统计学教授Frank Harrell留下了自己的风趣评论:“但是在所有定理中,中心极限定理是最后一个我想教给学生的东西。我想他们得先学好第一堂课,它包括一些设计、数据的意义、数据的稳健性、bootstrap、一些贝叶斯、高精度数据图等等。”

读完他的话,是不是觉得即便了解了这个定理,自己要学的东西还是很多呢?

原文地址(提供交互式可视化,建议去看看):mfviz.com/central-limit/

Github(包含可视化组件代码):github.com/mkfreeman/central-limit


What is this?

This is an attempt to visually explain the core concepts of the Central Limit Theorem. By providing a variety of interactive components, this page seeks to provide an intuitive understanding of one of the foundational theories behind inferential statistics. It draws inspiration from other visual explanations, such as this one on decision trees and these wonderful projects from setosa.io. The code is on GitHub.

Importantly, this is not a robust explanation of the theory. If you have any feedback (about the explanation, implementation, or design), feel free to reach out on on twitter.

A Simple Starting Point

In order to grapple with such an important theory, we'll consider a simple hypothetical situation. Let's imagine that there's a populaiton of 100 people with a distribution of opinions that range from, say, 0 - 100 on some issue. Its simple to consider those views represented along a horizontal axis as follows, with the mean opinion of that population labeled and shown as a dark line:

Population Mean

As a social scientist, you may be interested in measuring the disposition of this population, and describing it using information such as the mean opinion. Unfortunately, you may not have the time or funding to ask each individual their opinion. So, you may have to sample from the population at hand. Let's say you had enough time/effort to randomly sample ten individuals from your population of 100. This would give you some idea of how the population stands on the particular issue:

Population Mean

As you can see, a sample mean can differ a substantial amount from our population mean. So, how could sampling ever be reliable...?

Consider Multiple Samples

So let's say, hypothetically, that we could take multiple samples from our population. While this is something that may happen (especially with political polls), we'll use this more as an explanatory tool (there are other considerations that come into play when you actually sample repeated times that we won't consider). For each sample, let's keep track of how the sample meancompares to our population mean each time we draw a sample. Once we repeat the process multiple times, we will have a distribution of sample means, often referred to as the sampling distribution of the mean or (more simply) the sampling distribution. We'll display the distribution of sample means in a density plot below our population's density plot:

Population Mean

As we repeat this process, you may notice something interesting happen: the difference between our true population mean and the mean of our sample means begins to shrink. This makes sense, as it feels similar to simply drawing a larger sample (with replacement) from our population. Unfortunately, this doesn't solve the problem of limited resources for understanding a population's position on a particular issue. In order to understand the quality of a single estimate, we first need to understand what the distribution of sample means looks like.

Thinking about Distributions

The current visualization of our population's opinions as a density plot is a bit hard to decipher, so let's add a bit more information with a histogram of the opinions. This will better allow us to see the shape of the distribution:

Population Mean

Clearly, the distribution of our data is not normal. While many natually occuring phenomenon happen to be normally distributed, many are not. Luckily, the Central Limit Theorem provides us with a strong foundation for discussing the estimation of population parameters regardless of whether or not the event is normally distributed.

As it turns out, the shape of the population's distribution isn't what will help us understand our ability to make inferences about the population. Instead, let's consider what the distribution of sample means (also known as the sampling distribution) looks like. Use the button below to take multiple repeated samples, and see how the sampling distribution begins to take form.

Population Mean

This is a bit tedious, so let's speed the process up a bit:

Population Mean

As the number of samples taken increases, the sampling distribution becomes normal. This holds true regardless of whether or not the true population parameter is normally distributed. As noted above, the sampling mean (the mean of the sampling distribution) becomes increasingly accurate as the number of repetitions increases.

Population Mean

% errortruth

Why it matters

As the number of samples taken approaches infinity, the distribution of our sample means approximates the normal distribution. This foundational theory in statistics is what allows us to make inferences about populations based on an individual sample. Given our understanding of the normal distribution, we can easily discuss the probability of a value occuring given a mean. Conversely, we can then estimate the probability of a population mean given an observed sample mean. This not only allows us to provide reliable estimates of population values, but empowers us to quantify the confidence in our estimates (more on this in a future post).

Hopefully this explanation provided some intution to a generic defintion, such as this one from Wikipedia:

The central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined (finite) expected value and finite variance, will be approximately normally distributed, regardless of the underlying distribution.