统计学基础知识及其关系 (Foundational Knowledge of Statistical Concepts and Relationships)¶
1. 样本与总体 (Sample and Population)¶
-
总体 (Population): 统计学中的总体指的是我们感兴趣的所有个体、事件或数据点的集合。例如,一项全国性的民意调查的总体可能是所有成年公民。
Population: In statistics, a population refers to the entire set of individuals, events, or data points that are of interest. For example, in a national survey, the population might be all adult citizens. -
样本 (Sample): 样本是从总体中选取的一部分数据,用于推断总体特征。由于处理整个总体数据通常不现实或不必要,研究者通常通过分析样本来了解总体。
Sample: A sample is a subset of the population selected for analysis. Researchers typically analyze a sample to infer characteristics of the entire population because handling the entire population is often impractical or unnecessary.
2. 描述统计 (Descriptive Statistics)¶
-
描述统计是用来描述和总结数据的基本特征的一组统计方法。它们通常包括以下内容:
Descriptive statistics are a set of statistical methods used to describe and summarize the basic features of data. They typically include the following: -
中心趋势 (Central Tendency):
Central Tendency:- 均值 (Mean): 数据集中所有值的平均值。
Mean: The average of all values in the data set. - 中位数 (Median): 数据集中排序后位于中间的值。
Median: The middle value in a sorted data set. - 众数 (Mode): 数据集中出现频率最高的值。
Mode: The value that appears most frequently in the data set.
- 均值 (Mean): 数据集中所有值的平均值。
-
离散程度 (Dispersion):
Dispersion:- 方差 (Variance): 数据点与均值的平方差的平均值。
Variance: The average of the squared differences from the mean. - 标准差 (Standard Deviation): 方差的平方根,表示数据分布的离散程度。
Standard Deviation: The square root of the variance, indicating the dispersion of the data distribution. - 范围 (Range): 数据集中最大值与最小值之间的差异。
Range: The difference between the maximum and minimum values in the data set. - 四分位差 (Interquartile Range, IQR): 数据中Q3与Q1之间的差值,用于衡量数据的分散程度。
Interquartile Range (IQR): The difference between the third quartile (Q3) and the first quartile (Q1), used to measure the spread of the data.
- 方差 (Variance): 数据点与均值的平方差的平均值。
3. 概率 (Probability)¶
-
概率是指事件发生的可能性,介于0和1之间。概率基础知识是统计推断的重要组成部分。
Probability refers to the likelihood of an event occurring, ranging between 0 and 1. A foundational understanding of probability is crucial for statistical inference. -
随机变量 (Random Variables): 一个变量,其值由随机试验的结果确定。
Random Variables: A variable whose values are determined by the outcomes of a random experiment. - 概率分布 (Probability Distribution): 描述了随机变量的可能值以及这些值的概率。
Probability Distribution: Describes the possible values of a random variable and the likelihood of these values. - 常见分布 (Common Distributions):
- 正态分布 (Normal Distribution): 一种常见的连续概率分布,其特点是对称、均值为中心、呈钟形。
Normal Distribution: A common continuous probability distribution characterized by symmetry, with the mean as the center, forming a bell-shaped curve. - 二项分布 (Binomial Distribution): 描述固定次数的独立试验中,成功次数的概率分布。
Binomial Distribution: Describes the probability distribution of the number of successes in a fixed number of independent trials. - 泊松分布 (Poisson Distribution): 描述在固定时间或空间内发生某事件次数的概率分布。
Poisson Distribution: Describes the probability distribution of a given number of events occurring in a fixed interval of time or space.
- 正态分布 (Normal Distribution): 一种常见的连续概率分布,其特点是对称、均值为中心、呈钟形。
4. 统计推断 (Statistical Inference)¶
-
统计推断是根据样本数据来推测总体特征的过程。它包括参数估计和假设检验。
Statistical Inference is the process of making inferences about population characteristics based on sample data. It includes parameter estimation and hypothesis testing. -
参数估计 (Parameter Estimation):
Parameter Estimation:- 点估计 (Point Estimation): 使用样本统计量作为总体参数的估计值(如均值的点估计)。
Point Estimation: Using a sample statistic as an estimate of a population parameter (e.g., the sample mean as an estimate of the population mean). - 区间估计 (Interval Estimation): 通过样本数据计算一个范围,用来估计总体参数(如置信区间)。
Interval Estimation: Calculating a range from sample data to estimate a population parameter (e.g., confidence interval).
- 点估计 (Point Estimation): 使用样本统计量作为总体参数的估计值(如均值的点估计)。
-
假设检验 (Hypothesis Testing):
Hypothesis Testing:- 零假设 (Null Hypothesis, ( H_0 )): 表示没有效应或没有差异的假设。
Null Hypothesis (( H_0 )): The hypothesis that there is no effect or no difference. - 备择假设 (Alternative Hypothesis, ( H_1 )): 与零假设相对的假设,表示存在效应或差异。
Alternative Hypothesis (( H_1 )): The hypothesis that there is an effect or a difference, opposing the null hypothesis. - p值 (p-value): 在假设检验中,p值表示在零假设为真时,观察到或更极端结果的概率。p值越小,越有理由拒绝零假设。
p-value: In hypothesis testing, the p-value indicates the probability of observing the data, or something more extreme, assuming the null hypothesis is true. The smaller the p-value, the stronger the evidence against the null hypothesis. - 显著性水平 (Significance Level, ( \alpha )): 预先设定的一个阈值,通常为0.05,用于判断是否拒绝零假设。
Significance Level (( \alpha )): A pre-determined threshold, typically set at 0.05, used to decide whether to reject the null hypothesis.
- 零假设 (Null Hypothesis, ( H_0 )): 表示没有效应或没有差异的假设。
5. 相关性与因果关系 (Correlation vs. Causation)¶
-
相关性 (Correlation): 衡量两个变量之间线性关系的强度和方向。常用的度量是皮尔逊相关系数。
Correlation: Measures the strength and direction of a linear relationship between two variables. The Pearson correlation coefficient is a commonly used metric. -
正相关 (Positive Correlation): 当一个变量增加时,另一个变量也增加。
Positive Correlation: When one variable increases, the other variable also increases. - 负相关 (Negative Correlation): 当一个变量增加时,另一个变量减少。
Negative Correlation: When one variable increases, the other variable decreases. -
零相关 (Zero Correlation): 两个变量之间没有线性关系。
Zero Correlation: There is no linear relationship between the two variables. -
因果关系 (Causation): 说明一个变量的变化直接导致另一个变量的变化。相关性并不意味着因果关系,因果关系通常需要通过实验设计或更复杂的分析方法来确定。
Causation: Indicates that a change in one variable directly causes a change in another variable. Correlation does not imply causation; causation typically requires experimental design or more complex analysis to establish.
6. 回归分析 (Regression Analysis) - 续¶
- 回归系数 (Regression Coefficient): 代表自变量对因变量的影响大小和方向。
Regression Coefficient: Represents the magnitude and direction of the effect of an independent variable on the dependent variable.
7. 假设检验的误差类型 (Errors in Hypothesis Testing)¶
-
I型错误 (Type I Error): 错误地拒绝了实际上为真的零假设(即“假阳性”)。
Type I Error: Incorrectly rejecting a true null hypothesis (a "false positive"). -
II型错误 (Type II Error): 错误地接受了实际上为假的零假设(即“假阴性”)。
Type II Error: Incorrectly accepting a false null hypothesis (a "false negative").
8. 贝叶斯统计 (Bayesian Statistics)¶
-
贝叶斯统计通过贝叶斯定理结合先验知识和新数据来更新概率估计。
Bayesian Statistics updates probability estimates by combining prior knowledge with new data using Bayes' Theorem. -
贝叶斯定理 (Bayes' Theorem): 给出在给定条件下某一事件发生的概率计算公式。
Bayes' Theorem: Provides a formula for calculating the probability of an event given certain conditions. -
先验概率 (Prior Probability): 新证据出现前对事件发生概率的估计。
Prior Probability: The probability estimate of an event before new evidence is considered. -
后验概率 (Posterior Probability): 新证据出现后对事件发生概率的更新估计。
Posterior Probability: The updated probability estimate of an event after considering new evidence.
9. 蒙特卡罗模拟 (Monte Carlo Simulation)¶
- 蒙特卡罗模拟是一种利用随机抽样来估计数学函数或模型输出的技术,特别适用于复杂系统或不确定性较高的问题。
Monte Carlo Simulation is a technique that uses random sampling to estimate mathematical functions or model outputs, particularly useful for complex systems or problems with high uncertainty.
10. 分布假设与非参数统计 (Parametric vs. Non-parametric Statistics)¶
-
参数统计 (Parametric Statistics): 假设数据来自某种特定的分布,并基于这些假设进行分析(如正态分布)。
Parametric Statistics: Assumes that data follows a specific distribution, and analysis is based on these assumptions (e.g., normal distribution). -
非参数统计 (Non-parametric Statistics): 不依赖于数据的特定分布假设,适用于分布未知或不满足参数统计条件的数据。
Non-parametric Statistics: Does not rely on specific distribution assumptions for the data, suitable for data with unknown or non-normal distributions.