Statistics with Python 统计学理论

本文帮你梳理了高中数学统计的进阶版知识（真的只是稍难一点），帮你了解python处理数据时，部分方法背后的原理。

1.Inferential statistics 推论
从总体中抽取样本数据，对总体进行预测。

Hypothesis Testing: This involves formulating a hypothesis about a population parameter and using sample data to assess the validity of the hypothesis. Common methods include t-tests, chi-square tests, and ANOVA.
Confidence Intervals: Provide a range of values within which a population parameter is likely to fall.
Regression Analysis: Regression analysis explores the relationship between one or more independent variables and a dependent variable.
Analysis of Variance (ANOVA): ANOVA is a statistical method used to analyze the differences among group means in a sample. It is often employed to compare means across multiple groups.
Probability Distributions: Probability distributions, such as the normal distribution, play a crucial role in inferential statistics.
-**Sampling Distributions：The Central Limit Theorem is a key concept related to sampling distributions.

2.随机抽样 Select random Samples
all possible samples are equally likely
Larger complex samples.
key features:

分层：Population divided into different strata, and part of sample is allocated to each stratum; -， ensures sample representation from each stratum, and reduces variance of survey estimates (stratification)
聚类：Clusters of population units (e-g., counties) are randomly sampled first (with known probability) within strata, to save costs of data collection (collect data from cases close to each other geographically)（如果你想调查各国人，你可以直接去移民国家比如美国，这样省事儿）
简单随机样本是概率样本的一种形式。其中每个个体被选中的概率相等。只要出现random select就都是probability。
样本大小不会影响为什么简单随机抽样可能无法代表整个美国成年人的观点。

samples are not based on a known probability. (Challenging or impractical).
（1）预判在前；（2）看运气调查。

Convenience Sampling: In convenience sampling, individuals or elements are selected based on their == easy accessibility or availability == . (lead to a biased sample)
Purposive Sampling: Selecting individuals or elements based on == specific characteristics or qualities == that are relevant to the research objectives.
Snowball Sampling: Snowball sampling starts with an initial set of participants, and then those participants refer or introduce the researcher to additional potential participants. This method is often used when the population is hard to reach. 别人介绍
Quota（配额） Sampling: Quota sampling involves dividing the population into subgroups (strata) based on certain characteristics and then setting quotas for each subgroup. Participants are then conveniently selected to meet these quotas.
Judgmental Sampling: the researcher uses their judgment or expertise to select individuals who are believed to be representative of the population. This method relies heavily on the researcher’s ==subjective == judgment.

1.Pseudo-Randomization 伪随机化
在实验或研究中模拟随机分配的方法，而实际上并没有真正的随机性。这种方法通常是通过某种规则或算法来分配研究对象到不同的处理组，而不是使用真正的随机过程。

举例：
- 系统随机化。在系统随机化中，研究对象根据某些事先确定的规则或系统性方法进行分组，而不是通过纯随机的过程。例如，按照入组时间的先后顺序或按照某种特定的特征进行分组。
- 利用计算机算法生成伪随机数进行分组。虽然计算机生成的随机数实际上是确定性的，但在某些情况下，可以通过使用良好设计的伪随机数生成器来达到类似随机的效果。

2.Calibration 校准
通常是指调整样本或抽样方法，使其更准确地反映总体。
（1） 配额抽样（Quota Sampling）：

（2） 初始抽样后的重新加权（Re-weighting After Initial Sampling）：

（3） 改进抽样框架（Improving the Sampling Frame）：

3. 限制：

1.Sampling Theory (抽样理论)

2.Sampling Distributions (抽样分布)

在不同子群体中观察到的趋势与观察总体时的趋势相反。这种悖论的出现通常与变量之间的相互影响和不平衡的子群体大小有关。
举例：入学考试的性别歧视。在某个大学招生中，男生和女生的整体录取率可能呈现出性别歧视的趋势，但如果按照各专业细分进行分析，可能发现在每个专业中，男生和女生的录取率可能是相对平等甚至相反的。这是因为专业之间的录取标准或申请人的特征可能存在较大差异，从而导致整体趋势与各专业趋势相矛盾。

编程之家