Skip to main content

Inferential Statistics

Sampling and Estimation
Hypothesis Testing
Common Tests
Regression Analysis
Estimation Methods

Population vs Sample

Concept	Description	Pros	Cons	Example
Population	Entire group of interest (all individuals/objects)	Complete source of truth	Usually too large/infinite to study fully	All voters in a country
Sample	Subset of population chosen for study	Practical, manageable, enables inference	May mislead if non-representative	1,000 voters surveyed

Probability Sampling

Method	Description	Pros	Cons	Example
Simple Random Sampling (SRS)	Every individual has equal chance of selection	Easy, unbiased	Impractical for large groups; may miss subgroups	Randomly choosing 100 IDs
Stratified Sampling	Population divided into homogeneous subgroups (strata); random sampling within each	Ensures subgroup representation; more precise	Requires population info; complex	Sampling students by grade level
Systematic Sampling	Select every k-th element after random start	Simple, efficient	Bias risk if population has patterns	Every 10th store customer
Cluster Sampling	Divide into clusters, randomly select clusters, then sample all in them	Cost-effective; useful for dispersed groups	Higher error if clusters heterogeneous	Survey all students in 10 selected schools

Non-Probability Sampling

Method	Description	Pros	Cons	Example
Convenience Sampling	Use easily accessible participants	Fast, cheap	Strong bias risk, not representative	Surveying people in one street
Purposive Sampling	Researcher selects based on judgment/criteria	Good for niche cases or experts	Bias-prone, not generalizable	Interviewing only industry experts
Quota Sampling	Fill quotas for subgroups without randomization	Ensures subgroup presence	Still biased; no random selection	50 men, 50 women chosen by interviewer

Sampling Distribution & Estimation

Concept	Description	Key Points
Sampling Distribution	Distribution of a statistic across all possible samples	Central Limit Theorem ensures approximate normality for large n. Standard Error measures precision
Point Estimate	Single sample statistic used to estimate population parameter	Simple, quick, but no reliability measure
Interval Estimate (Confidence Intervals)	Range around point estimate with confidence level	Captures uncertainty, widely used in decision-making. Not probability for one sample - reflects long-run accuracy

Concept	Definition	Key Characteristics	Examples
Null Hypothesis ( $H_0$ )	Statement of no effect, no difference, or existing status quo. Always involves equality ( $=, \le, \ge$ )	Assumed true until evidence suggests otherwise; refers to population parameter; target of skepticism	$H_0: \mu = 50$ ; $H_0: p \le 0.20$ ; $H_0: \mu_1 = \mu_2$
Alternative Hypothesis ( $H_1$ or $H_a$ )	Statement conflicting with $H_0$ ; represents effect/researcher's claim. Uses $\neq, <, >$	What we conclude if evidence is sufficient; also refers to population parameter; usually the hypothesis we want to support	$H_1: \mu \neq 50$ (two-tailed); $H_1: p > 0.20$ (right-tailed); $H_1: \mu_1 < \mu_2$ (left-tailed)
Type I Error ( $\alpha$ )	Rejecting $H_0$ when $H_0$ is true (false positive)	Probability = $\alpha$ ; controlled via significance level	Convicting an innocent person; concluding a campaign increased conversions when it did not
Type II Error ( $\beta$ )	Failing to reject $H_0$ when $H_0$ is false (false negative)	Probability = $\beta$ ; occurs when test lacks sensitivity	Letting a guilty person go free; missing that a campaign actually increased conversions
Significance Level ( $\alpha$ )	Maximum acceptable probability of Type I error set before test	Common levels: 0.05, 0.01, 0.10; smaller $\alpha$ reduces false positives but increases false negatives	If $P \le \alpha$ , reject $H_0$
P-value	Probability of observing data as extreme or more extreme given $H_0$ is true	Small $P \le \alpha$ : reject $H_0$ ; Large $P > \alpha$ : fail to reject $H_0$ . Not the probability that $H_0$ is true	Example: $P = 0.03 \le 0.05$ → reject $H_0$
Power ( $1 - \beta$ )	Probability of correctly rejecting a false $H_0$	Desired ≥ 80%; increases with sample size, higher $\alpha$ , larger effect size, lower variability	Ensures study design is sensitive enough to detect meaningful effects; linked to resource planning

Test	Purpose	When to Use	Assumptions	Use Case
Z-test	Compares a sample mean to a population mean, or two sample means, when σ is known	Population standard deviation known; large n ≥ 30	Data is interval level; normal distribution (or CLT applies)	Test if average delivery time differs from 30 minutes when σ is known
t-test (One-sample)	Compares sample mean to a hypothesized population mean when σ is unknown	σ unknown; small to moderate sample size	Approx. normal distribution; independent observations	Check if average product weight differs from 100g
t-test (Independent samples)	Compares means of two independent groups	σ unknown; two independent groups	Normality; independence; equal variances (if using pooled)	Compare sales from two ad campaigns
t-test (Paired samples)	Compares means of two related groups (before - after, matched)	σ unknown; paired observations	Differences approx. normally distributed	Compare satisfaction before and after service improvement
ANOVA (One-way)	Tests if 3+ group means differ significantly (one factor)	Comparing ≥3 group means	Independence; normality; equal variances	Test if spending differs across customer segments
ANOVA (Two-way)	Tests effect of two categorical factors on a quantitative outcome (plus interaction)	Two factors, multiple groups	Same as above	Compare sales across marketing channels and regions
Chi-squared Goodness-of-Fit	Tests if observed categorical distribution matches expected	Categorical count data; expected distribution known	Expected counts ≥5 per cell (approx.); independence	Test if website traffic matches equal distribution across pages
Chi-squared Test of Independence	Tests association between two categorical variables	Contingency tables for two categorical variables	Large enough expected counts; independence	Test if gender and product preference are associated
Mann-Whitney U (Non-parametric)	Alternative to independent t-test (ranks)	Two independent samples; non-normal or ordinal data	Independence; ordinal/continuous ranked	Compare two groups with skewed data
Wilcoxon Signed-Rank (Non-parametric)	Alternative to paired t-test (ranks differences)	Paired sample, non-normal	Symmetry of differences (less strict than normality)	Before - after ratings with skewed scores
Kruskal-Wallis (Non-parametric)	Alternative to one-way ANOVA (ranks)	3+ independent groups; non-normal	Independence; ordinal/continuous ranked	Compare ranks of satisfaction across multiple regions
Spearman's Rank Correlation	Measures monotonic relationship between variables (non-parametric)	Ranked or ordinal data; non-linear monotonic	Independence; ordinal/continuous	Correlation between income rank and lifestyle scores

Aspect	Simple Linear Regression	Multiple Linear Regression	Logistic Regression
Purpose	Models the relationship between one dependent variable (Y) and one independent variable (X)	Models the relationship between one dependent variable (Y) and multiple independent variables (X₁, X₂, …, Xₖ)	Models the probability of a binary outcome (e.g., yes/no, success/failure)
Equation	$Y = \beta_0 + \beta_1 X + \epsilon$	$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k + \epsilon$	$P(Y=1\|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \ldots + \beta_k X_k)}}$
Interpretation of Coefficients	$\beta_1$ : Expected change in Y for a one-unit increase in X	$\beta_j$ : Expected change in Y for a one-unit increase in $X_j$ , holding all others constant	Coefficients affect the log-odds of Y=1; positive values increase probability, negative values decrease it
Estimation Method	Ordinary Least Squares (minimizing SSR)	Ordinary Least Squares (extended to multiple predictors)	Maximum Likelihood Estimation (MLE)
Assumptions	Linearity, independence of errors, homoscedasticity, normality of errors, no measurement error in X	All simple assumptions, plus no multicollinearity and no endogeneity	Assumes linear relationship between predictors and log-odds, independence of observations
Goodness of Fit	$R^2$ (proportion of variance explained)	Adjusted $R^2$ (accounts for multiple predictors), $R^2$	Pseudo- $R^2$ , accuracy, AUC (Area Under Curve)
Hypothesis Tests	Slope test: $H_0: \beta_1 = 0$	Slope tests for each predictor: $H_0: \beta_j = 0$	Tests for significance of predictors on log-odds ( $H_0: \beta_j = 0$ )
Challenges	Nonlinear relationships, violation of assumptions	Multicollinearity, overfitting, omitted variable bias	Probability calibration, handling class imbalance
Applications	Predicting sales from advertising spend, predicting height from age	Predicting house price from square footage, bedrooms, and location	Fraud detection, medical diagnosis, churn prediction, spam email detection

Method	Definition	Key Concepts	Applications
Maximum Likelihood Estimation (MLE)	Finds parameter values that maximize the likelihood of observing the given data	Likelihood function: $L(\theta\|x) = \prod p(x_i\|\theta)$ Log-likelihood: $\ell(\theta) = \sum \log p(x_i\|\theta)$ Solution: $\hat{\theta} = \arg\max_\theta L(\theta\|x)$	Parameter estimation in regression, classification Distribution fitting and model selection Foundation for many ML algorithms
Maximum A Posteriori (MAP)	Bayesian estimation that maximizes posterior probability including prior beliefs	Posterior: $P(\theta\|x) \propto P(x\|\theta) \cdot P(\theta)$ MAP: $\hat{\theta} = \arg\max_\theta P(\theta\|x)$ Balances data fit with prior assumptions	Regularized regression (ridge, lasso) Bayesian neural networks When prior knowledge is available
Method of Moments	Equates population moments with sample moments to estimate parameters	For distribution with k parameters, use first k moments Mean: $E[X] = \bar{X}$ , Variance: $E[X^2] - (E[X])^2 = s^2$ Simple but less efficient than MLE	Quick parameter estimation When MLE is difficult to compute Educational and theoretical contexts
Least Squares Estimation	Minimizes sum of squared residuals between observed and predicted values	Objective: $\min_\beta \sum (y_i - \hat{y}_i)^2$ Normal equations: $(X^T X)^{-1} X^T y$ Special case of MLE for Gaussian errors	Linear and nonlinear regression Foundation of many statistical models Computational efficiency for large datasets