Skip to main content

Descriptive Statistics

Quantitative vs. Qualitative Data

Type	Definition	Characteristics	Subtypes	Examples	Use Cases
Quantitative (Numerical)	Data expressed as numbers representing counts or measurements	Measurable, ordered, meaningful differences, arithmetic operations possible	Discrete (countable) and Continuous (measurable)	Discrete: number of children, dice rolls. Continuous: height, weight, temperature, revenue	Statistical tests (e.g., regression), visualizations like histograms and scatter plots, predictive modeling
Qualitative (Categorical)	Data representing characteristics or categories	Descriptive, grouped into categories, non-numerical, limited math operations	Nominal and Ordinal (part of levels of measurement)	Gender, marital status, satisfaction levels, product type	Chi-squared tests, bar charts, pie charts, encoding for machine learning models

Levels of Measurement

Level	Categories	Order	Equal Intervals	True Zero	Examples	Permissible Operations
Nominal	Yes	No	No	No	Gender, hair color, product type, country	Frequencies, mode, Chi-squared tests
Ordinal	Yes	Yes	No	No	Satisfaction ratings, education level, Likert scale, income brackets	Frequencies, mode, median, rank correlations
Interval	Yes	Yes	Yes	No	Temperature (°C/°F), IQ scores, years/dates	Addition, subtraction, mean, SD, correlations (Pearson)
Ratio	Yes	Yes	Yes	Yes	Height, weight, income, age, time, Kelvin temperature	All statistical methods, multiplication, division, ratios

Measures of Central Tendency: Finding the "Average"

Measure	Definition	Formula	Example	Best Use Cases	Sensitivity
Arithmetic Mean	Sum of all values divided by the number of values	$\bar{x} = \frac{\sum x_i}{n}$	Sales: $$$$ → 110	Symmetric data without extreme outliers	Highly sensitive to outliers
Weighted Mean	Mean where each value has a weight representing importance/frequency	$\bar{x}_w = \frac{\sum (x_i \cdot w_i)}{\sum w_i}$	Grade = 89.5 (with homework, midterm, final weights)	When values contribute unequally (e.g., grades, weighted averages)	Sensitive if extreme values have high weights
Geometric Mean	n-th root of the product of values	$GM = (\prod x_i)^{1/n}$	Returns: $[1.10,1.05,1.12]$ → 9%	Growth rates, ratios, percentages	Sensitive to zeros/negative values; less affected by outliers
Harmonic Mean	Reciprocal of the arithmetic mean of reciprocals	$HM = \frac{n}{\sum \frac{1}{x_i}}$	Speeds: 60 & 40 mph → 48 mph	Rates, speeds, efficiency ratios	Sensitive to very small values
Median	Middle value after ordering data	If odd $n$ : $(n+1)/2$ ; if even $n$ : mean of two middle values	$$$$ → Median = 30	Skewed data, outlier-heavy data, ordinal data	Robust to outliers
Mode	Most frequent value(s) in dataset	Count frequencies	$$$$ → Mode = 4	Nominal/categorical data, identifying most common class/value	Not affected by extreme values

Distribution Insights

Measure	Formula	Sensitivity to Outliers	Example Result (Dataset: )	Interpretation	Use Cases
Range	$\text{Range} = \max(x) - \min(x)$	Highly sensitive	$10 - 6 = 4$	Quick measure of spread	Good for rough checks but distorted by extreme values
Interquartile Range	$\text{IQR} = Q3 - Q1$	Not sensitive	From example set → $Q1=7, Q3=9, IQR=2$	Describes spread of middle 50%	Robust for skewed data and outlier detection
Variance (Sample)	$s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}$	Sensitive	$s^2 = 2.5$	Uses all data points; squared units limit interpretability	Key in advanced statistics
Variance (Population)	$\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}$	Sensitive	Depends on full population	Same as above but for entire population	Exact measure of spread in total dataset
Standard Deviation	$s = \sqrt{s^2}$ , $\sigma = \sqrt{\sigma^2}$	Sensitive	$\sqrt{2.5} \approx 1.58$	Average deviation from mean in same units as data	Most widely used spread measure
Mean Absolute Deviation (MAD)	$MAD = \frac{\sum \|x_i - \bar{x}\|}{n}$	Less sensitive	$1.2$	Average distance from mean without squaring	Easier to interpret, less used in theory

Symmetric (Normal): Mean ≈ Median ≈ Mode: All 3 measures are approximately equal.
Positively Skewed (Right): Mode < Median < Mean: Mean is pulled right by high outliers.
Negatively Skewed (Left): Mean < Median < Mode: Mean is pulled left by low outliers.

Measures of Dispersion (Variability): Quantifying Data Spread

Measure	Formula	Example	Sensitivity to Outliers	Advantages	Limitations	Use Cases
Range	$\text{Range} = \text{Max} - \text{Min}$	$10 - 6 = 4$	Highly sensitive	Very simple, quick estimate of spread	Ignores distribution between extremes, distorted by outliers	Quick checks (e.g., quality control)
Interquartile Range (IQR)	$IQR = Q3 - Q1$	For [5,7,8,8,10,12,13,15,17,20], $Q1=8, Q3=15, IQR=7$	Low sensitivity (robust)	Focuses on middle 50%, robust against skew and outliers	Ignores data outside middle 50%	Outlier detection, box plots, skewed data
Sample Variance	Population: $\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}$ Sample: $s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}$	For [6,7,8,9,10], $s^2 = 2.5$	Sensitive	Uses all data, key for advanced statistics	Units squared (harder to interpret)	ANOVA, regression, PCA, machine learning
Standard Deviation	$\sigma = \sqrt{\sigma^2}$ , $s = \sqrt{s^2}$	For variance = 2.5, $s = 1.581$	Sensitive	Same units as data, widely interpretable, foundational in stats	Still sensitive to outliers	Z-scores, hypothesis testing, confidence intervals
Mean Absolute Deviation (MAD)	$MAD = \frac{\sum \|x_i - \bar{x}\|}{n}$	For [6,7,8,9,10], $MAD = 1.2$	Less sensitive	Easy to understand, less influenced by extremes	Less mathematically flexible than variance/SD	Forecasting errors, robust spread measure

Measures of Position: Relative Standing

Measure	Definition	Method	Example	Use Cases
Percentiles	Value below which a certain percentage of observations fall	$L = (P/100) \times N$ . If $L$ is whole, average of value at $L$ and $L+1$ . If not, round up and take that value	70th percentile of ordered scores: $L=(70/100)\times10=7$ . Avg of 7th (85) & 8th (88) = 86.5	Ranking, benchmarking, and segmenting datasets (e.g., identifying top 10% performers)
Quartiles	Divide data into four equal parts (Q1=25th, Q2=50th/median, Q3=75th)	Same as percentile method, with P=25, 50, 75	Scores: Q1=72, Q2=81, Q3=88	Used in box plots, detect skewness, compute IQR ( $Q3-Q1$ ) for spread and outlier detection
Deciles	Divide data into ten equal parts	Same as percentile method, with P=10, 20, …, 90	D5 = 50th percentile = Median = 81	Provides more granular segmentation than quartiles, often used in income/wealth distribution analysis
Z-scores	Standardized score showing how many SDs a value is from mean	$Z = \frac{x - \mu}{\sigma}$ (population) or $Z = \frac{x - \bar{x}}{s}$ (sample)	Score = 85, Mean=70, SD=10 → $Z = (85-70)/10 = 1.5$	Standardizes across distributions, detects outliers ( $\|Z\|>2$ or 3), enables probability-based comparisons

Outlier Detection via Measures of Position

Method	Formula	Use Cases
IQR Method	Outliers if values lie outside $[Q1 - 1.5 \times IQR, \, Q3 + 1.5 \times IQR]$	Skewed datasets, box plot analysis
Z-score Method	Outliers if $\|Z\|>2$ (or 3)	Approximately normal distributions

Concept	Definition	Key Features	What It Shows	Relevance
Frequency Distributions	A table or graph showing the frequency of outcomes in a sample	Constructed with bins or categories, tally, relative and cumulative frequencies	Pattern of data frequencies across values/intervals	Foundation for histograms, highlights anomalies, aids data exploration
Histogram	Graphical representation of numerical data distribution	Continuous bins on x-axis; frequencies/relative frequencies on y-axis; no gaps	Shape, central tendency, spread, modes, outliers	Tests normality assumption, assesses symmetry/skewness
Box Plot	Visual summary using five-number summary (min, Q1, median, Q3, max)	Box (Q1 - Q3), median line, whiskers, outliers marked separately	Median, spread (IQR), symmetry, skewness, outliers	Useful for group comparison; robust to outliers
Bar Chart	Chart for categorical data showing frequencies or proportions	Categories on x-axis; separated bars (gaps between categories)	Prevalence/popularity of categories	Ideal for qualitative data distributions
Pie Chart	Circular graphic with slices proportional to categories	Arc length/area represents share	Proportion of categories relative to whole	Good for simple proportions (e.g., market share); weaker for precise comparisons
Scatter Plot	Graph of two variables as points on Cartesian plane	Each point shows one observation	Correlation, relationships, clusters, outliers	Basis for regression, critical for bivariate analysis
Skewness	Measure of asymmetry in distribution	Positive: right tail longer; Negative: left tail longer; Zero: symmetric	Highlights direction and degree of skew	Informs model assumptions, guides use of mean/median, reveals data limits
Kurtosis	Measure of tail heaviness and peakedness	Mesokurtic: normal; Leptokurtic: sharp peak, heavy tails; Platykurtic: flat, light tails	Probability of extreme values in distribution	Critical in risk assessment, model assumptions, influence of outliers

Correlation: Measuring Relationships Between Variables

Measure	Definition	Formula	Range	Interpretation	Limitations	Use Cases
Covariance	Measures how two variables change together (direction of relationship)	Population: $\sigma_{xy} = \frac{\sum (x_i - \mu_x)(y_i - \mu_y)}{N}$ Sample: $s_{xy} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1}$	$-\infty$ to $+\infty$	Positive → variables increase/decrease together. Negative → one increases while other decreases. Zero → no linear relationship	Scale-dependent, magnitude not directly interpretable	Good first step to check direction of relationship
Pearson Correlation (r)	Normalized measure of strength & direction of linear relationship between quantitative variables	Population: $\rho = \frac{\sigma_{xy}}{\sigma_x \sigma_y}$ Sample: $r = \frac{s_{xy}}{s_x s_y}$	$[-1, +1]$	+1 = perfect positive linear, -1 = perfect negative linear, 0 = no linear correlation. Strength guidelines: weak (±0.1-0.3), moderate (±0.3-0.7), strong (±0.7-1)	Sensitive to outliers, only captures linear relationships	Requires quantitative variables, linearity, no extreme outliers, homoscedasticity. Common in EDA, regression, feature selection
Spearman's Rank Correlation ( $\rho$ or $r_s$ )	Non-parametric measure of strength & direction of monotonic (possibly non-linear) relationship. Uses ranks instead of raw data	$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$ where $d_i$ = rank differences	$[-1, +1]$	+1 = perfect positive monotonic, -1 = perfect negative monotonic, 0 = no monotonic relationship	Does not measure strength of linear relationship, only ranked/monotonic consistency	Works with ordinal data, monotonic but non-linear relationships, non-normal data; less sensitive to outliers