Descriptive Statistics
- Data Types
- Measures of Central Tendency
- Measures of Dispersion
- Measures of Position
- Data Distribution and Shape
- Correlation
Quantitative vs. Qualitative Data
Type | Definition | Characteristics | Subtypes | Examples | Use Cases |
---|---|---|---|---|---|
Quantitative (Numerical) | Data expressed as numbers representing counts or measurements | Measurable, ordered, meaningful differences, arithmetic operations possible | Discrete (countable) and Continuous (measurable) | Discrete: number of children, dice rolls. Continuous: height, weight, temperature, revenue | Statistical tests (e.g., regression), visualizations like histograms and scatter plots, predictive modeling |
Qualitative (Categorical) | Data representing characteristics or categories | Descriptive, grouped into categories, non-numerical, limited math operations | Nominal and Ordinal (part of levels of measurement) | Gender, marital status, satisfaction levels, product type | Chi-squared tests, bar charts, pie charts, encoding for machine learning models |
Levels of Measurement
Level | Categories | Order | Equal Intervals | True Zero | Examples | Permissible Operations |
---|---|---|---|---|---|---|
Nominal | Yes | No | No | No | Gender, hair color, product type, country | Frequencies, mode, Chi-squared tests |
Ordinal | Yes | Yes | No | No | Satisfaction ratings, education level, Likert scale, income brackets | Frequencies, mode, median, rank correlations |
Interval | Yes | Yes | Yes | No | Temperature (°C/°F), IQ scores, years/dates | Addition, subtraction, mean, SD, correlations (Pearson) |
Ratio | Yes | Yes | Yes | Yes | Height, weight, income, age, time, Kelvin temperature | All statistical methods, multiplication, division, ratios |
Measures of Central Tendency: Finding the "Average"
Measure | Definition | Formula | Example | Best Use Cases | Sensitivity |
---|---|---|---|---|---|
Arithmetic Mean | Sum of all values divided by the number of values | Sales: $$$$ → 110 | Symmetric data without extreme outliers | Highly sensitive to outliers | |
Weighted Mean | Mean where each value has a weight representing importance/frequency | Grade = 89.5 (with homework, midterm, final weights) | When values contribute unequally (e.g., grades, weighted averages) | Sensitive if extreme values have high weights | |
Geometric Mean | n-th root of the product of values | Returns: → 9% | Growth rates, ratios, percentages | Sensitive to zeros/negative values; less affected by outliers | |
Harmonic Mean | Reciprocal of the arithmetic mean of reciprocals | Speeds: 60 & 40 mph → 48 mph | Rates, speeds, efficiency ratios | Sensitive to very small values | |
Median | Middle value after ordering data | If odd : ; if even : mean of two middle values | $$$$ → Median = 30 | Skewed data, outlier-heavy data, ordinal data | Robust to outliers |
Mode | Most frequent value(s) in dataset | Count frequencies | $$$$ → Mode = 4 | Nominal/categorical data, identifying most common class/value | Not affected by extreme values |
Distribution Insights
Measure | Formula | Sensitivity to Outliers | Example Result (Dataset: ) | Interpretation | Use Cases |
---|---|---|---|---|---|
Range | Highly sensitive | Quick measure of spread | Good for rough checks but distorted by extreme values | ||
Interquartile Range | Not sensitive | From example set → | Describes spread of middle 50% | Robust for skewed data and outlier detection | |
Variance (Sample) | Sensitive | Uses all data points; squared units limit interpretability | Key in advanced statistics | ||
Variance (Population) | Sensitive | Depends on full population | Same as above but for entire population | Exact measure of spread in total dataset | |
Standard Deviation | , | Sensitive | Average deviation from mean in same units as data | Most widely used spread measure | |
Mean Absolute Deviation (MAD) | Less sensitive | Average distance from mean without squaring | Easier to interpret, less used in theory |
- Symmetric (Normal):
Mean ≈ Median ≈ Mode
: All 3 measures are approximately equal. - Positively Skewed (Right):
Mode < Median < Mean
: Mean is pulled right by high outliers. - Negatively Skewed (Left):
Mean < Median < Mode
: Mean is pulled left by low outliers.
Measures of Dispersion (Variability): Quantifying Data Spread
Measure | Formula | Example | Sensitivity to Outliers | Advantages | Limitations | Use Cases |
---|---|---|---|---|---|---|
Range | Highly sensitive | Very simple, quick estimate of spread | Ignores distribution between extremes, distorted by outliers | Quick checks (e.g., quality control) | ||
Interquartile Range (IQR) | For [5,7,8,8,10,12,13,15,17,20], | Low sensitivity (robust) | Focuses on middle 50%, robust against skew and outliers | Ignores data outside middle 50% | Outlier detection, box plots, skewed data | |
Sample Variance | Population: Sample: | For [6,7,8,9,10], | Sensitive | Uses all data, key for advanced statistics | Units squared (harder to interpret) | ANOVA, regression, PCA, machine learning |
Standard Deviation | , | For variance = 2.5, | Sensitive | Same units as data, widely interpretable, foundational in stats | Still sensitive to outliers | Z-scores, hypothesis testing, confidence intervals |
Mean Absolute Deviation (MAD) | For [6,7,8,9,10], | Less sensitive | Easy to understand, less influenced by extremes | Less mathematically flexible than variance/SD | Forecasting errors, robust spread measure |
Measures of Position: Relative Standing
Measure | Definition | Method | Example | Use Cases |
---|---|---|---|---|
Percentiles | Value below which a certain percentage of observations fall | . If is whole, average of value at and . If not, round up and take that value | 70th percentile of ordered scores: . Avg of 7th (85) & 8th (88) = 86.5 | Ranking, benchmarking, and segmenting datasets (e.g., identifying top 10% performers) |
Quartiles | Divide data into four equal parts (Q1=25th, Q2=50th/median, Q3=75th) | Same as percentile method, with P=25, 50, 75 | Scores: Q1=72, Q2=81, Q3=88 | Used in box plots, detect skewness, compute IQR () for spread and outlier detection |
Deciles | Divide data into ten equal parts | Same as percentile method, with P=10, 20, …, 90 | D5 = 50th percentile = Median = 81 | Provides more granular segmentation than quartiles, often used in income/wealth distribution analysis |
Z-scores | Standardized score showing how many SDs a value is from mean | (population) or (sample) | Score = 85, Mean=70, SD=10 → | Standardizes across distributions, detects outliers ( or 3), enables probability-based comparisons |
Outlier Detection via Measures of Position
Method | Formula | Use Cases |
---|---|---|
IQR Method | Outliers if values lie outside | Skewed datasets, box plot analysis |
Z-score Method | Outliers if (or 3) | Approximately normal distributions |
Concept | Definition | Key Features | What It Shows | Relevance |
---|---|---|---|---|
Frequency Distributions | A table or graph showing the frequency of outcomes in a sample | Constructed with bins or categories, tally, relative and cumulative frequencies | Pattern of data frequencies across values/intervals | Foundation for histograms, highlights anomalies, aids data exploration |
Histogram | Graphical representation of numerical data distribution | Continuous bins on x-axis; frequencies/relative frequencies on y-axis; no gaps | Shape, central tendency, spread, modes, outliers | Tests normality assumption, assesses symmetry/skewness |
Box Plot | Visual summary using five-number summary (min, Q1, median, Q3, max) | Box (Q1 - Q3), median line, whiskers, outliers marked separately | Median, spread (IQR), symmetry, skewness, outliers | Useful for group comparison; robust to outliers |
Bar Chart | Chart for categorical data showing frequencies or proportions | Categories on x-axis; separated bars (gaps between categories) | Prevalence/popularity of categories | Ideal for qualitative data distributions |
Pie Chart | Circular graphic with slices proportional to categories | Arc length/area represents share | Proportion of categories relative to whole | Good for simple proportions (e.g., market share); weaker for precise comparisons |
Scatter Plot | Graph of two variables as points on Cartesian plane | Each point shows one observation | Correlation, relationships, clusters, outliers | Basis for regression, critical for bivariate analysis |
Skewness | Measure of asymmetry in distribution | Positive: right tail longer; Negative: left tail longer; Zero: symmetric | Highlights direction and degree of skew | Informs model assumptions, guides use of mean/median, reveals data limits |
Kurtosis | Measure of tail heaviness and peakedness | Mesokurtic: normal; Leptokurtic: sharp peak, heavy tails; Platykurtic: flat, light tails | Probability of extreme values in distribution | Critical in risk assessment, model assumptions, influence of outliers |
Correlation: Measuring Relationships Between Variables
Measure | Definition | Formula | Range | Interpretation | Limitations | Use Cases |
---|---|---|---|---|---|---|
Covariance | Measures how two variables change together (direction of relationship) | Population: Sample: | to | Positive → variables increase/decrease together. Negative → one increases while other decreases. Zero → no linear relationship | Scale-dependent, magnitude not directly interpretable | Good first step to check direction of relationship |
Pearson Correlation (r) | Normalized measure of strength & direction of linear relationship between quantitative variables | Population: Sample: | +1 = perfect positive linear, -1 = perfect negative linear, 0 = no linear correlation. Strength guidelines: weak (±0.1-0.3), moderate (±0.3-0.7), strong (±0.7-1) | Sensitive to outliers, only captures linear relationships | Requires quantitative variables, linearity, no extreme outliers, homoscedasticity. Common in EDA, regression, feature selection | |
Spearman's Rank Correlation ( or ) | Non-parametric measure of strength & direction of monotonic (possibly non-linear) relationship. Uses ranks instead of raw data | where = rank differences | +1 = perfect positive monotonic, -1 = perfect negative monotonic, 0 = no monotonic relationship | Does not measure strength of linear relationship, only ranked/monotonic consistency | Works with ordinal data, monotonic but non-linear relationships, non-normal data; less sensitive to outliers |