Statistics: Basics

Overview

Definition
Use Cases
Terminology

Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and presenting data. It provides methods for making inferences and decisions in the face of uncertainty, making it an indispensable tool for data-driven decision-making in various fields, including software development.

Population vs. Sample: data analysis can involve populations, which consist of all individuals or events of interest, but analyzing entire populations is often impractical, leading to the use of samples, representative subsets of populations
Descriptive vs. Inferential Statistics: descriptive statistics summarize data, aiding in understanding and interpretation, while inferential statistics enable predictions or inferences about populations based on sample data, extending analysis beyond observed data
Measures of Central Tendency (Mean, Median, Mode): measures of central tendency describe the central or typical value of a dataset. They provide insight into where the data tends to cluster
Mean: also known as the average, is calculated by summing all the values in the dataset and dividing by the total number of values. It's sensitive to extreme values and is commonly used for normally distributed data
Median: is the middle value of a dataset when arranged in ascending or descending order. It's less affected by outliers compared to the mean and provides a better representation of the central tendency for skewed distributions
Mode: is the value that appears most frequently in the dataset. It's useful for categorical or nominal data and can indicate the most common outcome
Measures of Dispersion (Variance, Standard Deviation, Range): quantify the spread or variability of data points around the central tendency. They provide insight into the consistency and variability of the dataset
Variance: measures the average squared deviation of each data point from the mean. It gives an indication of how spread out the data points are from the mean
Standard Deviation: is the square root of the variance. It's often used as a measure of the dispersion of data points around the mean. A higher standard deviation indicates greater variability in the dataset
Range: is the difference between the maximum and minimum values in the dataset. It provides a simple measure of the spread of the data but is sensitive to outliers

Statistical Inference and Hypothesis Testing

Sampling Techniques
Estimation
Hypothesis Testing

Sampling techniques are methods used to select a subset of individuals or items from a larger population for analysis. Understanding different sampling methods is crucial for ensuring the representativeness of the sample and the validity of statistical conclusions.

Simple random sampling: involves randomly selecting individuals or items from the population, where each member has an equal chance of being chosen. It ensures unbiased representation and is often used when the population is homogeneous and easily accessible
Stratified sampling: involves dividing the population into distinct subgroups or strata based on certain characteristics and then randomly selecting samples from each stratum. It ensures representation from different segments of the population and is useful when there are significant variations within the population
Cluster sampling: involves dividing the population into clusters or groups and then randomly selecting entire clusters to form the sample. It is practical when it's difficult or costly to obtain a complete list of individuals in the population and is commonly used in geographical or organizational studies
Systematic sampling: involves selecting every kth individual from a list or sequence of the population. It is straightforward to implement and ensures equal probability of selection, making it suitable for large populations when a random starting point is chosen

Hypothesis testing is a statistical method used to make decisions or draw conclusions about population parameters based on sample data. It involves formulating null and alternative hypotheses, conducting statistical tests, and interpreting the results.

Null and Alternative Hypotheses: null hypothesis $H_0$ represents the status quo or the assumption to be tested, while the alternative hypothesis $H_1$ represents the claim or hypothesis that contradicts the null hypothesis. Hypothesis testing aims to assess the evidence against the null hypothesis and determine whether there is sufficient evidence to reject it in favor of the alternative hypothesis.
Types of Errors (Type I and Type II): there are 2 types of errors that can occur. A Type I error occurs when the null hypothesis is incorrectly rejected when it is true, leading to a false positive conclusion. A Type II error occurs when the null hypothesis is incorrectly accepted when it is false, leading to a false negative conclusion
Statistical Tests (t-tests, Chi-square tests, ANOVA): various statistical tests are available for hypothesis testing, depending on the nature of the data and the research question. Commonly used tests include t-tests for comparing means, chi-square tests for testing independence or goodness of fit, and analysis of variance (ANOVA) for comparing multiple group means
P-values and Significance Levels: the p-value represents the probability of observing the sample data, or more extreme data, under the assumption that the null hypothesis is true. The significance level α is the threshold used to determine whether the p-value is sufficiently low to reject the null hypothesis. A common significance level is 0.05, indicating a 5% chance of making a Type I error

Regression Analysis and Predictive Modeling

Simple Linear Regression
Multiple Linear Regression
Applications of Regression Analysis in Software Development

Simple linear regression is the most basic form of regression analysis, involving a single independent variable and a dependent variable. It aims to model the relationship between the independent variable X and the dependent variable Y using a linear equation of the form:

$Y=β_0+β_1X+ϵ$

Regression Line: represents the best-fitting straight line through the data points, minimizing the sum of squared differences between the observed and predicted values. The coefficients $β_0$ and $β_1$ represent the intercept and slope of the line, respectively
Calculating Regression Coefficients: regression coefficients $β_0$ (intercept) and $β_1$ (slope) are estimated using the method of least squares, which minimizes the sum of squared residuals (the differences between observed and predicted values)
Assessing Model Fit: Model fit is assessed using various metrics, including $R^2$ (coefficient of determination), which measures the proportion of variance explained by the model, and residual analysis, which examines the distribution of residuals to ensure model assumptions are met

Data Visualization

Principles
Graphical Techniques
Exploratory Data Analysis (EDA)

According to Edward Tufte, a pioneer in the field of data visualization, there are four main goals of data graphics:

Show the data
Induce the viewer to think about the substance rather than the methodology, graphic design, or technology
Avoid distorting what the data have to say
Present many numbers in a small space

General guidelines

Maximize the data-ink ratio: use ink only to display data, not for decoration or redundancy
Minimize chartjunk: avoid unnecessary elements that distract from the data, such as grid lines, borders, backgrounds, or 3D effects
Use appropriate scales: choose scales that are proportional to the data and avoid misleading distortions, such as truncated axes or non-linear transformations
Use appropriate colors: use colors to highlight important features or categories, not for decoration or emphasis; avoid using too many colors or colors that are hard to distinguish
Use appropriate labels: provide clear and concise labels for axes, legends, titles, and annotations; avoid cluttering or overlapping labels
Use appropriate symbols: use symbols that are easy to recognize and interpret; avoid using too many symbols or symbols that are ambiguous or confusing

Charts and graphs

Bar Charts: show categorical or numerical data using horizontal or vertical bars; useful for comparing values across categories or groups
Pie & Donut Charts: show categorical or numerical data using circular sectors; useful for showing proportions or percentages of a whole
Line & Area Charts: show numerical data using points connected by lines; useful for showing trends or changes over time
Scatter Plots: show numerical data using points on a Cartesian plane; useful for showing relationships or correlations between two variables
Histograms: show numerical data using bars whose heights represent frequencies or densities; useful for showing distributions or ranges of values
Box Plots: show numerical data using boxes whose edges represent quartiles and whiskers that represent outliers; useful for showing summary statistics or comparing distributions across groups
Heat Maps: show numerical data using colors on a grid; useful for showing patterns or variations across two dimensions

Chose the right visualization

One way is to consider the characteristics of our data and our goals for displaying it. For example:

What kind of data do we have? Is it categorical or numerical? Is it discrete or continuous? Is it univariate, bivariate, or multivariate?
What kind of information do we want to show? Do we want to show comparisons, proportions, trends, relationships, distributions, or patterns?
Who is our audience? What is their level of familiarity with the data and the type of chart or graph? What is their level of interest and attention span?

Overview​

Statistical Inference and Hypothesis Testing​

Regression Analysis and Predictive Modeling​

Data Visualization​

Overview

Statistical Inference and Hypothesis Testing

Regression Analysis and Predictive Modeling

Data Visualization