Skip to main content

Statistics: Basics

Overview​

Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and presenting data. It provides methods for making inferences and decisions in the face of uncertainty, making it an indispensable tool for data-driven decision-making in various fields, including software development.

Statistical Inference and Hypothesis Testing​

Sampling techniques are methods used to select a subset of individuals or items from a larger population for analysis. Understanding different sampling methods is crucial for ensuring the representativeness of the sample and the validity of statistical conclusions.

  • Simple random sampling: involves randomly selecting individuals or items from the population, where each member has an equal chance of being chosen. It ensures unbiased representation and is often used when the population is homogeneous and easily accessible
  • Stratified sampling: involves dividing the population into distinct subgroups or strata based on certain characteristics and then randomly selecting samples from each stratum. It ensures representation from different segments of the population and is useful when there are significant variations within the population
  • Cluster sampling: involves dividing the population into clusters or groups and then randomly selecting entire clusters to form the sample. It is practical when it's difficult or costly to obtain a complete list of individuals in the population and is commonly used in geographical or organizational studies
  • Systematic sampling: involves selecting every kth individual from a list or sequence of the population. It is straightforward to implement and ensures equal probability of selection, making it suitable for large populations when a random starting point is chosen

Regression Analysis and Predictive Modeling​

Simple linear regression is the most basic form of regression analysis, involving a single independent variable and a dependent variable. It aims to model the relationship between the independent variable X and the dependent variable Y using a linear equation of the form:

Y=β0+β1X+ϵY=β_0+β_1X+ϵ

  • Regression Line: represents the best-fitting straight line through the data points, minimizing the sum of squared differences between the observed and predicted values. The coefficients β0β_0 and β1β_1 represent the intercept and slope of the line, respectively
  • Calculating Regression Coefficients: regression coefficients β0β_0 (intercept) and β1β_1 (slope) are estimated using the method of least squares, which minimizes the sum of squared residuals (the differences between observed and predicted values)
  • Assessing Model Fit: Model fit is assessed using various metrics, including R2R^2 (coefficient of determination), which measures the proportion of variance explained by the model, and residual analysis, which examines the distribution of residuals to ensure model assumptions are met

Data Visualization​

According to Edward Tufte, a pioneer in the field of data visualization, there are four main goals of data graphics:

  • Show the data
  • Induce the viewer to think about the substance rather than the methodology, graphic design, or technology
  • Avoid distorting what the data have to say
  • Present many numbers in a small space

General guidelines

  • Maximize the data-ink ratio: use ink only to display data, not for decoration or redundancy
  • Minimize chartjunk: avoid unnecessary elements that distract from the data, such as grid lines, borders, backgrounds, or 3D effects
  • Use appropriate scales: choose scales that are proportional to the data and avoid misleading distortions, such as truncated axes or non-linear transformations
  • Use appropriate colors: use colors to highlight important features or categories, not for decoration or emphasis; avoid using too many colors or colors that are hard to distinguish
  • Use appropriate labels: provide clear and concise labels for axes, legends, titles, and annotations; avoid cluttering or overlapping labels
  • Use appropriate symbols: use symbols that are easy to recognize and interpret; avoid using too many symbols or symbols that are ambiguous or confusing

Charts and graphs

  • Bar Charts: show categorical or numerical data using horizontal or vertical bars; useful for comparing values across categories or groups
  • Pie & Donut Charts: show categorical or numerical data using circular sectors; useful for showing proportions or percentages of a whole
  • Line & Area Charts: show numerical data using points connected by lines; useful for showing trends or changes over time
  • Scatter Plots: show numerical data using points on a Cartesian plane; useful for showing relationships or correlations between two variables
  • Histograms: show numerical data using bars whose heights represent frequencies or densities; useful for showing distributions or ranges of values
  • Box Plots: show numerical data using boxes whose edges represent quartiles and whiskers that represent outliers; useful for showing summary statistics or comparing distributions across groups
  • Heat Maps: show numerical data using colors on a grid; useful for showing patterns or variations across two dimensions

Chose the right visualization

One way is to consider the characteristics of our data and our goals for displaying it. For example:

  • What kind of data do we have? Is it categorical or numerical? Is it discrete or continuous? Is it univariate, bivariate, or multivariate?
  • What kind of information do we want to show? Do we want to show comparisons, proportions, trends, relationships, distributions, or patterns?
  • Who is our audience? What is their level of familiarity with the data and the type of chart or graph? What is their level of interest and attention span?