ML Math

Name	Description	Formula	Use Cases
Gradient Descent	Iterative optimization method to minimize a loss function	$\theta_{j+1} = \theta_j - \alpha \nabla J(\theta_j)$	Training neural networks, logistic regression, and general parameter estimation
Normal (Gaussian) Distribution	Continuous probability distribution defined by mean and variance	$f(x\mid\mu,\sigma^2)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$	Modeling noise, Gaussian priors, feature normalization, probabilistic models
Z-score	Standardizes a value relative to a distribution (mean, std)	$z=\frac{x-\mu}{\sigma}$	Feature scaling, outlier detection, standardization before training
Sigmoid (Logistic) Function	S-shaped activation mapping real-valued input to (0,1)	$\sigma(x)=\frac{1}{1+e^{-x}}$	Binary classification outputs, logistic regression, binary neuron activation
Pearson Correlation Coefficient	Measure of linear correlation between two variables	$\mathrm{Corr}(X,Y)=\frac{\mathrm{Cov}(X,Y)}{\mathrm{Std}(X)\,\mathrm{Std}(Y)}$	Feature selection, exploratory data analysis, detecting multicollinearity
Cosine Similarity	Angle-based similarity between two vectors	$\mathrm{sim}(A,B)=\frac{A\cdot B}{\lVert A\rVert\,\lVert B\rVert}$	Text similarity, nearest neighbors in embedding spaces, recommendation systems
Naive Bayes (posterior with conditional independence)	Probabilistic classifier assuming feature independence	$P(y\mid x_1,\dots,x_n)=\frac{P(y)\prod_{i=1}^n P(x_i\mid y)}{P(x_1,\dots,x_n)}$	Text classification, spam detection, quick baseline probabilistic models
Maximum Likelihood Estimation (MLE)	Parameter estimation by maximizing the data likelihood	$\hat{\theta}_{\mathrm{MLE}}=\operatorname{arg\,max}_{\theta}\prod_{i=1}^n P(x_i\mid\theta)$	Estimating model parameters for many statistical models (e.g., Gaussian, logistic)
Ordinary Least Squares (OLS) Solution	Closed-form linear regression coefficients minimizing squared error.	$\hat{\beta}=(X^{\top}X)^{-1}X^{\top}y$	Linear regression fitting, baseline regression analysis, quick parameter estimates
F1 Score	Harmonic mean of precision and recall for classification	$F_1=\frac{2\cdot P\cdot R}{P+R}$	Evaluating imbalanced classification tasks (e.g., information retrieval)
ReLU (Rectified Linear Unit)	Piecewise linear activation that is zero for negative inputs	$\mathrm{ReLU}(x)=\max(0,x)$	Activation in deep neural networks — helps with sparse activations and gradient flow
Softmax (class probability)	Converts logits to a probability distribution over classes	$P(y=j\mid x)=\frac{\exp(x^{\top}w_j)}{\sum_{k=1}^K\exp(x^{\top}w_k)}$	Multi-class classification outputs, final layer in classifiers, cross-entropy loss
Coefficient of Determination (R^2)	Fraction of variance explained by a regression model	$R^2=1-\frac{\sum_i (y_i-\hat{y}_i)^2}{\sum_i (y_i-\bar{y})^2}$	Assessing goodness-of-fit for regression models
Mean Squared Error (MSE)	Average squared difference between predictions and targets	$\mathrm{MSE}=\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2$	Regression loss for training and model comparison
MSE with L2 Regularization (Ridge-style)	MSE augmented with L2 penalty to shrink coefficients	$\mathrm{MSE}_{\text{reg}}=\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2 + \lambda\sum_{j=1}^p \beta_j^2$	Preventing overfitting, ridge regression, regularized linear models
Eigenvalue / Eigenvector Equation	Characterizes linear transformations via scale directions	$A v = \lambda v$	PCA, spectral clustering, analyzing linear operators and covariance matrices
(Shannon) Entropy	Measure of uncertainty or information content in a distribution	$H(X)=-\sum_i p_i\log_2 p_i$	Feature selection, decision tree splitting, information-theoretic model comparisons
K-Means Objective	Sum of squared distances used to define cluster assignments	$\underset{S}{\operatorname{arg\,min}}\sum_{i=1}^k\sum_{x\in S_i}\lVert x-\mu_i\rVert^2$	Unsupervised clustering to find centroids; pre-processing and segmentation
Kullback-Leibler (KL) Divergence	Asymmetric measure of difference between two probability distributions.	$D_{\mathrm{KL}}(P\\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}$	Variational inference, training generative models, measuring distribution shifts
Log-Loss (Binary Cross-Entropy)	Negative log-likelihood for binary classification predictions	$\ell_{\text{log}}=-\frac{1}{N}\sum_{i=1}^N \bigl[y_i\log(\hat{y}_i)+(1-y_i)\log(1-\hat{y}_i)\bigr]$	Loss for binary classifiers, logistic regression, and neural nets with sigmoid outputs
Support Vector Machine (hinge loss, primal)	Margin-based objective with hinge loss and regularization	$\min_{w,b}\;\frac{1}{2}\lVert w\rVert^2 + C\sum_{i=1}^n \max\bigl(0,\,1-y_i( w^{\top}x_i - b)\bigr)$	Classification with large-margin objectives; SVM training and kernel methods
Linear Regression (model)	Linear model expressing target as weighted sum of inputs	$y=\beta_0+\beta_1 x_1+\beta_2 x_2+\dots+\beta_n x_n + \varepsilon$	Predictive modeling for continuous targets, baseline models, interpretability
Singular Value Decomposition (SVD)	Factorizes a matrix into singular vectors and singular values	$A=U\Sigma V^{\top}$	Dimensionality reduction, low-rank approximations, recommender systems (matrix factorization).
Lagrange Multiplier (constrained optimization)	Method to optimize with equality constraints using multipliers	Primary constraint form: $g(x)=0$ ; $\mathcal{L}(x,\lambda)=f(x)-\lambda\,g(x)$	Constrained optimization in model training, dual formulations, constrained EM or SVM derivations.