Skip to main content

ML Math

NameDescriptionFormulaUse Cases
Gradient DescentIterative optimization method to minimize a loss functionΞΈj+1=ΞΈjβˆ’Ξ±βˆ‡J(ΞΈj)\theta_{j+1} = \theta_j - \alpha \nabla J(\theta_j)Training neural networks, logistic regression, and general parameter estimation
Normal (Gaussian) DistributionContinuous probability distribution defined by mean and variancef(x∣μ,Οƒ2)=1Οƒ2Ο€exp⁑(βˆ’(xβˆ’ΞΌ)22Οƒ2)f(x\mid\mu,\sigma^2)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)Modeling noise, Gaussian priors, feature normalization, probabilistic models
Z-scoreStandardizes a value relative to a distribution (mean, std)z=xβˆ’ΞΌΟƒz=\frac{x-\mu}{\sigma}Feature scaling, outlier detection, standardization before training
Sigmoid (Logistic) FunctionS-shaped activation mapping real-valued input to (0,1)Οƒ(x)=11+eβˆ’x\sigma(x)=\frac{1}{1+e^{-x}}Binary classification outputs, logistic regression, binary neuron activation
Pearson Correlation CoefficientMeasure of linear correlation between two variablesCorr(X,Y)=Cov(X,Y)Std(X) Std(Y)\mathrm{Corr}(X,Y)=\frac{\mathrm{Cov}(X,Y)}{\mathrm{Std}(X)\,\mathrm{Std}(Y)}Feature selection, exploratory data analysis, detecting multicollinearity
Cosine SimilarityAngle-based similarity between two vectorssim(A,B)=Aβ‹…Bβˆ₯Aβˆ₯ βˆ₯Bβˆ₯\mathrm{sim}(A,B)=\frac{A\cdot B}{\lVert A\rVert\,\lVert B\rVert}Text similarity, nearest neighbors in embedding spaces, recommendation systems
Naive Bayes (posterior with conditional independence)Probabilistic classifier assuming feature independenceP(y∣x1,…,xn)=P(y)∏i=1nP(xi∣y)P(x1,…,xn)P(y\mid x_1,\dots,x_n)=\frac{P(y)\prod_{i=1}^n P(x_i\mid y)}{P(x_1,\dots,x_n)}Text classification, spam detection, quick baseline probabilistic models
Maximum Likelihood Estimation (MLE)Parameter estimation by maximizing the data likelihoodΞΈ^MLE=arg max⁑θ∏i=1nP(xi∣θ)\hat{\theta}_{\mathrm{MLE}}=\operatorname{arg\,max}_{\theta}\prod_{i=1}^n P(x_i\mid\theta)Estimating model parameters for many statistical models (e.g., Gaussian, logistic)
Ordinary Least Squares (OLS) SolutionClosed-form linear regression coefficients minimizing squared error.Ξ²^=(X⊀X)βˆ’1X⊀y\hat{\beta}=(X^{\top}X)^{-1}X^{\top}yLinear regression fitting, baseline regression analysis, quick parameter estimates
F1 ScoreHarmonic mean of precision and recall for classificationF1=2β‹…Pβ‹…RP+RF_1=\frac{2\cdot P\cdot R}{P+R}Evaluating imbalanced classification tasks (e.g., information retrieval)
ReLU (Rectified Linear Unit)Piecewise linear activation that is zero for negative inputsReLU(x)=max⁑(0,x)\mathrm{ReLU}(x)=\max(0,x)Activation in deep neural networks β€” helps with sparse activations and gradient flow
Softmax (class probability)Converts logits to a probability distribution over classesP(y=j∣x)=exp⁑(x⊀wj)βˆ‘k=1Kexp⁑(x⊀wk)P(y=j\mid x)=\frac{\exp(x^{\top}w_j)}{\sum_{k=1}^K\exp(x^{\top}w_k)}Multi-class classification outputs, final layer in classifiers, cross-entropy loss
Coefficient of Determination (R^2)Fraction of variance explained by a regression modelR2=1βˆ’βˆ‘i(yiβˆ’y^i)2βˆ‘i(yiβˆ’yΛ‰)2R^2=1-\frac{\sum_i (y_i-\hat{y}_i)^2}{\sum_i (y_i-\bar{y})^2}Assessing goodness-of-fit for regression models
Mean Squared Error (MSE)Average squared difference between predictions and targetsMSE=1nβˆ‘i=1n(yiβˆ’y^i)2\mathrm{MSE}=\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2Regression loss for training and model comparison
MSE with L2 Regularization (Ridge-style)MSE augmented with L2 penalty to shrink coefficientsMSEreg=1nβˆ‘i=1n(yiβˆ’y^i)2+Ξ»βˆ‘j=1pΞ²j2\mathrm{MSE}_{\text{reg}}=\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2 + \lambda\sum_{j=1}^p \beta_j^2Preventing overfitting, ridge regression, regularized linear models
Eigenvalue / Eigenvector EquationCharacterizes linear transformations via scale directionsAv=Ξ»vA v = \lambda vPCA, spectral clustering, analyzing linear operators and covariance matrices
(Shannon) EntropyMeasure of uncertainty or information content in a distributionH(X)=βˆ’βˆ‘ipilog⁑2piH(X)=-\sum_i p_i\log_2 p_iFeature selection, decision tree splitting, information-theoretic model comparisons
K-Means ObjectiveSum of squared distances used to define cluster assignmentsarg min⁑Sβˆ‘i=1kβˆ‘x∈Siβˆ₯xβˆ’ΞΌiβˆ₯2\underset{S}{\operatorname{arg\,min}}\sum_{i=1}^k\sum_{x\in S_i}\lVert x-\mu_i\rVert^2Unsupervised clustering to find centroids; pre-processing and segmentation
Kullback-Leibler (KL) DivergenceAsymmetric measure of difference between two probability distributions.DKL(Pβˆ₯Q)=βˆ‘xP(x)log⁑P(x)Q(x)D_{\mathrm{KL}}(P\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}Variational inference, training generative models, measuring distribution shifts
Log-Loss (Binary Cross-Entropy)Negative log-likelihood for binary classification predictionsβ„“log=βˆ’1Nβˆ‘i=1N[yilog⁑(y^i)+(1βˆ’yi)log⁑(1βˆ’y^i)]\ell_{\text{log}}=-\frac{1}{N}\sum_{i=1}^N \bigl[y_i\log(\hat{y}_i)+(1-y_i)\log(1-\hat{y}_i)\bigr]Loss for binary classifiers, logistic regression, and neural nets with sigmoid outputs
Support Vector Machine (hinge loss, primal)Margin-based objective with hinge loss and regularizationmin⁑w,bβ€…β€Š12βˆ₯wβˆ₯2+Cβˆ‘i=1nmax⁑(0, 1βˆ’yi(w⊀xiβˆ’b))\min_{w,b}\;\frac{1}{2}\lVert w\rVert^2 + C\sum_{i=1}^n \max\bigl(0,\,1-y_i( w^{\top}x_i - b)\bigr)Classification with large-margin objectives; SVM training and kernel methods
Linear Regression (model)Linear model expressing target as weighted sum of inputsy=Ξ²0+Ξ²1x1+Ξ²2x2+β‹―+Ξ²nxn+Ξ΅y=\beta_0+\beta_1 x_1+\beta_2 x_2+\dots+\beta_n x_n + \varepsilonPredictive modeling for continuous targets, baseline models, interpretability
Singular Value Decomposition (SVD)Factorizes a matrix into singular vectors and singular valuesA=UΣV⊀A=U\Sigma V^{\top}Dimensionality reduction, low-rank approximations, recommender systems (matrix factorization).
Lagrange Multiplier (constrained optimization)Method to optimize with equality constraints using multipliersPrimary constraint form: g(x)=0g(x)=0; L(x,Ξ»)=f(x)βˆ’Ξ»β€‰g(x)\mathcal{L}(x,\lambda)=f(x)-\lambda\,g(x)Constrained optimization in model training, dual formulations, constrained EM or SVM derivations.