Optimization
- Optimization Theory
- Cost Loss
- Gradient Descent
- Convexity
- Time Series
- Bayesian Statistics
- Information Theory
- Numerical Methods
Type | Definition | Key Components | Examples |
---|---|---|---|
Unconstrained Optimization | Optimization without explicit constraints on variables |
| Most machine learning training algorithms |
Constrained Optimization | Optimization with explicit constraints on variables |
| Resource allocation, portfolio optimization |
Convex Optimization | Optimization over convex sets with convex objective functions |
| Support Vector Machines, L1/L2 regularization |
Function | Purpose | Formula | Key Characteristics | Use Cases |
---|---|---|---|---|
Mean Squared Error (MSE) / L2 Loss | Regression: minimize squared differences between predicted and actual values | Differentiable, convex (for linear regression), sensitive to outliers | Linear regression, continuous value prediction | |
Mean Absolute Error (MAE) / L1 Loss | Regression: minimize absolute differences between predicted and actual values | Robust to outliers, not differentiable at zero | Regression tasks with outliers, robust error measurement | |
Binary Cross-Entropy / Log Loss | Binary classification: measure prediction accuracy when output is probability between 0 and 1 | Penalizes wrong confident predictions, differentiable, pairs with sigmoid activation | Logistic regression, spam detection, medical diagnosis | |
Categorical Cross-Entropy (Softmax Loss) | Multi-class classification: evaluate predictions across multiple classes | Extension of binary version, works with softmax, differentiable | Image recognition, NLP tasks, multi-class classification | |
Hinge Loss | Classification (esp. SVM): maximize margin between classes | Zero loss region for confident predictions, not differentiable at all points | Support Vector Machines, margin-based classifiers |
Variant | Concept | Pros | Cons | Update Rule |
---|---|---|---|---|
Batch Gradient Descent (BGD) | Uses the entire dataset to compute the gradient before each update |
|
| with all data |
Stochastic Gradient Descent (SGD) | Updates parameters after computing gradient for each data point |
|
| with one example |
Mini-Batch Gradient Descent (MBGD) | Updates based on small batches of examples |
|
| with batch |
Momentum | Adds a fraction of the previous update to the current update (like rolling a ball downhill) |
|
| Similar to GD but with momentum term accumulated |
RMSprop | Divides learning rate by root mean square of historical gradients |
|
| Adaptive per-parameter learning rate |
Adam | Combines Momentum + RMSprop, using first (mean) and second (variance) moments of gradients |
|
| Adaptive moment estimates for update |
Aspect | Convex Functions | Non-Convex Functions |
---|---|---|
Shape | Bowl-shaped; "cups upwards" | Complex landscapes with hills, valleys, and plateaus |
Mathematical Condition | Does not satisfy convex inequality for all points | |
Local vs Global Minima | Any local minimum is also a global minimum | Multiple local minima, saddle points, and flat regions |
Optimization Guarantee | Gradient descent (with proper learning rate) always converges to the global minimum | Gradient descent may get stuck in local minima or saddle points |
Sensitivity to Initialization | Low; starting point less critical as all paths lead to same minimum | High; results depend heavily on initial parameter values |
Convergence | Deterministic and reliable | Can be slow, with risk of poor solutions |
Optimization Methods | Simple methods like gradient descent or closed-form solutions often sufficient | Require advanced optimizers (Adam, RMSprop, SGD with momentum), multiple restarts, or careful initialization |
Theoretical Guarantees | Strong; ensures uniqueness and optimality of solution | Weak; solutions may be suboptimal and vary between runs |
Examples | Mean Squared Error (linear regression), Cross-Entropy (logistic regression), L2-regularized loss | Deep neural networks, complex non-linear models |
Use in Data Analysis | Enables reliable model interpretation and parameter estimates (e.g., regression coefficients) | Powerful for complex models, but less interpretable and harder to optimize |
Concept | Definition | Purpose | Indicators & Techniques | Use Cases |
---|---|---|---|---|
Stationarity | Statistical properties (mean, variance, autocorrelation) remain constant over time | Ensures model validity, reliable inference, stable forecasting |
| Core assumption for AR, MA, ARIMA; non-stationarity (trends, seasonality, heteroscedasticity) must be corrected |
ACF (Autocorrelation Function) | Correlation of a series with its lagged versions | Reveals lag dependencies; detects seasonality or trends |
| Crucial for identifying |
PACF (Partial Autocorrelation Function) | Correlation between series and lagged values with intermediate lags removed | Measures direct lag effects |
| Identifies |
ARIMA Models | Combines AR (past values), I (differencing), and MA (errors) | Forecasting sequential data |
| Widely used for short/medium-term forecasts, benchmark models, trend/seasonality decomposition |
Spectral Analysis | Represents data in frequency domain via sinusoidal components | Detects hidden cycles, dominant frequencies |
| Identifies seasonality, key for signal processing, feature extraction |
Fourier Transform | Converts time-domain signal into frequency-domain representation | Reveals underlying frequencies; enables reconstruction and filtering |
| Used in decomposition, noise filtering, audio/image processing, and advanced time series models |
Aspect | Bayesian Statistics | Frequentist Statistics |
---|---|---|
Core Idea | Updates beliefs with data; probabilities represent degrees of belief | Focuses on probability of data given a fixed hypothesis; probabilities represent long-run frequencies |
Foundation | Bayes' Theorem: | Hypothesis testing and confidence intervals |
Parameters | Treated as random variables with probability distributions | Treated as fixed, unknown quantities |
Priors | Incorporates prior knowledge (informative, non-informative, conjugate) | No role for prior probabilities |
Inference Result | Full posterior distribution summarizing uncertainty | Point estimates and p-values; intervals are confidence intervals |
Hypothesis Testing | Uses posterior probabilities and Bayes factors for model comparison | Uses significance tests and p-values |
Uncertainty Quantification | Credible intervals: direct probability statements about parameters | Confidence intervals: indirect interpretation across repeated samples |
Computation | Often requires advanced techniques like MCMC (Metropolis-Hastings, Gibbs, HMC) | Closed-form or asymptotic approximations more common |
Flexibility | Naturally handles small datasets, expert knowledge, hierarchical models, complex structures | Stronger with large datasets; less flexible for complex prior information |
Use Cases | A/B testing, diagnostics, machine learning (Bayesian optimization, networks), risk assessment | Classical hypothesis testing, large-scale statistical inference |
Concept | Formula | What it Measures | Key Interpretations | Use Cases |
---|---|---|---|---|
Entropy | Uncertainty/randomness in a probability distribution |
|
| |
Cross-Entropy | Difference between true distribution and predicted distribution |
|
| |
KL Divergence | Information loss when approximating distribution with |
|
|
Aspect | Key Concepts | Challenges | Use Cases |
---|---|---|---|
Floating-Point Arithmetic |
|
|
|
Numerical Stability |
|
|
|
Newton's Method |
|
|
|
Quasi-Newton Methods (BFGS, L-BFGS) |
|
|
|
Conjugate Gradient Method |
|
|
|