Fundamentals | ML Labs

Data: refers to raw facts, figures, and statistics that are collected for analysis. It can be in various forms, such as numbers, text, images, or videos.
Data Science: is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines aspects of statistics, computer science, and domain expertise to analyze and interpret complex data sets. The goal of data science is to turn data into actionable insights that can inform decision-making and drive business value. Key components of data science include data collection, data cleaning, data analysis, machine learning, and data visualization.

Data Science Lifecycle

Problem Definition: Understand the business problem and define objectives
Data Mining: Collect and explore data to identify patterns and insights
Data Preparation: Clean and preprocess data for analysis
Data Exploration: Analyze data to uncover trends and relationships
Feature Engineering: Create and select relevant features for modeling
Predictive Modeling: Train machine learning models, evaluate their performance, and use them to make predictions
Data Visualization: Communicate the findings with key stakeholders using plots and interactive visualizations

Data Preparation

Collection: Gathering data sources and measuring the accuracy of each file
- Has this problem been approached before? What was discovered?
- Is the purpose and goal understood by all involved?
- Is there ambiguity and how to reduce it?
- What are the constraints?
- What will the end result potentially look like?
- How much resources (time, people, computational) are available?
- What data is already available to me?
- Who owns this data?
- What are the privacy concerns?
- Do I have enough to solve this problem?
- Is the data of acceptable quality for this problem?
- If I discover additional information through this data, should we consider changing or redefining the goals?
Cleaning: Detecting and removing incomplete and inaccurate data records
- Classification: Organizing data into categories for more efficient use
- Clustering: Grouping data into similar groups
- Regression: Determine the relationships between variables to predict or forecast values
Transformation: Reformatting data into proper formats and structures
Reduction: Condensing redundant data down to its meaningful parts
Consolidation: Combining and storing varied data into a single place
Storage: Storing the data in a storage medium for future use

Ethics Principles

Aspect	Definition
Accountability	Makes practitioners responsible for their data & AI operations, ensuring compliance with ethical principles
Transparency	Ensures data and AI decisions are understandable and interpretable, explaining the "what" and "why"
Fairness	Aims to treat all people fairly by addressing systemic or implicit socio-technical biases in data and systems
Reliability & Safety	Ensures AI behaves consistently with defined values while minimizing harms or unintended consequences
Privacy & Security	Protects user privacy and identities by understanding data lineage and providing security safeguards
Inclusiveness	Designs AI solutions to meet a broad range of human needs and capabilities, ensuring accessibility

Ethics Challenges

Aspect	Definition
Data Ownership	Who owns the data? What rights do subjects and organizations have regarding control, access, and erasure?
Informed Consent	Did users give explicit permission for data capture and understand its purpose, risks, and alternatives?
Intellectual Property	Does collected data have economic value? Do users or organizations hold IP rights, and how are they protected?
Data Privacy	Is personal data secured, anonymized, and accessible only to authorized contexts without risk of leaks?
Right to Be Forgotten	Do systems allow users to request erasure of personal data, complying with privacy regulations?
Dataset Bias	Was data representative? Were biases tested, mitigated, or removed to prevent unfair outcomes?
Data Quality	Is data valid, accurate, consistent, and complete enough to support reliable AI model development?
Algorithm Fairness	Does the algorithm discriminate against groups? Were model accuracy and potential harms evaluated?
Misrepresentation	Are insights derived from honest data, or are methods/statistics being selectively used to mislead?
Free Choice	Does system design nudge users with hidden biases? Can users truly understand, choose, and reverse decisions?

Probability and Random Variables

Aspect	Definition	Concept	Examples
Probability	Likelihood of an event, between 0 and 1	$P(E) = \frac{\text{favorable outcomes}}{\text{total outcomes}}$	Even number on a die: $3/6 = 0.5$
Random Variables	Functions assigning values to outcomes	Discrete (countable), Continuous (real range)	Dice roll (1-6), bus arrival time
Sample Space	Set of all possible outcomes	Set notation	$\left\{1,2,3,4,5,6\right\}$ for a die

Distributions

Aspect	Type	Definition	Key Points	Examples
Discrete Distribution	Discrete random variable	$P(X=s)$ defines probability for each outcome, sums to 1	Uniform distribution: equal probabilities	Die roll uniform with $1/6$
Continuous Distribution	Continuous random variable	Probability density function (PDF), probabilities over intervals	Exact value probability = 0	Normal distribution, uniform continuous

Statistic Measures

Aspect	Definition	Formula	Examples
Mean (Expectation)	Average value	$E(X) = \sum x_i p_i$	Average height, weight etc.
Variance	Spread of data around mean	$\sigma^2 = \frac{1}{n} \sum (x_i - \mu)^2$	Variability in test scores
Standard Deviation	Square root of variance	$\sigma = \sqrt{\sigma^2}$	Spread measurement

Central Tendency Measures

Aspect	Description	Notes	Use Case
Median	Middle value, robust to outliers	Splits data in half	Income data with outliers
Mode	Most frequent value	Useful for categorical data	Most common shoe size
Quartiles & IQR	Q1 = 25%, Q3 = 75%, IQR = Q3-Q1	Detect outliers with box plot	Box plot for exam scores

Key Theorems

Aspect	Description	Notes	Examples
Law of Large Numbers	Sample mean converges to population mean as sample size increases	Basis for reliability in statistics	Average height from large sample
Central Limit Theorem	Distribution of sample means approaches normal regardless of original distribution	Justifies normal approximation in many cases	Average test scores of students
Confidence Intervals	Estimating population parameter range from sample	Width increases with confidence level	Mean height with 95% confidence
Hypothesis Testing	Tests claims about populations, comparing means or distributions	Uses t-test, p-value indicates evidence strength	Comparing heights of groups
Covariance & Correlation	Measures relationship between two variables	Correlation normalized between -1 and 1	Weight vs height correlation

├── LICENSE                     ← Open-source license if one is chosen
├── Makefile                    ← Makefile with convenience commands like `make data` or `make train`
├── README.md                   ← The top-level README for developers using this project.
├── data
│   ├── external                ← Data from third party sources
│   ├── interim                 ← Intermediate data that has been transformed
│   ├── processed               ← The final, canonical data sets for modeling
│   └── raw                     ← The original, immutable data dump
├── docs                        ← Static documentation files (e.g. mkdocs, Sphinx)
├── models                      ← Trained and serialized models, model predictions, or model summaries
├── notebooks                   ← Jupyter notebooks. Naming convention is a number (for ordering), the creator's initials, and a short `-` delimited description, e.g. `1.0-jqp-initial-data-exploration`
├── pyproject.toml              ← Project configuration file with package metadata for {{module_name}} and configuration for tools like ruff
├── references                  ← Data dictionaries, manuals, and all other explanatory materials
├── reports                     ← Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures                 ← Generated graphics and figures to be used in reporting
├── requirements.txt            ← The requirements file for reproducing the analysis environment, e.g. generated with `pip freeze > requirements.txt`
├── setup.cfg                   ← Configuration file for flake8
└── {{module_name}}             ← Source code for use in this project
    ├── __init__.py             ← Makes {{module_name}} a Python module
    ├── config.py               ← Store useful variables and configuration
    ├── dataset.py              ← Scripts to download or generate data
    ├── features.py             ← Code to create features for modeling
    ├── modeling
    │   ├── __init__.py
    │   ├── predict.py          ← Code to run model inference with trained models
    │   └── train.py            ← Code to train models
    └── plots.py                ← Code to create visualizations

Data Science Lifecycle​

Data Preparation​

Ethics Principles​

Ethics Challenges​

Probability and Random Variables​

Distributions​

Statistic Measures​

Central Tendency Measures​

Key Theorems​

Data Science Lifecycle

Data Preparation

Ethics Principles

Ethics Challenges

Probability and Random Variables

Distributions

Statistic Measures

Central Tendency Measures

Key Theorems