Fundamentals
- Glossary
- DS Lifecycle
- Ethics
- Stats & Probability
- Project Structure
- Data: refers to raw facts, figures, and statistics that are collected for analysis. It can be in various forms, such as numbers, text, images, or videos.
- Data Science: is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines aspects of statistics, computer science, and domain expertise to analyze and interpret complex data sets. The goal of data science is to turn data into actionable insights that can inform decision-making and drive business value. Key components of data science include data collection, data cleaning, data analysis, machine learning, and data visualization.
Data Science Lifecycleβ
- Problem Definition: Understand the business problem and define objectives
- Data Mining: Collect and explore data to identify patterns and insights
- Data Preparation: Clean and preprocess data for analysis
- Data Exploration: Analyze data to uncover trends and relationships
- Feature Engineering: Create and select relevant features for modeling
- Predictive Modeling: Train machine learning models, evaluate their performance, and use them to make predictions
- Data Visualization: Communicate the findings with key stakeholders using plots and interactive visualizations
Data Preparationβ
- Collection: Gathering data sources and measuring the accuracy of each file
- Has this problem been approached before? What was discovered?
- Is the purpose and goal understood by all involved?
- Is there ambiguity and how to reduce it?
- What are the constraints?
- What will the end result potentially look like?
- How much resources (time, people, computational) are available?
- What data is already available to me?
- Who owns this data?
- What are the privacy concerns?
- Do I have enough to solve this problem?
- Is the data of acceptable quality for this problem?
- If I discover additional information through this data, should we consider changing or redefining the goals?
- Cleaning: Detecting and removing incomplete and inaccurate data records
- Classification: Organizing data into categories for more efficient use
- Clustering: Grouping data into similar groups
- Regression: Determine the relationships between variables to predict or forecast values
- Transformation: Reformatting data into proper formats and structures
- Reduction: Condensing redundant data down to its meaningful parts
- Consolidation: Combining and storing varied data into a single place
- Storage: Storing the data in a storage medium for future use
Ethics Principlesβ
Aspect | Definition |
---|---|
Accountability | Makes practitioners responsible for their data & AI operations, ensuring compliance with ethical principles |
Transparency | Ensures data and AI decisions are understandable and interpretable, explaining the "what" and "why" |
Fairness | Aims to treat all people fairly by addressing systemic or implicit socio-technical biases in data and systems |
Reliability & Safety | Ensures AI behaves consistently with defined values while minimizing harms or unintended consequences |
Privacy & Security | Protects user privacy and identities by understanding data lineage and providing security safeguards |
Inclusiveness | Designs AI solutions to meet a broad range of human needs and capabilities, ensuring accessibility |
Ethics Challengesβ
Aspect | Definition |
---|---|
Data Ownership | Who owns the data? What rights do subjects and organizations have regarding control, access, and erasure? |
Informed Consent | Did users give explicit permission for data capture and understand its purpose, risks, and alternatives? |
Intellectual Property | Does collected data have economic value? Do users or organizations hold IP rights, and how are they protected? |
Data Privacy | Is personal data secured, anonymized, and accessible only to authorized contexts without risk of leaks? |
Right to Be Forgotten | Do systems allow users to request erasure of personal data, complying with privacy regulations? |
Dataset Bias | Was data representative? Were biases tested, mitigated, or removed to prevent unfair outcomes? |
Data Quality | Is data valid, accurate, consistent, and complete enough to support reliable AI model development? |
Algorithm Fairness | Does the algorithm discriminate against groups? Were model accuracy and potential harms evaluated? |
Misrepresentation | Are insights derived from honest data, or are methods/statistics being selectively used to mislead? |
Free Choice | Does system design nudge users with hidden biases? Can users truly understand, choose, and reverse decisions? |
Probability and Random Variablesβ
Aspect | Definition | Concept | Examples |
---|---|---|---|
Probability | Likelihood of an event, between 0 and 1 | Even number on a die: | |
Random Variables | Functions assigning values to outcomes | Discrete (countable), Continuous (real range) | Dice roll (1-6), bus arrival time |
Sample Space | Set of all possible outcomes | Set notation | for a die |
Distributionsβ
Aspect | Type | Definition | Key Points | Examples |
---|---|---|---|---|
Discrete Distribution | Discrete random variable | defines probability for each outcome, sums to 1 | Uniform distribution: equal probabilities | Die roll uniform with |
Continuous Distribution | Continuous random variable | Probability density function (PDF), probabilities over intervals | Exact value probability = 0 | Normal distribution, uniform continuous |
Statistic Measuresβ
Aspect | Definition | Formula | Examples |
---|---|---|---|
Mean (Expectation) | Average value | Average height, weight etc. | |
Variance | Spread of data around mean | Variability in test scores | |
Standard Deviation | Square root of variance | Spread measurement |
Central Tendency Measuresβ
Aspect | Description | Notes | Use Case |
---|---|---|---|
Median | Middle value, robust to outliers | Splits data in half | Income data with outliers |
Mode | Most frequent value | Useful for categorical data | Most common shoe size |
Quartiles & IQR | Q1 = 25%, Q3 = 75%, IQR = Q3-Q1 | Detect outliers with box plot | Box plot for exam scores |
Key Theoremsβ
Aspect | Description | Notes | Examples |
---|---|---|---|
Law of Large Numbers | Sample mean converges to population mean as sample size increases | Basis for reliability in statistics | Average height from large sample |
Central Limit Theorem | Distribution of sample means approaches normal regardless of original distribution | Justifies normal approximation in many cases | Average test scores of students |
Confidence Intervals | Estimating population parameter range from sample | Width increases with confidence level | Mean height with 95% confidence |
Hypothesis Testing | Tests claims about populations, comparing means or distributions | Uses t-test, p-value indicates evidence strength | Comparing heights of groups |
Covariance & Correlation | Measures relationship between two variables | Correlation normalized between -1 and 1 | Weight vs height correlation |
βββ LICENSE β Open-source license if one is chosen
βββ Makefile β Makefile with convenience commands like `make data` or `make train`
βββ README.md β The top-level README for developers using this project.
βββ data
β βββ external β Data from third party sources
β βββ interim β Intermediate data that has been transformed
β βββ processed β The final, canonical data sets for modeling
β βββ raw β The original, immutable data dump
βββ docs β Static documentation files (e.g. mkdocs, Sphinx)
βββ models β Trained and serialized models, model predictions, or model summaries
βββ notebooks β Jupyter notebooks. Naming convention is a number (for ordering), the creator's initials, and a short `-` delimited description, e.g. `1.0-jqp-initial-data-exploration`
βββ pyproject.toml β Project configuration file with package metadata for {{module_name}} and configuration for tools like ruff
βββ references β Data dictionaries, manuals, and all other explanatory materials
βββ reports β Generated analysis as HTML, PDF, LaTeX, etc.
β βββ figures β Generated graphics and figures to be used in reporting
βββ requirements.txt β The requirements file for reproducing the analysis environment, e.g. generated with `pip freeze > requirements.txt`
βββ setup.cfg β Configuration file for flake8
βββ {{module_name}} β Source code for use in this project
βββ __init__.py β Makes {{module_name}} a Python module
βββ config.py β Store useful variables and configuration
βββ dataset.py β Scripts to download or generate data
βββ features.py β Code to create features for modeling
βββ modeling
β βββ __init__.py
β βββ predict.py β Code to run model inference with trained models
β βββ train.py β Code to train models
βββ plots.py β Code to create visualizations