Fundamentals

Overview
Selection Process
Transactions
Locking
Theorems

DB Type	Features	Use Cases
Columnar	Schema Evolution Column-Oriented Storage Column-Level Compression Column-Wise Indexing Analytical Query Performance Slice-and-Dice: efficiently analyze data by selecting columns (slice) and breaking it down further (dice), allowing complex queries and aggregations on large datasets with speed and flexibility	Data Warehousing Analytics Log Processing
Document	Efficient Query Performance Document Versioning Flexible Schema	Content Management IoT
Graph	Relationship Focus Deep Insight	Social Networks Fraud Detection LLMs
Key-Value	Data Partitioning Simple Data Model High-Write & Query Performance Developer Friendly	Caching Session Store
NewSQL	Transactions & ACID Support SQL	High-Transaction Real-time
NoSQL	Horizontal Scalability High Availability Distributed Architecture Flexible Data Model	Big Data Real-time
Object-Oriented	Complex Querying & Navigation Complex Data Models Object Persistence Encapsulation & Data Abstraction Object Versioning Inheritance & Polymorphism	Object Persistence
Relational / SQL	Indexing & Optimization Security Features Relationship & Referential Integrity Structured Data Transactions & ACID SQL Support	Online Transaction Processing (OLTP) Online Analytical Processing (OLAP)
Spatial	Spatial Types & Indexing Topology & Network Analysis Geospatial Query Language Integration with GIS	Geographic Information Systems (GIS) Spatial Analysis
Time-Series	Retention Policies Efficient Storage Time-Window Aggregation High Write & Query Performance	Sensor Data Financial Data Industrial IoT
Vector DB	High-dimensional vector storage Similarity search Approximate nearest neighbor (ANN) search Scalability Metadata management: Stores and manages additional information associated with vectors for filtering and contextualization	Generative AI Natural language processing (NLP) Recommender systems Image and video search Fraud detection

Type	Definition	Priority	Transaction Failure	Data Consistency	Use Cases
ACID	Atomicity: Each transaction is completed or aborted Consistency: Guarantees committed transaction state (store valid data) Isolation: Transactions are independent Durability: Committed data is never lost	Consistency	Entire transaction rolls back	Guaranteed immediately	Financial Transactions (Banking, Stock Trading)
BASE	Basically Available: System remains available for read and write operations despite failures Soft state: System may be temporarily inconsistent but eventually becomes consistent Eventual consistency: Over time, all data replicas converge to the same state	Availability	May proceed with eventual consistency	Eventual, but not always immediate	Social Media Platforms

Locking Hierarchy
Types

Aspect	Intent Shared (IS)	Exclusive (X)	Intent Exclusive (IX)	Shared with Intent Exclusive (SIX)
Definition	Allows multiple readers but prevents updates. Signals intent to acquire an exclusive lock later	Prevents all other users from accessing the data (reading or writing)	Signals intent to acquire an exclusive lock and prevents other users from acquiring any locks (read or write)	Allows multiple readers but prevents updates. Signals intent to acquire an exclusive lock and prevents other users from acquiring any locks (read or write)
Usage	Initial read access before acquiring X lock Improve concurrency for read-heavy workloads	Update operations (write, delete)	Prevent other users from acquiring any locks before acquiring X lock Useful for long-running transactions	Similar to IS but signals future X lock acquisition Useful for scenarios where initial read might be followed by update
Impact on Concurrency	Improves read concurrency	Reduces write concurrency	Blocks all access, impacting overall concurrency	Improves read concurrency initially, reduces write concurrency later
Scalability	Scales well with read-heavy workloads	May impact performance with high write concurrency	May impact performance due to blocking all access	Can provide a balance between read and write concurrency

Feature	Pessimistic Locking	Optimistic Locking
Visualization
Locking Mechanism	Acquires locks on database records before any read/write operation	No explicit locks; relies on versioning or timestamps
Transaction Isolation	Guarantees serializability (transactions appear to execute one after another)	Relies on conflict detection during commit
Concurrency	Lower concurrency due to exclusive access	Higher concurrency as multiple transactions can read data concurrently
Data Integrity	High; ensures only one transaction modifies data at a time	Lower; potential for "lost updates" if conflicts occur
Implementation	Database-managed; different lock types (shared, exclusive) available	Application-level; relies on versioning mechanisms (e.g., version numbers, timestamps) in the database
Error Handling	Rollback transactions that encounter locked records	Retry transactions that encounter conflicts during commit
Use Cases	High contention environments (frequent updates) Critical data operations (financial transactions) Applications requiring strict data consistency	Low contention environments (read-heavy systems) Non-critical data updates Short-lived transactions

CAP
PACELC

Consistency (C): Ensures that all nodes in the system have the same data at the same time
Availability (A): Ensures that every request gets a response about whether it was successful or failed
Partition Tolerance (P): Ensures that the system continues to operate despite network partitions or communication failures

Aspect	AP (Availability & Partition Tolerance)	CA (Consistency & Availability)	CP (Consistency & Partition Tolerance)
Visualization
Definition	Some data may not be consistent	Network issues might stop the system	Some data might not be available when a failure happens
Use Cases	Social networks, real-time analytics, recommendation systems	Financial applications, e-commerce	Multi-datacenter deployments
Examples	Cassandra, DynamoDB, Riak	Google Spanner, RDBMS with high availability configurations	MongoDB with replica sets, BigTable

Theorem	Scope	Consistency Model	Latency Consideration
CAP	Focuses on impact of network partitions on consistency and availability	Binary choice between strong consistency and availability	Doesn't explicitly consider latency
PACELC	Broader view, acknowledging trade-offs present even under normal operation	Consistency is treated as a spectrum, offering more nuanced options	Recognizes latency as a critical factor alongside consistency and availability (data replication can impact latency)

Data Storage Solutions

Overview
Processing Types
Data Processing
Data Repositories
File Storage

Key Points for Considerations

Data Volume and Growth: How much data do you currently store, and what's the anticipated growth rate?
Performance Requirements: How critical is fast access and retrieval of data for your operations?
Data Security and Compliance: What security measures are necessary to safeguard sensitive data? Are there industry regulations to adhere to?
Accessibility and Sharing Needs: Do you require remote access to data or collaboration features?
Budgetary Constraints: What is your allocated budget for data storage solutions?

Feature	Online Analytical Processing (OLAP)	Online Transaction Processing (OLTP)
Purpose	Analytical processing for decision-making	Transaction processing for day-to-day operations
Data Usage	Aggregates historical data for reporting and analysis	Processes real-time transactions and updates
Data Schema	Star or snowflake schemas, denormalized	Normalized schemas, reducing redundancy
Query Complexity	Complex queries with aggregations and joins	Simple queries with frequent read/write operations
Data Granularity	Summarized, aggregated data	Detailed, individual transactions
Data Size	Huge data volumes, typically in terabytes or more	Smaller data volumes, typically in gigabytes or less
Performance	Designed for high throughput, slower write speeds	Optimized for fast write speeds and low latency reads
Examples	Amazon Redshift, Google BigQuery, Snowflake	MySQL, PostgreSQL

Aspect	ETL	ELT	Reverse ETL	Medallion Architecture
Visualization
Process Flow	Extract data first, then transform and load into the target system	Extract data first, load into the target system, then transform within the target system	Extract data from the data warehouse or data lake, transform it as needed, and load it into operational systems (e.g., CRM, ERP)	Extract data from the data lake, transform it as needed, and load it into operational systems (e.g., CRM, ERP)
Data Transformation	Transformation occurs before loading into the target system	Transformation occurs after loading into the target system	Transformation occurs before loading into operational systems	Transformation occurs in multiple stages: raw (bronze), cleaned (silver), and aggregated (gold)
Performance	Typically slower due to data transformation overhead during the ETL process	Generally faster because loading raw data is faster than transforming it during the ETL process	Depends on the complexity of transformations and the performance of the target operational systems	Optimized for performance at each stage, with raw data ingestion being fast and transformations being incremental
Scalability	Limited scalability due to the need for substantial transformation before loading data	Highly scalable as it can leverage the processing power of the target system for transformations	Scalability depends on the capabilities of both the data warehouse/lake and the target operational systems	Highly scalable, designed to handle large volumes of data through incremental processing and layering
Storage Requirements	Higher storage requirements as both raw and transformed data need to be stored	Lower storage requirements as only raw data needs to be stored initially, and transformation occurs within the target system	Requires storage in both the data warehouse/lake and the target operational systems	Requires storage for multiple layers (bronze, silver, gold), but optimizes storage through incremental processing
Data Integrity	Higher data integrity as data is cleaned and transformed before loading into the target system	May require additional checks and controls within the target system to ensure data integrity post-transformation	Data integrity depends on the transformation logic and the capabilities of the target operational systems	High data integrity through staged processing and validation at each layer
Complexity	Typically more complex due to the need for designing and managing transformation logic	Generally less complex as it leverages the capabilities of the target system for transformations	Can be complex due to the need to integrate with multiple operational systems and ensure data consistency	Can be complex due to the need to manage multiple layers and ensure data quality at each stage
Flexibility	May be less flexible as transformation logic is predefined and applied uniformly to all data	More flexible as transformations can be tailored to specific use cases within the target system	Flexibility depends on the integration capabilities of the target operational systems	More flexible as it allows for different transformation logic at each layer
Real-time Processing	Less suitable for real-time processing due to batch-oriented nature	More suitable for real-time processing as data can be loaded into the target system immediately and transformed on-the-fly	Can support real-time processing depending on the capabilities of the target operational systems	Can support real-time processing through incremental data ingestion and processing
Examples	Apache Spark, Apache Flink, Google Cloud Dataproc	GCS → Spark jobs, Dataflow → BigQuery	Hightouch, Census, Grouparoo	Fivetran, Stitch, Airbyte
Use Cases	For well-defined data models and reporting needs	For big data, data lakes, and agile analytics environments	For syncing data from data warehouses/lakes to operational systems for improved decision-making and customer engagement	For building robust data pipelines that support data quality and governance

Medallion Architecture

Aspect	Bronze Layer	Silver Layer	Gold Layer	Platinum Layer (Optional)
Purpose	Raw data ingestion and storage	Cleaned, validated, and conformed data	Business-level aggregated and curated data	Advanced business KPIs, machine learning-ready refined data
Data Quality	Contains raw, unprocessed, high-fidelity data	Cleansed, deduplicated, and schema-applied data	Highly curated, denormalized, and enriched data	Highly refined, integrated, and often aggregated data for analytics or ML
Processing	Minimal or no transformation; just ingestion	Data cleaning, validation, schema evolution, basic transformation	Business logic, aggregation, denormalization	Advanced transformations, integration, optimization
Typical Data Types	Raw logs, sensor data, third-party sources, transactional data as-is	Structured and semi-structured data with errors removed and schema applied	Aggregates, summaries, reports, business metrics datasets	Project-specific or organizational-wide refined datasets usable across teams
Storage	Immutable raw storage, often in data lake/file system (e.g., AWS S3, Azure ADLS, GCS)	Cleansed data layer usually on managed tables in lakehouse or data warehouse	Highly accessible curated datasets in optimized formats	Special schemas or databases optimized for fast access and analytics
Schema	Schema-on-read or no schema (raw, semi-structured or unstructured)	Schema-on-write with enforcement, evolving schema	Fixed, optimized schema for analytics and reporting	Optimized and standardized schema for comprehensive analytics solutions
Governance & Ownership	Typically ingestion teams or raw data owners	Data engineers and data analysts for quality and consistency	Business analysts and data consumers for performance and insights	Cross-functional governance, compliance, and enterprise data teams
Latency	High latency acceptable	Lower latency, often near-real-time or batch processed	Low-latency, ready for fast analytics	Typically lowest latency, optimized for real-time or near real-time use cases
Examples of Activities	Collect raw events, incremental loads, ingest logs or source dumps	Remove duplicates, correct errors, unify formats, join related datasets	Build customer 360 views, sales summaries, financial aggregations	Train ML models, deliver business-wide dashboards, advanced metrics
Use Cases	Historical archive, full fidelity backups, source of truth for raw data	Basis for BI reporting, data science experiments, downstream transformations	End-user business analytics, dashboards, operational reporting	Cross-functional analytics, ML feature stores, enterprise-wide KPIs

Feature	Data Mart	Data Warehouse	Data Lake	Data Lakehouse	Data Fabric	Data Mesh
Visualization
Definition	Subset of a Data Warehouse containing specific data focused on a particular business function or department	Central repository for structured, organized, and processed data, optimized for querying and analysis	Vast repository of raw, unstructured, or semi-structured data stored in its native format	Architecture combining the features of a Data Lake and a Data Warehouse, providing unified analytics on both raw and processed data	Technology-centric architecture that provides a unified, automated data integration and management layer across diverse data sources. Utilizes AI/ML and metadata-driven automation to enable seamless data access, discovery, and governance across the enterprise	Organizational and socio-technical approach emphasizing decentralization of data ownership to domain teams. Treats data as a product with domain-specific teams responsible for managing their data lifecycle, quality, and accessibility
Data Type	Structured data tailored for specific business needs	Structured data, typically from operational systems	Raw, unstructured, semi-structured data	Raw, semi-structured, and structured data	Supports diverse data types from structured, semi-structured, and unstructured sources unified through virtualization and metadata management. Data fabric abstracts underlying formats	Data types depend on domain-specific needs; each domain decides appropriate data types and formats. Typically supports relational, NoSQL, real-time streams, and other domain-tailored formats
Data Storage	Similar to Data Warehouse, stored in relational databases or columnar stores	Stored in a structured format like relational databases (e.g., SQL Server, PostgreSQL) or columnar stores (e.g., Redshift, BigQuery)	Typically stored in distributed file systems like HDFS, AWS S3, or Azure Data Lake Storage	Usually stored in a combination of Data Lake storage and structured data formats like Parquet, Delta Lake, or Apache Iceberg	Centralized or logically unified data storage architecture employing data lakes, data warehouses, multi-cloud, and edge storage, often integrated via a virtual data layer. Focus on centralized control and unified view	Decentralized data storage managed by domain teams. Storage technology choices vary per domain, including traditional databases, NoSQL, and streaming platforms. No single point of control; storage is distributed across domains
Data Processing	Similar to Data Warehouse, batch processing is prevalent, with ETL tools used for data transformation	Primarily batch processing. Data transformation and ETL processes are well-defined and structured. Tools like Informatica, Talend, or Apache Airflow are commonly used	Supports batch and real-time processing. Processing is done on raw data, often using technologies like Hadoop, Spark, or Apache Flink	Combines batch and real-time processing capabilities. Data is transformed and unified using tools like Apache Spark, Databricks, or Delta Lake	Centralized or federated processing combined with intelligent pipeline orchestration and automation. Supports batch and real-time analytics, ETL/ELT, machine learning workflows. AI-driven automation optimizes data processing	Decentralized data processing tailored to domain-specific requirements. Domains implement their own processing pipelines using tools that fit their context (e.g., Apache Kafka, Apache Spark). Emphasis on domain autonomy and agility
Use Cases	Tailored for specific business functions or departments requiring localized analytics and reporting. Commonly used in finance, sales, or marketing	Best suited for structured reporting, business intelligence, and historical analysis. Ideal for organizations with well-defined data requirements	Suitable for exploratory analytics, machine learning, and big data processing where flexibility and scalability are critical	Ideal for organizations looking to combine the benefits of Data Lakes and Data Warehouses for unified analytics on both raw and processed data	Enterprises needing seamless integration and governance across disparate data sources, requiring a unified data access layer for analytics, compliance, and AI/ML applications. Suitable for organizations favoring centralized data control	Large complex organizations with multiple data domains requiring domain autonomy, faster innovation, and domain-aligned data products. Ideal where decentralized management drives agility, domain ownership, and product thinking for data
Examples	Sales Data Mart, Finance Data Mart, HR Data Mart	Amazon Redshift, Google BigQuery, Snowflake	Hadoop Distributed File System (HDFS), Amazon S3, Azure Data Lake Storage	Databricks Delta Lake, AWS Glue, Google BigQuery Omni	Platforms and solutions leveraging metadata-driven automation and data virtualization, e.g., Informatica Intelligent Data Platform, IBM Data Fabric, TIBCO Data Fabric, HPE Ezmeral Data Fabric	Organizational implementation of domain-oriented data products, e.g., Netflix's domain-driven data ownership, LinkedIn's decentralized data pipelines, and the use of tools like Apache Kafka and data catalogs supporting federated governance

Feature	Parquet	Avro
Data Storage	Columnar	Row-based
Schema	Self-describing, stored in file metadata	Stored with data, language-independent
Compression	Highly compressed, supports multiple compression codecs (Snappy, Gzip, LZO, etc.)	Compressed, supports Deflate and Snappy codecs
Performance	Optimized for analytical (OLAP) workloads, fast data retrieval and processing	Optimized for write-intensive, big data operations, efficient for accessing all fields
Language Support	Language-agnostic, supported by many big data frameworks (Spark, Hive, Impala)	Language-independent, can be used across different programming languages
Advantages	Highly compressed, efficient storage Columnar format enables fast data retrieval and processing Supports schema evolution and complex data types Widely adopted and integrated with major big data frameworks	Language-independent, can be used across different programming languages Efficient for write-intensive, big data operations Supports schema evolution and complex data types Compact binary format reduces storage requirements
Disadvantages	Not human-readable Difficulties in applying updates, requires deleting and recreating the file	Data is not human-readable Not as widely integrated as Parquet in some big data frameworks
Use Cases	Large-scale data analytics and business intelligence Efficient storage and processing of structured, semi-structured, and unstructured data Integration with cloud-based data warehousing and processing services (AWS Athena, Amazon Redshift Spectrum)	Efficient storage and processing of big data workloads Data exchange and serialization between different systems and applications Streaming and real-time data processing (Apache Kafka)

Database Federation

Definition	Visualization	Features	Pros	Cons	Use Cases	Examples
Data Federation is a technique used to integrate data from disparate sources and provide a unified, coherent view of data to the user. Often used in enterprise applications where data is distributed across multiple databases or systems		Transparent: Users see one database, not separate sources Heterogeneous: Handles different data formats and systems Extensible: Easily add new data sources as needed Autonomous: Leaves existing databases unchanged Integrates data: Combines data from various sources	Flexible data sharing Autonomy among the database components Access heterogeneous data in a unified way No tight coupling of applications with legacy databases	Adds more hardware and additional complexity Joining data from two databases is complex Data federation's strength of preserving data can be a disadvantage for companies needing regular updates as it prevents data alteration that can lead to issues with disaster recovery	Enterprise data integration Data warehousing Business intelligence Data migration	Google BigQuery Omni Trino StarRocks

Relational Database

Structured Query Language (SQL) is a programming language used for managing and manipulating relational databases.

Query Flow

Key Concepts

Attribute: A property or characteristic of an entity
Column: A vertical data element in a table
Constraint: A condition that must be met for a row to be inserted into a table, such as a column being non-null or unique values (column/table constraints)
Database Management System (DBMS): Database Management System, the software that manages and controls access to a database
Default: Default allows to add values to the column if the value of that column is not set
Primary Key: Non-null unique identifier for a row
Foreign Key: Field or combination of fields that establishes a link between two tables. It enforces referential integrity by ensuring that values in one table's key match with values in another table's key. This relationship allows for data consistency and facilitates queries across related tables
Record: A row in a table
Schema: The structure or blueprint of the database, defining the tables, columns, and relationships
Table: A collection of related data
View: Virtual table that is made up of elements of multiple physical or "real" tables

Language	Purpose	Examples	Features
DCL (Data Control Language)	Manages access and permissions to data	`GRANT`, `REVOKE`	Controls who can access, manipulate, or delete data
DDL (Data Definition Language)	Defines and manages database structure	`CREATE`, `ALTER`, `DROP`	Used to define tables, indexes, constraints, etc
DML (Data Manipulation Language)	Manipulates data within the database	`INSERT`, `UPDATE`, `DELETE`	Allows adding, modifying, and removing data
DQL (Data Query Language)	Retrieves data from the database	`SELECT`	Primarily used for querying data from tables

Aspect	WHERE Clause	HAVING Clause
Purpose	Filters individual rows based on specific conditions	Filters groups of rows after aggregation (using `GROUP BY`)
Data Considered	All rows in the table	Groups created by the `GROUP BY` clause
Condition Type	Can use any comparison operators, logical operators	Must use aggregate functions (`SUM`, `COUNT`, `AVG`, `MIN`, `MAX`) or expressions involving them
Execution Timing	Applied before rows are grouped (more efficient)	Applied after rows are grouped (less efficient)
Requirement	Can be used with or without `GROUP BY`	Requires a `GROUP BY` clause
Compatibility	Can be used with `SELECT`, `UPDATE`, and `DELETE` statements	Can only be used with `SELECT` statements
Filtering Logic	Filters rows that meet the condition, excluding others	Filters groups that meet the condition, excluding others

Feature	TRUNCATE	DELETE
Category	DDL (Data Definition Language)	DML (Data Manipulation Language)
Function	Removes all rows from a table	Removes specific rows based on conditions (`WHERE` clause)
Filtering	Not possible	Possible using `WHERE` clause
Transaction	Cannot be used within a transaction	Can be used within a transaction
Constraints	Disables foreign key constraints temporarily	Triggers foreign key constraints and other constraints
Transaction Log	Records only deallocated data pages	Records each deleted row
Rollback	Not possible	Possible (if transaction is active)
Speed	Faster	Slower
Locking	Exclusive lock on the table	Locks individual rows being deleted
Identity Columns	Resets auto-incrementing values	Preserves existing values
Triggers	Does not fire triggers	May fire triggers (depending on definition)
Permissions Required	`ALTER` on the table	`DELETE` on the table
Use Cases	Clearing temporary or staging tables Removing all data before reloading Fast data deletion for large tables	Deleting specific records based on criteria Selective data removal with rollback possibility Maintaining existing identity column values

Feature	Primary Key	Unique Key	Foreign Key
Purpose	Uniquely identifies each row	Ensures uniqueness for a set of columns	Links data between two tables
Uniqueness	Mandatory (one and only one)	Enforced (no duplicates)	Enforced (references unique values)
Null Values	Not allowed	Allowed (one per column)	Not allowed
Number per Table	One	Multiple	Can reference multiple tables
Indexes	Typically creates a clustered index	May or may not create an index	Typically does not create an index
Auto Increment	Supported	Not supported	Not supported
Foreign Key Reference	Can be referenced by foreign keys	Can be referenced by foreign keys	References a primary or unique key

Feature	Aggregation	Window Function
Functionality	Applies to entire groups of data	Applies to each row within a group
Purpose	Summarizes data by reducing it to a single value per group	Calculates running totals, subtotals, ranking, percentiles, on a defined partition of data
Common Functions	`SUM`, `COUNT`, `AVG`, `MIN`, `MAX`	`CUMSUM`, `ROW_NUMBER`, `DENSE_RANK`, `PERCENT_RANK`, `LAG`, `LEAD`
Output Format	Reduced DataFrame with one row per group	Maintains the original DataFrame structure with the same number of rows
Example	`SELECT department, COUNT(*) FROM employees GROUP BY department`	`SELECT name, salary, AVG(salary) OVER (PARTITION BY department ORDER BY hire_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM employees`
Use Cases	Suitable for summarizing data across groups, such as calculating total sales per region or average order value per customer segment	Ideal for analytical tasks requiring comparisons between rows, like calculating moving averages, cumulative sums, or identifying top N values within groups

Feature	UNION	UNION ALL	INTERSECT	EXCEPT
Purpose	Combines the results of two or more sets and removes duplicates	Combines the results of two or more sets without removing duplicates	Returns elements common to both sets	Returns elements present in the first set but not in the second
Syntax	`SELECT id FROM A UNION SELECT id FROM B`	`SELECT id FROM A UNION ALL SELECT id FROM B`	`SELECT id FROM A INTERSECT SELECT id FROM B`	`SELECT id FROM A EXCEPT SELECT id FROM B`
Example	`A = {1, 2, 3}, B = {2, 4, 5}` → `{1, 2, 3, 4, 5}` (w/o duplicates)	`A = {1, 2, 3}, B = {2, 4, 5}` → `{1, 2, 2, 3, 4, 5}` (w/ duplicates)	`A = {1, 2, 3}, B = {2, 4, 5}` → `{2}` (elements common to both sets)	`A = {1, 2, 3}, B = {2, 4, 5}` → `{1, 3}` (elements in A but not B)

Type	Definition	Features	Use Cases
Skip List	Probabilistic data structure for ordered sets or maps, offering efficient search and insertion with average case complexity better than balanced trees	Used in Redis	In-memory
Hash Index	Uses a hash function to quickly map data keys to their locations, ideal for fast lookups	Most common in-memory index solution	In-memory
SSTable	File format for storing data in sorted order on disk, enabling efficient retrieval operations	Immutable data structure. Seldom used alone	Disk-based
LSM Tree	Data structure that combines in-memory and disk-based storage for ordered data, optimizing write performance and later merging for efficient reads	High write throughput. Disk compaction may impact performance	Memory + Disk
B-Tree	Self-balancing tree data structure for sorted data, allowing efficient search, insertion, and deletion operations	Most popular database index implementation	Disk-based
Inverted Index	Used for text retrieval, where words are mapped to documents they appear in, facilitating fast full-text searches	Used in document search engine such as Lucene	Search document
Suffix Tree	Stores suffixes of words, enabling efficient searches for patterns and substrings within a text	Used in string search, such as string suffix match	Search string
R-Tree	Stores spatial data like points, rectangles, or polygons, allowing efficient searches for objects within a specific area	Nearest neighbor	Search multi-dimension shape

Type	Definition	Example
SELECT	Retrieves data from a database	`SELECT column1, column2 FROM table_name`
INSERT	Adds new records to a table	`INSERT INTO table_name (column1, column2) VALUES (value1, value2)`
UPDATE	Modifies existing records in a table	`UPDATE table_name SET column1 = value1, column2 = value2 WHERE condition`
DELETE	Deletes records from a table	`DELETE FROM table_name WHERE condition`
JOIN	Combines rows from two or more tables based on a related column	`SELECT * FROM table1 INNER JOIN table2 ON table1.column = table2.column`
GROUP BY	Groups rows with identical values into summary rows	`SELECT column1, COUNT(*) FROM table_name GROUP BY column1`
HAVING	Filters records grouped by GROUP BY clause	`SELECT column1, COUNT() FROM table_name GROUP BY column1 HAVING COUNT() > 10`
ORDER BY	Sorts the result set in ascending or descending order	`SELECT * FROM table_name ORDER BY column1 DESC`
WHERE	Filters records based on specified conditions	`SELECT * FROM table_name WHERE condition`
DISTINCT	Returns only distinct (different) values	`SELECT DISTINCT column1 FROM table_name`
UNION	Combines the result sets of two or more SELECT statements	`SELECT column1 FROM table1 UNION SELECT column1 FROM table2`
TRANSACTION	Groups a set of SQL statements into a single unit of work	`BEGIN TRANSACTION; ... COMMIT; or ROLLBACK;`
INDEX	Creates an index on a table	`CREATE INDEX index_name ON table_name (column1)`
VIEW	Virtual table derived from one or more tables	`CREATE VIEW view_name AS SELECT column1, column2 FROM table_name WHERE condition`
TRIGGER	Executes a set of actions when a certain event occurs on a table	`CREATE TRIGGER trigger_name BEFORE INSERT ON table_name FOR EACH ROW BEGIN ... END`
SUBQUERY	Nested query inside another query	`SELECT column1 FROM table_name WHERE column1 IN (SELECT column1 FROM table2 WHERE condition)`
CASE	Provides conditional logic within a query	`SELECT column1, CASE WHEN condition THEN result1 ELSE result2 END AS result FROM table_name`

Aspect	Common Table Expression (CTE)	Partition Over	Windowing	Materialized View
Definition	Temporary named result set within the scope of a single SQL statement	SQL clause to divide query result sets into partitions	Functions defined over a window/partition of rows in a query	Precomputed and stored result of a query as a physical table
Scope	Exists only during the execution of a single query	Applies within a query's window function	Used within SQL queries to perform calculations over sets	Persisted in database until refreshed or dropped
Purpose	Improve query readability, enable stepwise data transformation	Define subset of rows to apply window functions	Perform calculations like ranking, running totals, moving averages	Improve query performance by caching results
Data Persistence	None - ephemeral and transient	None - part of query execution	None - part of query execution	Yes - physical table stored on disk
Usage	Create intermediate result sets for reuse in complex queries	Used with window functions to partition data	Calculate aggregates and analytics over row windows	Store complex or aggregated query results
Performance Impact	May or may not improve depending on DB optimizer and usage	Overhead varies, used to optimize window function execution	Allows efficient in-query analytics without multiple scans	Speeds up query execution by avoiding repeated computations
Recursiveness	Supports recursion in some DBMS (e.g., PostgreSQL)	Not recursive	Not recursive	Not recursive
Reusability	Within single query only	Within window function context	Within window function context	Across multiple queries until refreshed
Refresh/Update	Not applicable	Not applicable	Not applicable	Can be refreshed on schedule, manually, or incrementally
Storage Requirement	No physical storage	No physical storage	No physical storage	Requires storage space for materialized data
Comparison to Subqueries/Temp Tables	More readable than nested subqueries; ephemeral unlike temp tables	Not comparable as standalone; part of windowing	Not comparable standalone; used for row-wise calculations	Persistent, unlike views or CTEs
Pros	Improves query clarity and debugging, enables recursion	Enables granular application of window functions	Powerful analytics over ordered subsets	Significant query performance improvement via caching
Cons	Depending on DB engine, may have unknown performance tradeoffs	Adds overhead to window function processing	Complexity in understanding window frame boundaries	Requires maintenance and storage, possible staleness of data
Use Cases	Simplify complex query logic, prepare reusable intermediate results	Partition data for window function analytics	Analytics like rankings, running totals, moving averages	Speed up expensive or repeated analytical queries
Examples	Breaking multi-step queries into readable named parts	`PARTITION BY` clause in window functions	`ROW_NUMBER() OVER (PARTITION BY col ORDER BY col)`	Precomputed aggregation tables, snapshot of query results
Syntax Example	`WITH cte AS (SELECT...) SELECT * FROM cte;`	`ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)`	`SUM(col) OVER (PARTITION BY col ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`	`CREATE MATERIALIZED VIEW mv_name AS SELECT ...;`

Process of organizing data in a relational database to minimize redundancy and improve data integrity.

Reduce data duplication: This saves storage space and minimizes maintenance headaches
Enhance data consistency: When a piece of data changes, it only needs to be updated in one place, ensuring consistency across the database
Improve data retrieval efficiency: Normalized databases allow for faster and more efficient querying of data

Normalization is achieved through a series of steps, each referred to as a Normal Form (NF) Each subsequent form builds upon the previous one, progressively reducing redundancy

Normal Form

Key Points

Steps

Example

0NF

Non-normalized data

Name	Address	Movie	Salutation
Jane Doe	1st Street	Pirates of the Caribbean, Game of Thrones	Ms
Joe Doe	38 Street	Kung Fu Panda, Squid Game	Mr
Joe Doe	8th Ave	Game of Thrones	Mr

1NF

Each table cell should contain a single value
Each record has to be unique

Identify repeating groups within a table
Create separate tables for these groups
Establish relationships between tables using primary and foreign keys. (A foreign key in one table references the primary key of another table, linking related data)

Flatten the table by storing in a cell a single value
Name Address Movie Salutation
Jane Doe 1st Street Pirates of the Caribbean Ms
Jane Doe 1st Street Game of Thrones Ms
Joe Doe 38 Street Squid Game Mr
Joe Doe 38 Street Squid Game Mr
Joe Doe 8th Ave Game of Thrones Mr

2NF

The table must already be in 1NF
Single column Primary Key that doesn't functionally dependant on any subset of candidate key relation

Identify non-key attributes that depend only on a part of the primary key
Create separate tables for these attributes, with a new primary key formed by the relevant part of the original key and any additional attributes that determine the value of the dependent attribute(s)
Establish foreign key relationships between the new table and the original table

Membership ID Name Address Salutation
1 Jane Doe 1st Street Ms
2 Joe Doe 38 Street Mr
3 Joe Doe 8th Ave Mr
Membership ID Movie
1 Pirates of the Caribbean
1 Game of Thrones
2 Squid Game
2 Squid Game
3 Game of Thrones

3NF

The table must already be in 2NF
Has no transitive functional dependencies
- If an A functionally determines B (A → B; or B is dependant on A, or B comes from an A ) and simultaneously B determines C (B → C), then A is transitively dependant on C (A → C)
- birth code → name, age; name, age → address; birth code → address

Identify transitive dependencies: Analyze the relationships between non-key attributes. Look for situations where one non-key attribute depends on another non-key attribute, which ultimately depends on the primary key
Decompose the table: If you find transitive dependencies, create a new table to isolate the dependent attribute(s) and any other attributes that determine their values
Establish foreign key relationships: Link the new table back to the original table using a foreign key that references the relevant determining attribute(s)

Membership ID Name Address Salutation ID
1 Jane Doe 1st Street 2
2 Joe Doe 38 Street 1
3 Joe Doe 8th Ave 1
Membership ID Movie
1 Pirates of the Caribbean
1 Game of Thrones
2 Squid Game
2 Squid Game
3 Game of Thrones
Salutation ID Salutation
1 Mr
2 Ms

Boyce-Codd Normal Form (BCNF)

The table must already be in 3NF
Decompose non-trivial functional dependencies by separating them into a new tables and link them with foreign keys (FK)
Ensure that every determinant is a candidate key

4NF

The table must already be in BCNF
If no database table instance contains 2 or more, independent and multivalued data describing the relevant entity

5NF

The table must already be in 4NF
Decompose join dependencies by separating them into a new tables and link them with foreign keys (FK)

6NF

The table must already be in 5NF
Decompose temporal dependencies which involve time-varying relationships between attributes

Normalization vs. Denormalization

Aspect	Normalization	Denormalization
Purpose	Eliminate data redundancy, maintain data integrity	Introduce controlled redundancy to improve read performance
Data Structure	Data is split into multiple related tables	Data is combined into fewer tables with redundancy
Redundancy	Minimal or no redundancy	High redundancy due to duplicated data
Query Complexity	Complex queries requiring multiple JOINs	Simplified queries with fewer or no JOINs
Storage Usage	Optimizes storage by avoiding duplication	Requires more storage due to duplicated data
Performance Optimization	Optimized for write operations (inserts, updates)	Optimized for read operations and faster query retrieval
Data Integrity & Consistency	Easier to enforce data integrity and consistency	Higher risk of data inconsistencies and anomalies
Maintenance Effort	Easier to maintain and update due to no duplicated data	More complex maintenance and potential update anomalies
Storage Cost	Lower, due to reduced duplication	Higher, due to redundant data storage
Write Operation Cost	Lower, fewer duplicate updates needed	Higher, updates must be propagated across duplicates
Read Operation Cost	Higher, due to required joins	Lower, data is pre-joined or duplicated
Referential Integrity	Enforced via foreign keys and constraints	Often reduced or absent, making integrity harder to enforce
Scalability	Suits relational databases	Often better for distributed NoSQL or denormalized stores
Impact on Indexing	Indexes can be spread across multiple tables	Indexing might be faster since related data is stored together
Risk of Anomalies	Low (insertion, update, deletion anomalies minimized)	Higher risk due to duplicated data
Development Complexity	Requires more thoughtful schema design and normalization steps	Simpler schema but requires careful management of duplicates
Use Cases	Systems with frequent updates where accuracy is critical	Read-heavy systems where fast data retrieval is crucial
Examples	Social media profiles, inventory systems	Reporting systems, dashboards, analytical queries
Example Scenario	Separate Customer and Order tables linked by CustomerID	Customer data embedded directly in Order table to avoid joins

Type	Definition	Syntax
Left Join / Left Outer Join	Everything on the left + everything on the right that matches	`SELECT * FROM table1 LEFT JOIN table2 ON table1.key = table2.key`
Anti Left Join	Everything on the left that is NOT on the right	`SELECT * FROM table1 LEFT JOIN table2 ON table1.key = table2.key WHERE table2.key IS NULL`
Right Join / Right Outer Join	Everything on the right + everything on the left that matches	`SELECT * FROM table1 RIGHT JOIN table2 ON table1.key = table2.key`
Anti Right Join	Everything on the right that is NOT on the left	`SELECT * FROM table1 RIGHT JOIN table2 ON table1.key = table2.key WHERE table1.key IS NULL`
Full Outer Join	Everything on the left + everything on the right	`SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.key = table2.key`
Anti Outer Join	Everything on the left and right that is unique to each other	`SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.key = table2.key WHERE table1.key IS NULL OR table2.key IS NULL`
Join / Inner Join	Only the things that are common	`SELECT * FROM table1 INNER JOIN table2 ON table1.key = table2.key`
Cross Join	All combinations of rows from both tables (Cartesian Product)	`SELECT * FROM table1 CROSS JOIN table2`
Self Join	The table is joined with itself	`SELECT * FROM table1 AS t1 INNER JOIN table1 AS t2 ON t1.key = t2.key`

Feature	One-to-One	One-to-Many	Many-to-Many
Visualization
Definition	Each record in one table is associated with exactly one record in another	Each record in one table can be associated with one or more records in another	Records in both tables can be associated with multiple records in the other
Example	Person → Passport	Department → Employee	Student → Course

Issue	Definition	Impact	Solutions
Cartesian Products	Occurs when 2 or more tables are joined without a proper `WHERE` clause, resulting in huge datasets	High resource consumption Slow query performance Database server overload	Ensure `WHERE` clause in `JOIN` queries Optimize query design and indexing Limit result set size
Connections overhead	Establishing multiple connections to the same database	Resource wastage Decreased performance due to connection establishment overhead Scalability issues	Implement connection pooling mechanism Tune connection timeout settings
Data Inconsistency	Same data exists in different forms across the database	Incorrect query results Application errors Data corruption	Use transactions and locking mechanisms Implement referential integrity constraints Regular data validation and cleansing
Deadlocks	2 or more transactions are waiting for each other to release locks	Transaction failure Resource wastage System deadlock	Analyze and optimize transaction isolation levels Minimize transaction duration
Full Table Scans	Scanning the entire table to satisfy a query instead of using indexes	Slow query performance High I/O load	Create appropriate indexes Tune SQL queries to leverage existing indexes Partition large tables for better performance
Missing Indexes	Indexes that are required to satisfy a query are missing	Slow query performance High CPU and I/O utilization	Analyze query execution plans Create indexes based on query patterns and workload characteristics
N+1	Initial query fetches N rows, and then N additional queries are executed to fetch related data for each row	Increased number of queries Poor performance High resource consumption	Use eager loading or batch loading in ORMs Optimize data access patterns Implement custom data retrieval methods
Over/Under fetching	Overfetching occurs when unnecessary data is fetched from the database, while underfetching occurs when necessary data is not fetched	Wastage of network bandwidth and server resources Reduced application responsiveness Increased latency	Optimize SQL queries to fetch only required data Use pagination and filtering to limit result set size Normalize data to reduce redundancy and improve query efficiency
Poorly Designed Schema	Lacks normalization, proper data types, and relationships, leading to inefficient data access	Data redundancy Update anomalies Inefficient queries and joins	Normalize the database schema Choose appropriate data types Define relationships and constraints Analyze and refactor existing schema structures
Subquery Inefficiency	Subqueries are used inefficiently, resulting in poor query performance	Slow query execution Resource contention	Rewrite subqueries as joins where possible Use correlated subqueries judiciously Ensure subquery optimization through proper indexing and query tuning

Data Storage Solutions​

Key Points for Considerations​

Medallion Architecture​

Database Federation​

Relational Database​

Query Flow​

Key Concepts​

Normalization vs. Denormalization​