Fundamentals
- Overview
- Selection Process
- Transactions
- Locking
- Theorems
DB Type | Features | Use Cases |
---|---|---|
Columnar |
|
|
Document |
|
|
Graph |
|
|
Key-Value |
|
|
NewSQL |
|
|
NoSQL |
|
|
Object-Oriented |
|
|
Relational / SQL |
|
|
Spatial |
|
|
Time-Series |
|
|
Vector DB |
|
|
- Selection
- Choice
Type | Definition | Priority | Transaction Failure | Data Consistency | Use Cases |
---|---|---|---|---|---|
ACID |
| Consistency | Entire transaction rolls back | Guaranteed immediately | Financial Transactions (Banking, Stock Trading) |
BASE |
| Availability | May proceed with eventual consistency | Eventual, but not always immediate | Social Media Platforms |
- Locking Hierarchy
- Types
Aspect | Intent Shared (IS) | Exclusive (X) | Intent Exclusive (IX) | Shared with Intent Exclusive (SIX) |
---|---|---|---|---|
Definition | Allows multiple readers but prevents updates. Signals intent to acquire an exclusive lock later | Prevents all other users from accessing the data (reading or writing) | Signals intent to acquire an exclusive lock and prevents other users from acquiring any locks (read or write) | Allows multiple readers but prevents updates. Signals intent to acquire an exclusive lock and prevents other users from acquiring any locks (read or write) |
Usage |
|
|
|
|
Impact on Concurrency | Improves read concurrency | Reduces write concurrency | Blocks all access, impacting overall concurrency | Improves read concurrency initially, reduces write concurrency later |
Scalability | Scales well with read-heavy workloads | May impact performance with high write concurrency | May impact performance due to blocking all access | Can provide a balance between read and write concurrency |
Feature | Pessimistic Locking | Optimistic Locking |
---|---|---|
Visualization | ||
Locking Mechanism | Acquires locks on database records before any read/write operation | No explicit locks; relies on versioning or timestamps |
Transaction Isolation | Guarantees serializability (transactions appear to execute one after another) | Relies on conflict detection during commit |
Concurrency | Lower concurrency due to exclusive access | Higher concurrency as multiple transactions can read data concurrently |
Data Integrity | High; ensures only one transaction modifies data at a time | Lower; potential for "lost updates" if conflicts occur |
Implementation | Database-managed; different lock types (shared, exclusive) available | Application-level; relies on versioning mechanisms (e.g., version numbers, timestamps) in the database |
Error Handling | Rollback transactions that encounter locked records | Retry transactions that encounter conflicts during commit |
Use Cases |
|
|
- CAP
- PACELC
- Consistency (C): Ensures that all nodes in the system have the same data at the same time
- Availability (A): Ensures that every request gets a response about whether it was successful or failed
- Partition Tolerance (P): Ensures that the system continues to operate despite network partitions or communication failures
Aspect | AP (Availability & Partition Tolerance) | CA (Consistency & Availability) | CP (Consistency & Partition Tolerance) |
---|---|---|---|
Visualization | |||
Definition | Some data may not be consistent | Network issues might stop the system | Some data might not be available when a failure happens |
Use Cases | Social networks, real-time analytics, recommendation systems | Financial applications, e-commerce | Multi-datacenter deployments |
Examples | Cassandra, DynamoDB, Riak | Google Spanner, RDBMS with high availability configurations | MongoDB with replica sets, BigTable |
Theorem | Scope | Consistency Model | Latency Consideration |
---|---|---|---|
CAP | Focuses on impact of network partitions on consistency and availability | Binary choice between strong consistency and availability | Doesn't explicitly consider latency |
PACELC | Broader view, acknowledging trade-offs present even under normal operation | Consistency is treated as a spectrum, offering more nuanced options | Recognizes latency as a critical factor alongside consistency and availability (data replication can impact latency) |
Data Storage Solutions​
- Overview
- Processing Types
- Data Processing
- Data Repositories
- File Storage
Key Points for Considerations​
- Data Volume and Growth: How much data do you currently store, and what's the anticipated growth rate?
- Performance Requirements: How critical is fast access and retrieval of data for your operations?
- Data Security and Compliance: What security measures are necessary to safeguard sensitive data? Are there industry regulations to adhere to?
- Accessibility and Sharing Needs: Do you require remote access to data or collaboration features?
- Budgetary Constraints: What is your allocated budget for data storage solutions?
Feature | Online Analytical Processing (OLAP) | Online Transaction Processing (OLTP) |
---|---|---|
Purpose | Analytical processing for decision-making | Transaction processing for day-to-day operations |
Data Usage | Aggregates historical data for reporting and analysis | Processes real-time transactions and updates |
Data Schema | Star or snowflake schemas, denormalized | Normalized schemas, reducing redundancy |
Query Complexity | Complex queries with aggregations and joins | Simple queries with frequent read/write operations |
Data Granularity | Summarized, aggregated data | Detailed, individual transactions |
Data Size | Huge data volumes, typically in terabytes or more | Smaller data volumes, typically in gigabytes or less |
Performance | Designed for high throughput, slower write speeds | Optimized for fast write speeds and low latency reads |
Examples | Amazon Redshift, Google BigQuery, Snowflake | MySQL, PostgreSQL |
Aspect | ETL | ELT | Reverse ETL | Medallion Architecture |
---|---|---|---|---|
Visualization | ||||
Process Flow | Extract data first, then transform and load into the target system | Extract data first, load into the target system, then transform within the target system | Extract data from the data warehouse or data lake, transform it as needed, and load it into operational systems (e.g., CRM, ERP) | Extract data from the data lake, transform it as needed, and load it into operational systems (e.g., CRM, ERP) |
Data Transformation | Transformation occurs before loading into the target system | Transformation occurs after loading into the target system | Transformation occurs before loading into operational systems | Transformation occurs in multiple stages: raw (bronze), cleaned (silver), and aggregated (gold) |
Performance | Typically slower due to data transformation overhead during the ETL process | Generally faster because loading raw data is faster than transforming it during the ETL process | Depends on the complexity of transformations and the performance of the target operational systems | Optimized for performance at each stage, with raw data ingestion being fast and transformations being incremental |
Scalability | Limited scalability due to the need for substantial transformation before loading data | Highly scalable as it can leverage the processing power of the target system for transformations | Scalability depends on the capabilities of both the data warehouse/lake and the target operational systems | Highly scalable, designed to handle large volumes of data through incremental processing and layering |
Storage Requirements | Higher storage requirements as both raw and transformed data need to be stored | Lower storage requirements as only raw data needs to be stored initially, and transformation occurs within the target system | Requires storage in both the data warehouse/lake and the target operational systems | Requires storage for multiple layers (bronze, silver, gold), but optimizes storage through incremental processing |
Data Integrity | Higher data integrity as data is cleaned and transformed before loading into the target system | May require additional checks and controls within the target system to ensure data integrity post-transformation | Data integrity depends on the transformation logic and the capabilities of the target operational systems | High data integrity through staged processing and validation at each layer |
Complexity | Typically more complex due to the need for designing and managing transformation logic | Generally less complex as it leverages the capabilities of the target system for transformations | Can be complex due to the need to integrate with multiple operational systems and ensure data consistency | Can be complex due to the need to manage multiple layers and ensure data quality at each stage |
Flexibility | May be less flexible as transformation logic is predefined and applied uniformly to all data | More flexible as transformations can be tailored to specific use cases within the target system | Flexibility depends on the integration capabilities of the target operational systems | More flexible as it allows for different transformation logic at each layer |
Real-time Processing | Less suitable for real-time processing due to batch-oriented nature | More suitable for real-time processing as data can be loaded into the target system immediately and transformed on-the-fly | Can support real-time processing depending on the capabilities of the target operational systems | Can support real-time processing through incremental data ingestion and processing |
Examples | Apache Spark, Apache Flink, Google Cloud Dataproc | GCS → Spark jobs, Dataflow → BigQuery | Hightouch, Census, Grouparoo | Fivetran, Stitch, Airbyte |
Use Cases | For well-defined data models and reporting needs | For big data, data lakes, and agile analytics environments | For syncing data from data warehouses/lakes to operational systems for improved decision-making and customer engagement | For building robust data pipelines that support data quality and governance |
Medallion Architecture​
Aspect | Bronze Layer | Silver Layer | Gold Layer | Platinum Layer (Optional) |
---|---|---|---|---|
Purpose | Raw data ingestion and storage | Cleaned, validated, and conformed data | Business-level aggregated and curated data | Advanced business KPIs, machine learning-ready refined data |
Data Quality | Contains raw, unprocessed, high-fidelity data | Cleansed, deduplicated, and schema-applied data | Highly curated, denormalized, and enriched data | Highly refined, integrated, and often aggregated data for analytics or ML |
Processing | Minimal or no transformation; just ingestion | Data cleaning, validation, schema evolution, basic transformation | Business logic, aggregation, denormalization | Advanced transformations, integration, optimization |
Typical Data Types | Raw logs, sensor data, third-party sources, transactional data as-is | Structured and semi-structured data with errors removed and schema applied | Aggregates, summaries, reports, business metrics datasets | Project-specific or organizational-wide refined datasets usable across teams |
Storage | Immutable raw storage, often in data lake/file system (e.g., AWS S3, Azure ADLS, GCS) | Cleansed data layer usually on managed tables in lakehouse or data warehouse | Highly accessible curated datasets in optimized formats | Special schemas or databases optimized for fast access and analytics |
Schema | Schema-on-read or no schema (raw, semi-structured or unstructured) | Schema-on-write with enforcement, evolving schema | Fixed, optimized schema for analytics and reporting | Optimized and standardized schema for comprehensive analytics solutions |
Governance & Ownership | Typically ingestion teams or raw data owners | Data engineers and data analysts for quality and consistency | Business analysts and data consumers for performance and insights | Cross-functional governance, compliance, and enterprise data teams |
Latency | High latency acceptable | Lower latency, often near-real-time or batch processed | Low-latency, ready for fast analytics | Typically lowest latency, optimized for real-time or near real-time use cases |
Examples of Activities | Collect raw events, incremental loads, ingest logs or source dumps | Remove duplicates, correct errors, unify formats, join related datasets | Build customer 360 views, sales summaries, financial aggregations | Train ML models, deliver business-wide dashboards, advanced metrics |
Use Cases | Historical archive, full fidelity backups, source of truth for raw data | Basis for BI reporting, data science experiments, downstream transformations | End-user business analytics, dashboards, operational reporting | Cross-functional analytics, ML feature stores, enterprise-wide KPIs |
Feature | Data Mart | Data Warehouse | Data Lake | Data Lakehouse | Data Fabric | Data Mesh |
---|---|---|---|---|---|---|
Visualization | ||||||
Definition | Subset of a Data Warehouse containing specific data focused on a particular business function or department | Central repository for structured, organized, and processed data, optimized for querying and analysis | Vast repository of raw, unstructured, or semi-structured data stored in its native format | Architecture combining the features of a Data Lake and a Data Warehouse, providing unified analytics on both raw and processed data | Technology-centric architecture that provides a unified, automated data integration and management layer across diverse data sources. Utilizes AI/ML and metadata-driven automation to enable seamless data access, discovery, and governance across the enterprise | Organizational and socio-technical approach emphasizing decentralization of data ownership to domain teams. Treats data as a product with domain-specific teams responsible for managing their data lifecycle, quality, and accessibility |
Data Type | Structured data tailored for specific business needs | Structured data, typically from operational systems | Raw, unstructured, semi-structured data | Raw, semi-structured, and structured data | Supports diverse data types from structured, semi-structured, and unstructured sources unified through virtualization and metadata management. Data fabric abstracts underlying formats | Data types depend on domain-specific needs; each domain decides appropriate data types and formats. Typically supports relational, NoSQL, real-time streams, and other domain-tailored formats |
Data Storage | Similar to Data Warehouse, stored in relational databases or columnar stores | Stored in a structured format like relational databases (e.g., SQL Server, PostgreSQL) or columnar stores (e.g., Redshift, BigQuery) | Typically stored in distributed file systems like HDFS, AWS S3, or Azure Data Lake Storage | Usually stored in a combination of Data Lake storage and structured data formats like Parquet, Delta Lake, or Apache Iceberg | Centralized or logically unified data storage architecture employing data lakes, data warehouses, multi-cloud, and edge storage, often integrated via a virtual data layer. Focus on centralized control and unified view | Decentralized data storage managed by domain teams. Storage technology choices vary per domain, including traditional databases, NoSQL, and streaming platforms. No single point of control; storage is distributed across domains |
Data Processing | Similar to Data Warehouse, batch processing is prevalent, with ETL tools used for data transformation | Primarily batch processing. Data transformation and ETL processes are well-defined and structured. Tools like Informatica, Talend, or Apache Airflow are commonly used | Supports batch and real-time processing. Processing is done on raw data, often using technologies like Hadoop, Spark, or Apache Flink | Combines batch and real-time processing capabilities. Data is transformed and unified using tools like Apache Spark, Databricks, or Delta Lake | Centralized or federated processing combined with intelligent pipeline orchestration and automation. Supports batch and real-time analytics, ETL/ELT, machine learning workflows. AI-driven automation optimizes data processing | Decentralized data processing tailored to domain-specific requirements. Domains implement their own processing pipelines using tools that fit their context (e.g., Apache Kafka, Apache Spark). Emphasis on domain autonomy and agility |
Use Cases | Tailored for specific business functions or departments requiring localized analytics and reporting. Commonly used in finance, sales, or marketing | Best suited for structured reporting, business intelligence, and historical analysis. Ideal for organizations with well-defined data requirements | Suitable for exploratory analytics, machine learning, and big data processing where flexibility and scalability are critical | Ideal for organizations looking to combine the benefits of Data Lakes and Data Warehouses for unified analytics on both raw and processed data | Enterprises needing seamless integration and governance across disparate data sources, requiring a unified data access layer for analytics, compliance, and AI/ML applications. Suitable for organizations favoring centralized data control | Large complex organizations with multiple data domains requiring domain autonomy, faster innovation, and domain-aligned data products. Ideal where decentralized management drives agility, domain ownership, and product thinking for data |
Examples | Sales Data Mart, Finance Data Mart, HR Data Mart | Amazon Redshift, Google BigQuery, Snowflake | Hadoop Distributed File System (HDFS), Amazon S3, Azure Data Lake Storage | Databricks Delta Lake, AWS Glue, Google BigQuery Omni | Platforms and solutions leveraging metadata-driven automation and data virtualization, e.g., Informatica Intelligent Data Platform, IBM Data Fabric, TIBCO Data Fabric, HPE Ezmeral Data Fabric | Organizational implementation of domain-oriented data products, e.g., Netflix's domain-driven data ownership, LinkedIn's decentralized data pipelines, and the use of tools like Apache Kafka and data catalogs supporting federated governance |
Feature | Parquet | Avro |
---|---|---|
Data Storage | Columnar | Row-based |
Schema | Self-describing, stored in file metadata | Stored with data, language-independent |
Compression | Highly compressed, supports multiple compression codecs (Snappy, Gzip, LZO, etc.) | Compressed, supports Deflate and Snappy codecs |
Performance | Optimized for analytical (OLAP) workloads, fast data retrieval and processing | Optimized for write-intensive, big data operations, efficient for accessing all fields |
Language Support | Language-agnostic, supported by many big data frameworks (Spark, Hive, Impala) | Language-independent, can be used across different programming languages |
Advantages |
|
|
Disadvantages |
|
|
Use Cases |
|
|
Database Federation​
Definition | Visualization | Features | Pros | Cons | Use Cases | Examples |
---|---|---|---|---|---|---|
Data Federation is a technique used to integrate data from disparate sources and provide a unified, coherent view of data to the user. Often used in enterprise applications where data is distributed across multiple databases or systems |
|
|
|
|
|
Relational Database​
- Overview
- Detailed
- Data Structures
- Syntax
- SQL Query Techniques
- Normalization
- Joins
- Relations
- Common Issues
Structured Query Language (SQL) is a programming language used for managing and manipulating relational databases.
Query Flow​
Key Concepts​
- Attribute: A property or characteristic of an entity
- Column: A vertical data element in a table
- Constraint: A condition that must be met for a row to be inserted into a table, such as a column being non-null or unique values (column/table constraints)
- Database Management System (DBMS): Database Management System, the software that manages and controls access to a database
- Default: Default allows to add values to the column if the value of that column is not set
- Primary Key: Non-null unique identifier for a row
- Foreign Key: Field or combination of fields that establishes a link between two tables. It enforces referential integrity by ensuring that values in one table's key match with values in another table's key. This relationship allows for data consistency and facilitates queries across related tables
- Record: A row in a table
- Schema: The structure or blueprint of the database, defining the tables, columns, and relationships
- Table: A collection of related data
- View: Virtual table that is made up of elements of multiple physical or "real" tables
- Subsets of SQL
- WHERE vs HAVING
- TRUNCATE vs DELETE
- Table Keys
- Aggregation vs Window Function
- Set Theory
Language | Purpose | Examples | Features |
---|---|---|---|
DCL (Data Control Language) | Manages access and permissions to data | GRANT , REVOKE | Controls who can access, manipulate, or delete data |
DDL (Data Definition Language) | Defines and manages database structure | CREATE , ALTER , DROP | Used to define tables, indexes, constraints, etc |
DML (Data Manipulation Language) | Manipulates data within the database | INSERT , UPDATE , DELETE | Allows adding, modifying, and removing data |
DQL (Data Query Language) | Retrieves data from the database | SELECT | Primarily used for querying data from tables |
Aspect | WHERE Clause | HAVING Clause |
---|---|---|
Purpose | Filters individual rows based on specific conditions | Filters groups of rows after aggregation (using GROUP BY ) |
Data Considered | All rows in the table | Groups created by the GROUP BY clause |
Condition Type | Can use any comparison operators, logical operators | Must use aggregate functions (SUM , COUNT , AVG , MIN , MAX ) or expressions involving them |
Execution Timing | Applied before rows are grouped (more efficient) | Applied after rows are grouped (less efficient) |
Requirement | Can be used with or without GROUP BY | Requires a GROUP BY clause |
Compatibility | Can be used with SELECT , UPDATE , and DELETE statements | Can only be used with SELECT statements |
Filtering Logic | Filters rows that meet the condition, excluding others | Filters groups that meet the condition, excluding others |
Feature | TRUNCATE | DELETE |
---|---|---|
Category | DDL (Data Definition Language) | DML (Data Manipulation Language) |
Function | Removes all rows from a table | Removes specific rows based on conditions (WHERE clause) |
Filtering | Not possible | Possible using WHERE clause |
Transaction | Cannot be used within a transaction | Can be used within a transaction |
Constraints | Disables foreign key constraints temporarily | Triggers foreign key constraints and other constraints |
Transaction Log | Records only deallocated data pages | Records each deleted row |
Rollback | Not possible | Possible (if transaction is active) |
Speed | Faster | Slower |
Locking | Exclusive lock on the table | Locks individual rows being deleted |
Identity Columns | Resets auto-incrementing values | Preserves existing values |
Triggers | Does not fire triggers | May fire triggers (depending on definition) |
Permissions Required | ALTER on the table | DELETE on the table |
Use Cases |
|
|
Feature | Primary Key | Unique Key | Foreign Key |
---|---|---|---|
Purpose | Uniquely identifies each row | Ensures uniqueness for a set of columns | Links data between two tables |
Uniqueness | Mandatory (one and only one) | Enforced (no duplicates) | Enforced (references unique values) |
Null Values | Not allowed | Allowed (one per column) | Not allowed |
Number per Table | One | Multiple | Can reference multiple tables |
Indexes | Typically creates a clustered index | May or may not create an index | Typically does not create an index |
Auto Increment | Supported | Not supported | Not supported |
Foreign Key Reference | Can be referenced by foreign keys | Can be referenced by foreign keys | References a primary or unique key |
Feature | Aggregation | Window Function |
---|---|---|
Functionality | Applies to entire groups of data | Applies to each row within a group |
Purpose | Summarizes data by reducing it to a single value per group | Calculates running totals, subtotals, ranking, percentiles, on a defined partition of data |
Common Functions | SUM , COUNT , AVG , MIN , MAX | CUMSUM , ROW_NUMBER , DENSE_RANK , PERCENT_RANK , LAG , LEAD |
Output Format | Reduced DataFrame with one row per group | Maintains the original DataFrame structure with the same number of rows |
Example | SELECT department, COUNT(*) FROM employees GROUP BY department | SELECT name, salary, AVG(salary) OVER (PARTITION BY department ORDER BY hire_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM employees |
Use Cases | Suitable for summarizing data across groups, such as calculating total sales per region or average order value per customer segment | Ideal for analytical tasks requiring comparisons between rows, like calculating moving averages, cumulative sums, or identifying top N values within groups |
Feature | UNION | UNION ALL | INTERSECT | EXCEPT |
---|---|---|---|---|
Purpose | Combines the results of two or more sets and removes duplicates | Combines the results of two or more sets without removing duplicates | Returns elements common to both sets | Returns elements present in the first set but not in the second |
Syntax | SELECT id FROM A UNION SELECT id FROM B | SELECT id FROM A UNION ALL SELECT id FROM B | SELECT id FROM A INTERSECT SELECT id FROM B | SELECT id FROM A EXCEPT SELECT id FROM B |
Example | A = {1, 2, 3}, B = {2, 4, 5} → {1, 2, 3, 4, 5} (w/o duplicates) | A = {1, 2, 3}, B = {2, 4, 5} → {1, 2, 2, 3, 4, 5} (w/ duplicates) | A = {1, 2, 3}, B = {2, 4, 5} → {2} (elements common to both sets) | A = {1, 2, 3}, B = {2, 4, 5} → {1, 3} (elements in A but not B) |
Type | Visualization | Definition | Features | Use Cases |
---|---|---|---|---|
Skip List | Probabilistic data structure for ordered sets or maps, offering efficient search and insertion with average case complexity better than balanced trees | Used in Redis | In-memory | |
Hash Index | Uses a hash function to quickly map data keys to their locations, ideal for fast lookups | Most common in-memory index solution | In-memory | |
SSTable | File format for storing data in sorted order on disk, enabling efficient retrieval operations | Immutable data structure. Seldom used alone | Disk-based | |
LSM Tree | Data structure that combines in-memory and disk-based storage for ordered data, optimizing write performance and later merging for efficient reads | High write throughput. Disk compaction may impact performance | Memory + Disk | |
B-Tree | Self-balancing tree data structure for sorted data, allowing efficient search, insertion, and deletion operations | Most popular database index implementation | Disk-based | |
Inverted Index | Used for text retrieval, where words are mapped to documents they appear in, facilitating fast full-text searches | Used in document search engine such as Lucene | Search document | |
Suffix Tree | Stores suffixes of words, enabling efficient searches for patterns and substrings within a text | Used in string search, such as string suffix match | Search string | |
R-Tree | Stores spatial data like points, rectangles, or polygons, allowing efficient searches for objects within a specific area | Nearest neighbor | Search multi-dimension shape |
Type | Definition | Example |
---|---|---|
SELECT | Retrieves data from a database | SELECT column1, column2 FROM table_name |
INSERT | Adds new records to a table | INSERT INTO table_name (column1, column2) VALUES (value1, value2) |
UPDATE | Modifies existing records in a table | UPDATE table_name SET column1 = value1, column2 = value2 WHERE condition |
DELETE | Deletes records from a table | DELETE FROM table_name WHERE condition |
JOIN | Combines rows from two or more tables based on a related column | SELECT * FROM table1 INNER JOIN table2 ON table1.column = table2.column |
GROUP BY | Groups rows with identical values into summary rows | SELECT column1, COUNT(*) FROM table_name GROUP BY column1 |
HAVING | Filters records grouped by GROUP BY clause | SELECT column1, COUNT(*) FROM table_name GROUP BY column1 HAVING COUNT(*) > 10 |
ORDER BY | Sorts the result set in ascending or descending order | SELECT * FROM table_name ORDER BY column1 DESC |
WHERE | Filters records based on specified conditions | SELECT * FROM table_name WHERE condition |
DISTINCT | Returns only distinct (different) values | SELECT DISTINCT column1 FROM table_name |
UNION | Combines the result sets of two or more SELECT statements | SELECT column1 FROM table1 UNION SELECT column1 FROM table2 |
TRANSACTION | Groups a set of SQL statements into a single unit of work | BEGIN TRANSACTION; ... COMMIT; or ROLLBACK; |
INDEX | Creates an index on a table | CREATE INDEX index_name ON table_name (column1) |
VIEW | Virtual table derived from one or more tables | CREATE VIEW view_name AS SELECT column1, column2 FROM table_name WHERE condition |
TRIGGER | Executes a set of actions when a certain event occurs on a table | CREATE TRIGGER trigger_name BEFORE INSERT ON table_name FOR EACH ROW BEGIN ... END |
SUBQUERY | Nested query inside another query | SELECT column1 FROM table_name WHERE column1 IN (SELECT column1 FROM table2 WHERE condition) |
CASE | Provides conditional logic within a query | SELECT column1, CASE WHEN condition THEN result1 ELSE result2 END AS result FROM table_name |
Aspect | Common Table Expression (CTE) | Partition Over | Windowing | Materialized View |
---|---|---|---|---|
Definition | Temporary named result set within the scope of a single SQL statement | SQL clause to divide query result sets into partitions | Functions defined over a window/partition of rows in a query | Precomputed and stored result of a query as a physical table |
Scope | Exists only during the execution of a single query | Applies within a query's window function | Used within SQL queries to perform calculations over sets | Persisted in database until refreshed or dropped |
Purpose | Improve query readability, enable stepwise data transformation | Define subset of rows to apply window functions | Perform calculations like ranking, running totals, moving averages | Improve query performance by caching results |
Data Persistence | None - ephemeral and transient | None - part of query execution | None - part of query execution | Yes - physical table stored on disk |
Usage | Create intermediate result sets for reuse in complex queries | Used with window functions to partition data | Calculate aggregates and analytics over row windows | Store complex or aggregated query results |
Performance Impact | May or may not improve depending on DB optimizer and usage | Overhead varies, used to optimize window function execution | Allows efficient in-query analytics without multiple scans | Speeds up query execution by avoiding repeated computations |
Recursiveness | Supports recursion in some DBMS (e.g., PostgreSQL) | Not recursive | Not recursive | Not recursive |
Reusability | Within single query only | Within window function context | Within window function context | Across multiple queries until refreshed |
Refresh/Update | Not applicable | Not applicable | Not applicable | Can be refreshed on schedule, manually, or incrementally |
Storage Requirement | No physical storage | No physical storage | No physical storage | Requires storage space for materialized data |
Comparison to Subqueries/Temp Tables | More readable than nested subqueries; ephemeral unlike temp tables | Not comparable as standalone; part of windowing | Not comparable standalone; used for row-wise calculations | Persistent, unlike views or CTEs |
Pros | Improves query clarity and debugging, enables recursion | Enables granular application of window functions | Powerful analytics over ordered subsets | Significant query performance improvement via caching |
Cons | Depending on DB engine, may have unknown performance tradeoffs | Adds overhead to window function processing | Complexity in understanding window frame boundaries | Requires maintenance and storage, possible staleness of data |
Use Cases | Simplify complex query logic, prepare reusable intermediate results | Partition data for window function analytics | Analytics like rankings, running totals, moving averages | Speed up expensive or repeated analytical queries |
Examples | Breaking multi-step queries into readable named parts | PARTITION BY clause in window functions | ROW_NUMBER() OVER (PARTITION BY col ORDER BY col) | Precomputed aggregation tables, snapshot of query results |
Syntax Example |
|
|
|
|
Process of organizing data in a relational database to minimize redundancy and improve data integrity.
- Reduce data duplication: This saves storage space and minimizes maintenance headaches
- Enhance data consistency: When a piece of data changes, it only needs to be updated in one place, ensuring consistency across the database
- Improve data retrieval efficiency: Normalized databases allow for faster and more efficient querying of data
Normalization is achieved through a series of steps, each referred to as a Normal Form (NF) Each subsequent form builds upon the previous one, progressively reducing redundancy
Normal Form | Key Points | Steps | Example | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0NF | Non-normalized data |
| |||||||||||||||||||||||||||||||||||
1NF |
|
|
| ||||||||||||||||||||||||||||||||||
2NF |
|
|
| ||||||||||||||||||||||||||||||||||
3NF |
|
|
| ||||||||||||||||||||||||||||||||||
Boyce-Codd Normal Form (BCNF) |
| ||||||||||||||||||||||||||||||||||||
4NF |
| ||||||||||||||||||||||||||||||||||||
5NF |
| ||||||||||||||||||||||||||||||||||||
6NF |
|
Normalization vs. Denormalization​
Aspect | Normalization | Denormalization |
---|---|---|
Purpose | Eliminate data redundancy, maintain data integrity | Introduce controlled redundancy to improve read performance |
Data Structure | Data is split into multiple related tables | Data is combined into fewer tables with redundancy |
Redundancy | Minimal or no redundancy | High redundancy due to duplicated data |
Query Complexity | Complex queries requiring multiple JOINs | Simplified queries with fewer or no JOINs |
Storage Usage | Optimizes storage by avoiding duplication | Requires more storage due to duplicated data |
Performance Optimization | Optimized for write operations (inserts, updates) | Optimized for read operations and faster query retrieval |
Data Integrity & Consistency | Easier to enforce data integrity and consistency | Higher risk of data inconsistencies and anomalies |
Maintenance Effort | Easier to maintain and update due to no duplicated data | More complex maintenance and potential update anomalies |
Storage Cost | Lower, due to reduced duplication | Higher, due to redundant data storage |
Write Operation Cost | Lower, fewer duplicate updates needed | Higher, updates must be propagated across duplicates |
Read Operation Cost | Higher, due to required joins | Lower, data is pre-joined or duplicated |
Referential Integrity | Enforced via foreign keys and constraints | Often reduced or absent, making integrity harder to enforce |
Scalability | Suits relational databases | Often better for distributed NoSQL or denormalized stores |
Impact on Indexing | Indexes can be spread across multiple tables | Indexing might be faster since related data is stored together |
Risk of Anomalies | Low (insertion, update, deletion anomalies minimized) | Higher risk due to duplicated data |
Development Complexity | Requires more thoughtful schema design and normalization steps | Simpler schema but requires careful management of duplicates |
Use Cases | Systems with frequent updates where accuracy is critical | Read-heavy systems where fast data retrieval is crucial |
Examples | Social media profiles, inventory systems | Reporting systems, dashboards, analytical queries |
Example Scenario | Separate Customer and Order tables linked by CustomerID | Customer data embedded directly in Order table to avoid joins |
Type | Visualization | Definition | Syntax | Example |
---|---|---|---|---|
Left Join / Left Outer Join | Everything on the left + everything on the right that matches | SELECT * FROM table1 LEFT JOIN table2 ON table1.key = table2.key | ||
Anti Left Join | Everything on the left that is NOT on the right | SELECT * FROM table1 LEFT JOIN table2 ON table1.key = table2.key WHERE table2.key IS NULL | ||
Right Join / Right Outer Join | Everything on the right + everything on the left that matches | SELECT * FROM table1 RIGHT JOIN table2 ON table1.key = table2.key | ||
Anti Right Join | Everything on the right that is NOT on the left | SELECT * FROM table1 RIGHT JOIN table2 ON table1.key = table2.key WHERE table1.key IS NULL | ||
Full Outer Join | Everything on the left + everything on the right | SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.key = table2.key | ||
Anti Outer Join | Everything on the left and right that is unique to each other | SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.key = table2.key WHERE table1.key IS NULL OR table2.key IS NULL | ||
Join / Inner Join | Only the things that are common | SELECT * FROM table1 INNER JOIN table2 ON table1.key = table2.key | ||
Cross Join | All combinations of rows from both tables (Cartesian Product) | SELECT * FROM table1 CROSS JOIN table2 | ||
Self Join | The table is joined with itself | SELECT * FROM table1 AS t1 INNER JOIN table1 AS t2 ON t1.key = t2.key |
Feature | One-to-One | One-to-Many | Many-to-Many |
---|---|---|---|
Visualization | |||
Definition | Each record in one table is associated with exactly one record in another | Each record in one table can be associated with one or more records in another | Records in both tables can be associated with multiple records in the other |
Example | Person → Passport | Department → Employee | Student → Course |
Issue | Definition | Impact | Solutions |
---|---|---|---|
Cartesian Products | Occurs when 2 or more tables are joined without a proper WHERE clause, resulting in huge datasets |
|
|
Connections overhead | Establishing multiple connections to the same database |
|
|
Data Inconsistency | Same data exists in different forms across the database |
|
|
Deadlocks | 2 or more transactions are waiting for each other to release locks |
|
|
Full Table Scans | Scanning the entire table to satisfy a query instead of using indexes |
|
|
Missing Indexes | Indexes that are required to satisfy a query are missing |
|
|
N+1 | Initial query fetches N rows, and then N additional queries are executed to fetch related data for each row |
|
|
Over/Under fetching | Overfetching occurs when unnecessary data is fetched from the database, while underfetching occurs when necessary data is not fetched |
|
|
Poorly Designed Schema | Lacks normalization, proper data types, and relationships, leading to inefficient data access |
|
|
Subquery Inefficiency | Subqueries are used inefficiently, resulting in poor query performance |
|
|