Database
- Overview
- Selection Process
- Transactions
- Locking
DB Type | Features | Use Cases |
---|---|---|
Columnar |
|
|
Document |
|
|
Graph |
|
|
Key-Value |
|
|
NewSQL |
|
|
NoSQL |
|
|
Object-Oriented |
|
|
Relational / SQL |
|
|
Spatial |
|
|
Time-Series |
|
|
Vector DB |
|
|
- Selection
- Choice
Type | Definition | Priority | Transaction Failure | Data Consistency | Use Cases |
---|---|---|---|---|---|
ACID |
| Consistency | Entire transaction rolls back | Guaranteed immediately | Financial Transactions (Banking, Stock Trading) |
BASE |
| Availability | May proceed with eventual consistency | Eventual, but not always immediate | Social Media Platforms |
- Locking Hierarchy
- Types
Aspect | Intent Shared (IS) | Exclusive (X) | Intent Exclusive (IX) | Shared with Intent Exclusive (SIX) |
---|---|---|---|---|
Definition | Allows multiple readers but prevents updates. Signals intent to acquire an exclusive lock later | Prevents all other users from accessing the data (reading or writing) | Signals intent to acquire an exclusive lock and prevents other users from acquiring any locks (read or write) | Allows multiple readers but prevents updates. Signals intent to acquire an exclusive lock and prevents other users from acquiring any locks (read or write) |
Usage |
|
|
|
|
Impact on Concurrency | Improves read concurrency | Reduces write concurrency | Blocks all access, impacting overall concurrency | Improves read concurrency initially, reduces write concurrency later |
Scalability | Scales well with read-heavy workloads | May impact performance with high write concurrency | May impact performance due to blocking all access | Can provide a balance between read and write concurrency |
Feature | Pessimistic Locking | Optimistic Locking |
---|---|---|
Visualization | ||
Locking Mechanism | Acquires locks on database records before any read/write operation | No explicit locks; relies on versioning or timestamps |
Transaction Isolation | Guarantees serializability (transactions appear to execute one after another) | Relies on conflict detection during commit |
Concurrency | Lower concurrency due to exclusive access | Higher concurrency as multiple transactions can read data concurrently |
Data Integrity | High; ensures only one transaction modifies data at a time | Lower; potential for "lost updates" if conflicts occur |
Implementation | Database-managed; different lock types (shared, exclusive) available | Application-level; relies on versioning mechanisms (e.g., version numbers, timestamps) in the database |
Error Handling | Rollback transactions that encounter locked records | Retry transactions that encounter conflicts during commit |
Use Cases |
|
|
Data Storage Solutions​
- Overview
- Processing Types
- Data Processing
- Data Repositories
- File Storage
Key Points for Considerations​
- Data Volume and Growth: How much data do you currently store, and what's the anticipated growth rate?
- Performance Requirements: How critical is fast access and retrieval of data for your operations?
- Data Security and Compliance: What security measures are necessary to safeguard sensitive data? Are there industry regulations to adhere to?
- Accessibility and Sharing Needs: Do you require remote access to data or collaboration features?
- Budgetary Constraints: What is your allocated budget for data storage solutions?
Feature | Online Analytical Processing (OLAP) | Online Transaction Processing (OLTP) |
---|---|---|
Purpose | Analytical processing for decision-making | Transaction processing for day-to-day operations |
Data Usage | Aggregates historical data for reporting and analysis | Processes real-time transactions and updates |
Data Schema | Star or snowflake schemas, denormalized | Normalized schemas, reducing redundancy |
Query Complexity | Complex queries with aggregations and joins | Simple queries with frequent read/write operations |
Data Granularity | Summarized, aggregated data | Detailed, individual transactions |
Data Size | Huge data volumes, typically in terabytes or more | Smaller data volumes, typically in gigabytes or less |
Performance | Designed for high throughput, slower write speeds | Optimized for fast write speeds and low latency reads |
Examples | Amazon Redshift, Google BigQuery, Snowflake | MySQL, PostgreSQL |
Aspect | ETL | ELT |
---|---|---|
Visualization | ||
Process Flow | Extract data first, then transform and load into the target system | Extract data first, load into the target system, then transform within the target system |
Data Transformation | Transformation occurs before loading into the target system | Transformation occurs after loading into the target system |
Performance | Typically slower due to data transformation overhead during the ETL process | Generally faster because loading raw data is faster than transforming it during the ETL process |
Scalability | Limited scalability due to the need for substantial transformation before loading data | Highly scalable as it can leverage the processing power of the target system for transformations |
Storage Requirements | Higher storage requirements as both raw and transformed data need to be stored | Lower storage requirements as only raw data needs to be stored initially, and transformation occurs within the target system |
Data Integrity | Higher data integrity as data is cleaned and transformed before loading into the target system | May require additional checks and controls within the target system to ensure data integrity post-transformation |
Complexity | Typically more complex due to the need for designing and managing transformation logic | Generally less complex as it leverages the capabilities of the target system for transformations |
Flexibility | May be less flexible as transformation logic is predefined and applied uniformly to all data | More flexible as transformations can be tailored to specific use cases within the target system |
Real-time Processing | Less suitable for real-time processing due to batch-oriented nature | More suitable for real-time processing as data can be loaded into the target system immediately and transformed on-the-fly |
Examples | Apache Spark, Apache Flink, Google Cloud Dataproc | GCS → Spark jobs, Dataflow → BigQuery |
Use Cases | For well-defined data models and reporting needs | For big data, data lakes, and agile analytics environments |
Feature | Data Mart | Data Warehouse | Data Lake | Data Lakehouse |
---|---|---|---|---|
Visualization | ||||
Definition | A subset of a Data Warehouse containing specific data focused on a particular business function or department | A central repository for structured, organized, and processed data, optimized for querying and analysis | A vast repository of raw, unstructured, or semi-structured data stored in its native format | An architecture combining the features of a Data Lake and a Data Warehouse, providing unified analytics on both raw and processed data |
Data Type | Structured data tailored for specific business needs | Structured data, typically from operational systems | Raw, unstructured, semi-structured data | Raw, semi-structured, and structured data |
Data Storage | Similar to Data Warehouse, stored in relational databases or columnar stores | Stored in a structured format like relational databases (e.g., SQL Server, PostgreSQL) or columnar stores (e.g., Redshift, BigQuery) | Typically stored in distributed file systems like HDFS, AWS S3, or Azure Data Lake Storage | Usually stored in a combination of Data Lake storage and structured data formats like Parquet, Delta Lake, or Apache Iceberg |
Data Processing | Similar to Data Warehouse, batch processing is prevalent, with ETL tools used for data transformation | Primarily batch processing. Data transformation and ETL processes are well-defined and structured. Tools like Informatica, Talend, or Apache Airflow are commonly used | Supports batch and real-time processing. Processing is done on raw data, often using technologies like Hadoop, Spark, or Apache Flink | Combines batch and real-time processing capabilities. Data is transformed and unified using tools like Apache Spark, Databricks, or Delta Lake |
Use Cases | Tailored for specific business functions or departments requiring localized analytics and reporting. Commonly used in finance, sales, or marketing | Best suited for structured reporting, business intelligence, and historical analysis. Ideal for organizations with well-defined data requirements | Suitable for exploratory analytics, machine learning, and big data processing where flexibility and scalability are critical | Ideal for organizations looking to combine the benefits of Data Lakes and Data Warehouses for unified analytics on both raw and processed data |
Examples | Sales Data Mart, Finance Data Mart, HR Data Mart | Amazon Redshift, Google BigQuery, Snowflake | Hadoop Distributed File System (HDFS), Amazon S3, Azure Data Lake Storage | Databricks Delta Lake, AWS Glue, Google BigQuery Omni |
Feature | Parquet | Avro |
---|---|---|
Data Storage | Columnar | Row-based |
Schema | Self-describing, stored in file metadata | Stored with data, language-independent |
Compression | Highly compressed, supports multiple compression codecs (Snappy, Gzip, LZO, etc.) | Compressed, supports Deflate and Snappy codecs |
Performance | Optimized for analytical (OLAP) workloads, fast data retrieval and processing | Optimized for write-intensive, big data operations, efficient for accessing all fields |
Language Support | Language-agnostic, supported by many big data frameworks (Spark, Hive, Impala) | Language-independent, can be used across different programming languages |
Advantages |
|
|
Disadvantages |
|
|
Use Cases |
|
|
Database Federation​
- Overview
- Features
- Pros & Cons
Data Federation is a technique used to integrate data from disparate sources and provide a unified, coherent view of data to the user. Often used in enterprise applications where data is distributed across multiple databases or systems.
- Transparent: Users see one database, not separate sources
- Heterogeneous: Handles different data formats and systems
- Extensible: Easily add new data sources as needed
- Autonomous: Leaves existing databases unchanged
- Integrates data: Combines data from various sources
Pros​
- Flexible data sharing
- Autonomy among the database components
- Access heterogeneous data in a unified way
- No tight coupling of applications with legacy databases
Cons​
- Adds more hardware and additional complexity
- Joining data from two databases is complex
- Data federation's strength of preserving data can be a disadvantage for companies needing regular updates as it prevents data alteration that can lead to issues with disaster recovery
Relational Database​
- Overview
- Detailed
- Data Structures
- Syntax
- Normalization
- Joins
- Relations
- Common Issues
Structured Query Language (SQL) is a programming language used for managing and manipulating relational databases.
Query Flow​
Key Concepts​
- Attribute: A property or characteristic of an entity
- Column: A vertical data element in a table
- Constraint: A condition that must be met for a row to be inserted into a table, such as a column being non-null or unique values (column/table constraints)
- Database Management System (DBMS): Database Management System, the software that manages and controls access to a database
- Default: Default allows to add values to the column if the value of that column is not set
- Primary Key: Non-null unique identifier for a row
- Foreign Key: Field or combination of fields that establishes a link between two tables. It enforces referential integrity by ensuring that values in one table's key match with values in another table's key. This relationship allows for data consistency and facilitates queries across related tables
- Record: A row in a table
- Schema: The structure or blueprint of the database, defining the tables, columns, and relationships
- Table: A collection of related data
- View: Virtual table that is made up of elements of multiple physical or "real" tables
- Subsets of SQL
- WHERE vs HAVING
- TRUNCATE vs DELETE
- Table Keys
- Aggregation vs Window Function
- Set Theory
Language | Purpose | Examples | Features |
---|---|---|---|
DCL (Data Control Language) | Manages access and permissions to data | GRANT , REVOKE | Controls who can access, manipulate, or delete data |
DDL (Data Definition Language) | Defines and manages database structure | CREATE , ALTER , DROP | Used to define tables, indexes, constraints, etc |
DML (Data Manipulation Language) | Manipulates data within the database | INSERT , UPDATE , DELETE | Allows adding, modifying, and removing data |
DQL (Data Query Language) | Retrieves data from the database | SELECT | Primarily used for querying data from tables |
Aspect | WHERE Clause | HAVING Clause |
---|---|---|
Purpose | Filters individual rows based on specific conditions | Filters groups of rows after aggregation (using GROUP BY ) |
Data Considered | All rows in the table | Groups created by the GROUP BY clause |
Condition Type | Can use any comparison operators, logical operators | Must use aggregate functions (SUM , COUNT , AVG , MIN , MAX ) or expressions involving them |
Execution Timing | Applied before rows are grouped (more efficient) | Applied after rows are grouped (less efficient) |
Requirement | Can be used with or without GROUP BY | Requires a GROUP BY clause |
Compatibility | Can be used with SELECT , UPDATE , and DELETE statements | Can only be used with SELECT statements |
Filtering Logic | Filters rows that meet the condition, excluding others | Filters groups that meet the condition, excluding others |
Feature | TRUNCATE | DELETE |
---|---|---|
Category | DDL (Data Definition Language) | DML (Data Manipulation Language) |
Function | Removes all rows from a table | Removes specific rows based on conditions (WHERE clause) |
Filtering | Not possible | Possible using WHERE clause |
Transaction | Cannot be used within a transaction | Can be used within a transaction |
Constraints | Disables foreign key constraints temporarily | Triggers foreign key constraints and other constraints |
Transaction Log | Records only deallocated data pages | Records each deleted row |
Rollback | Not possible | Possible (if transaction is active) |
Speed | Faster | Slower |
Locking | Exclusive lock on the table | Locks individual rows being deleted |
Identity Columns | Resets auto-incrementing values | Preserves existing values |
Triggers | Does not fire triggers | May fire triggers (depending on definition) |
Permissions Required | ALTER on the table | DELETE on the table |
Use Cases |
|
|
Feature | Primary Key | Unique Key | Foreign Key |
---|---|---|---|
Purpose | Uniquely identifies each row | Ensures uniqueness for a set of columns | Links data between two tables |
Uniqueness | Mandatory (one and only one) | Enforced (no duplicates) | Enforced (references unique values) |
Null Values | Not allowed | Allowed (one per column) | Not allowed |
Number per Table | One | Multiple | Can reference multiple tables |
Indexes | Typically creates a clustered index | May or may not create an index | Typically does not create an index |
Auto Increment | Supported | Not supported | Not supported |
Foreign Key Reference | Can be referenced by foreign keys | Can be referenced by foreign keys | References a primary or unique key |
Feature | Aggregation | Window Function |
---|---|---|
Functionality | Applies to entire groups of data | Applies to each row within a group |
Purpose | Summarizes data by reducing it to a single value per group | Calculates running totals, subtotals, ranking, percentiles, on a defined partition of data |
Common Functions | SUM , COUNT , AVG , MIN , MAX | CUMSUM , ROW_NUMBER , DENSE_RANK , PERCENT_RANK , LAG , LEAD |
Output Format | Reduced DataFrame with one row per group | Maintains the original DataFrame structure with the same number of rows |
Example | SELECT department, COUNT(*) FROM employees GROUP BY department | SELECT name, salary, AVG(salary) OVER (PARTITION BY department ORDER BY hire_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM employees |
Use Cases | Suitable for summarizing data across groups, such as calculating total sales per region or average order value per customer segment | Ideal for analytical tasks requiring comparisons between rows, like calculating moving averages, cumulative sums, or identifying top N values within groups |
Feature | UNION | UNION ALL | INTERSECT | EXCEPT |
---|---|---|---|---|
Purpose | Combines the results of two or more sets and removes duplicates | Combines the results of two or more sets without removing duplicates | Returns elements common to both sets | Returns elements present in the first set but not in the second |
Syntax | SELECT id FROM A UNION SELECT id FROM B | SELECT id FROM A UNION ALL SELECT id FROM B | SELECT id FROM A INTERSECT SELECT id FROM B | SELECT id FROM A EXCEPT SELECT id FROM B |
Example | A = {1, 2, 3}, B = {2, 4, 5} → {1, 2, 3, 4, 5} (w/o duplicates) | A = {1, 2, 3}, B = {2, 4, 5} → {1, 2, 2, 3, 4, 5} (w/ duplicates) | A = {1, 2, 3}, B = {2, 4, 5} → {2} (elements common to both sets) | A = {1, 2, 3}, B = {2, 4, 5} → {1, 3} (elements in A but not B) |
Type | Visualization | Definition | Features | Use Cases |
---|---|---|---|---|
Skip List | Probabilistic data structure for ordered sets or maps, offering efficient search and insertion with average case complexity better than balanced trees | Used in Redis | In-memory | |
Hash Index | Uses a hash function to quickly map data keys to their locations, ideal for fast lookups | Most common in-memory index solution | In-memory | |
SSTable | File format for storing data in sorted order on disk, enabling efficient retrieval operations | Immutable data structure. Seldom used alone | Disk-based | |
LSM Tree | Data structure that combines in-memory and disk-based storage for ordered data, optimizing write performance and later merging for efficient reads | High write throughput. Disk compaction may impact performance | Memory + Disk | |
B-Tree | Self-balancing tree data structure for sorted data, allowing efficient search, insertion, and deletion operations | Most popular database index implementation | Disk-based | |
Inverted Index | Used for text retrieval, where words are mapped to documents they appear in, facilitating fast full-text searches | Used in document search engine such as Lucene | Search document | |
Suffix Tree | Stores suffixes of words, enabling efficient searches for patterns and substrings within a text | Used in string search, such as string suffix match | Search string | |
R-Tree | Stores spatial data like points, rectangles, or polygons, allowing efficient searches for objects within a specific area | Nearest neighbor | Search multi-dimension shape |
Type | Definition | Example |
---|---|---|
SELECT | Retrieves data from a database | SELECT column1, column2 FROM table_name |
INSERT | Adds new records to a table | INSERT INTO table_name (column1, column2) VALUES (value1, value2) |
UPDATE | Modifies existing records in a table | UPDATE table_name SET column1 = value1, column2 = value2 WHERE condition |
DELETE | Deletes records from a table | DELETE FROM table_name WHERE condition |
JOIN | Combines rows from two or more tables based on a related column | SELECT * FROM table1 INNER JOIN table2 ON table1.column = table2.column |
GROUP BY | Groups rows with identical values into summary rows | SELECT column1, COUNT(*) FROM table_name GROUP BY column1 |
HAVING | Filters records grouped by GROUP BY clause | SELECT column1, COUNT(*) FROM table_name GROUP BY column1 HAVING COUNT(*) > 10 |
ORDER BY | Sorts the result set in ascending or descending order | SELECT * FROM table_name ORDER BY column1 DESC |
WHERE | Filters records based on specified conditions | SELECT * FROM table_name WHERE condition |
DISTINCT | Returns only distinct (different) values | SELECT DISTINCT column1 FROM table_name |
UNION | Combines the result sets of two or more SELECT statements | SELECT column1 FROM table1 UNION SELECT column1 FROM table2 |
TRANSACTION | Groups a set of SQL statements into a single unit of work | BEGIN TRANSACTION; ... COMMIT; or ROLLBACK; |
INDEX | Creates an index on a table | CREATE INDEX index_name ON table_name (column1) |
VIEW | Virtual table derived from one or more tables | CREATE VIEW view_name AS SELECT column1, column2 FROM table_name WHERE condition |
TRIGGER | Executes a set of actions when a certain event occurs on a table | CREATE TRIGGER trigger_name BEFORE INSERT ON table_name FOR EACH ROW BEGIN ... END |
SUBQUERY | Nested query inside another query | SELECT column1 FROM table_name WHERE column1 IN (SELECT column1 FROM table2 WHERE condition) |
CASE | Provides conditional logic within a query | SELECT column1, CASE WHEN condition THEN result1 ELSE result2 END AS result FROM table_name |
Process of organizing data in a relational database to minimize redundancy and improve data integrity.
- Reduce data duplication: This saves storage space and minimizes maintenance headaches
- Enhance data consistency: When a piece of data changes, it only needs to be updated in one place, ensuring consistency across the database
- Improve data retrieval efficiency: Normalized databases allow for faster and more efficient querying of data
Normalization is achieved through a series of steps, each referred to as a Normal Form (NF) Each subsequent form builds upon the previous one, progressively reducing redundancy
Normal Form | Key Points | Steps | Example | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0NF | Non-normalized data |
| |||||||||||||||||||||||||||||||||||
1NF |
|
|
| ||||||||||||||||||||||||||||||||||
2NF |
|
|
| ||||||||||||||||||||||||||||||||||
3NF |
|
|
| ||||||||||||||||||||||||||||||||||
Boyce-Codd Normal Form (BCNF) |
| ||||||||||||||||||||||||||||||||||||
4NF |
| ||||||||||||||||||||||||||||||||||||
5NF |
| ||||||||||||||||||||||||||||||||||||
6NF |
|
Type | Visualization | Definition | Syntax | Example |
---|---|---|---|---|
Left Join / Left Outer Join | Everything on the left + everything on the right that matches | SELECT * FROM table1 LEFT JOIN table2 ON table1.key = table2.key | ||
Anti Left Join | Everything on the left that is NOT on the right | SELECT * FROM table1 LEFT JOIN table2 ON table1.key = table2.key WHERE table2.key IS NULL | ||
Right Join / Right Outer Join | Everything on the right + everything on the left that matches | SELECT * FROM table1 RIGHT JOIN table2 ON table1.key = table2.key | ||
Anti Right Join | Everything on the right that is NOT on the left | SELECT * FROM table1 RIGHT JOIN table2 ON table1.key = table2.key WHERE table1.key IS NULL | ||
Full Outer Join | Everything on the left + everything on the right | SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.key = table2.key | ||
Anti Outer Join | Everything on the left and right that is unique to each other | SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.key = table2.key WHERE table1.key IS NULL OR table2.key IS NULL | ||
Join / Inner Join | Only the things that are common | SELECT * FROM table1 INNER JOIN table2 ON table1.key = table2.key | ||
Cross Join | All combinations of rows from both tables (Cartesian Product) | SELECT * FROM table1 CROSS JOIN table2 | ||
Self Join | The table is joined with itself | SELECT * FROM table1 AS t1 INNER JOIN table1 AS t2 ON t1.key = t2.key |
Feature | One-to-One | One-to-Many | Many-to-Many |
---|---|---|---|
Visualization | |||
Definition | Each record in one table is associated with exactly one record in another | Each record in one table can be associated with one or more records in another | Records in both tables can be associated with multiple records in the other |
Example | Person → Passport | Department → Employee | Student → Course |
Issue | Definition | Impact | Solutions |
---|---|---|---|
Cartesian Products | Occurs when 2 or more tables are joined without a proper WHERE clause, resulting in huge datasets |
|
|
Connections overhead | Establishing multiple connections to the same database |
|
|
Data Inconsistency | Same data exists in different forms across the database |
|
|
Deadlocks | 2 or more transactions are waiting for each other to release locks |
|
|
Full Table Scans | Scanning the entire table to satisfy a query instead of using indexes |
|
|
Missing Indexes | Indexes that are required to satisfy a query are missing |
|
|
N+1 | Initial query fetches N rows, and then N additional queries are executed to fetch related data for each row |
|
|
Over/Under fetching | Overfetching occurs when unnecessary data is fetched from the database, while underfetching occurs when necessary data is not fetched |
|
|
Poorly Designed Schema | Lacks normalization, proper data types, and relationships, leading to inefficient data access |
|
|
Subquery Inefficiency | Subqueries are used inefficiently, resulting in poor query performance |
|
|