Data Management
Theorems​
- CAP
- PACELC
- Overview
- Detailed
- Consistency (C): Ensures that all nodes in the system have the same data at the same time
- Availability (A): Ensures that every request gets a response about whether it was successful or failed
- Partition Tolerance (P): Ensures that the system continues to operate despite network partitions or communication failures
Aspect | AP (Availability & Partition Tolerance) | CA (Consistency & Availability) | CP (Consistency & Partition Tolerance) |
---|---|---|---|
Visualization | |||
Definition | Some data may not be consistent | Network issues might stop the system | Some data might not be available when a failure happens |
Use Cases | Social networks, real-time analytics, recommendation systems | Financial applications, e-commerce | Multi-datacenter deployments |
Examples | Cassandra, DynamoDB, Riak | Google Spanner, RDBMS with high availability configurations | MongoDB with replica sets, BigTable |
- Overview
- Detailed
Theorem | Scope | Consistency Model | Latency Consideration |
---|---|---|---|
CAP | Focuses on impact of network partitions on consistency and availability | Binary choice between strong consistency and availability | Doesn't explicitly consider latency |
PACELC | Broader view, acknowledging trade-offs present even under normal operation | Consistency is treated as a spectrum, offering more nuanced options | Recognizes latency as a critical factor alongside consistency and availability (data replication can impact latency) |
Distributed Unique Identifiers​
Type | Structure | Example | Features | Use Cases |
---|---|---|---|---|
MongoDB ObjectID |
| 6522bfc8-6abf1a160a-16a83e |
|
|
Nano ID | Characters: A-Za-z0-9_- | TZOb75IqNux-DuSLisVDp |
|
|
Sequence | 88 |
|
| |
Sonyflake |
| 0-000011010110000100110111000001110001101-00000000-0000000000000001 |
|
|
Twitter Snowflake |
| 0-00101111011111011010110100111101110001111-00000-00001-000000000000 |
|
|
Universally Unique Identifier (UUID) |
| d6e9ec10-65e2-11ee-97a0-3eb31bb9ccfe |
|
|
Data Synchronization & Distribution Mechanisms​
- Sharding/Partitioning
- Replication
- Overview
- Strategies
Database sharding splits a large database across machines for better handling of massive datasets.
Benefits​
- Improve response time
- Avoid total service outage
- Scale efficiently
Type | Visualization | Definition | Use Cases |
---|---|---|---|
Consistent Hashing | Distributes data across a dynamic number of partitions using a hash function | Distributed databases, Content Delivery Networks (CDNs) | |
Directory Based Sharding | Central directory maps data to specific shards based on predefined rules | Strong consistency and moderate scalability | |
Geo Sharding | Divides data based on geographic regions to localize data access | Applications requiring regional data localization or geo-distributed databases | |
Horizontal Partitioning (Sharding) | Data is partitioned across multiple databases or shards based on a certain criterion such as user ID, timestamp | High data volume and scalability requirements | |
Key-Based Sharding | Data is distributed across shards based on a predefined key | Predictable access patterns and high scalability requirements | |
Range-Based Sharding | Divides data into ranges (numeric ranges, alphabetical ranges) and assigns each range to a shard | Range-based queries and moderate scalability requirements. Time-series data or data with a sequential range | |
Vertical Partitioning | Segregates data vertically based on attributes or columns | Specific data access patterns and less dynamic schemas | |
Materialized Views | Precomputed views for faster query performance Pros:
Cons:
Considerations:
|
|
- Overview
- Types
- Methods
- CDC
- Strategies
- Conflict Resolutions
Database replication is the process of duplicating data from one database to another, ensuring that multiple copies of the same data are available across different locations or systems. This redundancy enhances data availability, fault tolerance, and scalability
Benefits​
-
Durability
- Replication enhances durability, preventing catastrophic data loss
- It ensures data preservation across multiple servers
- Replication, alongside backups, minimizes data loss windows and downtime
-
Availability
- Replication boosts system availability and resilience
- It enables seamless failover to standby servers
- Without replication, server outages could cause prolonged downtime
-
Increasing Throughput
- Replication spreads load across nodes, boosting throughput
- Additional replicas can be added for further scalability
- Proper management avoids replication overhead bottlenecks
-
Reducing Latency
- Replication brings data closer to users, reducing latency
- Shorter network distance leads to faster response times
- Multi-region replication improves user experience and productivity
Aspect | Full Table Replication | Key-based Incremental Replication | Log-based Incremental Replication | Trigger-based Replication | Snapshot Replication |
---|---|---|---|---|---|
Overview | Replicates entire tables | Replicates only changed rows based on key values | Replicates changes based on transaction logs | Replicates changes based on triggers | Replicates a point-in-time copy of data |
Data Volume / Network Bandwidth Usage | High | Moderate | Low | Moderate | High |
Use Cases | Data Warehousing, Reporting | Synchronizing specific datasets between databases | Replicating changes from a primary to secondary database | Replicating changes between databases with complex business logic | Creating backups for disaster recovery |
Method | Visualization | Definition | Pros | Cons | Use Cases |
---|---|---|---|---|---|
Bi-Directional | Data flows bidirectionally between source and target databases, allowing updates in both directions |
|
|
| |
Broadcast | Data from a single source is replicated to multiple targets simultaneously |
|
|
| |
Cascading | Replication is chained in a cascade, where changes propagate sequentially through multiple tiers of databases |
|
|
| |
Consolidation | Data from multiple sources is consolidated into a single target database |
|
|
| |
Peer-to-Peer | All databases are peers and can act as both a source and a target. Data can flow between any pair of databases |
|
|
| |
Unidirectional | Data flows in one direction from source to target databases |
|
|
|
Change Data Capture (CDC): Captures changes in real-time as they occur at the source database
Aspect | Transactional CDC | Batch-Optimized CDC | Data Warehouse Ingest-Merge | Message-Encoded CDC |
---|---|---|---|---|
Visualization | ||||
Definition | Captures changes in real-time as they occur at the source database | Captures changes in bulk at specific intervals, rather than in real-time | Ingests and merges data from multiple sources into a data warehouse | Encodes changes into messages for asynchronous processing and consumption |
Performance | Real-time, minimal latency for data replication | High throughput, reduced impact on source systems due to batch processing | Typically batch-oriented, suitable for large-scale data movement | Depends on message broker performance; can be asynchronous, may introduce latency |
Data Consistency | Ensures consistency between source and target systems in near real-time | Data consistency may lag behind real-time due to batch processing | May require additional checks to maintain consistency during merge process | Consistency depends on message delivery guarantees and processing logic |
Use Cases | Real-time data synchronization (financial transactions, inventory management) | Daily reporting, data warehousing | Commonly used for data warehousing, analytics, and reporting purposes | Useful for event-driven architectures, microservices, and distributed systems |
Replication Strategy | Visualization | Description | Pros | Cons | Use Cases |
---|---|---|---|---|---|
Leader-Follower Replication (Source-Replica / Master-Slave / Primary-Secondary) | Primary database instance accepts write operations, while one or more replicas replicate data from the leader. Replicas typically handle read operations |
|
|
| |
Active/Active Replication | Multiple database instances accept both read and write operations simultaneously. Each instance can serve read and write requests independently |
|
|
| |
Multi-Leader Replication (Master-Master / Primary-Primary) | Multiple database instances accept write operations independently, and changes are asynchronously replicated between them |
|
|
| |
Leaderless Replication | No designated leader. Each node in the cluster can accept both read and write operations. Data is replicated across all nodes in the cluster |
|
|
| |
Quorum Writes and Reads | Requires a certain number (quorum) of nodes to agree on a write or read operation before it's considered successful. It's often used in distributed databases to ensure consistency and availability |
|
|
|
Criteria | Last Write Wins (LWW) | Conflict-free Replicated Data Types (CRDTs) | Operational Transformation | Application-specific Resolution |
---|---|---|---|---|
Principle | The latest update overwrites previous ones | Concurrent updates merge seamlessly | Transformations are applied to resolve conflicts | Custom logic defines resolution rules |
Concurrency Control | Often based on timestamps | Built-in, ensures eventual consistency | Complex, requires careful design | Depends on implementation approach |
Conflict Detection | Timestamps or version vectors | Built-in mechanisms handle concurrent updates | Requires tracking dependencies | Custom logic or metadata tracking |
Use Cases | Simple applications with low concurrency where data loss is acceptable | Collaborative editing systems, real-time communication systems | Collaborative editing, version control systems, distributed databases | Application-specific needs, such as financial transactions |
Communication Patterns​
- Overview
- Patterns
Distributed databases offer scalability and fault tolerance, but introduce communication challenges.
Core Concepts​
- Data Distribution: Understanding how data is sharded or replicated across nodes is crucial for choosing communication patterns
- Synchronous vs. Asynchronous: Synchronous communication waits for a response before proceeding, while asynchronous allows independent execution. Selection depends on real-time response needs and fault tolerance requirements
- Consistency Models: Different consistency models (eventual consistency, strong consistency) define how quickly updates propagate across nodes, impacting communication frequency
- Choreography
- CQRS
- Distributed Query Processing
- Event Sourcing
- Modular
- Orchestration
- Outbox
- Parallel Pipelines
- Phase Commit
Aspect | |
---|---|
Visualization | |
Definition | Decentralized coordination between services |
Pros |
|
Cons |
|
Considerations |
|
Use Cases |
|
Aspect | |
---|---|
Visualization | |
Definition | Command Query Responsibility Segregation (CQRS). Separates read and write operations |
Pros |
|
Cons |
|
Considerations |
|
Use Cases |
|
Aspect | |
---|---|
Visualization | |
Definition | Executes queries across distributed nodes |
Pros |
|
Cons |
|
Considerations |
|
Use Cases |
|
Aspect | |
---|---|
Visualization | |
Definition | Stores events instead of current state for data changes. Appends only storage for replay of events to specific state/snapshot |
Pros |
|
Cons |
|
Considerations |
|
Use Cases |
|
Aspect | |
---|---|
Visualization | |
Definition | Data and transactions are divided into modules |
Pros |
|
Cons |
|
Considerations |
|
Use Cases |
|
Aspect | |
---|---|
Visualization | |
Definition | Sequences distributed transactions into a saga |
Pros |
|
Cons |
|
Considerations |
|
Use Cases |
|
Aspect | |
---|---|
Visualization | |
Definition | Uses an outbox table to guarantee message delivery |
Pros |
|
Cons |
|
Considerations |
|
Use Cases |
|
Aspect | |
---|---|
Visualization | |
Definition | Divides data processing into parallel pipelines |
Pros |
|
Cons |
|
Considerations |
|
Use Cases |
|
Aspect | Two-Phase Commit (2PC) | Three-Phase Commit (3PC) |
---|---|---|
Visualization | ||
Definition | Ensures all participants commit or abort together in 2 phases | Extension of 2PC adding a "prepare to abort" phase for increased fault tolerance |
Steps |
|
|
Pros |
|
|
Cons |
|
|
Use Cases |
|
|
Cache​
- Overview
- Strategies
Traditional cache stores frequently accessed data in a faster-to-access location (usually RAM) compared to the primary data source (typically a database).
Distributed cache extends this concept by spreading the cached data across multiple machines (nodes) within a network.
Strategy | Visualization | Definition | Use Cases |
---|---|---|---|
Read Cache-Aside (Lazy-Loading) | If the data is not in the cache, the application fetches it from the data source and then caches it for subsequent accesses |
| |
Read-Through | Data is fetched from the main storage only when it's not found in the cache, ensuring that the cache reflects the most up-to-date information available |
| |
Write-Around | Data is written directly to the main storage, bypassing the cache, but subsequent reads of that data can be cached for faster access |
| |
Write-Back | Data is written to the cache first and then later transferred to the main memory, reducing the frequency of memory writes and improving system performance by allowing multiple updates before writing back to memory |
| |
Write-Through | Data is written simultaneously to both the cache and the underlying storage, ensuring consistency between the two at all times |
|