Data Management

Theorems

CAP
PACELC

Overview
Detailed

Consistency (C): Ensures that all nodes in the system have the same data at the same time
Availability (A): Ensures that every request gets a response about whether it was successful or failed
Partition Tolerance (P): Ensures that the system continues to operate despite network partitions or communication failures

Aspect	AP (Availability & Partition Tolerance)	CA (Consistency & Availability)	CP (Consistency & Partition Tolerance)
Visualization
Definition	Some data may not be consistent	Network issues might stop the system	Some data might not be available when a failure happens
Use Cases	Social networks, real-time analytics, recommendation systems	Financial applications, e-commerce	Multi-datacenter deployments
Examples	Cassandra, DynamoDB, Riak	Google Spanner, RDBMS with high availability configurations	MongoDB with replica sets, BigTable

Overview
Detailed

Theorem	Scope	Consistency Model	Latency Consideration
CAP	Focuses on impact of network partitions on consistency and availability	Binary choice between strong consistency and availability	Doesn't explicitly consider latency
PACELC	Broader view, acknowledging trade-offs present even under normal operation	Consistency is treated as a spectrum, offering more nuanced options	Recognizes latency as a critical factor alongside consistency and availability (data replication can impact latency)

Distributed Unique Identifiers

Type	Structure	Example	Features	Use Cases
MongoDB ObjectID	Timestamp in seconds (4 bytes) Process ID (5 bytes) Incrementing Counter (3 bytes)	6522bfc8-6abf1a160a-16a83e	Total: 96 bits (16.7m ID/sec) Hexadecimal Unix timestamps Ordered	Internal document identifiers in MongoDB
Nano ID	Characters: `A-Za-z0-9_-`	TZOb75IqNux-DuSLisVDp	Total: 126 bits Customizable URL friendly Randomness	Short, URL-friendly identifiers
Sequence		88	Incremental Decimal Human readable	Simple counters Primary keys
Sonyflake	Sign-bit (unused) Timestamp: 10 milliseconds (39 bits) Counter (8 bits) Machine ID (16 bits)	0-000011010110000100110111000001110001101-00000000-0000000000000001	Total: 64 bits (256 ID/10 ms) ~174 years Decimal Distributed	Unique identifiers across distributed systems
Twitter Snowflake	Sign-bit (unused) Timestamp in milliseconds (41 bits) Data Center ID (5 bits) Machine ID (5 bits) Incrementing Counter (12 bits)	0-00101111011111011010110100111101110001111-00000-00001-000000000000	Total: 64 bits (4096 ID/ms) ~70 years Decimal Distributed	Unique identifiers across distributed systems
Universally Unique Identifier (UUID)	time_low (4 bytes) time_mid (2 bytes) version & time_hi (2 bytes) clock_seq_hi_&_res (2 bytes) MAC Address (6 bytes)	d6e9ec10-65e2-11ee-97a0-3eb31bb9ccfe	Total: 128 bits Timestamp (depends on version) Low collision Types v1: Current Time + MAC address v2: Version 1 + extra info (like user/group ID) v3: MD5 hash of namespace + string v4: Pure random v5: SHA-1 hash of namespace + string	Suitable for various systems and interoperability May not be optimized for distributed systems

Data Synchronization & Distribution Mechanisms

Sharding/Partitioning
Replication

Overview
Strategies

Database sharding splits a large database across machines for better handling of massive datasets.

Benefits

Improve response time
Avoid total service outage
Scale efficiently

Type	Definition	Use Cases
Consistent Hashing	Distributes data across a dynamic number of partitions using a hash function	Distributed databases, Content Delivery Networks (CDNs)
Directory Based Sharding	Central directory maps data to specific shards based on predefined rules	Strong consistency and moderate scalability
Geo Sharding	Divides data based on geographic regions to localize data access	Applications requiring regional data localization or geo-distributed databases
Horizontal Partitioning (Sharding)	Data is partitioned across multiple databases or shards based on a certain criterion such as user ID, timestamp	High data volume and scalability requirements
Key-Based Sharding	Data is distributed across shards based on a predefined key	Predictable access patterns and high scalability requirements
Range-Based Sharding	Divides data into ranges (numeric ranges, alphabetical ranges) and assigns each range to a shard	Range-based queries and moderate scalability requirements. Time-series data or data with a sequential range
Vertical Partitioning	Segregates data vertically based on attributes or columns	Specific data access patterns and less dynamic schemas
Materialized Views	Precomputed views for faster query performance Pros: Improved query performance Cons: Overhead in maintaining materialized views, potential staleness Considerations: View maintenance and update strategies Consistency with underlying data changes Refresh policies and scheduling Impact on write performance and storage requirements	Aggregated queries Report generation

Database replication is the process of duplicating data from one database to another, ensuring that multiple copies of the same data are available across different locations or systems. This redundancy enhances data availability, fault tolerance, and scalability

Benefits

Durability
- Replication enhances durability, preventing catastrophic data loss
- It ensures data preservation across multiple servers
- Replication, alongside backups, minimizes data loss windows and downtime
Availability
- Replication boosts system availability and resilience
- It enables seamless failover to standby servers
- Without replication, server outages could cause prolonged downtime
Increasing Throughput
- Replication spreads load across nodes, boosting throughput
- Additional replicas can be added for further scalability
- Proper management avoids replication overhead bottlenecks
Reducing Latency
- Replication brings data closer to users, reducing latency
- Shorter network distance leads to faster response times
- Multi-region replication improves user experience and productivity

Aspect	Full Table Replication	Key-based Incremental Replication	Log-based Incremental Replication	Trigger-based Replication	Snapshot Replication
Overview	Replicates entire tables	Replicates only changed rows based on key values	Replicates changes based on transaction logs	Replicates changes based on triggers	Replicates a point-in-time copy of data
Data Volume / Network Bandwidth Usage	High	Moderate	Low	Moderate	High
Use Cases	Data Warehousing, Reporting	Synchronizing specific datasets between databases	Replicating changes from a primary to secondary database	Replicating changes between databases with complex business logic	Creating backups for disaster recovery

Method	Definition	Pros	Cons	Use Cases
Bi-Directional	Data flows bidirectionally between source and target databases, allowing updates in both directions	High availability and fault tolerance Improved performance for distributed applications	Complexity in conflict resolution Increased risk of data inconsistencies	Multi-site collaboration Active-active data centers Real-time data synchronization
Broadcast	Data from a single source is replicated to multiple targets simultaneously	Scalability for large-scale distribution Reduced network traffic	Limited support for bidirectional data flows Potential for data redundancy	Mass data distribution Real-time data broadcasting
Cascading	Replication is chained in a cascade, where changes propagate sequentially through multiple tiers of databases	Flexibility in data routing and transformation Enhanced security through layered replication	Increased complexity in setup and maintenance Potential for latency and synchronization issues	Hierarchical data distribution Data transformation and filtering Data distribution across geographical regions
Consolidation	Data from multiple sources is consolidated into a single target database	Simplified data management Reduced storage and infrastructure costs	Risk of data loss if not implemented properly Increased latency for distributed queries	Data warehousing Centralized reporting Data aggregation
Peer-to-Peer	All databases are peers and can act as both a source and a target. Data can flow between any pair of databases	Enhanced fault tolerance and scalability Flexibility in data routing	Complexity in configuration and maintenance Potential for network congestion and data conflicts	Decentralized applications Collaborative editing environments Distributed systems
Unidirectional	Data flows in one direction from source to target databases	Simplicity in setup Reduced risk of conflicts	Limited scalability for distributed systems Potential for data latency	Reporting and analytics Disaster recovery Load balancing

Change Data Capture (CDC): Captures changes in real-time as they occur at the source database

Aspect	Transactional CDC	Batch-Optimized CDC	Data Warehouse Ingest-Merge	Message-Encoded CDC
Visualization
Definition	Captures changes in real-time as they occur at the source database	Captures changes in bulk at specific intervals, rather than in real-time	Ingests and merges data from multiple sources into a data warehouse	Encodes changes into messages for asynchronous processing and consumption
Performance	Real-time, minimal latency for data replication	High throughput, reduced impact on source systems due to batch processing	Typically batch-oriented, suitable for large-scale data movement	Depends on message broker performance; can be asynchronous, may introduce latency
Data Consistency	Ensures consistency between source and target systems in near real-time	Data consistency may lag behind real-time due to batch processing	May require additional checks to maintain consistency during merge process	Consistency depends on message delivery guarantees and processing logic
Use Cases	Real-time data synchronization (financial transactions, inventory management)	Daily reporting, data warehousing	Commonly used for data warehousing, analytics, and reporting purposes	Useful for event-driven architectures, microservices, and distributed systems

Replication Strategy	Description	Pros	Cons	Use Cases
Leader-Follower Replication (Source-Replica / Master-Slave / Primary-Secondary)	Primary database instance accepts write operations, while one or more replicas replicate data from the leader. Replicas typically handle read operations	Simple setup and maintenance Consistent read operations Failover support for the leader	Write operations bottlenecked by leader Potential for replication lag Single point of failure (leader)	High availability for read-heavy workloads Load balancing read operations
Active/Active Replication	Multiple database instances accept both read and write operations simultaneously. Each instance can serve read and write requests independently	Distributed load balancing Improved fault tolerance Minimal replication lag	Complex conflict resolution Increased risk of data inconsistency Higher infrastructure and maintenance costs	Geographically distributed applications Low-latency requirements High throughput
Multi-Leader Replication (Master-Master / Primary-Primary)	Multiple database instances accept write operations independently, and changes are asynchronously replicated between them	Improved write scalability Enhanced fault tolerance No single point of failure for writes	Complex conflict resolution Increased risk of data conflicts and inconsistency Potential for replication lag	Geographically distributed teams Active-active setups requiring write capabilities on all nodes
Leaderless Replication	No designated leader. Each node in the cluster can accept both read and write operations. Data is replicated across all nodes in the cluster	No single point of failure High availability for both reads and writes Linear scalability	Complex consistency and conflict resolution Increased network traffic for replication Potential for divergent data states	Highly distributed environments Scalable architectures Fault tolerance
Quorum Writes and Reads	Requires a certain number (quorum) of nodes to agree on a write or read operation before it's considered successful. It's often used in distributed databases to ensure consistency and availability	Improved fault tolerance Tunable consistency levels Reduced risk of split-brain scenarios	Increased coordination overhead Potential for performance degradation with large clusters Complexity in determining appropriate quorum sizes	Highly available systems Consistency across distributed nodes

Criteria	Last Write Wins (LWW)	Conflict-free Replicated Data Types (CRDTs)	Operational Transformation	Application-specific Resolution
Principle	The latest update overwrites previous ones	Concurrent updates merge seamlessly	Transformations are applied to resolve conflicts	Custom logic defines resolution rules
Concurrency Control	Often based on timestamps	Built-in, ensures eventual consistency	Complex, requires careful design	Depends on implementation approach
Conflict Detection	Timestamps or version vectors	Built-in mechanisms handle concurrent updates	Requires tracking dependencies	Custom logic or metadata tracking
Use Cases	Simple applications with low concurrency where data loss is acceptable	Collaborative editing systems, real-time communication systems	Collaborative editing, version control systems, distributed databases	Application-specific needs, such as financial transactions

Communication Patterns

Overview
Patterns

Distributed databases offer scalability and fault tolerance, but introduce communication challenges.

Core Concepts

Data Distribution: Understanding how data is sharded or replicated across nodes is crucial for choosing communication patterns
Synchronous vs. Asynchronous: Synchronous communication waits for a response before proceeding, while asynchronous allows independent execution. Selection depends on real-time response needs and fault tolerance requirements
Consistency Models: Different consistency models (eventual consistency, strong consistency) define how quickly updates propagate across nodes, impacting communication frequency

Aspect
Visualization
Definition	Decentralized coordination between services
Pros	Scalability
Cons	Harder to reason about global system state Potential for cascading failures
Considerations	Event-driven architecture Message formats and protocols Resilience against message loss or duplication Scalability and message routing mechanisms
Use Cases	Event-driven architectures Decoupled systems

Aspect
Visualization
Definition	Command Query Responsibility Segregation (CQRS). Separates read and write operations
Pros	Optimized read/write operations Scalability
Cons	Increased complexity in implementation Potential consistency issues
Considerations	Consistency between read and write models Data synchronization mechanisms Scalability of read and write paths Complexity of maintaining separate models
Use Cases	Complex read/write operations Scalability requirements

Aspect
Visualization
Definition	Executes queries across distributed nodes
Pros	Scalability Improved query performance
Cons	Increased network overhead Potential data consistency challenges
Considerations	Query optimization and pushdown mechanisms Data locality and network overhead Consistency and isolation levels Handling failures and partial results
Use Cases	Analytical workloads Data-intensive applications

Aspect
Visualization
Definition	Stores events instead of current state for data changes. Appends only storage for replay of events to specific state/snapshot
Pros	Full audit trail Scalability
Cons	Complexity in replaying events Potential for performance issues
Considerations	Event schema evolution and compatibility Event storage and indexing strategies Event replay and snapshotting mechanisms Eventual consistency and read model projections
Use Cases	Auditing Versioning Rebuilding state

Aspect
Visualization
Definition	Data and transactions are divided into modules
Pros	Scalability
Cons	Potential inconsistency in data between modules
Considerations	Transaction isolation levels ACID compliance Coordination and consistency across distributed transactions Rollback and compensating actions for failures
Use Cases	Microservices architecture

Aspect
Visualization
Definition	Sequences distributed transactions into a saga
Pros	Maintains consistency
Cons	Complexity in handling compensating transactions Potential for inconsistencies
Considerations	Transactional boundaries and compensation logic Consistency and atomicity guarantees Long-running transaction handling Saga orchestration and message correlation
Use Cases	Long-lived transactions Business workflows

Aspect
Visualization
Definition	Uses an outbox table to guarantee message delivery
Pros	Ensures message delivery
Cons	Requires additional infrastructure, complexity in implementation
Considerations	Outbox implementation and integration patterns Message delivery guarantees and retries Error handling and dead letter queues Scalability and performance considerations
Use Cases	Event-driven architectures

Aspect
Visualization
Definition	Divides data processing into parallel pipelines
Pros	Increased throughput
Cons	Complexity in managing parallelism Potential for data skew
Considerations	Parallelism and concurrency control mechanisms Data partitioning and load balancing Fault tolerance and error recovery strategies Resource utilization and bottleneck detection
Use Cases	Data processing pipelines

Aspect	Two-Phase Commit (2PC)	Three-Phase Commit (3PC)
Visualization
Definition	Ensures all participants commit or abort together in 2 phases	Extension of 2PC adding a "prepare to abort" phase for increased fault tolerance
Steps	First Phase (Pre-Commit Phase): Voting Request: Coordinator asks participants if they can commit the transaction Voting: Participants pre-execute the transaction, responding "YES" if executable, "NO" if not Second Phase (Commit/Rollback Phase): All YES Votes: Coordinator sends "commit" command to all Some NO Votes: Coordinator sends "rollback" command if any participant dissents or doesn't respond	First Phase (Prepare Phase) Coordinator sends commit inquiry Participants respond with "can" or "cannot" Second Phase (Pre-Commit Phase) If all agree, coordinator sends pre-commit message Participants pre-execute, send "ready to commit" Third Phase (Commit Phase) Coordinator sends commit message when all ready Participants commit; any "abort" or no response cancels transaction
Pros	Ensures transaction atomicity in a distributed environment	Better solves single point of failure and blocking problems: By adding a prepare phase and timeouts. This allows for better handling of coordinator failures and avoids blocking issues Enhances system availability and robustness: Commit timeouts prevent long waits for unresponsive participants, boosting system availability and robustness Improves transaction execution efficiency: Lets participants start working early (prepare phase), improving efficiency for long transactions
Cons	Performance overhead: Requires frequent network messages (requests, votes, commands) which increases latency, especially in slow networks Single point of failure: Coordinator failure leaves participants unsure about commit/rollback, impacting performance. Use multiple coordinators (primary-secondary) or heartbeat/timeouts for fault tolerance Blocking problem: Can block if a participant fails to respond during voting. Timeouts help avoid this by letting the coordinator rollback after a set wait time	Higher message overhead: Reduces performance especially in bad networks. Optimize network and messages to reduce overhead Increased complexity: New stage and timeouts make it trickier to implement and maintain. Consider Paxos or Raft for better fault tolerance, blocking handling, and performance Blocking problem still exists: Network partitions or simultaneous failures can still cause blocking. Fault recovery with logs helps resume operations by restoring state
Use Cases	Simple data updates across a few geographically distributed databases Financial transactions	Scenarios with potential coordinator failures: Single Point of Failure (SPOF) Large-scale distributed databases

Cache

Overview
Strategies

Traditional cache stores frequently accessed data in a faster-to-access location (usually RAM) compared to the primary data source (typically a database).

Distributed cache extends this concept by spreading the cached data across multiple machines (nodes) within a network.

Strategy	Definition	Use Cases
Read Cache-Aside (Lazy-Loading)	If the data is not in the cache, the application fetches it from the data source and then caches it for subsequent accesses	Frequently accessed, read-heavy workloads with acceptable eventual consistency (Content Delivery Networks: CDN)
Read-Through	Data is fetched from the main storage only when it's not found in the cache, ensuring that the cache reflects the most up-to-date information available	Read-heavy workloads where data consistency is managed by Cache Provider (middleware and proxy servers)
Write-Around	Data is written directly to the main storage, bypassing the cache, but subsequent reads of that data can be cached for faster access	Write-heavy workloads, large file writes, streaming applications
Write-Back	Data is written to the cache first and then later transferred to the main memory, reducing the frequency of memory writes and improving system performance by allowing multiple updates before writing back to memory	Write-heavy workloads where eventual consistency is acceptable (CPU caches, virtualization platforms in enhancing the performance of virtual machines VMs)
Write-Through	Data is written simultaneously to both the cache and the underlying storage, ensuring consistency between the two at all times	Real-time data updates where consistency is paramount (real-time analytics and dashboards)

Theorems​

Distributed Unique Identifiers​

Data Synchronization & Distribution Mechanisms​

Benefits​

Benefits​

Communication Patterns​

Core Concepts​

Cache​

Theorems

Distributed Unique Identifiers

Data Synchronization & Distribution Mechanisms

Benefits

Benefits

Communication Patterns

Core Concepts

Cache