Skip to main content

Data Management

Theorems​

  • Consistency (C): Ensures that all nodes in the system have the same data at the same time
  • Availability (A): Ensures that every request gets a response about whether it was successful or failed
  • Partition Tolerance (P): Ensures that the system continues to operate despite network partitions or communication failures

Distributed Unique Identifiers​

TypeStructureExampleFeaturesUse Cases
MongoDB ObjectID
  • Timestamp in seconds (4 bytes)
  • Process ID (5 bytes)
  • Incrementing Counter (3 bytes)
6522bfc8-6abf1a160a-16a83e
  • Total: 96 bits (16.7m ID/sec)
  • Hexadecimal
  • Unix timestamps
  • Ordered
  • Internal document identifiers in MongoDB
Nano IDCharacters: A-Za-z0-9_-TZOb75IqNux-DuSLisVDp
  • Total: 126 bits
  • Customizable
  • URL friendly
  • Randomness
  • Short, URL-friendly identifiers
Sequence88
  • Incremental
  • Decimal
  • Human readable
  • Simple counters
  • Primary keys
Sonyflake
  • Sign-bit (unused)
  • Timestamp: 10 milliseconds (39 bits)
  • Counter (8 bits)
  • Machine ID (16 bits)
0-000011010110000100110111000001110001101-00000000-0000000000000001
  • Total: 64 bits (256 ID/10 ms)
  • ~174 years
  • Decimal
  • Distributed
  • Unique identifiers across distributed systems
Twitter Snowflake
  • Sign-bit (unused)
  • Timestamp in milliseconds (41 bits)
  • Data Center ID (5 bits)
  • Machine ID (5 bits)
  • Incrementing Counter (12 bits)
0-00101111011111011010110100111101110001111-00000-00001-000000000000
  • Total: 64 bits (4096 ID/ms)
  • ~70 years
  • Decimal
  • Distributed
  • Unique identifiers across distributed systems
Universally Unique Identifier (UUID)
  • time_low (4 bytes)
  • time_mid (2 bytes)
  • version & time_hi (2 bytes)
  • clock_seq_hi_&_res (2 bytes)
  • MAC Address (6 bytes)
d6e9ec10-65e2-11ee-97a0-3eb31bb9ccfe
  • Total: 128 bits
  • Timestamp (depends on version)
  • Low collision
  • Types

    • v1: Current Time + MAC address
    • v2: Version 1 + extra info (like user/group ID)
    • v3: MD5 hash of namespace + string
    • v4: Pure random
    • v5: SHA-1 hash of namespace + string
  • Suitable for various systems and interoperability
  • May not be optimized for distributed systems

Data Synchronization & Distribution Mechanisms​

Database sharding splits a large database across machines for better handling of massive datasets.

Benefits​

  • Improve response time
  • Avoid total service outage
  • Scale efficiently

Communication Patterns​

Distributed databases offer scalability and fault tolerance, but introduce communication challenges.

Core Concepts​

  • Data Distribution: Understanding how data is sharded or replicated across nodes is crucial for choosing communication patterns
  • Synchronous vs. Asynchronous: Synchronous communication waits for a response before proceeding, while asynchronous allows independent execution. Selection depends on real-time response needs and fault tolerance requirements
  • Consistency Models: Different consistency models (eventual consistency, strong consistency) define how quickly updates propagate across nodes, impacting communication frequency

Cache​

Traditional cache stores frequently accessed data in a faster-to-access location (usually RAM) compared to the primary data source (typically a database).

Distributed cache extends this concept by spreading the cached data across multiple machines (nodes) within a network.