Fundamentals
- DE Lifecycle
- Data Access
- Data Storage
- Architecture Patterns
- Testing
- Infrastructure as Code
- Security
- Data Mesh
DE Lifecycle describes the stages involved in taking raw data from its origin to a usable format for analytics, reporting, and machine learning. The typical stages are Generation, where data is created; Storage, where it's held; Ingestion, where it's brought into a system; Transformation, where it's cleaned and processed; and Serving, where it's made available to users and applications. This structured process ensures the consistent delivery of high-quality data products and helps data engineers build reliable data pipelines.
Stages​
Stage | Description | Key Activities |
---|---|---|
Generation | Data originates from source systems like databases, apps, IoT devices, APIs, files, and web services. Data engineers must understand their formats, generation velocity, and integration protocols | Understanding data formats, generation velocity, integration protocols, schema analysis, connectivity, and business logic |
Evaluating Source Systems | Data engineer must understand how source systems generate data, including their quirks, behaviors, and limitations to design effective ingestion pipelines | Managing schemas, handling inconsistencies, and ensuring reliable data extraction |
Ingestion | Ingestion refers to the process of moving data from generating sources into a centralized processing system (data lake, warehouse, stream processor), either in batch or real-time (streaming) modes. Source systems and ingestion are critical chokepoints - a single data hiccup can disrupt the entire pipeline, breaking downstream processes and creating ripple effects | Selecting ingestion patterns (batch vs. streaming), validating and monitoring pipeline flows, handling schema drift, initial data quality checks |
Data Storage | Data at every stage - raw, cleaned, modeled - may be persistently stored for reliability, auditability, and downstream processing. Storage architectures include data lakes (raw staging), data warehouses (structured, analytics-ready), and hybrid lakehouse solutions | Choosing storage types, optimizing for scalability and cost, enforcing security and backup protocols, supporting data versioning and lineage |
Transformation | Converts raw ingested data into cleaned, standardized, enriched formats suitable for analytics and ML. Transformations can be orchestrated via ETL/ELT tools, SQL scripts, or data workflow managers. The Medallion Architecture often structures this into Bronze (raw), Silver (cleaned), and Gold (aggregated) layers | Cleansing, data normalization and format conversion, business logic and enrichment, aggregations, modeling, statistical summarization, validation and data quality testing |
Serving | Transformed data must be delivered to stakeholders or applications for actual use. This can involve feeding BI dashboards, analytics platforms, ML models, or external systems via APIs or reverse ETL for operational analytics | Providing data to BI dashboards, analytics platforms, or reporting tools; feeding machine learning models; supplying external systems via APIs or reverse ETL; ensuring reliability, freshness, and security for all consumers |
Undercurrents | Several critical themes run through all stages of the data engineering lifecycle, including security, data management, DataOps, data architecture, orchestration, and software engineering best practices |
|
- Access Frequency
- Data Types
Aspect | Hot Data | Lukewarm (Warm) Data | Cold Data |
---|---|---|---|
Definition | Frequently accessed, high-value, real-time or near-real-time data | Moderately accessed, regularly needed but not instant | Infrequently accessed, usually retained for archival purposes |
Access Frequency | Constant, immediate, sub-second or millisecond response | Scheduled, hours to days, moderate latency | Rare, weeks to years, high latency |
Access Latency | Sub-second or millisecond | Seconds to minutes | Minutes to hours |
Storage Media | RAM, in-memory database, SSDs, high-performance NAS | Mid-tier SSDs, high-speed HDDs, cloud object storage | Low-cost HDDs, archival cloud storage |
Retention Policy | Short-term, transactional | Weeks to months, operational | Long-term, years (or indefinitely for compliance) |
Data Volume | Typically smaller, volume managed for speed | Medium | Very large, bulk data |
Data Value | Immediate, high business impact | Useful, moderate business impact | Historical, regulatory, analytical |
Security Requirements | Highest, critical for business operations | Moderate, standard access protection | Encryption, integrity, regulatory compliance |
Scalability | Vertical scaling for speed | Horizontal scaling, cost-performance balance | Massive horizontal scaling, low access needs |
Challenges | High cost, data lifecycle, scalability | Balancing cost and access | Retrieval speed, data integrity, long-term maintenance |
Use Cases | Fraud detection, real-time stock trading, network monitoring | Monthly business reporting, operational data | Legal audits, disaster recovery, regulatory reporting |
Data Type | Definition | Characteristics | Examples | Use Cases | Notes |
---|---|---|---|---|---|
Raw Data | Unprocessed, unfiltered data collected directly from sources without any manipulation or analysis | Original, uncleaned, may contain errors or noise; very granular; inflexible until processed | Sensor readings, transaction logs, survey responses in original form, unedited images or video files | Foundation for all analysis; requires cleaning, transformation, and validation | Also called primary data or source data; critical for unbiased analysis |
Quantitative Data | Numerical data representing measurable quantities that can be counted or measured and subjected to math operations | Numeric, measurable, often continuous or discrete; suitable for statistical and mathematical modeling | Heights, weights, sales numbers, temperatures, survey ratings (on scale) | Statistical analysis, predictive modeling, machine learning, numerical summarization | Often stored as structured data but can appear semi-structured or raw |
Qualitative Data | Non-numerical, descriptive data representing categories, concepts, or subjective qualities | Categorical or textual data; may be structured (coded categories) or unstructured (open text) | Interview transcripts, social media posts, survey open responses, observations | Content analysis, thematic coding, sentiment analysis, NLP applications | Includes both structured categorical data and unstructured textual/media data |
Structured Data | Data organized into predefined models or schemas, typically tabular with rows and columns | Highly organized; easy to store, search, and analyze; conforms to relational databases or spreadsheets | Databases with customer info, spreadsheets with sales data, financial records | Used in relational databases, data warehouses, and analytical reporting | Often quantitative and categorical (qualitative) data; easiest to handle computationally |
Unstructured Data | Data without a predefined schema or organization, often qualitative and complex | No fixed format; text-heavy or multimedia; requires advanced techniques for parsing and analysis | Emails, videos, audio recordings, social media feeds, documents, images | Text mining, image/video analysis, sentiment analysis, AI-driven extraction | Increasingly important with big data, but challenging to manage and analyze |
Semi-structured Data | Data that does not fit rigid schemas but contains tags or markers to separate elements for easier processing | Hybrid of structured and unstructured; contains metadata or tags alongside free-form content | JSON, XML, CSV files, HTML documents, tagged multimedia files | Easier to process than unstructured; used for web data, APIs, and metadata extraction | Provides flexibility of unstructured with some model-driven processing advantages |
- Data Collection
- Data Modeling
- Slowly Changing Dimension
- Schemas
Bounded vs. Unbounded Data​
Aspect | Bounded Data | Unbounded Data |
---|---|---|
Definition | Finite data set with a known start and end point | Infinite or continuously growing data with no predefined end |
Data Characteristics | Fixed size, complete, and unchanging once fully collected | Potentially infinite, dynamic, and continuously generated |
Examples | Historical sales data, completed dataset for a specific period (e.g., last quarter sales) | Streaming logs, real-time sensor data, social media feeds |
Processing Model | Batch processing - data processed as a whole after collection | Stream processing - data processed incrementally as it arrives |
Data Ordering | Typically sequential and complete, allowing deterministic processing | May be out-of-order, delayed, or non-sequential due to latency and distributed sources |
Timing | Processed after data collection, often with latency (days, hours) | Processed in real-time or near real-time with minimal delay |
System Architectures | Traditional Data Warehouses, ETL pipelines, batch-oriented systems | Streaming platforms like Apache Kafka, Apache Flink, Apache Beam, Spark Streaming |
Storage Requirements | Larger storage upfront for entire dataset storage | Continuous storage needs with potential for state management or windowing to handle data volume |
Computation Model | Deterministic and re-runnable computations on fixed data sets | Incremental, stateful computations with approximate processing or windowing to manage infinite data |
System Complexity | Lower complexity in handling data consistency and completeness | Higher complexity to handle out-of-order events, late data, and exactly-once processing guarantees |
Error Handling | Errors can be corrected in batch runs before analysis | Needs continuous monitoring and corrective mechanisms to handle anomalies in a live stream |
Scalability Challenges | Scalability mainly in storage and batch job execution | Requires scalable infrastructure to handle continuous high-throughput data ingestion and processing |
Latency | Higher latency acceptable due to batch processing nature | Low latency required to provide timely insights or actions |
Architectural Patterns | ETL, Lambda Architecture (batch layer dominant) | Kappa Architecture, unified stream processing approach combining batch and stream |
Data Completeness | Complete view of the dataset after processing | Incomplete snapshots at any point, with evolving data as stream progresses |
Examples from Real World | Financial reports for closed fiscal year; archived web logs | Network packet captures; social media mentions; real-time transaction feeds |
Use Cases | Historical analytics, reporting, compliance auditing | Real-time analytics, alerting, fraud detection, IoT monitoring |
Batch vs. Micro-Batch vs. Real-Time Processing​
Aspect | Batch Processing | Micro-Batch Processing | Real-Time Processing |
---|---|---|---|
Definition | Collects and stores data over a period, processing all at once | Processes data in small batches at short, regular intervals | Processes data immediately as it arrives, near-instantaneous |
Frequency | Low frequency (e.g., hourly, daily, monthly) | Medium frequency (seconds to minutes intervals) | High frequency (sub-second to real-time continuous) |
Latency | High latency due to waiting for batch completion | Moderate latency, quicker than batch but not instantaneous | Very low latency, near-instant results |
Data Volume | Large volumes of accumulated data | Smaller chunks of data per batch | Continuous streams of individual events |
Complexity | Simple to implement and manage | Moderate complexity, combines batch and streaming elements | High complexity requiring advanced architecture and tooling |
Resource Utilization | Efficient resource use, runs during off-peak times | More frequent resource use than batch, less than streaming | Resource-intensive, requires horizontal scaling |
Processing Model | Triggered by schedule (time or data volume) | Triggered by time interval or data size threshold | Constant event-driven processing |
Stateful Processing Support | Yes, often requires stateful operations | Supports small state, similar to batch | Usually stateless or manages small state due to speed demand |
Data Freshness | Lower data freshness, data available after processing batch | Near real-time freshness, data is updated every few minutes | Highest data freshness, updates data as it arrives |
Fault Tolerance | Easier to handle failures with retries during next batch | Moderate fault tolerance | Requires robust fault tolerance mechanisms, checkpointing |
Typical Technologies | Apache Spark batch jobs, Hadoop MapReduce | Apache Spark Streaming, Fluentd, Logstash | Apache Kafka, Apache Flink, Apache Pulsar |
Cost | Lower operational cost due to infrequency | Moderate operational costs | Higher cost due to continuous processing and infrastructure |
Use Cases | End-of-day reports, billing, historical analytics | Incremental dashboard updates, near real-time user behavior | Fraud detection, monitoring, live analytics |
Pull vs. Push​
Aspect | Pull | Push |
---|---|---|
Initiation | Data target pulls data from the source by requesting it explicitly, as needed or at scheduled intervals | Data source initiates and sends data to the target automatically when data is available |
Control of Flow | Target controls when and how much data to ingest (e.g., batch size, frequency) | Source controls the data flow; target has little control over rate or timing |
Real-time Capability | Can be near-real-time but often involves periodic polling. Typically higher latency than push | Immediate; real-time delivery as soon as new data is generated |
Scalability | Highly scalable; multiple consumers can independently fetch data at their own pace, easier replication, supports distributed scaling | May overwhelm consumers if the source produces more data than the targets can handle; hard to optimize for multiple consumers |
Replayability/Recovery | Easier to recover or reprocess missed data, as the consumer can retry requests or fetch from specific offsets | Replay is challenging - if a consumer misses data, it's hard to get the missing pieces back unless a buffer or queue is used |
Latency | May introduce latency depending on polling frequency and network delays | Low latency; pushes changes to consumers as soon as available |
Efficiency | May require more bandwidth for frequent polling; less efficient for frequent changes unless optimized | Efficient for sources with frequent changes or high update rates - useful for event-driven architectures |
Security | Target must connect to source, requiring bidirectional communication and b security layers on the source | More secure for sources; the source doesn't listen for network connections, reducing attack surfaces |
Operational Complexity | Source must allow for external requests, potentially more firewall and authentication setup; simpler consumer-side error handling and scaling | Potentially less operational overhead if rate limiting and buffering are handled; but flow control and backpressure management are harder |
Data Ownership | Consumer chooses what, when, and how much to ingest, offering flexibility for diverse requirements | Source knows and manages its own data, ensuring accurate and robust delivery |
Implementation Details | Requires periodic scheduler or polling mechanism; easier integration with existing APIs and systems | Requires consumers to implement logic for handling unsolicited data, queuing, or buffering; flow and rate limiting complex |
Hybrid Patterns | Hybrid approaches leverage strengths of both, such as push for immediate updates and pull for detailed/batch data | Often combined - e.g., system pushes notifications but clients pull detailed data as needed |
Consistency Guarantees | Easier to achieve exactly-once or at-least-once semantics with systems like Kafka | Needs careful orchestration for b consistency, especially in distributed setups |
Common Technologies | REST APIs, scheduled ETL, database dumps, Kafka Connect, batch queries, polling | Webhooks, real-time streaming (e.g., MQTT), proprietary push APIs, some ETL tools |
Use Cases | Reporting, batch processing, periodic data sync, data lakes, less time-sensitive operations | Time-sensitive, high-frequency events, IoT devices, notification systems, or real-time analytics |
Data Modeling Technique | Definition | Characteristics | Advantages | Disadvantages | Use Cases | Examples |
---|---|---|---|---|---|---|
Conceptual Data Modeling | High-level, abstract model focusing on business entities and relationships without technical detail | Entities and relationships shown typically via ER diagrams or UML class diagrams | Platform-neutral, easy communication with business stakeholders | Lacks technical detail for implementation | Business planning, stakeholder alignment, early project stages | ER Diagrams, UML |
Logical Data Modeling | Detailed model defining data elements, attributes, relationships, keys, and rules without platform dependency | Normalized tables, keys, constraints, and relationships; focus on data integrity and structure | Provides clean, normalized design; aids data quality and governance | Does not address physical storage, indexing, or performance optimization | Schema design, data governance, preparing for physical modeling | 3NF, Data Vault modeling (hubs-links-satellites) |
Physical Data Modeling | Model optimized for database implementation specifying tables, columns, indexes, partitions, etc | Denormalized/normalized tables, platform-specific constructs like indexes, partitions, cluster keys | Optimizes storage, retrieval, and query performance | Tightly coupled to specific technologies, less flexible | Performance tuning and optimization for specific DB engines | Star schema, Snowflake schema, Anchor Modeling |
Dimensional Modeling | Simplifies data structures for analytical queries grouping data into facts and dimensions | Fact tables (numeric measures) linked to dimension tables (contextual descriptors) like star or snowflake schema | Intuitive for analysts, improves query speed for OLAP workloads | Less suitable for transactional systems, more redundancy | BI, data warehouses, dashboards, self-service analytics | Star Schema, Snowflake Schema, Slowly Changing Dimensions (SCDs) |
Relational Data Modeling | Organizes data in normalized tables ensuring minimal redundancy and data consistency | Tables with rows and columns, defined primary and foreign keys, normalized forms | Strong data integrity, widely supported, good for complex relationships | Can be complex to query for analytical workloads, performance overhead for joins | OLTP systems, master data management, transactional apps | 2NF, 3NF, Boyce-Codd Normal Form (BCNF), ER diagrams |
Entity-Relationship (ER) Model | Represents entities, attributes, and their relationships for database design | Entities represented as objects/tables; attributes as columns; relationships with cardinality and optionality | Clear visualization of data relationships, promotes normalization | May become complex for large systems, design only | Database design, relational databases | Chen ER Model, Crow's Foot notation |
Object-Oriented Data Modeling | Combines data with behavior, encapsulates data and operations together representing objects | Objects with attributes and methods; supports inheritance, classes, and polymorphism | Closer to real-world modeling, reusable components | Complexity, less common in traditional DBs | Object databases, applications using OOP principles | Classes, inheritance hierarchies |
Hierarchical Data Modeling | Organizes data in tree-like parent-child relationships | Strict one-to-many parent-child relationships; records organized in a hierarchy | Simple and fast navigation in one-to-many data | Inflexible with many-to-many or complex relationships | Legacy systems, XML/JSON document stores, file systems | IMS database, XML schemas |
Network Data Modeling | Extends hierarchical to allow many-to-many relationships | Graph-like structures, records with multiple owners or parents | More flexible than hierarchical, models complex relationships | More complex design and management than relational | Complex interconnected data like telecommunications, logistics | CODASYL, Graph databases (Neo4j, Amazon Neptune) |
Temporal/Historical Modeling | Tracks data changes over time for auditing, historical analysis | Stores multiple data versions with timestamps for valid and transaction time | Supports full data history and versioning, improves auditability | Increases data storage and complexity | Compliance, time-series, audit trails, customer lifecycle | Bitemporal modeling, Slowly Changing Dimensions (SCDs), Anchor modeling |
Agile Data Modeling | Enables iterative and flexible modeling adapting to evolving business needs | Combines techniques, emphasizes collaboration and incremental updates | Highly adaptable, incorporates feedback quickly | Can lack initial rigor, may lead to inconsistent models | Rapid development environments, evolving business domains | Often combined with other models in Agile projects |
Big Data Modeling | Tailored to handle volume, velocity, and variety of big data | May use NoSQL schema-on-read, data lakes, schemas for semi-structured data | Scales for huge data volumes, flexible schema | Less mature standards, potential for data inconsistency | Big data platforms, streaming analytics | Schema-on-read, Hadoop, NoSQL, Data lakehouse |
Inmon | Corporate Information Factory (CIF). Enterprise-wide data architecture integrating various data sources into a centralized warehouse. Flow: Sources → Staging (ETL) → Enterprise Data Warehouse (Data stored in 3NF) → Data Marts → Consumption | Top-down approach, normalized data warehouse, data marts for specific domains | Comprehensive, consistent enterprise view | Complex, time-consuming implementation | Large enterprises needing integrated data | Normalized data warehouse, data marts |
Kimball | Bus Architecture. Dimensional modeling approach focusing on ease of use and performance. Flow: Sources → Staging (ETL) → Enterprise Data Warehouse (STAR Shema) → Data Marts → Consumption | Bottom-up approach, data marts for specific business areas | Fast query performance, user-friendly data structures | Can lead to data silos, less comprehensive view | Mid-sized to large enterprises with specific reporting needs | Star schema, snowflake schema, data marts |
Data Vault (Linstedt) | Hybrid approach combining elements of 3NF and dimensional modeling. Flow: Sources → Staging (ETL) → Raw Data Vault → Business Data Vault → Data Marts → Consumption | Focuses on agility and scalability, accommodating changes easily | Supports historical tracking and auditability | Can be complex to implement and manage | Organizations needing flexibility and rapid change adaptation | Data vault model, hubs, links, satellites |
Slowly Changing Dimensions (SCDs) are dimension tables in data warehouses where attribute values change slowly over time. Unlike frequently changing fact data, dimension data (e.g., customer details, product attributes) requires historical tracking for accurate reporting. SCDs manage these changes in various ways to meet different business and analysis needs.
Aspect | Type 0: Retain Original | Type 1: Overwrite | Type 2: Add New Row | Type 3: Add New Attribute | Type 4: Add History Table | Type 5: Add Mini-Dimension | Type 6: Combined Approach | Type 7: Hybrid Approach |
---|---|---|---|---|---|---|---|---|
Description | Attribute never changes; always original value | Overwrite old data; no history kept | Insert new row for each change; full history tracked | Add new column(s) to track limited previous value | Store historical data in separate history table | Create a mini-dimension for frequently changing attributes | Combines Types 1, 2, 3 in one model (overwrite, add row, add attribute) | Combines various SCD techniques beyond Type 6 for adaptive needs |
Visualization | The attribute never changes, so the entity design simply holds the original column with no alteration | Changes overwrite old values, and there is no history kept | New row is added for each change, keeping full history. Start and end dates track consistency | One or more additional columns retain limited history (e.g., previous value) | Separate history table is created to maintain full change history, keeping the current state in the main dimension | Mini-dimension stores rapidly changing attributes separately, referenced by the main dimension | Combines Types 1, 2, and 3. Maintains both current values (Type 1) and full history (Type 2) and a previous attribute (Type 3) | Flexible hybrid approach, often combining multiple SCD strategies for different columns depending on business requirements |
Change Handling Method | No update | Overwrite existing values | Add new record per change | Add new column to track previous value | Use separate history table for old data | Extract frequently changing attributes into separate mini-dim table | Current and historical columns plus version column | Flexible, combines multiple change management techniques |
Historical Data Tracking | No | No | Yes | Limited (only one previous value) | Yes | Partial history through mini-dims | Yes | Yes |
Storage Impact | Minimal | Minimal | High (multiple rows per entity) | Moderate (additional columns) | Moderate to High (two tables) | Moderate (extra mini-dim tables) | High (due to multiple approaches combined) | Variable, depends on component types used |
Query Complexity | Very simple | Simple | More complex due to multiple rows | Simple for limited history | Moderate due to joins with history table | Moderate (joins with mini-dim) | Moderate to complex | Complex, depending on combination used |
Pros | Simple; fast queries | Easy implementation, fast update | Full historical data tracking | Easy access to current and prior value | Clear separation of current and historical data | Improves query performance for frequent small changes | Flexible; combines best of types 1, 2, 3 | Highly adaptable to complex scenarios |
Cons | No history, no ability to analyze change | History lost | Adds storage; may impact performance | Only tracks limited history, not scalable | Extra complexity with multiple tables | Additional ETL and dimensional complexity | Complexity; maintenance overhead | High complexity; requires sophisticated design |
Implementation Complexity | Low | Low | Moderate to high | Low to moderate | Moderate to high | Moderate to high | High | Very high |
Impact on Performance | Minimal | Minimal | Can degrade with large historical data | Moderate | Moderate | Moderate | Can be performance intensive | Depends on implemented hybrid techniques |
Dimension Table Action | No change to attribute value | Overwrite attribute value | Add new dimension row for profile with new attribute value | Add new column to preserve attribute's current and prior values | Add mini-dimension table containing rapidly changing attributes | Add type 4 mini-dimension, along with overwritten type 1 mini-dimension key in base dimension | Add type 1 overwritten attributes to type 2 dimension row, and overwrite all prior dimension rows | Add type 2 dimension row with new attribute value, plus view limited to current rows and/or attribute values |
Impact on Fact Analysis | Facts associated with attribute's original value | Facts associated with attribute's current value | Facts associated with attribute value in effect when fact occurred | Facts associated with both current and prior attribute alternative values | Facts associated with rapidly changing attributes in effect when fact occurred | Facts associated with rapidly changing attributes in effect when fact occurred, plus current rapidly changing attribute values | Facts associated with attribute value in effect when fact occurred, plus current values | Facts associated with attribute value in effect when fact occurred, plus current values |
Use Cases | Static attributes like SSN, zip codes | Correcting typos, non-critical updates e.g. email, phone | Track full history of customer address, employee job changes | Track current and previous salary, status | Maintain full historical pricing, employment data | Track attributes like customer segmentation that change frequently | Employee role and department tracking with full change history | Complex enterprise needs, combining multiple SCD styles |
Schema Types​
Aspect | Physical Schema | Logical Schema | Evolving Schema | Contractual Schema (API) | Metadata Schema |
---|---|---|---|---|---|
Definition | Describes how data is physically stored and arranged (files, indices, partitions) on storage media or DBMS | Defines the logical (human-readable) structure: tables, fields, relationships, constraints | Captures the actual schema changes (add/remove fields) over time, typically in dynamic or pipeline-driven systems | Schema defining fields and their validation between systems via an API contract (e.g., JSON, GraphQL) | Schema that describes the data about data, such as lineage, column descriptions, and governance |
Level of Abstraction | Lowest: hardware, file system, storage block level | Higher: data model, independent of storage | Variable: follows either physical or logical but adapts to change | Variable: can be logical or physical depending on API implementation | Varies: may refer to logical, physical, or conceptual layers |
Focus | Performance, storage efficiency, physical locations | Data organization, integrity, relationships, constraints | Handling schema drift, flexibility for changes | Interface definition, data validation, compatibility | Data governance, lineage, quality, observability |
Typical Stakeholders | DBAs, infrastructure engineers | Data modelers, analysts, architects | Data engineers, analytics teams | Backend engineers, API consumers/producers | Data governance, compliance, data stewards |
Benefits | Maximizes storage & query performance; supports tuning, scaling | Ensures consistency, maintainability, integrity of business logic | Enables rapid evolution, tracks change, minimizes disruption | Allows machine interoperability, enforces standards, prevents breakage | Aids data discovery, quality, lineage, and regulatory compliance |
Limitations | Complex to change, tightly-coupled to hardware/DBMS | May hide physical inefficiencies, less relevant for storage choices | Risk of data loss or incompatibility if not managed well | Tight coupling can hinder API flexibility, requires documentation | Can become outdated or incomplete without good processes |
Use Cases | DBMS optimization, partitions, indexes, backup/recovery strategies | ER diagrams, database normalization, data modeling | ELT pipelines, analytics, SaaS product changes | API design, system integration, microservices communication | Data catalogs, pipeline documentation, lineage tracking |
Examples | Parquet files with partitioning; index files for tables; disk layouts | Star schema; ERD; relational database definitions | Adding new analytics events; updating field names in ELT | REST/GraphQL/OpenAPI schema definitions; JSON schema | dbt sources.yml; OpenMetadata; catalog records; lineage graphs |
Star vs. Snowflake vs. Galaxy Schema​
Aspect | Star Schema | Snowflake Schema | Galaxy Schema |
---|---|---|---|
Structure | Central fact table linked to denormalized dimension tables | Fact table linked to normalized dimension tables, split hierarchically | Multiple fact tables sharing dimension tables, can be a mix of star and snowflake |
Visualization | |||
Data Normalization | Dimension tables are denormalized (flat structure, redundancy present) | Dimension tables are normalized (data split into sub-tables, minimal redundancy) | Typically involves normalized or partially normalized dimension tables to reduce data redundancy. Dimensions are often conformed (shared across fact tables). Normalization level can vary depending on design goals |
Query Performance | Faster query execution due to fewer joins | Slower query execution due to multiple joins required | Performance can vary; may benefit from fewer joins but could be impacted by complexity |
Query Complexity | Simpler queries, fewer joins, easy to write and understand | More complex queries, requires deeper understanding and multiple joins | Queries can be complex due to multiple fact tables and shared dimensions; requires good understanding of schema |
Storage Requirements | Higher storage use due to redundant and denormalized data | More storage efficient; reduced duplication through normalization | Storage efficiency varies; can be optimized through shared dimensions but may still have redundancy depending on design |
Data Redundancy | Higher - dimensions repeat attribute values in multiple rows | Lower - most redundant data is eliminated | Varies - some redundancy may remain depending on design |
Space Usage | More storage space required for large datasets | Less storage space through normalization | Varies - can be optimized but may still require significant space depending on data volume and design |
Foreign Keys | Fewer foreign keys (simple design) | More foreign keys due to multiple related tables | Multiple foreign keys due to shared dimensions; complexity depends on design |
Data Integrity | Lower: Denormalization risks inconsistency due to data being updated in many places | Higher: Normalization enforces referential integrity and consistency | Varies - can be managed but may require more effort to maintain consistency |
Updates and Modifications | Harder to update - redundant data increases risk of inconsistent modifications | Easier for updates - changes in an attribute only affect one table | Varies - updates may be easier due to shared dimensions but can be complex depending on relationships |
Dimension Table Structure | Flat structure - each dimension is a single table, no sub-tables | Multi-layered - each dimension may be decomposed into sub-dimensions | Varies - dimensions can be flat or multi-layered depending on design |
BI & Reporting Suitability | Best for BI tools, dashboards, and quick ad hoc queries | Better for complex analytical queries, detailed reporting, and multidimensional analysis | Suitable for complex reporting needs involving multiple fact tables; requires good understanding of schema |
Maintainability | Easier to maintain, intuitive design | More difficult to maintain, complex design | Varies - can be complex to maintain due to multiple fact tables and shared dimensions |
Design Complexity | Easier and faster to design and implement | Requires careful design due to hierarchical splitting | Varies - can be complex to design depending on relationships and shared dimensions |
Scalability | Scalable for typical analytic workloads, though can suffer performance issues at extreme scale due to redundancy | Good scalability, especially for complex and large-scale data with multiple hierarchies | Scalability varies; can handle complex data but may require careful design to avoid performance bottlenecks |
ETL/ELT Complexity | Simpler ETL/ELT pipelines - fewer tables to populate and maintain | More complex ETL/ELT - hierarchical normalization requires careful loading and management | ETL/ELT complexity varies; may require more sophisticated pipelines to manage multiple fact tables and shared dimensions |
Drawbacks | Data redundancy, storage waste, potential for inconsistencies, not suited for high-cardinality or complex hierarchies | Query slowness for basic analytics, complexity in query construction and ETL, harder for non-technical users to understand and navigate | Complexity in design and maintenance, potential performance issues if not well-optimized |
Use Cases | Retail sales analysis with simple product/geography/time/customer dimensions | Data warehouses with complex product/customer/location hierarchies, and systems requiring fine-grained data integrity | Enterprise data warehouses with multiple business processes, complex reporting needs, and shared dimensions across fact tables |
Lambda vs. Kappa​
Aspect | Lambda | Kappa |
---|---|---|
Processing Model | Combines batch processing and real-time stream processing in separate layers | Uses a single, unified stream processing pipeline for both real-time and reprocessing |
Visualization | ||
Processing Layers | Three layers: Batch Layer (large-scale processing), Speed Layer (real-time), Serving Layer (query) | Single pipeline for all data, eliminating the batch layer |
Complexity | High complexity; requires maintaining and synchronizing two separate codebases and pipelines | Simpler architecture; only one processing pipeline to maintain |
Latency | Batch layer processing introduces higher latency; speed layer offers low latency for real-time data | Low latency overall due to continuous stream processing |
Fault Tolerance | Fault tolerant: batch layer can recompute results if speed layer fails or produces errors | Fault tolerant depending on stream processing reliability; relies on log replay for reprocessing errors |
Data Reprocessing Capability | Batch layer enables accurate reprocessing of historical data to fix errors or recompute results | Reprocessing done via replaying events from the log through the stream processor |
Accuracy | High accuracy due to batch layer with complete data; speed layer may produce approximate results | Consistent real-time results but may lack batch-layer level accuracy for complex computations |
Scalability | Scales horizontally but more complex scaling due to separate batch and speed layers | Easier to scale stream processing horizontally; simpler operational model |
Historical Data Handling | Excellent, supports deep historical batch analytics and corrections | Less suited for complex historical data analysis, designed mainly for streaming real-time data |
Implementation Complexity | High development and maintenance effort due to dual pipelines and serving layer integration | Lower implementation and maintenance overhead |
Consistency Between Layers | Requires careful coordination to keep batch and speed outputs consistent | Single pipeline avoids consistency issues inherent in Lambda dual-layer design |
Real-Time Analytics | Provides real-time insights via speed layer but with possible eventual consistency lag | Provides immediate real-time analytics with no separate batch delay |
Support for Complex Analytics | Good support since batch layer handles heavy, complex queries and aggregations | Limited complex analytics, as everything must be handled in stream processing |
Reprocessing Complexity | Batch layer reprocessing is separate and managed independently | Reprocessing simply involves re-consuming the event stream, simplifying error correction |
Data Duplication Risk | Potential for duplication or mismatch between batch and speed layer results if not carefully managed | Minimal duplication risk since there is only one data processing pipeline |
Use Cases | Suitable for systems needing both comprehensive historical analysis and real-time insights | Best for real-time focused applications with simpler operational needs (e.g., IoT, user activity tracking) |
Examples | Recommendation engines, financial modeling, large-scale analytics | Real-time monitoring, IoT analytics, clickstream processing, social media analytics |
Type | Purpose | Scope | When Performed | Key Techniques | Considerations | Relevance | Quality Checks |
---|---|---|---|---|---|---|---|
Data Quality Testing | Validate accuracy, completeness, consistency, validity, timeliness, and uniqueness of data | Data at rest (tables, datasets) and in-motion (streams) | Often ongoing, triggered by data load or refresh | Profiling, validation rules, anomaly detection, null checks, deduplication | Identifying subtle quality issues, evolving data schemas | Crucial for trustworthy analytics; foundation to all downstream processes |
|
Data Integrity Testing | Ensure accuracy, completeness, retrievable, verifiable, truthfulness, consistency, and reliability of data throughout its lifecycle | Data storage, processing, transmission, and updates | Routine and triggered by data changes, migrations | Validation rules, checksums, version control, continuous monitoring, domain and entity integrity tests | Managing volume & complexity, real-time validation, compliance, security | Critical to maintain trustworthiness of data; prevents corruption and errors across all data states and systems |
|
Integration Testing | Verify interactions and data flow between integrated components or systems | Endpoints, APIs, data sources, ETL components | After component/unit testing, pre-system integration | API calls, contract validation, mock testing | Managing dependencies, environment setup, flaky tests | Ensure data flows cleanly between systems without loss or corruption |
|
Performance Testing | Assess system responsiveness, throughput, stability under load | Entire pipeline throughput, resource usage, latency | Pre-release or after significant changes | Load testing, stress testing, volume testing | Simulating realistic load, environment parity | Essential to meet SLAs for batch and streaming jobs, avoid bottlenecks |
|
Regression Testing | Ensure new code/changes do not break existing data workflows or features | Entire data pipeline or specific modules | After any change or update | Automated retesting, test case prioritization | Test suite maintenance, execution time | Maintain pipeline stability; detect silent errors after changes |
|
End-to-End Testing | Validate complete workflows from ingestion through transformations to consumption | Across all pipeline stages and downstream applications | Before major releases or deployment | Full process simulation, real user scenario emulation | High complexity, environment parity | Confirm entire data lifecycle works as expected from source to consumer |
|
Functional Testing | Validates specific functions or business rules within data transformations | Specific ETL jobs, SQL functions, or data logic blocks | During development and after changes | Unit tests, SQL assertions, black-box testing | Test data setup, mock dependencies | Validate correctness of data transformations and business logic |
|
Compliance Testing | Verify data adherence to legal, regulatory, and internal policies | Data privacy, retention, access controls, audit trails | Scheduled or triggered by regulation changes | Policy validation, audit log review | Dynamic rules, audits, cross-system consistency | Ensure data governance and regulatory compliance requirements are met |
|
Contract Testing | Verify that communication contracts/interfaces between services remain consistent | API schemas, data contracts, message formats | Before and during integration releases | Schema validation, consumer-driven contract testing | Coordinating consumer/provider contracts | Prevent integration breakage due to incompatible schema changes |
|
Data Processes Testing | Validate ETL/ELT logic, correctness of data transformation and processing | Extract, Transform, Load stages individually and combined | During development, scheduled after pipeline changes | Unit tests, integration tests, system-wide data validation | Complex dependencies, state handling | Ensure processing steps handle data correctly and produce expected results |
|
Pipeline Testing | End-to-end and targeted tests validating pipeline orchestration, error handling, and data flow | Orchestrator workflows, triggers, retries, alerts | Continuous, after pipeline deployments or fixes | Workflow simulations, failure scenario testing | Environment parity, handling intermittent failures | Verify pipeline robustness, alerting, and data delivery completeness |
|
- Imperative/Declarative
- Idempotency
Infrastructure as Code (IaC) provisions and manages computing infrastructure using code instead of manual processes. It reduces time-consuming errors, especially at scale, by defining desired states and automating deployment. This frees developers to focus on applications, while organizations gain cost control, risk reduction, and faster responses to opportunities.
Aspect | Imperative Programming | Declarative Programming |
---|---|---|
Definition | Specifies how to perform tasks step-by-step through explicit instructions | Specifies what the desired outcome or goal is, without detailing how to achieve it |
Programming Approach | The developer writes detailed instructions explicitly controlling each step to change the program state | Describes the desired end state; the system figures out the instructions to reach that state automatically |
Control Flow | Explicit; the developer manages the exact order of operations and flow | Implicit; controlled by the system or runtime |
State Management | Explicit and manual; the developer must maintain and update system state | Abstracted away and handled automatically by the system |
Level of Abstraction | Lower-level, deals with detailed procedural steps and direct system operations | Higher-level, more abstract, focuses on logic and outcomes |
Error Handling | Must be explicitly handled by the programmer; easier to introduce state inconsistency or errors | Often more robust due to abstraction; the system validates state before applying changes |
Flexibility/Control | More control over performance and optimization by managing each operation exactly | Less fine-grained control over execution details, focus is on describing end results |
Maintainability | Can become complex and harder to maintain with scaling due to detailed step management | Typically easier to maintain and extend as logic is expressed declaratively |
Adaptability to State | Rigid; instructions may fail if the initial state differs from assumptions | Adaptive; compares current state with desired state and adjusts actions dynamically |
Performance | Potentially faster for low-level tasks when optimized by expert programmers | May add overhead from abstraction or compilation; optimized by underlying engine |
Error-Prone | More prone to errors due to manual state & control flow management | Generally less error-prone since system manages steps and state consistency |
Debugging | Easier for step-by-step tracing but can get complicated in large codebases | Debugging declarative code may be harder due to abstraction, requires understanding system internals |
Tools | Chef and Puppet | Terraform |
Use Cases | Writing detailed data processing pipelines, manual orchestration of ETL steps, data cleaning scripts | Defining database schemas, data transformations (dbt models), infrastructure as code (Terraform ), SQL queries |
Example: Creating Table (SQL) | Write explicit commands to create table, add columns, alter structure; may fail if structure exists | Define the desired table structure and let the system handle creation or alteration dynamically |
Example Analogy | Giving step-by-step instructions on how to make the sandwich starting from scratch | Showing a picture of the final sandwich and having a competent chef make it |
Idempotency means an operation can be applied multiple times without changing the result beyond the initial application.
Importance​
- Prevents duplicate data processing and corruption during retries
- Simplifies error handling by making retries safe
- Ensures consistent and deterministic pipeline outputs
- Enables scalable, concurrent processing without complex locking
- Facilitates easier debugging and auditing
- Meets strict regulatory compliance for transactional data
Guidelines​
- Use Idempotency Keys:
- Assign unique identifiers to each operation or data item
- Use composite keys (e.g., source + timestamp) to detect duplicates
- Store these keys to recognize repeated operation attempts and avoid reprocessing
- Employ Atomic Transactions:
- Group operations into atomic units that either complete fully or rollback entirely
- Use transactional ACID-compliant storage systems where possible
- Deduplication Techniques:
- Implement deduplication at multiple levels (data ingestion, processing, storage)
- Utilize probabilistic data structures (Bloom filters) and sliding window algorithms for efficient duplicate detection
- Checkpointing and State Management:
- Maintain and persist checkpoints/states for recovery and partial processing resumption
- Enable pipeline to restart safely from the last consistent state after failures
- Use Contextual Uniqueness:
- Incorporate business logic attributes in idempotency checks to catch logical duplicates
- Concurrency Control:
- Design systems that handle concurrent writes gracefully using idempotency
- Leverage modern concurrency control patterns like non-blocking concurrency
- Choose Idempotent Storage Backends:
- Leverage storage systems that support conditional updates or compare-and-swap semantics (e.g., Delta Lake, Apache Hudi, distributed NoSQL with ACID features)
Testing and Validation​
Validation Techniques​
- Testing Methodologies
- Repeated Execution Testing: Re-run operations multiple times and verify the same state
- Fault Injection Testing: Simulate failures (network, crashes) to observe idempotent behavior
- Concurrent Operation Testing: Run identical operations simultaneously to test race conditions
- State Transition Validation: Confirm system transitions remain consistent regardless of operation frequency
- Time-Window Testing: Retry operations across time spans to ensure idempotency holds over time
- Validation Techniques
- Range Checking: Validate data values fall within acceptable limits
- Type Checking: Verify data types conform to expectations
- Format Checking: Ensure compliance with required data formats (e.g., emails, phone numbers)
- Consistency Checks: Confirm relational integrity across fields and datasets
- Automated Testing
- Property-based testing to generate varied and edge-case scenarios
- Chaos engineering tools to introduce faults in production-like environments
- Integration and regression tests to maintain idempotency guarantees as systems evolve
- Performance monitoring to assess idempotency overhead
- Overview
- Authentication
- Authorization
Aspect | Authentication | Authorization | Encryption | Tokenization | Data Masking | Data Obfuscation |
---|---|---|---|---|---|---|
Definition | Verifying identity of a user or system | Granting or denying access rights to resources | Transforming data into unreadable format to protect it | Replacing sensitive data with non-sensitive tokens | Replacing sensitive data with fictitious but realistic data | Hiding data through transformation to prevent understanding |
Purpose | Confirming who is accessing the system | Controlling what authenticated users can do/access | Protecting data confidentiality during storage/transit | Safeguarding sensitive data by replacing it with tokens | Protecting sensitive info while keeping data useful | Preventing data exposure while often preserving format |
Scope | Identity level (user, device, service) | Permission level (file, operation, service) | Data at rest, in transit | Specific sensitive data fields/elements | Databases, tables, fields, datasets for testing/sharing | Various data forms, often to resist reverse engineering |
Reversibility | N/A (identity verification) | N/A (access control) | Reversible if decryption key is held | Usually reversible via token vault, some are irreversible | Usually irreversible; aim is to prevent data recovery | Usually irreversible or complex to reverse |
Security Focus | Identity assurance | Access control enforcement | Confidentiality, data leakage prevention | Strong data security with minimal data exposure | Privacy compliance, risk reduction | Anti-reverse engineering, protecting intellectual property |
Data Format Preservation | N/A | N/A | Does not preserve original data format visibly | Can preserve format (format-preserving tokenization) | Preserves data usability and format | Often preserves structure/format for usability |
Performance Impact | Low to medium, depends on method | Low to medium, depends on complexity of policies | Can be high, especially with strong encryption and large data | Medium, due to token vault and lookups | Low to medium, depending on masking method (static/dynamic) | Low to medium, depends on obfuscation technique |
Complexity | Can be complex (multi-factor, adaptive) | Can be complex with fine-grained policies and delegation | Complex key management and cryptographic implementation | Complex token vault/database management | Intermediate; requires design of masking policies | Intermediate; requires custom transformation/logics |
Regulatory Compliance | Supports compliance by preventing unauthorized access | Supports compliance by enforcing access control | Strong support for data privacy and protection laws | Helps meet PCI DSS, GDPR by masking real data | Ensures compliance with GDPR, HIPAA, CCPA in testing/sharing | Assists compliance by protecting sensitive info exposure |
Key Limitation | Doesn't control resource access beyond identity verification | Authz policies can be bypassed if authN is weak | Key management critical; if keys lost, data unrecoverable | Reliance on token vault security; complexity | May reduce realism or break referential integrity | Can be reverse-engineered if weak transformations used |
Use Cases | Logins, multi-factor auth, biometric verification | Role-based access control, attribute-based access control | Securing emails, files, network traffic, databases | Payment card processing, PII protection, API token usage | Test/dev environments, analytics with safe data, compliance | Protecting source code, data export, secure telemetry |
Example Techniques | Passwords, biometrics, OTP, SSO | RBAC, ABAC, ACLs, policy engines | AES, RSA, TLS/SSL, hashing | Format-preserving tokenization, stateless/stateful tokens | Substitution, shuffling, scrambling, nulling, encryption-based masking | Character substitution, ciphering, noise addition |
Evolution of Authentication Methods​
Credentials (Base64)​
JSON Web Token (JWT)​
Visualization | Specs |
---|---|
|
Oauth 2.0​
Visualization | Specs |
---|---|
|
SSH Keys​
SSL Certificates​
2FA (Two-Factor Authentication)​
Visualization | Specs |
---|---|
|
2SA (Two-Step Authentication)​
Aspect | Role-Based Access Control (RBAC) | Attribute-Based Access Control (ABAC) | Access Control List (ACL) |
---|---|---|---|
Concept | Assigns permissions to users based on their roles within an organization | Grants access based on attributes of the user, resource, environment, and context | A list of specific rules defining who can access an object and what actions allowed |
Main Focus | Roles and their associated permissions | Attributes and policies combining them | Explicit rules tied to individual resources |
Key Components | Users, Roles, Permissions, Sessions | Subjects (users), Objects (resources), Actions, Environment, Policies | Resources, Access Control Entries (ACEs) specifying users/groups and their permissions |
Visualization | |||
Access Control Model | Role-centric, static binding of permissions | Policy-centric, dynamic evaluation of attributes at request time | Rule-centric, access defined by explicit rules for users or groups per resource |
Flexibility | Moderate. Roles predefined; less adaptable to context changes | High. Can consider dynamic and contextual information (time, location, device, etc.) | Low to moderate. Rules usually static and manually maintained |
Granularity | Coarse to moderate, depends on number and granularity of roles | Fine-grained; policies can combine multiple attributes for precise decisions | Fine-grained at resource level, specifying detailed permissions per user/object |
Scalability | Scales well with a manageable number of roles; risk of role explosion if too many roles created | Can become complex and computationally heavy with many attributes and policies | Can be complex to manage at scale if many resources and users require rules |
Administration | Centralized administration through role assignments; easier for compliance audits | Complex policy administration requiring careful attribute and policy design | Decentralized - resource owners or admins define ACLs; can be cumbersome |
Policy Evaluation | At user-login or session creation, roles assigned then used throughout session | Real-time evaluation of attributes at each access request | Each access request evaluated against ordered ACL rules sequentially |
Security Strength | Good for static deterministic control but vulnerable if roles have excessive privileges | Potentially stronger due to fine-grained, context-aware policies | Strong when rules are well managed; can be prone to errors if rules overlap |
Policy Complexity | Simpler conceptually and easier to implement for basic needs | More complex, requiring detailed attribute and policy management | Simple for small sets of resources but can become complex |
Typical Policy Components | Roles, permissions, users, sessions | Attributes (user, resource, environment), policies, rules combining attributes | Access Control Entries (ACEs) specifying users/groups and their permissions |
Errors and Conflicts | Role explosion can create overlap or excessive permissions | Policy conflicts can be complex to detect and resolve | Rule ordering is critical; earlier rules take precedence, leading to conflicts if mismanaged |
Management Overhead | Moderate; fewer roles means simpler management but can grow with complexity of roles | Higher due to attribute and policy complexity | High if many resources/users require individualized ACLs |
User Control | No direct control by end users; all managed by administrators | No direct user control; policy-driven access | Owners may control ACLs on their resources (discretionary control) |
Compliance and Auditing | Easier to audit due to defined roles and permissions | More complex auditing due to dynamic policies but more precise logging possible | Auditable if ACLs are properly logged and maintained |
Hybrid Use | Often combined with ABAC for context-aware refinements | Can include role as an attribute or integrate with RBAC | ACLs often used alongside RBAC or ABAC for network or low-level access control layers |
Example Permissions | "HR Manager" role can approve leave requests and view payroll data | User accessing resource only during business hours and from corporate device | IP-based allow/deny rules on network devices or file read/write permissions per user |
Use Cases | Enterprises with clearly defined job functions and structured hierarchies | Environments needing fine-grained, dynamic, context-aware access decisions | Network devices (routers, firewalls), file systems, and simple resource-based control |
Common Implementations | Microsoft Active Directory, Oracle RBAC, databases, enterprise IT systems | Healthcare, finance, government systems with strict compliance needs | Router/firewall rules, Windows/Linux file system permissions, some databases |
- Overview
- Governance Topologies
Data mesh is a decentralized data architecture where teams own and manage their data. It assigns ownership to business domains (e.g., finance, marketing, sales), providing a self-serve platform and federated governance. This enables autonomous development of tailored data services while ensuring a unified data experience across the organization.
Aspect | Domain Ownership | Data as a Product | Self-Serve Data Platform | Federated Governance |
---|---|---|---|---|
Strategic Domain Driven Design | Domain Bounded Context | Product Thinking | Domain-Agnostic | Context-Mapping |
Socio-technical Perspective | Domain Teams | Data Product by Domain Team | Data Platform Team | Guild |
Technology | Operational & Analytical Data | Interoperability Interfaces | Self-Serve Data Platform | Data Governance & Automation |
Core Principles​
- Domain-oriented decentralized ownership: Business domains (e.g., customer service, marketing) own and manage their analytical and operational data services, tailoring data models to their needs
- Data as a product: domain teams treat other domains as consumers, providing high-quality, secure, and up-to-date data
- Self-service data infrastructure as a platform: dedicated team provides tools for domains to autonomously consume, develop, deploy, and manage interoperable data products
- Federated computational governance: centralized governance authority with embedded governance in each domain's processes, enabling autonomy while ensuring compliance
Data Mesh Architecture​
Data Product​
High-level Platform Design and Governance​
Example​
Aspect | Fine-grained Fully Federated Mesh | Fine-grained Fully and Fully Governed Mesh | Hybrid Federated Mesh | Value Chain-Aligned Mesh | Coarse-grained Aligned Mesh | Coarse-grained and Governed Mesh |
---|---|---|---|---|---|---|
Description | Pure data mesh model with many small, independent deployable components, peer-to-peer data distribution, logically centralized governance metadata | Adds a central data distribution layer to fine-grained federated mesh for stronger governance and centralized data distribution | Combines federation and centralization. Central platform hosts/maintains data products; domain autonomy mainly in data consumption | Domains aligned along business value chains, working in close groups with autonomy but sharing central standards for cross-domain data | Large, coarse-grained domains, often as a result of mergers; domains contain many applications, organic growth leads to complexity | Similar to coarse-grained aligned mesh but with stronger governance features like addressing time-variant and non-volatile data concerns |
Visualization | ||||||
Granularity | Fine-grained data products, many small independent units | Fine-grained data products with centralized distribution layer | Hybrid: fine to moderate granularity; central platform more involved | Fine to moderate granularity aligned by value chains | Coarse-grained domains containing many applications | Coarse-grained domains with governed attributes |
Governance Approach | Federated with logically centralized metadata governance but mostly domain autonomy | Fully governed with central control over distribution and conformance | Governed but with domain autonomy in consumption; central platform manages creation/maintenance | Central standards for cross-domain data; requires architectural guidance | Strong governance policies necessary due to complexity | Fully governed with relaxed controls in large domains |
Data Distribution | Peer-to-peer between domains; domains share data directly | Centralized data distribution via shared storage layer (domain-specific containers) | Domains create/manage data via central platform; consumes data autonomously | Aligned along value chains; domains share as needed with governance | Centralized/shared to manage complexity across coarse domains | Centralized/shared with governance controls for data quality |
Ownership | Domain owns, manages, shares data independently | Clear boundaries with domain ownership but central distribution | Domain teams or platform team may own/manage data products depending on capability | Domains collaborate with autonomy within their value chain | Domain ownership but domains large and complex | Domains own data but comply with governance for consistency |
Complexity / Management | High complexity managing many small data products; needs conformance agreement across domains | Higher complexity with governance and central controls; may slow time-to-market | Moderate complexity; need supporting platform and governance team to manage hybrid roles | Requires architectural coordination to define boundaries and standards clearly | High complexity due to coarse domains and multiple applications | High complexity with additional governance overhead |
Scalability | Scales well horizontally but can be costly and resource-intensive due to duplication | Scales with strong conformance but may have coupling delays and cost overheads centralized | Scales with centralized platform efficiency and local domain agility | Scales by value chains enabling domain group specialization | Suited to large enterprises with many legacy systems and apps | Similar to coarse-grained aligned but with governance improves scale consistency |
Network / Infrastructure Impact | Potential for heavy network utilization and infrastructure duplication | More efficient central infrastructure with shared storage and compute pools | Some reduction in duplication with central platform; moderate overhead | Balanced infrastructure demands due to group alignment | Infrastructure complexity due to large domain size and app count | Higher infrastructure cost but managed for compliance and quality |
Challenges & Risks | Requires consensus on standards; potential data gravity vs decentralization conflict; costly infrastructure | Longer time to market, potential domain coupling; challenge in multi-cloud seamless governance | Management overhead with mixed governance; complex rules for data distribution | Need strong architectural guidance; boundaries may be fluid and require attention | Data alignment issues with domain boundaries; capability duplication | Balancing autonomy with strong governance may slow flexibility |
Governed Data Characteristics | Metadata governance centralized, data governance mostly at domain level | Stronger data quality, compliance, and governance enforced centrally | Governance mixed: central for product creation, federated for consumption | Governance focuses on cross-domain data product standards | Governance policies critical due to scale and complexity | Governance addresses time-variant, compliance, and quality controls |
Use Cases | Cloud-native, multi-cloud companies with many skilled engineers and high autonomy | Financial institutions, governments valuing compliance over agility | Organizations with legacy systems or lacking fully skilled teams; partial mesh | Organizations needing stream-alignment or hyper-specialized domain cooperation (e.g., supply chain) | Large enterprises with complex merged systems & applications | Large enterprises needing governance and compliance in complex domains |