Fundamentals

DE Lifecycle describes the stages involved in taking raw data from its origin to a usable format for analytics, reporting, and machine learning. The typical stages are Generation, where data is created; Storage, where it's held; Ingestion, where it's brought into a system; Transformation, where it's cleaned and processed; and Serving, where it's made available to users and applications. This structured process ensures the consistent delivery of high-quality data products and helps data engineers build reliable data pipelines.

Stages

Stage	Description	Key Activities
Generation	Data originates from source systems like databases, apps, IoT devices, APIs, files, and web services. Data engineers must understand their formats, generation velocity, and integration protocols	Understanding data formats, generation velocity, integration protocols, schema analysis, connectivity, and business logic
Evaluating Source Systems	Data engineer must understand how source systems generate data, including their quirks, behaviors, and limitations to design effective ingestion pipelines	Managing schemas, handling inconsistencies, and ensuring reliable data extraction
Ingestion	Ingestion refers to the process of moving data from generating sources into a centralized processing system (data lake, warehouse, stream processor), either in batch or real-time (streaming) modes. Source systems and ingestion are critical chokepoints - a single data hiccup can disrupt the entire pipeline, breaking downstream processes and creating ripple effects	Selecting ingestion patterns (batch vs. streaming), validating and monitoring pipeline flows, handling schema drift, initial data quality checks
Data Storage	Data at every stage - raw, cleaned, modeled - may be persistently stored for reliability, auditability, and downstream processing. Storage architectures include data lakes (raw staging), data warehouses (structured, analytics-ready), and hybrid lakehouse solutions	Choosing storage types, optimizing for scalability and cost, enforcing security and backup protocols, supporting data versioning and lineage
Transformation	Converts raw ingested data into cleaned, standardized, enriched formats suitable for analytics and ML. Transformations can be orchestrated via ETL/ELT tools, SQL scripts, or data workflow managers. The Medallion Architecture often structures this into Bronze (raw), Silver (cleaned), and Gold (aggregated) layers	Cleansing, data normalization and format conversion, business logic and enrichment, aggregations, modeling, statistical summarization, validation and data quality testing
Serving	Transformed data must be delivered to stakeholders or applications for actual use. This can involve feeding BI dashboards, analytics platforms, ML models, or external systems via APIs or reverse ETL for operational analytics	Providing data to BI dashboards, analytics platforms, or reporting tools; feeding machine learning models; supplying external systems via APIs or reverse ETL; ensuring reliability, freshness, and security for all consumers
Undercurrents	Several critical themes run through all stages of the data engineering lifecycle, including security, data management, DataOps, data architecture, orchestration, and software engineering best practices	Security: Implementing robust access controls, encryption, and compliance measures Data Management:Establishing governance frameworks, metadata management, and data cataloging DataOps:Adopting DevOps principles for data workflows Data Architecture: Designing scalable architectures Orchestration: Coordinating complex workflows Software Engineering: Applying best practices in coding, version control, and documentation

Access Frequency
Data Types

Aspect	Hot Data	Lukewarm (Warm) Data	Cold Data
Definition	Frequently accessed, high-value, real-time or near-real-time data	Moderately accessed, regularly needed but not instant	Infrequently accessed, usually retained for archival purposes
Access Frequency	Constant, immediate, sub-second or millisecond response	Scheduled, hours to days, moderate latency	Rare, weeks to years, high latency
Access Latency	Sub-second or millisecond	Seconds to minutes	Minutes to hours
Storage Media	RAM, in-memory database, SSDs, high-performance NAS	Mid-tier SSDs, high-speed HDDs, cloud object storage	Low-cost HDDs, archival cloud storage
Retention Policy	Short-term, transactional	Weeks to months, operational	Long-term, years (or indefinitely for compliance)
Data Volume	Typically smaller, volume managed for speed	Medium	Very large, bulk data
Data Value	Immediate, high business impact	Useful, moderate business impact	Historical, regulatory, analytical
Security Requirements	Highest, critical for business operations	Moderate, standard access protection	Encryption, integrity, regulatory compliance
Scalability	Vertical scaling for speed	Horizontal scaling, cost-performance balance	Massive horizontal scaling, low access needs
Challenges	High cost, data lifecycle, scalability	Balancing cost and access	Retrieval speed, data integrity, long-term maintenance
Use Cases	Fraud detection, real-time stock trading, network monitoring	Monthly business reporting, operational data	Legal audits, disaster recovery, regulatory reporting

Data Type	Definition	Characteristics	Examples	Use Cases	Notes
Raw Data	Unprocessed, unfiltered data collected directly from sources without any manipulation or analysis	Original, uncleaned, may contain errors or noise; very granular; inflexible until processed	Sensor readings, transaction logs, survey responses in original form, unedited images or video files	Foundation for all analysis; requires cleaning, transformation, and validation	Also called primary data or source data; critical for unbiased analysis
Quantitative Data	Numerical data representing measurable quantities that can be counted or measured and subjected to math operations	Numeric, measurable, often continuous or discrete; suitable for statistical and mathematical modeling	Heights, weights, sales numbers, temperatures, survey ratings (on scale)	Statistical analysis, predictive modeling, machine learning, numerical summarization	Often stored as structured data but can appear semi-structured or raw
Qualitative Data	Non-numerical, descriptive data representing categories, concepts, or subjective qualities	Categorical or textual data; may be structured (coded categories) or unstructured (open text)	Interview transcripts, social media posts, survey open responses, observations	Content analysis, thematic coding, sentiment analysis, NLP applications	Includes both structured categorical data and unstructured textual/media data
Structured Data	Data organized into predefined models or schemas, typically tabular with rows and columns	Highly organized; easy to store, search, and analyze; conforms to relational databases or spreadsheets	Databases with customer info, spreadsheets with sales data, financial records	Used in relational databases, data warehouses, and analytical reporting	Often quantitative and categorical (qualitative) data; easiest to handle computationally
Unstructured Data	Data without a predefined schema or organization, often qualitative and complex	No fixed format; text-heavy or multimedia; requires advanced techniques for parsing and analysis	Emails, videos, audio recordings, social media feeds, documents, images	Text mining, image/video analysis, sentiment analysis, AI-driven extraction	Increasingly important with big data, but challenging to manage and analyze
Semi-structured Data	Data that does not fit rigid schemas but contains tags or markers to separate elements for easier processing	Hybrid of structured and unstructured; contains metadata or tags alongside free-form content	JSON, XML, CSV files, HTML documents, tagged multimedia files	Easier to process than unstructured; used for web data, APIs, and metadata extraction	Provides flexibility of unstructured with some model-driven processing advantages

Data Collection
Data Modeling
Slowly Changing Dimension
Schemas

Bounded vs. Unbounded Data

Aspect	Bounded Data	Unbounded Data
Definition	Finite data set with a known start and end point	Infinite or continuously growing data with no predefined end
Data Characteristics	Fixed size, complete, and unchanging once fully collected	Potentially infinite, dynamic, and continuously generated
Examples	Historical sales data, completed dataset for a specific period (e.g., last quarter sales)	Streaming logs, real-time sensor data, social media feeds
Processing Model	Batch processing - data processed as a whole after collection	Stream processing - data processed incrementally as it arrives
Data Ordering	Typically sequential and complete, allowing deterministic processing	May be out-of-order, delayed, or non-sequential due to latency and distributed sources
Timing	Processed after data collection, often with latency (days, hours)	Processed in real-time or near real-time with minimal delay
System Architectures	Traditional Data Warehouses, ETL pipelines, batch-oriented systems	Streaming platforms like Apache Kafka, Apache Flink, Apache Beam, Spark Streaming
Storage Requirements	Larger storage upfront for entire dataset storage	Continuous storage needs with potential for state management or windowing to handle data volume
Computation Model	Deterministic and re-runnable computations on fixed data sets	Incremental, stateful computations with approximate processing or windowing to manage infinite data
System Complexity	Lower complexity in handling data consistency and completeness	Higher complexity to handle out-of-order events, late data, and exactly-once processing guarantees
Error Handling	Errors can be corrected in batch runs before analysis	Needs continuous monitoring and corrective mechanisms to handle anomalies in a live stream
Scalability Challenges	Scalability mainly in storage and batch job execution	Requires scalable infrastructure to handle continuous high-throughput data ingestion and processing
Latency	Higher latency acceptable due to batch processing nature	Low latency required to provide timely insights or actions
Architectural Patterns	ETL, Lambda Architecture (batch layer dominant)	Kappa Architecture, unified stream processing approach combining batch and stream
Data Completeness	Complete view of the dataset after processing	Incomplete snapshots at any point, with evolving data as stream progresses
Examples from Real World	Financial reports for closed fiscal year; archived web logs	Network packet captures; social media mentions; real-time transaction feeds
Use Cases	Historical analytics, reporting, compliance auditing	Real-time analytics, alerting, fraud detection, IoT monitoring

Batch vs. Micro-Batch vs. Real-Time Processing

Aspect	Batch Processing	Micro-Batch Processing	Real-Time Processing
Definition	Collects and stores data over a period, processing all at once	Processes data in small batches at short, regular intervals	Processes data immediately as it arrives, near-instantaneous
Frequency	Low frequency (e.g., hourly, daily, monthly)	Medium frequency (seconds to minutes intervals)	High frequency (sub-second to real-time continuous)
Latency	High latency due to waiting for batch completion	Moderate latency, quicker than batch but not instantaneous	Very low latency, near-instant results
Data Volume	Large volumes of accumulated data	Smaller chunks of data per batch	Continuous streams of individual events
Complexity	Simple to implement and manage	Moderate complexity, combines batch and streaming elements	High complexity requiring advanced architecture and tooling
Resource Utilization	Efficient resource use, runs during off-peak times	More frequent resource use than batch, less than streaming	Resource-intensive, requires horizontal scaling
Processing Model	Triggered by schedule (time or data volume)	Triggered by time interval or data size threshold	Constant event-driven processing
Stateful Processing Support	Yes, often requires stateful operations	Supports small state, similar to batch	Usually stateless or manages small state due to speed demand
Data Freshness	Lower data freshness, data available after processing batch	Near real-time freshness, data is updated every few minutes	Highest data freshness, updates data as it arrives
Fault Tolerance	Easier to handle failures with retries during next batch	Moderate fault tolerance	Requires robust fault tolerance mechanisms, checkpointing
Typical Technologies	Apache Spark batch jobs, Hadoop MapReduce	Apache Spark Streaming, Fluentd, Logstash	Apache Kafka, Apache Flink, Apache Pulsar
Cost	Lower operational cost due to infrequency	Moderate operational costs	Higher cost due to continuous processing and infrastructure
Use Cases	End-of-day reports, billing, historical analytics	Incremental dashboard updates, near real-time user behavior	Fraud detection, monitoring, live analytics

Pull vs. Push

Aspect	Pull	Push
Initiation	Data target pulls data from the source by requesting it explicitly, as needed or at scheduled intervals	Data source initiates and sends data to the target automatically when data is available
Control of Flow	Target controls when and how much data to ingest (e.g., batch size, frequency)	Source controls the data flow; target has little control over rate or timing
Real-time Capability	Can be near-real-time but often involves periodic polling. Typically higher latency than push	Immediate; real-time delivery as soon as new data is generated
Scalability	Highly scalable; multiple consumers can independently fetch data at their own pace, easier replication, supports distributed scaling	May overwhelm consumers if the source produces more data than the targets can handle; hard to optimize for multiple consumers
Replayability/Recovery	Easier to recover or reprocess missed data, as the consumer can retry requests or fetch from specific offsets	Replay is challenging - if a consumer misses data, it's hard to get the missing pieces back unless a buffer or queue is used
Latency	May introduce latency depending on polling frequency and network delays	Low latency; pushes changes to consumers as soon as available
Efficiency	May require more bandwidth for frequent polling; less efficient for frequent changes unless optimized	Efficient for sources with frequent changes or high update rates - useful for event-driven architectures
Security	Target must connect to source, requiring bidirectional communication and b security layers on the source	More secure for sources; the source doesn't listen for network connections, reducing attack surfaces
Operational Complexity	Source must allow for external requests, potentially more firewall and authentication setup; simpler consumer-side error handling and scaling	Potentially less operational overhead if rate limiting and buffering are handled; but flow control and backpressure management are harder
Data Ownership	Consumer chooses what, when, and how much to ingest, offering flexibility for diverse requirements	Source knows and manages its own data, ensuring accurate and robust delivery
Implementation Details	Requires periodic scheduler or polling mechanism; easier integration with existing APIs and systems	Requires consumers to implement logic for handling unsolicited data, queuing, or buffering; flow and rate limiting complex
Hybrid Patterns	Hybrid approaches leverage strengths of both, such as push for immediate updates and pull for detailed/batch data	Often combined - e.g., system pushes notifications but clients pull detailed data as needed
Consistency Guarantees	Easier to achieve exactly-once or at-least-once semantics with systems like Kafka	Needs careful orchestration for b consistency, especially in distributed setups
Common Technologies	REST APIs, scheduled ETL, database dumps, Kafka Connect, batch queries, polling	Webhooks, real-time streaming (e.g., MQTT), proprietary push APIs, some ETL tools
Use Cases	Reporting, batch processing, periodic data sync, data lakes, less time-sensitive operations	Time-sensitive, high-frequency events, IoT devices, notification systems, or real-time analytics

Data Modeling Technique	Definition	Characteristics	Advantages	Disadvantages	Use Cases	Examples
Conceptual Data Modeling	High-level, abstract model focusing on business entities and relationships without technical detail	Entities and relationships shown typically via ER diagrams or UML class diagrams	Platform-neutral, easy communication with business stakeholders	Lacks technical detail for implementation	Business planning, stakeholder alignment, early project stages	ER Diagrams, UML
Logical Data Modeling	Detailed model defining data elements, attributes, relationships, keys, and rules without platform dependency	Normalized tables, keys, constraints, and relationships; focus on data integrity and structure	Provides clean, normalized design; aids data quality and governance	Does not address physical storage, indexing, or performance optimization	Schema design, data governance, preparing for physical modeling	3NF, Data Vault modeling (hubs-links-satellites)
Physical Data Modeling	Model optimized for database implementation specifying tables, columns, indexes, partitions, etc	Denormalized/normalized tables, platform-specific constructs like indexes, partitions, cluster keys	Optimizes storage, retrieval, and query performance	Tightly coupled to specific technologies, less flexible	Performance tuning and optimization for specific DB engines	Star schema, Snowflake schema, Anchor Modeling
Dimensional Modeling	Simplifies data structures for analytical queries grouping data into facts and dimensions	Fact tables (numeric measures) linked to dimension tables (contextual descriptors) like star or snowflake schema	Intuitive for analysts, improves query speed for OLAP workloads	Less suitable for transactional systems, more redundancy	BI, data warehouses, dashboards, self-service analytics	Star Schema, Snowflake Schema, Slowly Changing Dimensions (SCDs)
Relational Data Modeling	Organizes data in normalized tables ensuring minimal redundancy and data consistency	Tables with rows and columns, defined primary and foreign keys, normalized forms	Strong data integrity, widely supported, good for complex relationships	Can be complex to query for analytical workloads, performance overhead for joins	OLTP systems, master data management, transactional apps	2NF, 3NF, Boyce-Codd Normal Form (BCNF), ER diagrams
Entity-Relationship (ER) Model	Represents entities, attributes, and their relationships for database design	Entities represented as objects/tables; attributes as columns; relationships with cardinality and optionality	Clear visualization of data relationships, promotes normalization	May become complex for large systems, design only	Database design, relational databases	Chen ER Model, Crow's Foot notation
Object-Oriented Data Modeling	Combines data with behavior, encapsulates data and operations together representing objects	Objects with attributes and methods; supports inheritance, classes, and polymorphism	Closer to real-world modeling, reusable components	Complexity, less common in traditional DBs	Object databases, applications using OOP principles	Classes, inheritance hierarchies
Hierarchical Data Modeling	Organizes data in tree-like parent-child relationships	Strict one-to-many parent-child relationships; records organized in a hierarchy	Simple and fast navigation in one-to-many data	Inflexible with many-to-many or complex relationships	Legacy systems, XML/JSON document stores, file systems	IMS database, XML schemas
Network Data Modeling	Extends hierarchical to allow many-to-many relationships	Graph-like structures, records with multiple owners or parents	More flexible than hierarchical, models complex relationships	More complex design and management than relational	Complex interconnected data like telecommunications, logistics	CODASYL, Graph databases (Neo4j, Amazon Neptune)
Temporal/Historical Modeling	Tracks data changes over time for auditing, historical analysis	Stores multiple data versions with timestamps for valid and transaction time	Supports full data history and versioning, improves auditability	Increases data storage and complexity	Compliance, time-series, audit trails, customer lifecycle	Bitemporal modeling, Slowly Changing Dimensions (SCDs), Anchor modeling
Agile Data Modeling	Enables iterative and flexible modeling adapting to evolving business needs	Combines techniques, emphasizes collaboration and incremental updates	Highly adaptable, incorporates feedback quickly	Can lack initial rigor, may lead to inconsistent models	Rapid development environments, evolving business domains	Often combined with other models in Agile projects
Big Data Modeling	Tailored to handle volume, velocity, and variety of big data	May use NoSQL schema-on-read, data lakes, schemas for semi-structured data	Scales for huge data volumes, flexible schema	Less mature standards, potential for data inconsistency	Big data platforms, streaming analytics	Schema-on-read, Hadoop, NoSQL, Data lakehouse
Inmon	Corporate Information Factory (CIF). Enterprise-wide data architecture integrating various data sources into a centralized warehouse. Flow: Sources → Staging (ETL) → Enterprise Data Warehouse (Data stored in 3NF) → Data Marts → Consumption	Top-down approach, normalized data warehouse, data marts for specific domains	Comprehensive, consistent enterprise view	Complex, time-consuming implementation	Large enterprises needing integrated data	Normalized data warehouse, data marts
Kimball	Bus Architecture. Dimensional modeling approach focusing on ease of use and performance. Flow: Sources → Staging (ETL) → Enterprise Data Warehouse (STAR Shema) → Data Marts → Consumption	Bottom-up approach, data marts for specific business areas	Fast query performance, user-friendly data structures	Can lead to data silos, less comprehensive view	Mid-sized to large enterprises with specific reporting needs	Star schema, snowflake schema, data marts
Data Vault (Linstedt)	Hybrid approach combining elements of 3NF and dimensional modeling. Flow: Sources → Staging (ETL) → Raw Data Vault → Business Data Vault → Data Marts → Consumption	Focuses on agility and scalability, accommodating changes easily	Supports historical tracking and auditability	Can be complex to implement and manage	Organizations needing flexibility and rapid change adaptation	Data vault model, hubs, links, satellites

Slowly Changing Dimensions (SCDs) are dimension tables in data warehouses where attribute values change slowly over time. Unlike frequently changing fact data, dimension data (e.g., customer details, product attributes) requires historical tracking for accurate reporting. SCDs manage these changes in various ways to meet different business and analysis needs.

Aspect	Type 0: Retain Original	Type 1: Overwrite	Type 2: Add New Row	Type 3: Add New Attribute	Type 4: Add History Table	Type 5: Add Mini-Dimension	Type 6: Combined Approach	Type 7: Hybrid Approach
Description	Attribute never changes; always original value	Overwrite old data; no history kept	Insert new row for each change; full history tracked	Add new column(s) to track limited previous value	Store historical data in separate history table	Create a mini-dimension for frequently changing attributes	Combines Types 1, 2, 3 in one model (overwrite, add row, add attribute)	Combines various SCD techniques beyond Type 6 for adaptive needs
Visualization	The attribute never changes, so the entity design simply holds the original column with no alteration	Changes overwrite old values, and there is no history kept	New row is added for each change, keeping full history. Start and end dates track consistency	One or more additional columns retain limited history (e.g., previous value)	Separate history table is created to maintain full change history, keeping the current state in the main dimension	Mini-dimension stores rapidly changing attributes separately, referenced by the main dimension	Combines Types 1, 2, and 3. Maintains both current values (Type 1) and full history (Type 2) and a previous attribute (Type 3)	Flexible hybrid approach, often combining multiple SCD strategies for different columns depending on business requirements
Change Handling Method	No update	Overwrite existing values	Add new record per change	Add new column to track previous value	Use separate history table for old data	Extract frequently changing attributes into separate mini-dim table	Current and historical columns plus version column	Flexible, combines multiple change management techniques
Historical Data Tracking	No	No	Yes	Limited (only one previous value)	Yes	Partial history through mini-dims	Yes	Yes
Storage Impact	Minimal	Minimal	High (multiple rows per entity)	Moderate (additional columns)	Moderate to High (two tables)	Moderate (extra mini-dim tables)	High (due to multiple approaches combined)	Variable, depends on component types used
Query Complexity	Very simple	Simple	More complex due to multiple rows	Simple for limited history	Moderate due to joins with history table	Moderate (joins with mini-dim)	Moderate to complex	Complex, depending on combination used
Pros	Simple; fast queries	Easy implementation, fast update	Full historical data tracking	Easy access to current and prior value	Clear separation of current and historical data	Improves query performance for frequent small changes	Flexible; combines best of types 1, 2, 3	Highly adaptable to complex scenarios
Cons	No history, no ability to analyze change	History lost	Adds storage; may impact performance	Only tracks limited history, not scalable	Extra complexity with multiple tables	Additional ETL and dimensional complexity	Complexity; maintenance overhead	High complexity; requires sophisticated design
Implementation Complexity	Low	Low	Moderate to high	Low to moderate	Moderate to high	Moderate to high	High	Very high
Impact on Performance	Minimal	Minimal	Can degrade with large historical data	Moderate	Moderate	Moderate	Can be performance intensive	Depends on implemented hybrid techniques
Dimension Table Action	No change to attribute value	Overwrite attribute value	Add new dimension row for profile with new attribute value	Add new column to preserve attribute's current and prior values	Add mini-dimension table containing rapidly changing attributes	Add type 4 mini-dimension, along with overwritten type 1 mini-dimension key in base dimension	Add type 1 overwritten attributes to type 2 dimension row, and overwrite all prior dimension rows	Add type 2 dimension row with new attribute value, plus view limited to current rows and/or attribute values
Impact on Fact Analysis	Facts associated with attribute's original value	Facts associated with attribute's current value	Facts associated with attribute value in effect when fact occurred	Facts associated with both current and prior attribute alternative values	Facts associated with rapidly changing attributes in effect when fact occurred	Facts associated with rapidly changing attributes in effect when fact occurred, plus current rapidly changing attribute values	Facts associated with attribute value in effect when fact occurred, plus current values	Facts associated with attribute value in effect when fact occurred, plus current values
Use Cases	Static attributes like SSN, zip codes	Correcting typos, non-critical updates e.g. email, phone	Track full history of customer address, employee job changes	Track current and previous salary, status	Maintain full historical pricing, employment data	Track attributes like customer segmentation that change frequently	Employee role and department tracking with full change history	Complex enterprise needs, combining multiple SCD styles

Schema Types

Aspect	Physical Schema	Logical Schema	Evolving Schema	Contractual Schema (API)	Metadata Schema
Definition	Describes how data is physically stored and arranged (files, indices, partitions) on storage media or DBMS	Defines the logical (human-readable) structure: tables, fields, relationships, constraints	Captures the actual schema changes (add/remove fields) over time, typically in dynamic or pipeline-driven systems	Schema defining fields and their validation between systems via an API contract (e.g., JSON, GraphQL)	Schema that describes the data about data, such as lineage, column descriptions, and governance
Level of Abstraction	Lowest: hardware, file system, storage block level	Higher: data model, independent of storage	Variable: follows either physical or logical but adapts to change	Variable: can be logical or physical depending on API implementation	Varies: may refer to logical, physical, or conceptual layers
Focus	Performance, storage efficiency, physical locations	Data organization, integrity, relationships, constraints	Handling schema drift, flexibility for changes	Interface definition, data validation, compatibility	Data governance, lineage, quality, observability
Typical Stakeholders	DBAs, infrastructure engineers	Data modelers, analysts, architects	Data engineers, analytics teams	Backend engineers, API consumers/producers	Data governance, compliance, data stewards
Benefits	Maximizes storage & query performance; supports tuning, scaling	Ensures consistency, maintainability, integrity of business logic	Enables rapid evolution, tracks change, minimizes disruption	Allows machine interoperability, enforces standards, prevents breakage	Aids data discovery, quality, lineage, and regulatory compliance
Limitations	Complex to change, tightly-coupled to hardware/DBMS	May hide physical inefficiencies, less relevant for storage choices	Risk of data loss or incompatibility if not managed well	Tight coupling can hinder API flexibility, requires documentation	Can become outdated or incomplete without good processes
Use Cases	DBMS optimization, partitions, indexes, backup/recovery strategies	ER diagrams, database normalization, data modeling	ELT pipelines, analytics, SaaS product changes	API design, system integration, microservices communication	Data catalogs, pipeline documentation, lineage tracking
Examples	Parquet files with partitioning; index files for tables; disk layouts	Star schema; ERD; relational database definitions	Adding new analytics events; updating field names in ELT	REST/GraphQL/OpenAPI schema definitions; JSON schema	dbt sources.yml; OpenMetadata; catalog records; lineage graphs

Star vs. Snowflake vs. Galaxy Schema

Aspect	Star Schema	Snowflake Schema	Galaxy Schema
Structure	Central fact table linked to denormalized dimension tables	Fact table linked to normalized dimension tables, split hierarchically	Multiple fact tables sharing dimension tables, can be a mix of star and snowflake
Visualization
Data Normalization	Dimension tables are denormalized (flat structure, redundancy present)	Dimension tables are normalized (data split into sub-tables, minimal redundancy)	Typically involves normalized or partially normalized dimension tables to reduce data redundancy. Dimensions are often conformed (shared across fact tables). Normalization level can vary depending on design goals
Query Performance	Faster query execution due to fewer joins	Slower query execution due to multiple joins required	Performance can vary; may benefit from fewer joins but could be impacted by complexity
Query Complexity	Simpler queries, fewer joins, easy to write and understand	More complex queries, requires deeper understanding and multiple joins	Queries can be complex due to multiple fact tables and shared dimensions; requires good understanding of schema
Storage Requirements	Higher storage use due to redundant and denormalized data	More storage efficient; reduced duplication through normalization	Storage efficiency varies; can be optimized through shared dimensions but may still have redundancy depending on design
Data Redundancy	Higher - dimensions repeat attribute values in multiple rows	Lower - most redundant data is eliminated	Varies - some redundancy may remain depending on design
Space Usage	More storage space required for large datasets	Less storage space through normalization	Varies - can be optimized but may still require significant space depending on data volume and design
Foreign Keys	Fewer foreign keys (simple design)	More foreign keys due to multiple related tables	Multiple foreign keys due to shared dimensions; complexity depends on design
Data Integrity	Lower: Denormalization risks inconsistency due to data being updated in many places	Higher: Normalization enforces referential integrity and consistency	Varies - can be managed but may require more effort to maintain consistency
Updates and Modifications	Harder to update - redundant data increases risk of inconsistent modifications	Easier for updates - changes in an attribute only affect one table	Varies - updates may be easier due to shared dimensions but can be complex depending on relationships
Dimension Table Structure	Flat structure - each dimension is a single table, no sub-tables	Multi-layered - each dimension may be decomposed into sub-dimensions	Varies - dimensions can be flat or multi-layered depending on design
BI & Reporting Suitability	Best for BI tools, dashboards, and quick ad hoc queries	Better for complex analytical queries, detailed reporting, and multidimensional analysis	Suitable for complex reporting needs involving multiple fact tables; requires good understanding of schema
Maintainability	Easier to maintain, intuitive design	More difficult to maintain, complex design	Varies - can be complex to maintain due to multiple fact tables and shared dimensions
Design Complexity	Easier and faster to design and implement	Requires careful design due to hierarchical splitting	Varies - can be complex to design depending on relationships and shared dimensions
Scalability	Scalable for typical analytic workloads, though can suffer performance issues at extreme scale due to redundancy	Good scalability, especially for complex and large-scale data with multiple hierarchies	Scalability varies; can handle complex data but may require careful design to avoid performance bottlenecks
ETL/ELT Complexity	Simpler ETL/ELT pipelines - fewer tables to populate and maintain	More complex ETL/ELT - hierarchical normalization requires careful loading and management	ETL/ELT complexity varies; may require more sophisticated pipelines to manage multiple fact tables and shared dimensions
Drawbacks	Data redundancy, storage waste, potential for inconsistencies, not suited for high-cardinality or complex hierarchies	Query slowness for basic analytics, complexity in query construction and ETL, harder for non-technical users to understand and navigate	Complexity in design and maintenance, potential performance issues if not well-optimized
Use Cases	Retail sales analysis with simple product/geography/time/customer dimensions	Data warehouses with complex product/customer/location hierarchies, and systems requiring fine-grained data integrity	Enterprise data warehouses with multiple business processes, complex reporting needs, and shared dimensions across fact tables

Lambda vs. Kappa

Aspect	Lambda	Kappa
Processing Model	Combines batch processing and real-time stream processing in separate layers	Uses a single, unified stream processing pipeline for both real-time and reprocessing
Visualization
Processing Layers	Three layers: Batch Layer (large-scale processing), Speed Layer (real-time), Serving Layer (query)	Single pipeline for all data, eliminating the batch layer
Complexity	High complexity; requires maintaining and synchronizing two separate codebases and pipelines	Simpler architecture; only one processing pipeline to maintain
Latency	Batch layer processing introduces higher latency; speed layer offers low latency for real-time data	Low latency overall due to continuous stream processing
Fault Tolerance	Fault tolerant: batch layer can recompute results if speed layer fails or produces errors	Fault tolerant depending on stream processing reliability; relies on log replay for reprocessing errors
Data Reprocessing Capability	Batch layer enables accurate reprocessing of historical data to fix errors or recompute results	Reprocessing done via replaying events from the log through the stream processor
Accuracy	High accuracy due to batch layer with complete data; speed layer may produce approximate results	Consistent real-time results but may lack batch-layer level accuracy for complex computations
Scalability	Scales horizontally but more complex scaling due to separate batch and speed layers	Easier to scale stream processing horizontally; simpler operational model
Historical Data Handling	Excellent, supports deep historical batch analytics and corrections	Less suited for complex historical data analysis, designed mainly for streaming real-time data
Implementation Complexity	High development and maintenance effort due to dual pipelines and serving layer integration	Lower implementation and maintenance overhead
Consistency Between Layers	Requires careful coordination to keep batch and speed outputs consistent	Single pipeline avoids consistency issues inherent in Lambda dual-layer design
Real-Time Analytics	Provides real-time insights via speed layer but with possible eventual consistency lag	Provides immediate real-time analytics with no separate batch delay
Support for Complex Analytics	Good support since batch layer handles heavy, complex queries and aggregations	Limited complex analytics, as everything must be handled in stream processing
Reprocessing Complexity	Batch layer reprocessing is separate and managed independently	Reprocessing simply involves re-consuming the event stream, simplifying error correction
Data Duplication Risk	Potential for duplication or mismatch between batch and speed layer results if not carefully managed	Minimal duplication risk since there is only one data processing pipeline
Use Cases	Suitable for systems needing both comprehensive historical analysis and real-time insights	Best for real-time focused applications with simpler operational needs (e.g., IoT, user activity tracking)
Examples	Recommendation engines, financial modeling, large-scale analytics	Real-time monitoring, IoT analytics, clickstream processing, social media analytics

Type	Purpose	Scope	When Performed	Key Techniques	Considerations	Relevance	Quality Checks
Data Quality Testing	Validate accuracy, completeness, consistency, validity, timeliness, and uniqueness of data	Data at rest (tables, datasets) and in-motion (streams)	Often ongoing, triggered by data load or refresh	Profiling, validation rules, anomaly detection, null checks, deduplication	Identifying subtle quality issues, evolving data schemas	Crucial for trustworthy analytics; foundation to all downstream processes	Descriptive Checks: Validating data entries represent real-world values (e.g., valid email format, phone numbers) Structural Checks: Ensuring data conforms to schema (all required fields present, data types correct) Integrity Checks: Validating relationships between datasets (e.g., foreign keys match primary keys) Accuracy Checks: Comparing data against trusted source systems Timeliness Checks: Confirming data freshness and updates within defined periods Null or Missing Values: Identifying nulls where data is mandatory Duplicate Data: Detecting duplicated records that could cause inconsistencies Range and Distribution Checks: Confirming numeric data falls within expected ranges or distributions Consistency Checks: Ensuring data consistency across systems and datasets Format Validation: Checking that data values meet predefined formats or patterns
Data Integrity Testing	Ensure accuracy, completeness, retrievable, verifiable, truthfulness, consistency, and reliability of data throughout its lifecycle	Data storage, processing, transmission, and updates	Routine and triggered by data changes, migrations	Validation rules, checksums, version control, continuous monitoring, domain and entity integrity tests	Managing volume & complexity, real-time validation, compliance, security	Critical to maintain trustworthiness of data; prevents corruption and errors across all data states and systems	Accuracy: Data matches real-world truth Reliability: Repeatable results in different conditions Completeness: No missing data required to maintain integrity Referential Integrity: Relationships between tables/systems hold true (e.g., foreign keys) Repeatability: Consistent test outcomes over time Scalability: Tests effective under large datasets Validation Against Requirements: Checking adherence to data constraints, ranges, and allowed values Automated anomaly detection to spot unusual data patterns Testing in isolated, production-like environments to avoid disruption Monitoring error resolution and anomaly frequency metrics
Integration Testing	Verify interactions and data flow between integrated components or systems	Endpoints, APIs, data sources, ETL components	After component/unit testing, pre-system integration	API calls, contract validation, mock testing	Managing dependencies, environment setup, flaky tests	Ensure data flows cleanly between systems without loss or corruption	Testing interactions between microservices, databases, and APIs Verifying data formats and response correctness during data exchanges Using mocks/stubs to simulate unavailable services Automated API testing with tools like REST-assured or Postman Continuous integration (CI) pipeline integration to run tests upon changes Validation of data transformations during integration Monitoring logs and failures with detailed reporting Ensuring correct error handling in data communication Approaches include top-down, bottom-up, and big-bang integration testing
Performance Testing	Assess system responsiveness, throughput, stability under load	Entire pipeline throughput, resource usage, latency	Pre-release or after significant changes	Load testing, stress testing, volume testing	Simulating realistic load, environment parity	Essential to meet SLAs for batch and streaming jobs, avoid bottlenecks	Load Testing: Measuring system behavior under expected data volumes Stress Testing: Testing beyond normal capacity limits Soak Testing: Running systems under load over extended time to find memory leaks Spike Testing: Sudden large surges of data volume Measuring response times, throughput, latency, and resource utilization Validating batch and streaming pipeline processing times Ensuring system remains responsive with increasing data sizes Starting performance tests early in development to catch issues quickly Time frames vary: from minutes for load/stress/spike tests, hours for soak tests
Regression Testing	Ensure new code/changes do not break existing data workflows or features	Entire data pipeline or specific modules	After any change or update	Automated retesting, test case prioritization	Test suite maintenance, execution time	Maintain pipeline stability; detect silent errors after changes	Re-running previously passed tests on updated data pipelines/systems Validating functional and non-functional features remain stable Automated re-execution of test scripts upon schema or code changes Checking data accuracy, completeness, and transformations remain correct Detecting "immutable changes" where data changes should not occur Maintaining a regression test suite for quick verification with new code deploys
End-to-End Testing	Validate complete workflows from ingestion through transformations to consumption	Across all pipeline stages and downstream applications	Before major releases or deployment	Full process simulation, real user scenario emulation	High complexity, environment parity	Confirm entire data lifecycle works as expected from source to consumer	Verifying data ingestion, processing, storage, and output in one flow Testing from data source event to final display or report generation Ensuring downstream integrations (notification, payments, reports) work Covering functional and non-functional aspects like usability and security Performing both automated and manual E2E tests on realistic data Monitoring for broken workflows or data errors affecting user journeys Employing tools like Cypress, Selenium, Playwright for automation
Functional Testing	Validates specific functions or business rules within data transformations	Specific ETL jobs, SQL functions, or data logic blocks	During development and after changes	Unit tests, SQL assertions, black-box testing	Test data setup, mock dependencies	Validate correctness of data transformations and business logic	Testing data processing logic against defined specifications Validating outputs based on expected input data Checking edge cases and error handling paths Verifying correctness of data transformations Focus on "what" the system does, not "how" internally Manual and automated tests to validate individual features
Compliance Testing	Verify data adherence to legal, regulatory, and internal policies	Data privacy, retention, access controls, audit trails	Scheduled or triggered by regulation changes	Policy validation, audit log review	Dynamic rules, audits, cross-system consistency	Ensure data governance and regulatory compliance requirements are met	Validating data privacy laws adherence (e.g., GDPR, HIPAA) Checking data encryption is applied where required Ensuring retention and deletion policies are enforced Auditing data access controls and audit trails Verifying reporting meets regulatory requirements Penetration testing and security compliance checks often integrated Documenting compliance evidence and configurations
Contract Testing	Verify that communication contracts/interfaces between services remain consistent	API schemas, data contracts, message formats	Before and during integration releases	Schema validation, consumer-driven contract testing	Coordinating consumer/provider contracts	Prevent integration breakage due to incompatible schema changes	Checking APIs adhere to agreed contracts (request and response formats) Verification of data types, mandatory fields, and error codes Ensuring backward compatibility of APIs Using consumer-driven contract testing frameworks Automated tests executed in CI pipelines Preventing integration failures due to contract violations
Data Processes Testing	Validate ETL/ELT logic, correctness of data transformation and processing	Extract, Transform, Load stages individually and combined	During development, scheduled after pipeline changes	Unit tests, integration tests, system-wide data validation	Complex dependencies, state handling	Ensure processing steps handle data correctly and produce expected results	Validating ETL/ELT logic correctness Ensuring data filtering, mapping, aggregation perform as expected Checking handling of nulls, missing data Verifying business rules implementations Testing intermediate outputs for correctness Automating process-level unit tests and scenario tests
Pipeline Testing	End-to-end and targeted tests validating pipeline orchestration, error handling, and data flow	Orchestrator workflows, triggers, retries, alerts	Continuous, after pipeline deployments or fixes	Workflow simulations, failure scenario testing	Environment parity, handling intermittent failures	Verify pipeline robustness, alerting, and data delivery completeness	Testing pipeline orchestration logic and scheduling Validation of data flow correctness through all stages End-to-end latency and throughput monitoring Fault tolerance and error recovery testing Automated tests triggered by pipeline runs Integration with monitoring/alerting systems Versioning and rollback testing for pipelines

Imperative/Declarative
Idempotency

Infrastructure as Code (IaC) provisions and manages computing infrastructure using code instead of manual processes. It reduces time-consuming errors, especially at scale, by defining desired states and automating deployment. This frees developers to focus on applications, while organizations gain cost control, risk reduction, and faster responses to opportunities.

Aspect	Imperative Programming	Declarative Programming
Definition	Specifies how to perform tasks step-by-step through explicit instructions	Specifies what the desired outcome or goal is, without detailing how to achieve it
Programming Approach	The developer writes detailed instructions explicitly controlling each step to change the program state	Describes the desired end state; the system figures out the instructions to reach that state automatically
Control Flow	Explicit; the developer manages the exact order of operations and flow	Implicit; controlled by the system or runtime
State Management	Explicit and manual; the developer must maintain and update system state	Abstracted away and handled automatically by the system
Level of Abstraction	Lower-level, deals with detailed procedural steps and direct system operations	Higher-level, more abstract, focuses on logic and outcomes
Error Handling	Must be explicitly handled by the programmer; easier to introduce state inconsistency or errors	Often more robust due to abstraction; the system validates state before applying changes
Flexibility/Control	More control over performance and optimization by managing each operation exactly	Less fine-grained control over execution details, focus is on describing end results
Maintainability	Can become complex and harder to maintain with scaling due to detailed step management	Typically easier to maintain and extend as logic is expressed declaratively
Adaptability to State	Rigid; instructions may fail if the initial state differs from assumptions	Adaptive; compares current state with desired state and adjusts actions dynamically
Performance	Potentially faster for low-level tasks when optimized by expert programmers	May add overhead from abstraction or compilation; optimized by underlying engine
Error-Prone	More prone to errors due to manual state & control flow management	Generally less error-prone since system manages steps and state consistency
Debugging	Easier for step-by-step tracing but can get complicated in large codebases	Debugging declarative code may be harder due to abstraction, requires understanding system internals
Tools	`Chef` and `Puppet`	`Terraform`
Use Cases	Writing detailed data processing pipelines, manual orchestration of ETL steps, data cleaning scripts	Defining database schemas, data transformations (`dbt` models), infrastructure as code (`Terraform`), SQL queries
Example: Creating Table (SQL)	Write explicit commands to create table, add columns, alter structure; may fail if structure exists	Define the desired table structure and let the system handle creation or alteration dynamically
Example Analogy	Giving step-by-step instructions on how to make the sandwich starting from scratch	Showing a picture of the final sandwich and having a competent chef make it

Idempotency means an operation can be applied multiple times without changing the result beyond the initial application.

Importance

Prevents duplicate data processing and corruption during retries
Simplifies error handling by making retries safe
Ensures consistent and deterministic pipeline outputs
Enables scalable, concurrent processing without complex locking
Facilitates easier debugging and auditing
Meets strict regulatory compliance for transactional data

Guidelines

Use Idempotency Keys:
- Assign unique identifiers to each operation or data item
- Use composite keys (e.g., source + timestamp) to detect duplicates
- Store these keys to recognize repeated operation attempts and avoid reprocessing
Employ Atomic Transactions:
- Group operations into atomic units that either complete fully or rollback entirely
- Use transactional ACID-compliant storage systems where possible
Deduplication Techniques:
- Implement deduplication at multiple levels (data ingestion, processing, storage)
- Utilize probabilistic data structures (Bloom filters) and sliding window algorithms for efficient duplicate detection
Checkpointing and State Management:
- Maintain and persist checkpoints/states for recovery and partial processing resumption
- Enable pipeline to restart safely from the last consistent state after failures
Use Contextual Uniqueness:
- Incorporate business logic attributes in idempotency checks to catch logical duplicates
Concurrency Control:
- Design systems that handle concurrent writes gracefully using idempotency
- Leverage modern concurrency control patterns like non-blocking concurrency
Choose Idempotent Storage Backends:
- Leverage storage systems that support conditional updates or compare-and-swap semantics (e.g., Delta Lake, Apache Hudi, distributed NoSQL with ACID features)

Testing and Validation

Validation Techniques

Testing Methodologies
- Repeated Execution Testing: Re-run operations multiple times and verify the same state
- Fault Injection Testing: Simulate failures (network, crashes) to observe idempotent behavior
- Concurrent Operation Testing: Run identical operations simultaneously to test race conditions
- State Transition Validation: Confirm system transitions remain consistent regardless of operation frequency
- Time-Window Testing: Retry operations across time spans to ensure idempotency holds over time
Validation Techniques
- Range Checking: Validate data values fall within acceptable limits
- Type Checking: Verify data types conform to expectations
- Format Checking: Ensure compliance with required data formats (e.g., emails, phone numbers)
- Consistency Checks: Confirm relational integrity across fields and datasets
Automated Testing
- Property-based testing to generate varied and edge-case scenarios
- Chaos engineering tools to introduce faults in production-like environments
- Integration and regression tests to maintain idempotency guarantees as systems evolve
- Performance monitoring to assess idempotency overhead

Overview
Authentication
Authorization

Aspect	Authentication	Authorization	Encryption	Tokenization	Data Masking	Data Obfuscation
Definition	Verifying identity of a user or system	Granting or denying access rights to resources	Transforming data into unreadable format to protect it	Replacing sensitive data with non-sensitive tokens	Replacing sensitive data with fictitious but realistic data	Hiding data through transformation to prevent understanding
Purpose	Confirming who is accessing the system	Controlling what authenticated users can do/access	Protecting data confidentiality during storage/transit	Safeguarding sensitive data by replacing it with tokens	Protecting sensitive info while keeping data useful	Preventing data exposure while often preserving format
Scope	Identity level (user, device, service)	Permission level (file, operation, service)	Data at rest, in transit	Specific sensitive data fields/elements	Databases, tables, fields, datasets for testing/sharing	Various data forms, often to resist reverse engineering
Reversibility	N/A (identity verification)	N/A (access control)	Reversible if decryption key is held	Usually reversible via token vault, some are irreversible	Usually irreversible; aim is to prevent data recovery	Usually irreversible or complex to reverse
Security Focus	Identity assurance	Access control enforcement	Confidentiality, data leakage prevention	Strong data security with minimal data exposure	Privacy compliance, risk reduction	Anti-reverse engineering, protecting intellectual property
Data Format Preservation	N/A	N/A	Does not preserve original data format visibly	Can preserve format (format-preserving tokenization)	Preserves data usability and format	Often preserves structure/format for usability
Performance Impact	Low to medium, depends on method	Low to medium, depends on complexity of policies	Can be high, especially with strong encryption and large data	Medium, due to token vault and lookups	Low to medium, depending on masking method (static/dynamic)	Low to medium, depends on obfuscation technique
Complexity	Can be complex (multi-factor, adaptive)	Can be complex with fine-grained policies and delegation	Complex key management and cryptographic implementation	Complex token vault/database management	Intermediate; requires design of masking policies	Intermediate; requires custom transformation/logics
Regulatory Compliance	Supports compliance by preventing unauthorized access	Supports compliance by enforcing access control	Strong support for data privacy and protection laws	Helps meet PCI DSS, GDPR by masking real data	Ensures compliance with GDPR, HIPAA, CCPA in testing/sharing	Assists compliance by protecting sensitive info exposure
Key Limitation	Doesn't control resource access beyond identity verification	Authz policies can be bypassed if authN is weak	Key management critical; if keys lost, data unrecoverable	Reliance on token vault security; complexity	May reduce realism or break referential integrity	Can be reverse-engineered if weak transformations used
Use Cases	Logins, multi-factor auth, biometric verification	Role-based access control, attribute-based access control	Securing emails, files, network traffic, databases	Payment card processing, PII protection, API token usage	Test/dev environments, analytics with safe data, compliance	Protecting source code, data export, secure telemetry
Example Techniques	Passwords, biometrics, OTP, SSO	RBAC, ABAC, ACLs, policy engines	AES, RSA, TLS/SSL, hashing	Format-preserving tokenization, stateless/stateful tokens	Substitution, shuffling, scrambling, nulling, encryption-based masking	Character substitution, ciphering, noise addition

Evolution of Authentication Methods

Credentials (Base64)

JSON Web Token (JWT)

Visualization	Specs
	JWT Structure Content Header `{ "alg": "HS256", "type": "JWT" }` Data `{ "user_id": 1234, }` Signature: `HMACSHA256("base64(header).base64(data)", secret)` Encode each part using Base64: `base64(header)`, `base64(data)`, `base64(signature)` Concatenate each part using dot (`.`): `base64(header).base64(data).base64(signature)`

Oauth 2.0

Visualization	Specs
	Open Authorization (OAuth): Protocol for sharing user Authorization across systems OAuth 1.0: Protocol designed only for web browser only OAuth 2.0: Protocol for cross-platform use (web, mobile, desktop, API) Involved Parties User (Resource Owner): Authorizes flows across systems Identity Provider (IdP): Stores user identity, validates credentials, and shares authorization with other services Server: Service user accesses for authorization Flow Types Authorization code: Client gets a code from server, exchanges for access token Client credentials: Client directly authenticates for access to its resources Implicit code: Deprecated due to security risks Resource owner password: User's credentials exchanged for access token, not recommended for security reasons

Visualization

Specs

Open Authorization (OAuth): Protocol for sharing user Authorization across systems
- OAuth 1.0: Protocol designed only for web browser only
- OAuth 2.0: Protocol for cross-platform use (web, mobile, desktop, API)
Involved Parties
- User (Resource Owner): Authorizes flows across systems
- Identity Provider (IdP): Stores user identity, validates credentials, and shares authorization with other services
- Server: Service user accesses for authorization
Flow Types
- Authorization code: Client gets a code from server, exchanges for access token
- Client credentials: Client directly authenticates for access to its resources
- Implicit code: Deprecated due to security risks
- Resource owner password: User's credentials exchanged for access token, not recommended for security reasons

SSH Keys

SSL Certificates

2FA (Two-Factor Authentication)

Visualization	Specs
	Safety Secret key transmission via `HTTPS` Encryption of secret keys in client and database Security Password (6-digit number) has 1 million combinations Changes every 30 seconds, making it hard to guess 2FA Code Types SMS code, scratch card, mobile app Hardware token: U2F FIDO key, MFA token, digital ID Biometric system: finger/hand print, iris scan, behavior/movement tracking

2SA (Two-Step Authentication)

Aspect	Role-Based Access Control (RBAC)	Attribute-Based Access Control (ABAC)	Access Control List (ACL)
Concept	Assigns permissions to users based on their roles within an organization	Grants access based on attributes of the user, resource, environment, and context	A list of specific rules defining who can access an object and what actions allowed
Main Focus	Roles and their associated permissions	Attributes and policies combining them	Explicit rules tied to individual resources
Key Components	Users, Roles, Permissions, Sessions	Subjects (users), Objects (resources), Actions, Environment, Policies	Resources, Access Control Entries (ACEs) specifying users/groups and their permissions
Visualization
Access Control Model	Role-centric, static binding of permissions	Policy-centric, dynamic evaluation of attributes at request time	Rule-centric, access defined by explicit rules for users or groups per resource
Flexibility	Moderate. Roles predefined; less adaptable to context changes	High. Can consider dynamic and contextual information (time, location, device, etc.)	Low to moderate. Rules usually static and manually maintained
Granularity	Coarse to moderate, depends on number and granularity of roles	Fine-grained; policies can combine multiple attributes for precise decisions	Fine-grained at resource level, specifying detailed permissions per user/object
Scalability	Scales well with a manageable number of roles; risk of role explosion if too many roles created	Can become complex and computationally heavy with many attributes and policies	Can be complex to manage at scale if many resources and users require rules
Administration	Centralized administration through role assignments; easier for compliance audits	Complex policy administration requiring careful attribute and policy design	Decentralized - resource owners or admins define ACLs; can be cumbersome
Policy Evaluation	At user-login or session creation, roles assigned then used throughout session	Real-time evaluation of attributes at each access request	Each access request evaluated against ordered ACL rules sequentially
Security Strength	Good for static deterministic control but vulnerable if roles have excessive privileges	Potentially stronger due to fine-grained, context-aware policies	Strong when rules are well managed; can be prone to errors if rules overlap
Policy Complexity	Simpler conceptually and easier to implement for basic needs	More complex, requiring detailed attribute and policy management	Simple for small sets of resources but can become complex
Typical Policy Components	Roles, permissions, users, sessions	Attributes (user, resource, environment), policies, rules combining attributes	Access Control Entries (ACEs) specifying users/groups and their permissions
Errors and Conflicts	Role explosion can create overlap or excessive permissions	Policy conflicts can be complex to detect and resolve	Rule ordering is critical; earlier rules take precedence, leading to conflicts if mismanaged
Management Overhead	Moderate; fewer roles means simpler management but can grow with complexity of roles	Higher due to attribute and policy complexity	High if many resources/users require individualized ACLs
User Control	No direct control by end users; all managed by administrators	No direct user control; policy-driven access	Owners may control ACLs on their resources (discretionary control)
Compliance and Auditing	Easier to audit due to defined roles and permissions	More complex auditing due to dynamic policies but more precise logging possible	Auditable if ACLs are properly logged and maintained
Hybrid Use	Often combined with ABAC for context-aware refinements	Can include role as an attribute or integrate with RBAC	ACLs often used alongside RBAC or ABAC for network or low-level access control layers
Example Permissions	"HR Manager" role can approve leave requests and view payroll data	User accessing resource only during business hours and from corporate device	IP-based allow/deny rules on network devices or file read/write permissions per user
Use Cases	Enterprises with clearly defined job functions and structured hierarchies	Environments needing fine-grained, dynamic, context-aware access decisions	Network devices (routers, firewalls), file systems, and simple resource-based control
Common Implementations	Microsoft Active Directory, Oracle RBAC, databases, enterprise IT systems	Healthcare, finance, government systems with strict compliance needs	Router/firewall rules, Windows/Linux file system permissions, some databases

Overview
Governance Topologies

Data mesh is a decentralized data architecture where teams own and manage their data. It assigns ownership to business domains (e.g., finance, marketing, sales), providing a self-serve platform and federated governance. This enables autonomous development of tailored data services while ensuring a unified data experience across the organization.

Aspect	Domain Ownership	Data as a Product	Self-Serve Data Platform	Federated Governance
Strategic Domain Driven Design	Domain Bounded Context	Product Thinking	Domain-Agnostic	Context-Mapping
Socio-technical Perspective	Domain Teams	Data Product by Domain Team	Data Platform Team	Guild
Technology	Operational & Analytical Data	Interoperability Interfaces	Self-Serve Data Platform	Data Governance & Automation

Core Principles

Domain-oriented decentralized ownership: Business domains (e.g., customer service, marketing) own and manage their analytical and operational data services, tailoring data models to their needs
Data as a product: domain teams treat other domains as consumers, providing high-quality, secure, and up-to-date data
Self-service data infrastructure as a platform: dedicated team provides tools for domains to autonomously consume, develop, deploy, and manage interoperable data products
Federated computational governance: centralized governance authority with embedded governance in each domain's processes, enabling autonomy while ensuring compliance

Data Mesh Architecture

Data Product

High-level Platform Design and Governance

Example

Aspect	Fine-grained Fully Federated Mesh	Fine-grained Fully and Fully Governed Mesh	Hybrid Federated Mesh	Value Chain-Aligned Mesh	Coarse-grained Aligned Mesh	Coarse-grained and Governed Mesh
Description	Pure data mesh model with many small, independent deployable components, peer-to-peer data distribution, logically centralized governance metadata	Adds a central data distribution layer to fine-grained federated mesh for stronger governance and centralized data distribution	Combines federation and centralization. Central platform hosts/maintains data products; domain autonomy mainly in data consumption	Domains aligned along business value chains, working in close groups with autonomy but sharing central standards for cross-domain data	Large, coarse-grained domains, often as a result of mergers; domains contain many applications, organic growth leads to complexity	Similar to coarse-grained aligned mesh but with stronger governance features like addressing time-variant and non-volatile data concerns
Visualization
Granularity	Fine-grained data products, many small independent units	Fine-grained data products with centralized distribution layer	Hybrid: fine to moderate granularity; central platform more involved	Fine to moderate granularity aligned by value chains	Coarse-grained domains containing many applications	Coarse-grained domains with governed attributes
Governance Approach	Federated with logically centralized metadata governance but mostly domain autonomy	Fully governed with central control over distribution and conformance	Governed but with domain autonomy in consumption; central platform manages creation/maintenance	Central standards for cross-domain data; requires architectural guidance	Strong governance policies necessary due to complexity	Fully governed with relaxed controls in large domains
Data Distribution	Peer-to-peer between domains; domains share data directly	Centralized data distribution via shared storage layer (domain-specific containers)	Domains create/manage data via central platform; consumes data autonomously	Aligned along value chains; domains share as needed with governance	Centralized/shared to manage complexity across coarse domains	Centralized/shared with governance controls for data quality
Ownership	Domain owns, manages, shares data independently	Clear boundaries with domain ownership but central distribution	Domain teams or platform team may own/manage data products depending on capability	Domains collaborate with autonomy within their value chain	Domain ownership but domains large and complex	Domains own data but comply with governance for consistency
Complexity / Management	High complexity managing many small data products; needs conformance agreement across domains	Higher complexity with governance and central controls; may slow time-to-market	Moderate complexity; need supporting platform and governance team to manage hybrid roles	Requires architectural coordination to define boundaries and standards clearly	High complexity due to coarse domains and multiple applications	High complexity with additional governance overhead
Scalability	Scales well horizontally but can be costly and resource-intensive due to duplication	Scales with strong conformance but may have coupling delays and cost overheads centralized	Scales with centralized platform efficiency and local domain agility	Scales by value chains enabling domain group specialization	Suited to large enterprises with many legacy systems and apps	Similar to coarse-grained aligned but with governance improves scale consistency
Network / Infrastructure Impact	Potential for heavy network utilization and infrastructure duplication	More efficient central infrastructure with shared storage and compute pools	Some reduction in duplication with central platform; moderate overhead	Balanced infrastructure demands due to group alignment	Infrastructure complexity due to large domain size and app count	Higher infrastructure cost but managed for compliance and quality
Challenges & Risks	Requires consensus on standards; potential data gravity vs decentralization conflict; costly infrastructure	Longer time to market, potential domain coupling; challenge in multi-cloud seamless governance	Management overhead with mixed governance; complex rules for data distribution	Need strong architectural guidance; boundaries may be fluid and require attention	Data alignment issues with domain boundaries; capability duplication	Balancing autonomy with strong governance may slow flexibility
Governed Data Characteristics	Metadata governance centralized, data governance mostly at domain level	Stronger data quality, compliance, and governance enforced centrally	Governance mixed: central for product creation, federated for consumption	Governance focuses on cross-domain data product standards	Governance policies critical due to scale and complexity	Governance addresses time-variant, compliance, and quality controls
Use Cases	Cloud-native, multi-cloud companies with many skilled engineers and high autonomy	Financial institutions, governments valuing compliance over agility	Organizations with legacy systems or lacking fully skilled teams; partial mesh	Organizations needing stream-alignment or hyper-specialized domain cooperation (e.g., supply chain)	Large enterprises with complex merged systems & applications	Large enterprises needing governance and compliance in complex domains

Stages​

Bounded vs. Unbounded Data​

Batch vs. Micro-Batch vs. Real-Time Processing​

Pull vs. Push​

Schema Types​

Star vs. Snowflake vs. Galaxy Schema​

Lambda vs. Kappa​

Importance​

Guidelines​

Testing and Validation​

Validation Techniques​

Evolution of Authentication Methods​

Credentials (Base64)​

JSON Web Token (JWT)​

Oauth 2.0​

SSH Keys​

SSL Certificates​

2FA (Two-Factor Authentication)​

2SA (Two-Step Authentication)​

Core Principles​

Data Mesh Architecture​

Data Product​

High-level Platform Design and Governance​

Example​