The Data Engineering Playbook: 15 Foundational Concepts Explained



This content originally appeared on DEV Community and was authored by Kemboijebby

Introduction
In today’s data-driven world, organizations are collecting, processing, and analyzing information at unprecedented scale and speed. Behind the scenes, data engineers build the systems and pipelines that make this possible—transforming raw data into reliable, usable assets for analytics, machine learning, and decision-making.

While tools and technologies change rapidly, the core principles of data engineering remain constant. Understanding these concepts is essential for designing robust architectures, ensuring data quality, and meeting the demands of modern businesses.

In this article, we’ll explore 15 foundational concepts every aspiring or practicing data engineer should master.

1.Batch vs Stream Processing

  • Batch Processing involves collecting data over a set period (e.g., hourly, daily) and processing it in bulk. This method is ideal when immediate availability isn’t critical and allows for cost-efficient, large-scale transformations. For example, an e-commerce company might run a nightly batch job to consolidate daily sales data into a data warehouse for next-day reporting.

  • Stream Data Processing processes data continuously as it arrives, enabling near real-time analytics. This is essential for scenarios where timely insights drive action—such as monitoring credit card transactions for fraud or updating live dashboards for ride-sharing demand.

2.Change Data Capture(CDC)
Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real-time to a downstream process or system.

2.1 Why it Matters
Capturing every change from transactions in a source database and moving them to the target in real-time keeps the systems in sync and provides for reliable data replication and zero-downtime cloud migrations.
CDC is perfect for modern cloud architectures since it’s a highly efficient way to move data across a wide area network.

2.1 Change Data Capture in ETL
Change data capture is a method of ETL (Extract, Transform, Load) where data is extracted from a source, transformed, and then loaded to a target repository such as a data lake or data warehouse.
Extract. Historically, data would be extracted in bulk using batch-based database queries. The challenge comes as data in the source tables is continuously updated. Completely refreshing a replica of the source data is not suitable and therefore these updates are not reliably reflected in the target repository.

Change data capture solves for this challenge, extracting data in a real-time or near-real-time manner and providing you a reliable stream of change data.

Transformation. Typically, ETL tools transform data in a staging area before loading. This involves converting a data set’s structure and format to match the target repository, typically a traditional data warehouse. Given the constraints of these warehouses, the entire data set must be transformed before loading, so transforming large data sets can be time intensive.

Today’s datasets are too large and timeliness is too important for this approach. In the more modern ELT pipeline (Extract, Load, Transform), data is loaded immediately and then transformed in the target system, typically a cloud-based data warehouse, data lake, or data lakehouse. ELT operates either on a micro-batch timescale, only loading the data modified since the last successful load, or CDC timescale which continually loads data as it changes at the source.

Load. This phase refers to the process of placing the data into the target system, where it can be analyzed by BI or analytics tools.

3.Idempotency
In the realm of data processing and analysis, the concept of idempotency plays a crucial role in ensuring the reliability and consistency of data pipelines. Idempotency is a property that guarantees that running a pipeline repeatedly against the same source data will yield identical results. This property is fundamental in the world of data engineering, as it helps maintain data integrity, simplifies error recovery, and facilitates efficient data processing.

4.OLAP and OLTP in Databases
Online Analytical Processing (OLAP) refers to software tools used for the analysis of data in business decision-making processes. OLAP systems generally allow users to extract and view data from various perspectives, many times they do this in a multidimensional format which is necessary for understanding complex interrelations in the data. These systems are part of data warehousing and business intelligence, enabling users to do things like trend analysis, financial forecasting, and any other form of in-depth data analysis.

OLAP Examples
Any type of Data Warehouse System is an OLAP system. The uses of the OLAP System are described below.

  • personalizes homepages with custom songs and playlists based on user preferences.
  • Netflix movie recommendation system.

Online Transaction Processing, commonly known as OLTP, is a data processing approach emphasizing real-time execution of transactions. The majority of OLTP systems are meant to manage numerous short atomic operations that keep databases in line. To maintain transaction integrity and reliability, these systems support ACID (Atomicity, Consistency, Isolation, Durability) properties. It is through this that numerous unavoidable applications run their critical courses like online banking, reservation systems etc.

OLTP Examples
An example considered for OLTP System is ATM Center a person who authenticates first will receive the amount first and the condition is that the amount to be withdrawn must be present in the ATM. The uses of the OLTP System are described below.

  • ATM center is an OLTP application.
  • OLTP handles the ACID properties during data transactions via the application.
  • It’s also used for Online banking, Online airline ticket booking, sending a text message, add a book to the shopping cart.

5.Columnar vs Row-based Storage
Databases and file formats store data in one of two fundamental ways—row-based or columnar—and the choice has a major impact on performance, storage efficiency, and query patterns.

6.Partitioning
Data partitioning is a technique for dividing large datasets into smaller, manageable chunks called partitions. Each partition contains a subset of data and is distributed across multiple nodes or servers. These partitions can be stored, queried, and managed as individual tables, though they logically belong to the same dataset.

Types of Partitioning

  • Horizontal Partitioning Instead of storing all the data in a single table, horizontal partitioning splits the data into rows, meaning different sets of rows are stored as partitions.
    All partitions of horizontal partitioning contain the same set of columns but different groups of rows.

  • Vertical partitioning divides data by columns, so each partition contains the same number of rows but fewer columns.
    The partition key or the primary column will be present in every partition, maintaining the logical relationship.
    Vertical partitioning is popular when sensitive information is to be stored separately from regular data. It allows sensitive columns to be saved in one partition and standard data in another.

7.ELT and ETL
7.1 ELT
Extraction, Load and Transform (ELT) is the technique of extracting raw data from the source, storing it in the data warehouse of the target server and preparing it for end-stream users.
ELT consists of three different operations performed on the data:

  • Extract: Extracting data is the process of identifying data from one or more sources. The sources may include databases, files, ERP, CRM, or any other useful source of data.
  • Load: Loading is the process of storing the extracted raw data in a data warehouse or data lake.
  • Transform: Data transformation is the process in which the raw data from the source is transformed into the target format required for analysis.

7.2 ETL Process
ETL is the traditional technique of extracting raw data, transforming it as required for the users and storing it in data warehouses. ELT was later developed, with ETL as its base. The three operations in ETL and ELT are the same, except that their order of processing is slightly different. This change in sequence was made to overcome some drawbacks.

  • Extract: It is the process of extracting raw data from all available data sources such as databases, files, ERP, CRM or any other.
  • Transform: The extracted data is immediately transformed as required by the user.
  • Load: The transformed data is then loaded into the data warehouse from where the users can access it.

8.CAP
The CAP theorem is a fundamental concept in distributed systems theory that was first proposed by Eric Brewer in 2000 and subsequently shown by Seth Gilbert and Nancy Lynch in 2002. It asserts that all three of the following qualities cannot be concurrently guaranteed in any distributed data system:
8.2 Consistency
Consistency means that all the nodes (databases) inside a network will have the same copies of a replicated data item visible for various transactions. It guarantees that every node in a distributed cluster returns the same, most recent, and successful write. It refers to every client having the same view of the data. There are various types of consistency models. Consistency in CAP refers to sequential consistency, a very strong form of consistency.

8.2 Availability
Availability means that each read or write request for a data item will either be processed successfully or will receive a message that the operation cannot be completed. Every non-failing node returns a response for all the read and write requests in a reasonable amount of time. The key word here is “every”. In simple terms, every node (on either side of a network partition) must be able to respond in a reasonable amount of time.

8.3 Partition Tolerance
Partition tolerance means that the system can continue operating even if the network connecting the nodes has a fault that results in two or more partitions, where the nodes in each partition can only communicate among each other. That means, the system continues to function and upholds its consistency guarantees in spite of network partitions. Network partitions are a fact of life. Distributed systems guaranteeing partition tolerance can gracefully recover from partitions once the partition heals.

9.Windowing in Streaming
In real-time data processing, windowing is a technique that groups events that arrive over a period of time into finite sets for aggregation and analysis. This is necessary because streaming data is unbounded—it never ends.

The main types of windows are:

  • Tumbling windows: Fixed-size, non-overlapping time intervals (e.g.count sales in 5-minute blocks).
  • Sliding windows: Fixed-size intervals that overlap, capturing more granular trends (e.g., a 5-minute window sliding every minute).
  • Session windows: Dynamic windows that close when no new events arrive for a defined gap (e.g., tracking user sessions based on inactivity).

Example: A web analytics platform might use a tumbling window to count unique visitors in 5-minute intervals for live traffic dashboards.

10.DAGs and Workflow Orchestration
A Directed Acyclic Graph (DAG) is a set of tasks connected by dependencies, where the edges indicate execution order and no cycles are allowed. DAGs form the backbone of workflow orchestration tools, ensuring tasks run in the correct sequence and only when prerequisites are met.

Popular orchestration tools include:

  • Apache Airflow – widely used for data pipelines, supports scheduling and monitoring.
  • Prefect– emphasizes ease of use and dynamic workflows.

Example: An Airflow DAG might extract sales data from an API, transform it using Python scripts, and load it into a data warehouse every morning—each task running in sequence, with automatic retries if something fails.

11.Retry Logic & Dead Letter Queues
In distributed systems, transient errors—temporary issues like network timeouts—are common. Retry logic automatically attempts the failed operation again after a delay, increasing resilience.
When retries still fail, messages or records can be moved to a Dead Letter Queue (DLQ), where they’re stored for manual inspection or later reprocessing.

Example: In Kafka, a DLQ might hold messages with invalid schemas that failed deserialization, allowing engineers to investigate without losing the data.

12.Backfilling & Reprocessing
Backfilling means populating a system with historical data that was previously missing, while reprocessing means re-running transformations on data that was already processed—often because of a bug fix or updated business logic.

Example: If a currency conversion bug caused incorrect financial reports for the last quarter, engineers might reprocess that period’s raw data with the corrected logic, replacing the faulty results.

13.Data Governance
Data governance ensures that data is accurate, consistent, secure, and compliant with regulations. It covers policies, processes, and tools for managing data quality, privacy, and lifecycle.
Key aspects include:

  • Data quality: Validations, profiling, and cleansing.
  • Privacy & compliance: Meeting requirements like GDPR (Europe) or HIPAA (U.S. healthcare).
  • Access control: Role-based permissions to protect sensitive information.

Example: A customer dataset in a retail company might mask personally identifiable information (PII) before analysts can query it, ensuring compliance and preventing misuse.

14.Time Travel & Data Versioning
Time travel in data systems allows querying historical snapshots of data as it existed at a specific point in time. Data versioning stores multiple versions of a dataset so changes can be tracked and rolled back if needed.

Example: Snowflake’s Time Travel feature can restore a table to its state from 72 hours ago, recovering from accidental deletions or schema changes without restoring from a backup.

15.Distributed Processing Concepts
Large datasets often exceed the capacity of a single machine. Distributed processing breaks the workload into smaller tasks across multiple nodes for faster computation and higher scalability.
Key concepts include:

  • Parallelization: Running tasks simultaneously.
  • Sharding: Splitting data into partitions stored across nodes.
  • Replication: Duplicating data across nodes for fault tolerance.

Example: Apache Spark processes terabytes of log data by splitting it into partitions, distributing them across a cluster, and executing transformations in parallel—dramatically reducing processing time.


This content originally appeared on DEV Community and was authored by Kemboijebby