Foundational Concepts of Data Engineering – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Brian Ouchoh

What happens when you want to report about an event in your organization? What happens when you want to get insights of your operations through data analysis? What happens when a datascientist wants to train a large language Model? One common denomination for all this tasks is to consume data. Data engineering not only provides a way to collect, store, process and access data reliably, but also tools to design and optimize data systems.

Here are some core concepts that you need understand as a data engineer:

1.Batch ingestion vs Streaming Ingestion
Batch ingestion is collecting data over a period of time and then processing the data at a go , unlike dealing with a record as it arrives.

The period can be hourly, daily, weekly etc. An example is a restaurant collecting all point of sale transactions from all servers and load them into a database for end of shift reporting.

Unlike batch ingestion, streaming ingestion processes data as it arrives. An example, is a point of sale system that updates the amount of sales made as soon as a new sale is made.

2.Change Data Capture (CDC)
A change data capture is the process of identifying and recording changes i.e. inserts, updates and delete in a source database, then applying those changes downstream without having to reprocess the entire data-set

An example, you have and e commerce platform with a table called “orders” which is updated constantly as purchase status changes. Scenarios:

A. Without a CDC: instead of capturing the changes, the organization would periodically export the entire “orders” table from your database to your data-warehouse. This would result to high resource usage, increases latency and complex deduplication.

B. With a CDC: The changes in the purchases that affect the “orders” table will be captured and applies downstream without having to reprocess the entire data-set.

CDC is powered by tools such as Debezium, oracle GoldenGate and AWS Data Migration Service

3.Idempotency
Indempotency ensures that running the same operation multiple times such as restarting an ingestion job after a failure, has the same effect as running it once. Thus avoiding duplication.
Indempotency uses techniques such as upserts and using unique keys.

4.OLTP vs OLAP
OLTP (Online Transaction Processing)prioritize speed, consistency and concurrency to ensure that operational systems remain fast and reliable. Hence, OLTPs are optimized for handling a large number of small,quick transactions such as inserting and updating a single record.

OLAP (Online Analytical Processing) are designed for aggregations, trend analysis and multidimensional queries that may scan a large number of rows. Hence. OLAP systems are optimized for running complex analytical queries over large datasets.

5.Partitioning
Partitioning is a technique of splitting large datasets into smaller ,more manageable parts based on a key such as dates. The aim is to improve query performance and manageability.

Common types of partitioning include:
A. Range partitioning – Divides data based on a continuous range of values (e.g., dates or numeric IDs).
B. List partitioning – Groups data based on a predefined list of values (e.g., regions: “US”, “EU”, “APAC”).
C. Hash partitioning – Uses a hash function on a key column to distribute rows evenly across partitions, improving load balancing.
D. Composite partitioning – Combines two or more partitioning strategies (e.g., range + hash) for better control.

6.ETL vs ELT
ETL in full is Extract, Transform, Load and ELT in full is Extract, Load, Transform. Both terms refer to different strategies of a data pipeline in data engineering.

In ETL Data is transformed before loading into the target system while in ELT Data is loaded first, then transformed in the target system

7.CAP Theorem
Distributed systems guarantee consistency, availability and partition tolerance. The CAP theorem states that this distributed systems can only provide two of the three things:

A. Consistency (all nodes see the same data at the same time)
B. Availability (every request gets a response)
C. Partition tolerance (system continues to operate despite network failures)

Example: Apache Cassandra prioritizes Availability and Partition tolerance (AP), while traditional SQL databases often prioritize Consistency and Availability (CA)

8.Windowing in Streaming
In a case of streaming data, it never ends. A window can be used to group data into finite chunks, eg data in the last 5 minutes. This makes it easy for processing.

Common window types:
A. Tumbling windows – Fixed-size, non-overlapping intervals (e.g., every 5 minutes).
B. Sliding windows – Overlapping intervals that “slide” forward, useful for rolling metrics
C. Session windows – Group events that occur within a defined inactivity gap, useful for user activity sessions.

9.DAGs and Workflow Orchestration
A DAG is Directed Acyclic Graph. A DAG represents a a set of tasks linked by dependencies, with a clear order, and no circular paths. Workflow orchestrators like Apache Airflow or Prefect use DAGs to define, schedule, and monitor data pipelines.

10.Retry Logic & Dead Letter Queues
Retry logic automatically attempts to reprocess failed tasks to handle temporary failures that often resolve on their own when retried(Transient errors).

Dead letter ques(DLQs) store messages that consistently fail processing for later inspection.

Example: A Kafka consumer might retry processing an event three times before sending it to a DLQ for manual review.

11.Back-filling & Reprocessing
Back filling is the process of ingesting historical data that was missed or never processed initially. Failure to process historical data can occur due to temporary outage that causes a gap or a new pipeline that goes live and needs to populate past data

Reprocessing involves rerunning processing logic on existing historical data to correct errors , apply updated transformations, or accommodate schema changes.

12.Data Governance
Data governance refers to the framework of rules, procedures, and best practices that guide how data is managed to maintain its accuracy, protect it from unauthorized access, ensure confidentiality, and meet regulatory obligations.

Examples of Data Governance frameworks are: Control Objectives for Information and Related Technologies, Data Management Capability Assessment Model and NIST Privacy Framework

13.Time Travel & Data Versioning

Time travel and data versioning are features in modern data warehouses and table formats (such as Snowflake, Delta Lake, and Apache Iceberg) that allow you to access and query historical versions of data. This means you can “look back in time” to see the state of your dataset at a specific moment, or maintain multiple dataset versions for auditing, debugging, or recovery purposes.

Why it matters:

A. Simplifies auditing and compliance reporting.
B. Helps debug data issues by comparing historical states.
C. Enables safe experimentation without risking permanent data loss.

14.Distributed Processing Concepts

Distributed processing splits a workload across multiple machines to handle large-scale data efficiently. Concepts include:
Sharding: Splitting data across nodes.

Replication: Keeping copies of data for fault tolerance.

MapReduce: Dividing a task into smaller “map” tasks, then combining results in a “reduce” step.

This content originally appeared on DEV Community and was authored by Brian Ouchoh