A Recap of Data Engineering Concepts – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Mwirigi Eric

As a data Engineer, an in-depth comprehension of the core concepts applied in the field, plays a pivotal role in carrying out daily tasks as well as career progression in the field.

In this article, we explore the concepts and aim to expound their meanings, applications and relevance in the field of Data Engineering.

1. Batch versus Streaming Ingestion

Data Ingestion refers to the process of data collection (from various sources), and moving it to a target destination either in batches or in real-time. Primarily, the methods adopted for data ingestion are batch and streaming.

Batch Ingestion involves collection and processing of data in chunks whereby the process can either be scheduled or designed to occur automatically.

The method is effective for resource intensive jobs and repetitive tasks and is ideal for applications such as data warehousing and ETL processes.

Streaming Ingestion involves collection and processing of data as received i.e in real-time, from source to target.

The method is effective for applications that require real-time data processing such as network traffic, fraud detection and mobile money services such as Mpesa.

2. Change Data Capture (CDC)

This is a data integration pattern that captures only the changes made to data, (through insert, update and delete statements), and represents these changes as a list called CDC feed.

The CDC approach focusses on incremental changes, hence ensuring that target systems access the most current data at all times.

CDC is usually implemented through four methods namely:

Log-based method which reads a database transaction logs.
Trigger-based method which used database triggers attached to source table events.
Time-based methods that rely on dedicated columns that record the las modified time for each record.
Polling-based method that checks for changes based on a time-stamp or version column.

3. Idempotency

Idempotency is a property of an operation producing the same results despite of the number of times the operation is run. This ensures the repeatability and predictability of data processes, thus ensuring data consistency across distributed systems, as well as effective handling of failures and retries.

4. OLTP versus OLAP Data Processing Systems

Online Transaction Processing (OLTP) systems are designed for real-time transaction management of concurrent users, and focuses on data consistency and fast retrieval of records. Examples of OLTP systems use include; ATM withdrawals, online banking and e-commerce purchases.
Online Analytical Processing (OLAP) systems are designed to analyze aggregated historical data from various sources, for the purposes of data analysis, reporting, and business intelligence. A real-world user case of OLAP system is the Netflix movie recommendation system.

5. Columnar versus Row-based Storage

Data storage and management adopts two formats i.e columnar and row-based storage formats.

Columnar Storage organizes data by columns whereby each column is stored separately allowing the system to read/write specific columns independently. Common user cases include data warehouses such as Google BigQuery, and Amazon Redshift.

The advantages of columnar storage include Faster aggregations, Efficient data compression and Efficient analytical queries (optimized for read-heavy queries).

The disadvantages include; Inefficient for transactional operations and complexity in implementing and management.

Row-based Storage organizes data by rows, where each row stores a complete record. This format is used in relational databases such as PostgreSQL and MySQL.

The advantages of row-based storage include; Simple data access, and Efficient for transactional operations.

On the other hand, the row-based format is Inefficient for analytical queries and might use more space if not optimized properly.

6. Partitioning

Data Partitioning involves dividing data into smaller segments/partitions, whereby each partition contains a subset of the entire dataset.

Partitioning data helps in improving query performance by limiting data retrieval to only the relevant data, thus reducing the servers’ workload and accelerating data processing.

Data can be partitioned using the following methods:

Horizontal Partitioning (Row-based) – Splits tables by rows, thus each partition has the same columns but different records, whereby the partitioning is based on a partition key.
Vertical Partitioning (Column-based) – Splits tables by columns allowing the reading/writing of columns independently.
Functional Partitioning – Partitions data according to operational requirements, with each partition containing data specific to a particular function.

7. ETL versus ELT

The Extract, Transform and Load (ETL) and Extract, Load and Transform (ELT) methods are the two common approaches for data integration, with their major difference being the order of operations.

Extract, Transform and Load (ETL) – transforms data (on a separate processing server) before loading it to a target system such as a data warehouse. Real-world example of ETL user case would be in ecommerce where ETL allows business gain insights into customer behavoir and preferences.
Extract, Load and Transform (ELT) – performs data transformations directly within the data warehouse itself. It allows for raw data to be sent directly to the data warehouse, eliminating the need for staging processes. Real-world example of ELT user case would be in cloud data warehouses such as snowflake and Amazon redshift.

8. CAP Theorem

The CAP theorem (Brewer’s theorem) in distributed systems claims that; All three of the desirable properties of Consistency, Availability and Partition Tolerance cannot be concurrently guaranteed in any distributed data system.

Consistency – Means that all clients see the same data at the same time, no matter which node they connect to.
Availability – Means that any client making a request for data gets a response, even if one or more nodes are down.
Partition Tolerance – Means that the cluster must continue to work despite any number of communication breakdowns between nodes in the system.

Therefore, in a distributed system only two of the properties can be achieved simultaneously, thus guides developers in prioritizing the properties that best suit their needs.

9. Windowing in Streaming

In streaming, windowing is a way of grouping events into a set of time-based collections or windows, in which the windows are specified based on time intervals or number of records.

Windowing allows stream processing applications to break down continuous data streams into manageable chunks for processing and analysis.

10. DAGs and Workflow Orchestration

A Directed Acyclic Graph (DAG) is a conceptual representation of tasks, whose order is represented by a graph in which nodes are linked by one-way connections that do not form any cycles.

DAGs represent tasks as nodes and dependencies as edges, thereby enforcing a logical execution order, ensuring that tasks are executed sequentially based on their dependencies.

In data engineering DAGs are used in orchestrating ETL processes as well as managing complex data workflows that involve multiple tasks and dependencies, such as a machine learning workflow.

11. Retry Logic and Dead Letter Queues

Retry logic – is a method used in software and systems development to automatically attempt an action again if it fails the first time. This helps to handle temporary issues, such as network interruptions or unavailable services, by giving the process another chance to succeed.
Dead Letter Queue– is a special type of message queue that stores messages that fail to be processed successfully by consumers. It acts as a safety net for handling failed messages and helps in debugging and retrying failed tasks.

A real-world user case of the Dead Letter Queue is Uber, which leverages Apache Kafka features using non-blocking request reprocessing Dead Letter Queue to achieve decoupled, observable error-handling without disrupting real-time traffic.

12. Backfilling and Reprocessing

Data Backfilling involves filling in missing historical data that does not exist in the system or correcting stale data in the system.
Data Reprocessing involves preparing raw data for analysis by cleaning and transforming it into a usable format.

13. Data Governance

Data governance is a framework of principles that manage data throughout its lifecycle, from collection and storage to processing and disposal.

Why it matters: Without effective data governance, data inconsistencies in different systems across an organization might not get resolved.

14. Data Versioning and Time Travel

Data Versioning involves creation of a unique reference (query or ID) for a collection of data, to allow for quicker development and processing of data while reducing errors.
Time Travel feature in data versioning, enables versioning of big data stored in a data lake, allowing access to any historical version of the data thus providing robust data management through data rollback capabilities for bad writes/deletes.

15. Distributed Processing Concepts

Distributed data processing involves handling and analyzing data across multiple interconnected devices or nodes, leveraging the collective computing power of interconnected devices.

Benefits of distributed processing include scalability, robust fault tolerance, enhanced system performance and efficient handling of large volumes of data.

Real-world adoption of distributed concepts include: Fraud detection and risk management, Personalized product recommendation in e-commerce and Network Monitoring and Optimization.

This content originally appeared on DEV Community and was authored by Mwirigi Eric