15 Must-Know Data Engineering Tricks



This content originally appeared on DEV Community and was authored by Hilary Wambwa

Think of data engineering as the behind-the-scenes hero that makes sense of the massive amounts of data modern companies deal with every day.
Here are 15 tricks that will help you navigate this field.

Batch vs Streaming ingestion

Batch Ingestion: Fixed chunks of data is ingested on a fixed schedule or manually into a system.
Example: processing daily sales data for a retail company from a transactional database to a warehouse.
Streaming: Data or events are ingested into a system in real time or near real time based on a trigger.
Example: processing data from a temperature monitoring IOT sensor in a green house.

Change Data Capture (CDC)

Style of data movement where every change (inserts, updates, deletes) is captured in real time to move data from one data source to target without reprocessing the entire datasets.
Three ways to perform CDC:

  • Log based: Every database transaction is logged in a log file. Pick up the changes and move from the log to target. Efficient, no impact on source system
  • Query based: Querying the data in the source based on a timestamp to pick up the changes. Source must have a column tracking the last modification timestamp.
  • Triggers: Database triggers are created to log/store the changes in a separate audit table which are then read and propagated to the target database.

Example: A healthcare institution stores patient records in a transactional database. This data is moved to an Amazon Redshift warehouse via log-based change data capture therefore storing patients’ historical data for tracking and compliance.

Idempotency

Given that the same input is used, a data pipeline/processes should produce the same results regardless of the number of times it runs or system failures such as network, server or API.
Without idempotency, scheduled retries could lead to duplication of data, incomplete data states as well as costly computation and storage.
How to practice idempotency:

  • Use unique IDs for each data operation/record.
  • Verify the state of data to confirm if incoming data matches current data.
  • Deduplication: Kafka’s exactly-once semantics, ensures messages are processed only once by tracking offsets and transaction IDs.
  • Using unique constraints/upsert operations on databases e.g INSERT … ON CONFLICT DO NOTHING/UPDATE in SQL to avoid duplicates.

OLTP vs OLAP

OLTP is a database design system which entails transactional/read-write operations (insert, update, drop) of high volume and frequent datasets.
OLAP is designed to optimize complex queries and aggregations often used in data warehouses to analyze large historical data.

Columnar vs Row-based Storage

To understand this let’s consider a table with sales data; 1 million rows and the following rows (customer_id, date, amount, order_id).

In a Columnar based storage, each column data is stored separately in memory i.e all 1 million customer_id are stored together separate from date, amount and order_id. This is perfect for OLAP systems where aggregation is common e.g SUM(amount) query only reads amount column, skipping the rest. It minimizes disk I/O.

In a Row based storage, data is stored by row where all attributes of one record are stored together. This is perfect for OLTP where read-write operations are commonly done on entire records.

Partitioning

This is dividing data into small partitions using attributes such as date and geographical regions. It is useful in optimizing querying and processing such that in a large dataset, the system uses partitions to only query the relevant data instead of the whole dataset.

  • Horizontal partitioning: Partitioning a subset of rows based on attributes such as date.
  • Vertical partitioning: Splitting a table into columns, most frequently queried columns are stored in one partition while those less frequently queried stored in another.

ETL vs ELT

ETL (Extract, Transform, Load): Extract data from source, transform and load into target. Common in structured data warehouses.
ELT (Extract, Load, Transform): Extract data from source, loads into target then transform the data. Common in cloud data lakes as they leverage the cloud’s compute power.

CAP Theorem

In a distributed system where data is stored across different nodes, when a communication failure between nodes occurs, a system must choose between consistency (all nodes show the same data) or availability (the system remains operational). Partition tolerance is a constant, so the trade off is between consistency and availability.

Examples:
Consistency: Every read operation retrieves the most recent write. Crucial for data accuracy. Consider banking where if a user transfers Ksh.1000, all nodes update to reflect this immediately.
Availability: System must remain operational to respond to requests. Responsive over accuracy. Social media platform retrieves posts even though some nodes are disconnected, sometimes showing outdated posts.

Windowing in Streaming

When streaming, data comes in fast and continuously. Processing and analyzing the data requires bounded chunks of data to calculate averages, sums e.t.c. Windowing is dividing this data in bounded streams(windows). These windows are based on:

  • Time: 5-minute bounds
  • Count: Every 100 events
  • Session: A user’s browsing/shopping session.

Example:
A ride hailing company uses time-based windows to adjust price based on real time demand (ride duration or number of requests)

DAGs and Workflow Orchestration

Directed Acyclic Graph (DAG) uses nodes and directed edges to create an execution order for tasks avoiding cycles meaning tasks cannot loop back to earlier steps.
Workflow orchestration is the managing and scheduling of these tasks. Apache Airflow is an example of workflow orchestration tool.
A DAG is written using python to extract air quality data from an API, transform the data and load it in a database. An airflow scheduler provides execution order while the web UI monitors the DAG. If the DAG fails(error), airflow logs the error, retries the task and displays or sends an alert.

Retry Logic & Dead Letter Queues

Retry logic is redoing a task over and over where the retry is not only done after some time and not immediately (configurable delays) but also the retry time increases exponentially with each attempt (exponential backoff) i.e. 1s, 2s, 4s, 8s. This helps avoid overwhelming target system with retries as well as increasing wait time to allow system recovery.

Backfilling & Reprocessing

Backfilling is input of historical data to a system to fill gaps or correct errors especially when introducing a new pipeline.
Reprocessing on the other hand is re-running pipelines on existing data to fix errors, increase accuracy of the data or when introducing new transformations.

Example: A marketing firm introduces a new analytic platform. It has to backfill years of campaign data for historical analysis.

Data Governance

These are rules that define the management and protection of an organization’s data to ensure security, quality and compliance.
It helps to define who owns the data, who accesses it and what is included in the metadata.

Time Travel & Data Versioning

Time travel: This is the ability to query data as it existed at a certain time in the past. Imagine debugging a pipeline and need to see what your data looked like last week before a bug corrupted it. Time travel enables one to rewind the data at the specific time without altering it.
Data Versioning: Monitoring changes to dataset overtime similar to version control in software engineering i.e., Git. Each change creates a new version of data allowing analysis of how data has evolved.

Distributed Processing Concepts

Distributed processing is carrying out complex tasks involving large data using multiple connected nodes (servers/cloud instances). This divides tasks into smaller and parallel tasks. The opposite is centralized processing where one machine carries out all the tasks.
Distributed processing relies on several concepts:

  • Parallel processing: This is executing several tasks simultaneously.
  • Partitioning: Dividing data in small chunks where each node processes a partition independently.
  • Fault tolerance: Ability to continue operating despite system failure. In this case if a node fails while processing, another node is used to recompute the task.
  • Load balancing: Distribution of tasks equally across nodes to avoid bottlenecks and delays.
  • Distributed coordination: Tasks have to be managed for synchronization. A good example is Apache Zookeeper.


This content originally appeared on DEV Community and was authored by Hilary Wambwa