This content originally appeared on DEV Community and was authored by DevOps Fundamental

Mastering Data Compaction in Apache Iceberg: A Production Deep Dive

Introduction

The relentless growth of data in modern applications presents a significant engineering challenge: maintaining query performance on ever-increasing datasets. We recently faced this at scale while building a real-time analytics platform for a large e-commerce company. Ingesting billions of events daily into a data lake, we observed rapidly degrading query latency on historical data. Simple partitioning wasn’t enough. The root cause? Excessive small files and inefficient data layouts. This led us to deeply investigate and optimize data compaction strategies within Apache Iceberg, a table format designed for large analytic datasets. This post details our journey, focusing on the architectural considerations, performance tuning, and operational realities of Iceberg compaction. We’ll cover how it fits into a modern data lakehouse architecture built on Spark, Flink, and cloud storage (AWS S3 in our case).

What is Data Compaction in Big Data Systems?

Data compaction, in the context of table formats like Iceberg, Delta Lake, and Hudi, is the process of rewriting small files into larger, more efficient files. This is crucial for several reasons. Small files lead to increased metadata overhead (listing files in object storage is expensive), increased I/O operations (more files to open and read), and reduced parallelism during query execution. Iceberg’s architecture relies on metadata layers (manifest lists, manifest files, data files) to track data. Frequent small file writes overwhelm these layers. Compaction rewrites data files, updating the metadata to reflect the new, larger files. Iceberg’s compaction process is driven by a configurable schedule and leverages snapshot isolation for consistency. At the protocol level, compaction jobs read data from existing files, rewrite it into new files (typically Parquet or ORC), and then atomically update the table metadata to point to the new files.

Real-World Use Cases

Clickstream Analytics: Ingesting high-velocity clickstream data requires frequent, small writes. Compaction consolidates these events into larger files for efficient analysis of user behavior.
IoT Sensor Data: Similar to clickstream, IoT data often arrives in bursts. Compaction is vital for optimizing queries on time-series data.
Change Data Capture (CDC) Pipelines: CDC pipelines often write updates and inserts as individual records. Compaction merges these changes into larger, optimized files.
Log Analytics: Aggregating logs from numerous sources generates many small files. Compaction improves query performance for security monitoring and troubleshooting.
Machine Learning Feature Stores: Frequent updates to feature values necessitate compaction to maintain efficient access to training data.

System Design & Architecture

graph LR
    A[Data Sources (Kafka, CDC)] --> B(Spark Streaming/Flink);
    B --> C{Iceberg Table (S3)};
    C --> D[Query Engines (Spark, Trino/Presto)];
    E[Iceberg Compaction Daemon] --> C;
    subgraph Data Lakehouse
        A
        B
        C
        D
        E
    end
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#ccf,stroke:#333,stroke-width:2px
    style C fill:#ffc,stroke:#333,stroke-width:2px
    style D fill:#cff,stroke:#333,stroke-width:2px
    style E fill:#fcc,stroke:#333,stroke-width:2px

This diagram illustrates a typical Iceberg-based data lakehouse. Data is ingested from sources like Kafka or CDC systems using stream processing frameworks (Spark Streaming or Flink). The data is written to an Iceberg table stored in object storage (S3). Query engines like Spark and Trino/Presto read data directly from the Iceberg table. A dedicated Iceberg compaction daemon (often scheduled via Airflow or a similar workflow orchestrator) periodically triggers compaction jobs to optimize the table.

On AWS, this translates to:

Data Sources: Kafka on MSK, Debezium for CDC.
Processing: Spark on EMR, Flink on EMR.
Storage: S3.
Compaction: Spark jobs scheduled by Airflow.
Querying: Athena, Redshift Spectrum, Spark on EMR.

Performance Tuning & Resource Management

Effective compaction requires careful tuning. Here are key parameters:

iceberg.compaction.target-file-size: The desired size of compacted files (e.g., 256MB). Larger files reduce metadata overhead but can increase compaction latency.
iceberg.compaction.max-files-per-compaction: The maximum number of files to process in a single compaction job (e.g., 1000).
spark.sql.shuffle.partitions: Crucial for parallelizing compaction. Adjust based on cluster size (e.g., 200 for a 50-node cluster).
fs.s3a.connection.maximum: Controls the number of concurrent connections to S3 (e.g., 1000). Important for avoiding throttling.
spark.executor.memory: Allocate sufficient memory to executors to avoid OOM errors during compaction (e.g., 8g).

We found that a target file size of 256MB, combined with a maximum of 1000 files per compaction, provided a good balance between compaction latency and query performance. Monitoring S3 request metrics (using CloudWatch) is essential to identify and address throttling issues. We also implemented dynamic partitioning based on event time to further optimize data layout.

Failure Modes & Debugging

Common failure modes include:

Data Skew: Uneven data distribution can lead to some tasks taking significantly longer than others, causing job failures. Solution: Salt the partitioning key to distribute data more evenly.
Out-of-Memory Errors: Insufficient executor memory can cause OOM errors during compaction. Solution: Increase executor memory or reduce the number of files processed per compaction.
Job Retries: Transient network errors or S3 throttling can cause job retries. Solution: Implement exponential backoff and retry mechanisms.
DAG Crashes: Errors in the compaction logic can cause the entire DAG to crash. Solution: Thoroughly test compaction jobs and monitor logs for errors.

Debugging tools:

Spark UI: Provides detailed information about task execution, memory usage, and shuffle statistics.
Iceberg CLI: Useful for inspecting table metadata and verifying compaction status.
CloudWatch Logs: Collects logs from Spark executors and compaction jobs.
Datadog/Prometheus: Monitoring metrics like S3 request latency, executor memory usage, and compaction job duration.

Data Governance & Schema Management

Iceberg’s schema evolution capabilities are critical. We use a schema registry (Confluent Schema Registry) to manage schema changes. Compaction jobs automatically handle schema evolution, ensuring backward compatibility. We leverage Iceberg’s metadata to track schema history and allow queries to access data written with different schemas. Metadata is stored in the Hive Metastore, providing a centralized catalog for all our data assets.

Security and Access Control

We enforce data encryption at rest using S3 KMS keys. Access control is managed using IAM policies, granting specific permissions to different users and roles. We also leverage Apache Ranger to implement fine-grained access control at the table and column level. Audit logging is enabled to track all data access and modification events.

Testing & CI/CD Integration

We use Great Expectations to validate data quality before and after compaction. DBT tests are used to verify data transformations. Compaction jobs are integrated into our CI/CD pipeline, with automated regression tests to ensure that compaction doesn’t introduce any data inconsistencies. We have staging environments where we test new compaction configurations before deploying them to production.

Common Pitfalls & Operational Misconceptions

Ignoring Compaction: Assuming partitioning alone is sufficient. Symptom: Slow query performance on historical data. Mitigation: Implement a regular compaction schedule.
Overly Aggressive Compaction: Compacting too frequently can consume excessive resources. Symptom: High cluster utilization, increased infrastructure costs. Mitigation: Tune iceberg.compaction.target-file-size and iceberg.compaction.max-files-per-compaction.
Insufficient Executor Memory: Leading to OOM errors. Symptom: Compaction jobs failing with OOM exceptions. Mitigation: Increase executor memory.
S3 Throttling: Causing job retries and increased latency. Symptom: High S3 request latency, 429 errors in CloudWatch. Mitigation: Increase fs.s3a.connection.maximum and implement exponential backoff.
Ignoring Data Skew: Resulting in uneven task execution and job failures. Symptom: Long-running tasks, skewed shuffle statistics in Spark UI. Mitigation: Salt the partitioning key.

Enterprise Patterns & Best Practices

Data Lakehouse vs. Warehouse: Iceberg enables a data lakehouse architecture, combining the flexibility of a data lake with the performance and governance of a data warehouse.
Batch vs. Micro-Batch vs. Streaming: Choose the appropriate ingestion pattern based on data velocity and latency requirements.
File Format Decisions: Parquet is generally preferred for analytical workloads due to its columnar storage and efficient compression.
Storage Tiering: Move infrequently accessed data to cheaper storage tiers (e.g., S3 Glacier) to reduce costs.
Workflow Orchestration: Use Airflow or Dagster to schedule and manage compaction jobs.

Conclusion

Mastering data compaction in Iceberg is essential for building reliable, scalable, and performant Big Data infrastructure. By carefully tuning compaction parameters, monitoring performance metrics, and implementing robust error handling, you can ensure that your data lakehouse remains efficient and responsive to evolving business needs. Next steps include benchmarking different compaction configurations, introducing schema enforcement to improve data quality, and migrating to newer Iceberg features like hidden partitioning for further optimization.