Fodder on the Feuding Format Fracas is Fitting For Finding Final Boss Forensics

or...Why We Chose Hudi for Cybersecurity Data. In Part 1 we covered serialization. Now the part everyone actually argues about.

Mar 25, 2026

Part 2 of 2 - You can read about Avro selection here in part 1

Iceberg, Delta Lake, and Hudi are all solid table formats. For most use cases any of them will serve you well — the differences only start to matter at the edges. This post is about our edges, and why we landed where we did.

What We Needed the Table Layer to Do

Before evaluating anything, we wrote down the requirements specific to our platform. Not features — but the actual behaviors we needed in production. A few of them are below:

Continuous streaming ingest. Detection latency starts at the table layer. We couldn’t afford a batch-oriented architecture.
Upserts. Security events change after they land. Threat intel enrichment, identity correlation, analyst annotations — a record that arrives at T+0 might look very different by T+60 seconds. The table format needed to handle that cleanly.
Incremental processing. When a new detection rule is written, we backfill it against historical data. Reprocessing the full dataset every time isn’t viable at our scale. We needed “give me everything that changed since timestamp X” as a native operation.
Timeline and auditability. In security, when a record arrived and what it looked like before enrichment are forensic questions, not optional metadata. We needed that built into the table, not maintained separately.
Avro compatibility. We covered the serialization decision in Part 1. The table format needed to work with it.

Iceberg

Iceberg’s metadata architecture is well-designed. Hidden partitioning, partition evolution without rewriting data, snapshot isolation, and broad engine support across Spark, Flink, Trino, Dremio, and Snowflake. If you need multiple query engines reading the same tables, Iceberg is the most credible answer right now.

Where it didn’t fit us: Iceberg was built for read-heavy analytical workloads on stable data. Streaming upserts at high ingest rates require careful engineering to avoid small file accumulation. Record-level incremental consumption isn’t a native primitive. For a different team with a read-dominated workload, Iceberg would be a strong choice. For our workload, we’d have been building around its edges rather than with them.

Delta Lake

Delta Lake’s foundation is solid — ACID transactions on Parquet, a reliable transaction log, clean MERGE support. If your stack is heavily Databricks or Spark, staying in that ecosystem has real operational advantages.

Two things didn’t fit us. First, our streaming layer isn’t uniformly Spark, and Delta Lake’s design reflects its Spark origins in ways that created friction we didn’t want to carry. Second, Change Data Feed — Delta’s mechanism for incremental consumption — works, but it was added after the fact to an architecture not originally designed around changelogs. For a workload where incremental processing is load-bearing, we wanted something where that model was native.

Why Hudi

Hudi was built around a specific use case: continuous ingest, records that change after landing, consumers that only want to process what changed. That’s a good description of what we were building and a pretty good description of world we live in for Security Engineering. Some of the key features of Hudi that were attractive to us are:

The timeline. Hudi maintains an ordered, atomic log of every action taken against a table — commits, compactions, cleanings, rollbacks — with precise timestamps. For us this serves as forensic infrastructure. When an analyst needs to know what a table looked like at the time of an incident, or when a record was modified and what it contained before, the timeline has it. We’re not maintaining a separate audit system alongside the data.
Upserts. Records are identified by a primary key. When a new version arrives, Hudi handles the merge without a full partition rewrite. An EDR event lands, gets enriched with threat intel 15 seconds later, gets annotated with a case ID 45 seconds after that — three upserts, one record, handled as intended behavior.
Merge on Read and Copy on Write. Hudi lets you choose a storage type per table. MoR writes delta logs for updates and merges at read time — low write latency, right for high-velocity ingest. CoW rewrites files on every upsert — faster reads, right for tables being queried constantly. We run our ingest tables on MoR and our analyst-facing tables on CoW after compaction. Matching the storage model to the access pattern per table is genuinely useful.
Incremental queries. A consumer specifies a point on the timeline and gets back exactly what changed since then. No full scans, no manual offset tracking. New detection rules backfill by walking the timeline in chunks. Downstream processes that fell over during an incident resume from where they stopped.
Avro. Hudi uses Avro as its internal record representation in the versions we’re running. Schema evolution and field resolution are consistent between the serialization layer and the table layer, which is what we were aiming for when we made the Part 1 decision.

Tradeoffs We Knew Going In

Hudi requires more operational configuration than Delta Lake. Compaction scheduling, cleaning policies, timeline retention, index types — the defaults are a starting point, not a final answer. Plan for ongoing tuning.

Iceberg has broader query engine support. If non-Spark engines are a hard requirement, check Hudi’s current compatibility list before committing.

MoR tables accumulate delta logs between compaction runs. If compaction falls behind ingest, read performance degrades. It’s manageable but it needs monitoring.

None of these changed the decision. They were costs we could see and plan for.

Where This Leaves Us

We chose Hudi because our specific requirements — streaming ingest, upserts, incremental processing, and a native timeline — mapped to what it was built for. Iceberg and Delta Lake are both good. They were just better fits for different workloads than ours.

So, that’s the whole analysis. The right format for your stack depends on your requirements. These were ours.

Engineering Chaos is about applying modern data engineering to rethink how security teams build and operate their data infrastructure. As always, views are mine. You can share them, but I don’t know why you’d want to.

Engineering Chaos

Discussion about this post

Ready for more?