A Super Cereal Post About Serialization
Evaluating and picking a serialization format for true real-time threat detection
Part 1 of 2
Stream vs batch. Kappa vs Lambda. Copy on Write vs Merge on Read. These are all things security engineers think about (ok, maybe just me and a handful of others?) but they are functions of a high-performance data system.
I have a feeling they will become more common parlance in the security engineering space as security and data engineering teams begin to merge. For our team, we needed to decide. For our users, they see high-performance results — sub-minute detections on petabyte-scale data. What we didn’t want them to see, well worry about really, was any of the words I opened with.
So this will be a 2-part post on our thinking and selection process. This post will cover serialization formats and the following will cover table formats. There’s no “right” answer here — for 90% of use cases most of the capabilities will work. The remaining 10% is on how your platform performs, and that’s what we’ll dive into.
This is the story of why we chose Apache Avro as our serialization format for cybersecurity data and why we felt this was important for what we do.
The Inherent Problem with Cybersecurity Data
Cybersecurity data is not like analytics data. It is not like transactional data. It is not like IoT sensor data. It is its own uniquely chaotic beast, and while the data itself isn’t that unique, when combining it into a system-of-systems, that’s where things begin to get weird.
But here’s what makes it different:
Your schemas are someone else’s decision. And they aren’t all that stable. Elastic is on Version 9 of their product and Version 8 of the Elastic Common Schema (ECS) as an example (correlation, causation, all of that). This isn’t a dig on Elastic — in fact they led the way here in my opinion — but I’ll selfishly use the example to say cybersecurity schema management is just a mess. When an EDR vendor pushes an agent update and a field you’ve been relying on for 18 months is renamed, split into two fields, or now carries a nested object where a string used to live, it breaks things fast. And the only way you find out sometimes is because you got an alert because your pipeline broke.
Your data sources multiply constantly. At any given time you’re ingesting from endpoint agents, network taps, cloud provider audit logs, identity providers, SaaS security APIs, threat intelligence feeds, and whatever new tool the IT team just bought. Each one has its own schema and none of them agreed on a standard before publishing.
Volume spikes are tied to incidents. When something is actively on fire, your ingest rate can jump 10x in seconds. That is exactly the wrong moment to be debugging a schema mismatch.
Late-arriving data is the norm, not the exception. Security logs get buffered, re-routed, and delayed by the very controls they’re monitoring. You cannot assume your stream is ordered or complete.
We like to say cybersecurity data is always in motion. It’s not about compacting it (that’s important), it’s not about decreasing storage volume (that’s important too), but specifically for streaming analytics it’s about taming the chaos into something close to manageable.
Why JSON and CSV Don’t Scale Here
JSON is everywhere in security. Almost every API returns it. Almost every vendor exports it. And it is a genuinely terrible choice for a high-throughput data pipeline at scale.
It’s verbose. Every record carries its own field names. When you’re processing millions of events per second to minute, you are paying a real compute and storage cost to repeat the string "sourceIpAddress" on every single row.
It has no schema contract. JSON is whatever the producer decided to send. There is no enforcement, no evolution tracking, no compatibility guarantee. One bad producer can silently corrupt a downstream table with no warning.
It’s slow to parse at volume. Text-based parsing is expensive. When you’re doing stream analytics on high-velocity threat detection pipelines, that cost compounds fast.
CSV is even worse — no types, no nesting, and a comma in the wrong string field becomes your entire afternoon.
As we started looking towards streaming data we needed something binary, something schematized, and something built to handle change over time. That points squarely at the three main contenders: Protocol Buffers, Thrift, and Avro.
Avro vs Protobuf vs Thrift: Why Avro Wins for This Domain
Protobuf and Thrift are excellent serialization formats. They’re fast, compact, and widely adopted. But they carry a critical assumption: the schema lives in compiled code.
In a product company with stable, versioned APIs, that’s fine. In a security data platform ingesting from dozens of external sources that you don’t control, it’s a liability.
When a vendor changes their schema, you cannot wait for a code review, a build pipeline, and a deployment to resume ingestion. You need to adapt in the data layer, not the application layer.
Avro makes a fundamentally different architectural choice: the schema travels with the data, registered in a schema registry, and resolved at read time. This single design decision unlocks everything that makes Avro the right choice for cybersecurity pipelines.
Schema Evolution That Actually Works
Avro has first-class support for schema evolution with well-defined compatibility rules. You can:
∙ Add a field with a default value — old readers ignore it, new readers see it. Fully backward compatible.
∙ Remove a field that has a default — new readers use the default when reading old data. Fully forward compatible.
∙ Rename a field using aliases — old data maps to the new name transparently.
This is a disciplined contract. And it means that when your EDR vendor renames process_name to process.executable.name in their next agent release, your pipeline doesn’t break — you update the schema, register the new version, define the alias, and data keeps flowing.
In cybersecurity, schema evolution isn’t a nice-to-have. It’s an operational requirement.
The Schema Registry as a Source of Truth
When you pair Avro with a schema registry (Confluent Schema Registry, AWS Glue Schema Registry, or similar), something really cool happens: your schema becomes a governed, versioned artifact.
Every producer registers its schema before writing. Every consumer fetches the schema by ID embedded in the message. Compatibility rules are enforced at write time — not discovered at query time.
For a security data platform, this means:
∙ Lineage becomes traceable. You know exactly what schema version produced any given record.
∙ Breaking changes are caught at ingestion, not when an analyst runs a query at midnight during an incident.
∙ New data sources are onboarded with a schema contract, not just a hope that the fields stay consistent.
∙ Security analysts get consistency across every vendor. When CrowdStrike, SentinelOne, and Microsoft Defender all land in the same table with normalized, schema-enforced field names, analysts stop memorizing vendor-specific quirks and start writing detection logic that actually works across your entire endpoint fleet.
This is data engineering discipline applied to one of the most schema-volatile domains that exists. And it changes the operational posture of the whole platform.
What This Means for Stream Analytics
The payoff is in your detection pipelines. When every event flowing through your Kafka topics is Avro-encoded with a registered schema, your stream processors gain superpowers:
Enrichment is predictable. When you join an authentication event against a user identity store, you know exactly what fields are available and what types they carry. No defensive null checks for fields that might or might not exist depending on the source.
Multi-source correlation works cleanly. When you’re correlating a process execution event, a network connection event, and a file write event to detect lateral movement, all three sources are speaking the same schematized language. You can join them confidently.
Schema drift is observable. When a new schema version is registered, you can alert on it, review it, and decide whether it’s backward compatible before it affects downstream consumers. Schema changes become change management events, not surprises.
Historical backfills don’t break. When you need to reprocess 90 days of events against a new detection rule, you’re reading Avro records where the schema is embedded in the data. Old records still deserialize correctly even after the schema has evolved.
The Honest Tradeoffs
Avro is not perfect and we have found a few things out along the way.
Schema registry dependency is real. If your registry is unavailable, producers and consumers that require online schema resolution will fail. You need to treat your schema registry with the same operational care as your message broker. Cache schemas aggressively. Build for registry outages.
Binary formats are harder to debug. When a JSON record is malformed you can read it. When an Avro record is malformed you need tooling to inspect it. Invest early in good observability and deserialization debugging utilities.
Schema governance requires discipline. The power of a schema registry is only realized if your teams actually use it. Producers that bypass the registry and write raw JSON or unregistered Avro undermine the whole model. We tried this early on and saw a massive decrease in performance.
Where This Leaves Us
We chose Avro because cybersecurity data is defined by change — vendors change schemas, new sources get added, the threat landscape forces new data types into existence. A serialization format that cannot evolve gracefully is a serialization format that becomes a liability.
Schema evolution gives us the operational resilience to keep pipelines running when the data changes. The schema registry gives us the governance foundation that makes the whole platform trustworthy. And the analyst on the other end gets consistent, predictable data regardless of which vendor generated it.
That foundation is what makes the table format argument in our next post even possible to have. When the serialization layer is solid, you can focus on optimizing how data is stored and queried — rather than constantly firefighting schema breakage.
Up Next — Part 2: The Table Format Wars
We evaluated Iceberg, Delta Lake, and Hudi. We ultimately chose Hudi because for our stack it worked. As I mentioned earlier, for 90% of use cases the differences won’t matter much — but for the 10% left, that’s where our decision was made.


