Data Models and Other Professional Delusions

Mike Saxton

10 Nov 2024 • 4 min read

Data Models Are Like Assholes: Everyone Has One and They All Stink.

(This was my original title but in an edit room floor decision I decided to change it. It is a good time to set the disclaimer that this blog contains the opinions of me personally and nobody else yada, yada, yada.)

I've sat through enough vendor pitches and threat intel sharing workshops to know one universal truth: everyone thinks their security data model is special. It's their uniquely perfect way of representing indicators, their revolutionary approach to storing alerts, their game-changing threat intelligence taxonomy.

The Reality Check

Here's the thing about security data models: they're compromises. All of them. That pristine STIX implementation you spent months building? It'll last right up until you need to ingest that one critical threat feed that somehow represents IP addresses in Base64 because... reasons. That elegant security data lake? Wait until compliance needs to prove exactly who looked at what alert and when, during a three-month window from last year, and they need it yesterday.

The Three Lies of Security Data Modeling

“Our model captures all relevant security context"

Narrator: It captured all known context from last quarter's threats

2. "We're STIX/TAXII compliant"

Translation: We can export to STIX, as long as you don't need half the fields.

3. "The schema is vendor-agnostic"

Sure, Jan

Why They All Stink

Every security data model is built on assumptions. Assumptions about:

What data you'll need for detection
How threats will manifest
Which correlation rules matter
What attributes will be available
How much context you can actually get
Whether time zones will ever be consistent (they won't)

So What's the Answer?

The secret isn't building the perfect security data model. It's building one that's:

Resilient enough to handle dirty data (because it's all dirty data)
Flexible enough to adapt (because threats always change)
Simple enough to query (because you'll be writing those queries at 2 AM)
Pragmatic enough to actually detect things
Like Pokémon, enable you to collect it all, without introducing bottlenecks

The Wisdom of Regret

After years of watching beautiful security architectures turn into alert graveyards, here's what I've learned:

Denormalization isn't a sin, it's a survival strategy (Try running that complex join when there's an active incident)
Normalization is a journey, not a destination
If you can't explain your correlation logic to a junior analyst, it's too complex
The threats will evolve faster than your data model
Sometimes grep is just fine (I said what I said)

The Denormalization Confession

Let's talk about denormalization, security's dirty little secret. Everyone's doing it, but nobody wants to admit it. You start with perfect normal forms and beautiful relationships between indicators, events, and entities. Then reality hits:

That perfectly normalized alert structure? It's why your queries take 30 seconds when you need them in 2.
Those beautifully separated IoC tables? They're why your correlation rules look like a SQL version of War and Peace.
That elegant evidence chain? It's why your analysts are drinking at noon.

Sometimes you need to duplicate data. Sometimes you need to flatten those relationships. Sometimes you need to sacrifice perfection at the altar of "holy crap we need to detect this NOW."

Schema-on-Read

After that therapeutic rant, let me share something useful-ish: schema-on-read patterns with Apache Spark. Because sometimes the best data model is the one you don't have to enforce up front.

What if instead of trying to force every security product's weird log format into your perfect schema (and breaking everything every time vendor X decides timestamps should now be in Roman numerals), you just... didn't? What if you stored the data in its raw form and applied structure only when you need it?

This is where tools like Spark shine:

Raw logs in their natural habitat? Fine.
Weird vendor format changes? Whatever.
Need to correlate across different schemas? That's future-you's problem.

Let's Get Technical For a Minute

[Brief pause in our cynicism for some actual technical depth]

In traditional security data pipelines, we try to normalize everything at ingestion:

# Traditional approach - pray nothing changes

def ingest_security_event(raw_event):

normalized = force_into_standard_schema(raw_event) # What could go wrong?

store_in_database(normalized) # Hope you got it right

With schema-on-read, we maintain flexibility:

# Store raw events

_events = spark.read.json("s3://security-events/*", schema="string")

raw_events.write.parquet("s3://raw-storage/")

# Define interpretations as needed

def extract_authentication_events(raw_df):

return raw_df.select(

F.regexp_extract('value', 'timestamp:"(.*?)"', 1).alias('event_time'),

F.regexp_extract('value', 'user:(.*?),', 1).alias('username'),

F.regexp_extract('value', 'src_ip:(.*?),', 1).alias('source_ip'),

F.regexp_extract('value', 'result:(SUCCESS|FAILURE)', 1).alias('auth_result')

)

Performance Considerations (Because Someone Will Ask)

The key to making this performant:

# Partition raw data intelligently

raw_events.write.partitionBy(

"year", "month", "day", "event_type"

).parquet("s3://raw-storage/")

# Create optimized views for common queries

frequently_needed = spark.read.parquet("s3://raw-storage/")

.where(F.col("event_type").isin(["AUTH", "NETWORK"]))

.select(extract_common_fields())

.repartition("source_ip")

.cache()

Back to Reality

Let's talk about why this matters in the real world:

Store first, structure later

Because you don't know what you'll need until you need it

Because your perfect schema is perfect right up until it isn't

Transform on demand

Different teams need different views? No problem

New correlation requirement? Just add another transformation

3. Evolve gradually

No more big schema migrations

No more "everyone stop sending logs" maintenance windows

The Catch (Because There's Always a Catch)

Yes, this means:

You need more processing power
Your queries might be more complex
You're trading write-time complexity for read-time flexibility

But honestly? In security, that's usually the right trade-off. Because the only constant in security data is that it will change, and usually at 4:30 PM on a Friday.

My get off the stage moment

Your data model still stinks. My data model still stinks. But with schema-on-read, at least it’s possible to focus on finding the threat and not finding the fail normalization pattern. And sometimes, that's the best we can hope for.

But hey - I know about your perfect security data lake with automated STIX translation and real-time correlation. I'm sure it's lovely. Let me know how it goes when you have to ingest that one critical feed where timestamps are optional.