A Spark story about scale, grain, and why speed makes system awareness non-negotiable.

Image generated using ChatGPT

The Setup: When “Correct” Code Still Fails

Recently, I ran into a Spark job that looked perfectly fine. I used Claude in this case to generate code, which was

  • Syntactically correct

  • Semantically correct

  • Easy to read and reason about

This worked on the sample dataset. Assuming Spark will manage the scale, we pushed it to production. However, on prod, it took more than 3 hours to complete on a 40-node EMR cluster. It was trying to process ~99 GB fact dataset with ~150 MB dimension data.

What bothered us wasn’t just the runtime but also the fact that nothing looked obviously wrong.

Turned out, this wasn’t a Spark tuning problem nor an infrastructure problem. And interestingly, it wasn’t really an AI problem either.

It was a systems understanding problem.


A Quick Translation (So We Don’t Get Lost in Domain Names)

Before going further, let me simplify the domain language. This will keep our focus on the point I want to make rather than on the domain itself.

The real system was ad tech-related, but the specifics don’t matter. What does matter is the shape of the data.

So for the rest of this post, think in these neutral terms:

  • Entity: Thing we want metrics for (a campaign, product, page, or item)

  • Fine-grained ID: A very detailed identifier (user, device, session, event)

  • Group: A higher-level attribute (region, segment, category)

Each row in the input dataset represented:

(entity, group, fine_grained_id) → metric

Keep this mental model in mind. It’s the key to everything that follows.


The Spark Job (Simplified)

The original Spark logic followed a very common pattern:

  1. Read a large fact dataset

  2. Join it with a small dimension table (broadcasted)

  3. Compute derived metrics

  4. Aggregate the results

In simplified form:

factDataset
  .join(broadcast(dimension), joinKey)
  .withColumn("derived_metric", ...)
  .groupBy(...)
  .agg(...)

Nothing here is wrong. In fact, this was exactly the kind of code I would have approved in a review without thinking twice.

So why did it crawl?

The Missing Question: What Is the Grain?

I spent a fair amount of time in Spark UI before realizing it wasn’t helping.
The breakthrough came only after I stopped looking at Spark altogether and asked myself a simpler question:

What does one row in this dataset actually represent?

Not how big it is in GB.
Not how many partitions it has.
But the grain.

In this case, each row was entity × fine-grained ID × group.

That means:

  • Millions of rows per entity

  • Before any meaningful aggregation

This detail was easy to miss and devastating at scale.

Why Grain Matters More Than Size?

From a purely logical perspective, these two approaches are equivalent, and that’s exactly why I didn’t question the original version early enough:

(join → aggregate)

and

(aggregate → join → aggregate)

Both produce the same final answer.

But Spark doesn’t execute logic. It executes physical plans.

And in Spark:

  • Joins multiply rows

  • Aggregations reduce rows

  • Shuffle cost grows with row count × row width

When you join a fine-grained dataset with a dimension that fans out:

  • Spark must materialize all expanded rows

  • Move them across the network

  • Spill them to disk and only then reduce them later

That intermediate blow-up is a row explosion.

It’s invisible in code. It’s painfully visible in runtime.


The Fix Was Simple

The fix wasn’t:

  • More executors

  • Different join hints

  • Or Spark tuning flags

It was changing when aggregation happened.

Instead of joining first and aggregating later, we aggregated to the correct grain first:

factDataset
  .groupBy("entity", "group")
  .agg(sum("metric").alias("group_metric"))
  .join(broadcast(dimension), "group")
  .groupBy("entity", "dimension_key")
  .agg(sum("scaled_metric"))

Same logic. Same result.

Completely different execution shape.

By collapsing the fine-grained rows early:

  • The row explosion disappeared

  • The shuffle volume dropped dramatically

  • Runtime fell from hours to minutes


Why This Was Actually a Prompting Failure?

At this point, you may wonder:

What does this have to do with prompting?

Here’s the uncomfortable answer:

The original prompt never described the system’s execution constraints, mostly because I hadn’t articulated them clearly to myself either.

When I asked the AI to generate the Spark transformation, my prompt implicitly assumed:

  • Joins are cheap

  • Aggregation order doesn’t matter

  • Scale is just “bigger data”, not structurally different data

Those assumptions are harmless in single-node code. However, they are catastrophic in distributed systems.

The AI did exactly what I asked and exactly what I failed to constrain.


Why AI Misses Row Explosion?

Row explosion is not a syntax problem. It’s an execution model problem.

Understanding it requires knowing that:

  • Joins fan out

  • Aggregation reduces cardinality

  • Shuffle cost dominates runtime

  • Order of operations changes physical plans

AI models are trained on code, not execution traces. They learn what patterns appear together, not what those patterns cost at runtime. There’s no gradient signal for “this join will blow up your cluster.”

And unless you surface that knowledge explicitly in a prompt, the model has no reason to optimize for it.


What a Better Prompt Would Have Looked Like?

Looking back, a better prompt wouldn’t just say what to compute.
It would have forced me to confront assumptions I was making implicitly.

For example:

The input dataset is at fine-grained (user/device/event) level.
Joining before aggregation will cause row explosion.
Please aggregate to the lowest meaningful grain before any fan out joins.

That single paragraph communicates:

  • Data grain

  • Join behavior

  • Performance constraints

Without that, the AI may naturally produce a logically correct but physically disastrous plan.


Prompting Is Architecture

This is the real lesson. Prompting is not about:

  • Better wording

  • Clever phrasing

  • Longer instructions

Prompting is about constraints. And constraints are architecture.

When you prompt an AI without understanding the system underneath, you’re effectively saying, “Decide the architecture for me.

The AI will happily do so guided only by local correctness.


Closing Thought

AI didn’t cause the failure — it just surfaced gaps in my thinking faster than I would have found them on my own.

The better you understand:

  • Data grain

  • Execution models

  • Failure modes

…the better your prompts become.

And at that point, AI stops being just a code generator
and starts becoming a true force multiplier.

In the age of AI, depth matters more than ever — because when code is easy to produce, the cost of wrong system-level choices grows exponentially.

This lesson isn’t unique to Spark. It applies anywhere execution cost is invisible in code — database query planners, network
serialization, and memory allocation patterns. The domain changes; the principle doesn’t.


About the Author

I am a software engineer with 15+ years of experience building data-intensive systems. I work on building distributed systems and analytical platforms, with a focus on understanding systems from first principles. I like to document engineering tradeoffs and lessons learned from building systems under practical constraints.


Review Credits

Thanks, Amaan, Vighnesh, for reviewing this post and providing valuable feedback.


Tags: Writing Prompts, LLM, Claude, Apache Spark, Architecture