A Spark story about scale, grain, and why speed makes system awareness non-negotiable.

Image generated using ChatGPT
The Setup: When “Correct” Code Still Fails
Recently, I ran into a Spark job that looked perfectly fine. I used Claude in this case to generate code, which was
Syntactically correct
Semantically correct
Easy to read and reason about
This worked on the sample dataset. Assuming Spark will manage the scale, we pushed it to production. However, on prod, it took more than 3 hours to complete on a 40-node EMR cluster. It was trying to process ~99 GB fact dataset with ~150 MB dimension data.
What bothered us wasn’t just the runtime but also the fact that nothing looked obviously wrong.
Turned out, this wasn’t a Spark tuning problem nor an infrastructure problem. And interestingly, it wasn’t really an AI problem either.
It was a systems understanding problem.
A Quick Translation (So We Don’t Get Lost in Domain Names)
Before going further, let me simplify the domain language. This will keep our focus on the point I want to make rather than on the domain itself.
The real system was ad tech-related, but the specifics don’t matter. What does matter is the shape of the data.
So for the rest of this post, think in these neutral terms:
Entity: Thing we want metrics for (a campaign, product, page, or item)
Fine-grained ID: A very detailed identifier (user, device, session, event)
Group: A higher-level attribute (region, segment, category)
Each row in the input dataset represented:
(entity, group, fine_grained_id) → metricKeep this mental model in mind. It’s the key to everything that follows.
The Spark Job (Simplified)
The original Spark logic followed a very common pattern:
Read a large fact dataset
Join it with a small dimension table (broadcasted)
Compute derived metrics
Aggregate the results
In simplified form:
factDataset
.join(broadcast(dimension), joinKey)
.withColumn("derived_metric", ...)
.groupBy(...)
.agg(...)Nothing here is wrong. In fact, this was exactly the kind of code I would have approved in a review without thinking twice.
So why did it crawl?
The Missing Question: What Is the Grain?
I spent a fair amount of time in Spark UI before realizing it wasn’t helping.
The breakthrough came only after I stopped looking at Spark altogether and asked myself a simpler question:
What does one row in this dataset actually represent?
Not how big it is in GB.
Not how many partitions it has.
But the grain.
In this case, each row was entity × fine-grained ID × group.
That means:
Millions of rows per entity
Before any meaningful aggregation
This detail was easy to miss and devastating at scale.
Why Grain Matters More Than Size?
From a purely logical perspective, these two approaches are equivalent, and that’s exactly why I didn’t question the original version early enough:
(join → aggregate)and
(aggregate → join → aggregate)Both produce the same final answer.
But Spark doesn’t execute logic. It executes physical plans.
And in Spark:
Joins multiply rows
Aggregations reduce rows
Shuffle cost grows with row count × row width
When you join a fine-grained dataset with a dimension that fans out:
Spark must materialize all expanded rows
Move them across the network
Spill them to disk and only then reduce them later
That intermediate blow-up is a row explosion.
It’s invisible in code. It’s painfully visible in runtime.
The Fix Was Simple
The fix wasn’t:
More executors
Different join hints
Or Spark tuning flags
It was changing when aggregation happened.
Instead of joining first and aggregating later, we aggregated to the correct grain first:
factDataset
.groupBy("entity", "group")
.agg(sum("metric").alias("group_metric"))
.join(broadcast(dimension), "group")
.groupBy("entity", "dimension_key")
.agg(sum("scaled_metric"))Same logic. Same result.
Completely different execution shape.
By collapsing the fine-grained rows early:
The row explosion disappeared
The shuffle volume dropped dramatically
Runtime fell from hours to minutes
Why This Was Actually a Prompting Failure?
At this point, you may wonder:
What does this have to do with prompting?
Here’s the uncomfortable answer:
The original prompt never described the system’s execution constraints, mostly because I hadn’t articulated them clearly to myself either.
When I asked the AI to generate the Spark transformation, my prompt implicitly assumed:
Joins are cheap
Aggregation order doesn’t matter
Scale is just “bigger data”, not structurally different data
Those assumptions are harmless in single-node code. However, they are catastrophic in distributed systems.
The AI did exactly what I asked and exactly what I failed to constrain.
Why AI Misses Row Explosion?
Row explosion is not a syntax problem. It’s an execution model problem.
Understanding it requires knowing that:
Joins fan out
Aggregation reduces cardinality
Shuffle cost dominates runtime
Order of operations changes physical plans
AI models are trained on code, not execution traces. They learn what patterns appear together, not what those patterns cost at runtime. There’s no gradient signal for “this join will blow up your cluster.”
And unless you surface that knowledge explicitly in a prompt, the model has no reason to optimize for it.
What a Better Prompt Would Have Looked Like?
Looking back, a better prompt wouldn’t just say what to compute.
It would have forced me to confront assumptions I was making implicitly.
For example:
The input dataset is at fine-grained (user/device/event) level.
Joining before aggregation will cause row explosion.
Please aggregate to the lowest meaningful grain before any fan out joins.
That single paragraph communicates:
Data grain
Join behavior
Performance constraints
Without that, the AI may naturally produce a logically correct but physically disastrous plan.
Prompting Is Architecture
This is the real lesson. Prompting is not about:
Better wording
Clever phrasing
Longer instructions
Prompting is about constraints. And constraints are architecture.
When you prompt an AI without understanding the system underneath, you’re effectively saying, “Decide the architecture for me.”
The AI will happily do so guided only by local correctness.
Closing Thought
AI didn’t cause the failure — it just surfaced gaps in my thinking faster than I would have found them on my own.
The better you understand:
Data grain
Execution models
Failure modes
…the better your prompts become.
And at that point, AI stops being just a code generator
and starts becoming a true force multiplier.
In the age of AI, depth matters more than ever — because when code is easy to produce, the cost of wrong system-level choices grows exponentially.
This lesson isn’t unique to Spark. It applies anywhere execution cost is invisible in code — database query planners, network
serialization, and memory allocation patterns. The domain changes; the principle doesn’t.
About the Author
I am a software engineer with 15+ years of experience building data-intensive systems. I work on building distributed systems and analytical platforms, with a focus on understanding systems from first principles. I like to document engineering tradeoffs and lessons learned from building systems under practical constraints.
Review Credits
Thanks, Amaan, Vighnesh, for reviewing this post and providing valuable feedback.
Tags: Writing Prompts, LLM, Claude, Apache Spark, Architecture