Apache Spark on YARN: Rethinking Distributed Processing Beyond MapReduce

2 minute read

By October 2015, Apache Spark had officially moved from being a promising successor to MapReduce into the mainstream engine of choice for many big data workloads.

Built for speed, general-purpose compute, and developer productivity, Spark redefined what it meant to process large-scale data — and while it could run on its own cluster manager or Mesos, most enterprises embraced Spark on YARN for its integration with the Hadoop ecosystem.

What made Spark so compelling wasn’t just its developer-friendly APIs or its expressive power — it was how operationally different it was from MapReduce, and how it leveraged YARN as a resource orchestrator while managing its own intelligent compute pipeline.

The Architectural Shift: From MapReduce to Resilient Distributed Datasets

MapReduce is stateless. Every job writes intermediate data to HDFS between steps. It’s durable, but expensive:

Lots of disk I/O
High job latency
Rigid two-phase structure (Map → Shuffle → Reduce)

Spark changed the game by introducing Resilient Distributed Datasets (RDDs):

Immutable, partitioned collections of objects spread across the cluster
Stored in memory by default
Able to express complex DAGs of transformations

Instead of chaining multiple MapReduce jobs with intermediate HDFS writes, Spark could express entire workflows in memory:

val rdd = sc.textFile("hdfs://data.txt")
             .flatMap(_.split(" "))
             .map(word => (word, 1))
             .reduceByKey(_ + _)

What would’ve taken multiple jobs in MapReduce is now a single DAG in Spark.

Spark on YARN: Who Does What?

When you run Spark on YARN, you’re combining two systems:

YARN handles resource negotiation: CPU, memory, container placement
Spark handles the compute model: scheduling, task execution, data locality

Flow of Execution:

Spark submits an ApplicationMaster to YARN (just like other apps)
The AM negotiates containers for the driver and executors
Spark driver builds the DAG of stages and tasks
Tasks are shipped to executors as JVM threads
Executors perform work, cache RDDs, and report back to the driver

The key here: Spark manages task execution independently once containers are allocated. YARN simply launches and monitors containers — it doesn’t orchestrate individual stages.

This separation of concerns allows Spark to run faster and more intelligently than traditional YARN-native models like MapReduce.

Operational Differences from MapReduce

1. Long-Running Executors

MapReduce launches a new JVM per task attempt — Spark uses long-lived JVMs (executors), keeping context and data in memory across stages.

This dramatically reduces overhead.

2. In-Memory Caching

With persist() and cache(), Spark lets users explicitly keep datasets in RAM. This is huge for iterative algorithms like machine learning.

3. DAG Execution Engine

Spark builds a Directed Acyclic Graph of transformations, optimizing execution paths, reusing shuffle files, and pipelining tasks where possible.

MapReduce only understands map → shuffle → reduce — no optimization in between.

4. Fault Recovery via Lineage

Instead of writing to disk every step, Spark remembers how to recompute lost partitions via RDD lineage. Failures are fast and scoped.

How JVM Makes Spark Possible

Like Hadoop and YARN, Spark is built on the JVM, but it pushes the envelope further:

Leverages JVM serialization for efficient task distribution
Makes heavy use of off-heap memory for caching (via Tungsten)
Uses code generation (via Catalyst) to compile SQL queries to bytecode
Integrates deeply with GC tuning and thread management

The JVM isn’t just a convenience — it’s an execution platform. Spark embraces it for its introspection, portability, and performance potential.

Spark on YARN in Production

By 2015, many enterprise workloads were shifting to Spark on YARN:

ETL pipelines replaced multiple MapReduce jobs with a single Spark app
SQL-on-Hadoop users moved from Hive on MR to Hive on Spark or Spark SQL
Data scientists embraced PySpark for exploratory analytics

Cluster administrators benefited too:

Unified resource governance via YARN’s schedulers
Multi-tenant isolation with dynamic executor allocation
Consistent monitoring using Hadoop-native tools

Comparing Execution Philosophies

Feature	MapReduce on YARN	Spark on YARN
Compute Model	Two-phase (Map + Reduce)	DAG + In-Memory RDDs
Task Execution	New JVM per task	Long-lived Executors
Intermediate Data	HDFS	Memory (or Disk if needed)
Scheduling	YARN Controlled	Spark Controlled
Fault Tolerance	Re-run task, reread data	RDD lineage-based recompute
Ideal Workload	Batch-only	Batch, ML, Streaming

Lessons from the Field

Spark isn’t just faster — it’s fundamentally different
Memory, lineage, and DAGs shift how jobs are written and executed.
YARN is a launchpad, not a scheduler for Spark
Once containers are allocated, Spark controls everything inside.
JVM enables both flexibility and performance
Spark proves that the JVM can support large-scale, low-latency compute.
Replacing MapReduce requires more than performance
Spark’s expressiveness, APIs, and multi-language support are just as critical.
Understanding operational boundaries matters
Spark + YARN is a powerful combo — but tuning requires knowing who controls what.

If You’re Curious…

Try writing a multi-stage data pipeline in both MapReduce and Spark — compare the effort
Run Spark in cluster mode and client mode on YARN and monitor executor placement
Explore Spark’s web UI to see DAGs and task distribution
Tune a job with dynamic allocation + caching — and watch memory usage

“If MapReduce brought data processing to the masses, Spark made it feel like programming again.”

In 2015, Spark running on YARN marked the beginning of a new era — unifying speed, scale, and simplicity for big data engineering.

And the revolution was just heating up.

Share on

X Facebook LinkedIn Bluesky