2 minute read

By September 2015, Hadoop was no longer a fringe technology. It had become the de facto platform for big data infrastructure — and at the heart of this evolution was a powerful but often misunderstood component: YARN.

While relational databases like Oracle RAC doubled down on shared storage and tightly integrated compute, YARN flipped the model. Instead of pulling data to compute, it moved compute to the data — a fundamental shift in architecture.

And surprisingly, a big part of what made this possible was the JVM.


The Backstory: From MapReduce to YARN

In Hadoop’s early days, MapReduce was both the execution engine and the resource manager. This tight coupling led to limitations:

  • Every job had to be written in the MapReduce model
  • Resource utilization was inefficient
  • No support for multiple compute paradigms (e.g., streaming, graph processing)

Enter YARN (Yet Another Resource Negotiator) — introduced in Hadoop 2.x — which decoupled resource management from execution logic.

YARN transformed Hadoop from a batch processing engine into a general-purpose data operating system.


YARN Under the Hood

At a high level, YARN consists of four core components:

1. ResourceManager (RM)

Acts as the master for resource allocation across the cluster. It tracks available resources (CPU, memory) and assigns them to applications.

2. NodeManager (NM)

Runs on each node and reports resource availability to the RM. It also launches and monitors containers.

3. ApplicationMaster (AM)

Per-application coordinator. Each submitted job spins up an AM, which negotiates resources with the RM and coordinates tasks on NodeManagers.

4. Container

A lightweight, isolated process with allocated resources (CPU, memory). Containers are where actual tasks run — including MapReduce, Spark, Tez, etc.


Data Locality as a First-Class Principle

Unlike traditional systems where data is fetched over the network to a compute node, YARN asks: “Where does the data live?” and then schedules compute on that node.

This strategy:

  • Minimizes data shuffling
  • Reduces I/O latency
  • Increases throughput at scale

Here’s a typical flow:

  1. A MapReduce job is submitted
  2. YARN starts an ApplicationMaster
  3. The AM asks the RM for containers on nodes where HDFS blocks reside
  4. NodeManagers launch tasks in containers with data-local awareness

The result? Massively parallel data-local execution, enabled by HDFS’s block-level replication and metadata awareness.


How This Differs From Oracle RAC

Oracle RAC (Real Application Clusters) takes a very different approach:

  • All nodes share access to a centralized storage layer (usually SAN or ASM)
  • Coordination happens via a cluster-wide lock manager and internal messaging
  • Compute is distributed, but data is effectively centralized

This model works well for OLTP workloads with high concurrency and transactional integrity. But it introduces significant bottlenecks for analytical or batch workloads:

  • Data locality is nonexistent
  • Reads/writes can lead to lock contention
  • Scaling requires expensive hardware and careful tuning

In contrast, YARN + HDFS:

  • Distributes both data and compute
  • Treats failure as normal (replication instead of locking)
  • Scales horizontally on commodity machines

This is not just a technical difference — it’s a philosophical one.


The Role of the JVM

YARN’s flexibility owes a lot to the Java Virtual Machine:

  • Containers in YARN often run JVM-based applications (MapReduce, Spark, Tez)
  • JVM isolation (via classloaders and sandboxing) allows safe multi-tenancy
  • Java’s rich ecosystem (HDFS, Hive, HBase, etc.) makes integration seamless
  • JVM introspection tools (JMX, GC logs, profilers) allow deep visibility into task execution

And critically, JVM portability means that the same code can run across clusters with different OSes, configurations, or hardware — as long as there’s a JDK.

We used this extensively in 2015 for cross-data-center execution of Spark jobs using YARN as the resource backbone.


Beyond MapReduce: Multi-Engine on YARN

YARN isn’t just for MapReduce. By 2015, the Hadoop ecosystem had already started diversifying:

  • Apache Tez: DAG-based execution for low-latency jobs (used by Hive)
  • Apache Spark: In-memory processing engine that integrates with YARN
  • Apache Flink: Stream processing framework with batch support

All of these could be scheduled via YARN, thanks to its pluggable ApplicationMaster architecture.

YARN had become the kernel of the data OS — abstracting resources, managing lifecycle, and ensuring fairness across diverse applications.


Lessons from the Field

  1. Data-local compute is a game changer
    YARN’s ability to push compute to data fundamentally reduces I/O overhead and accelerates analytics.

  2. RAC vs YARN is a contrast of assumptions
    RAC assumes strong coordination; YARN assumes eventual consistency and fault tolerance.

  3. JVM is not a bottleneck — it’s an enabler
    From serialization frameworks to GC tuning, the JVM powers modern data pipelines at scale.

  4. Resource decoupling enables innovation
    By splitting resource management and execution, YARN unlocked diverse compute models.

  5. Distributed doesn’t mean complex, if done right
    YARN abstracts just enough to let engineers focus on data processing logic, not cluster orchestration.


If You’re Curious…

  • Try submitting a Spark job to a YARN cluster and observe container allocation
  • Explore YARN’s FairScheduler and CapacityScheduler — key for multi-tenant environments
  • Read the original Apache YARN architecture paper for deep insight
  • Compare HDFS replication-aware scheduling with Oracle’s global buffer cache

“YARN didn’t just change how we run jobs. It changed where we run them — and why that matters.”

By moving computation closer to data, embracing JVM-based execution, and providing a robust, extensible resource layer, YARN redefined distributed computing for the data age.

And in 2015, that shift was only just beginning.

Comments