Google faced a challenge: they needed to query massive amounts of raw data, but processing and generating output was time-consuming. They required a solution that could parallelize the entire computation process, deliver results faster across multiple machines, and handle failures effectively.

The goal was to build a system capable of parallel computation. While the tasks of deriving data were straightforward, the volume of raw data (crawled pages, documents, etc.) was huge.

This led to the development of Hadoop, which manages parallel processing, data distribution, and load balancing—handling all these complex details behind the scenes while allowing users to focus solely on defining their tasks.

Google created a model called MapReduce, which follows two steps: first mapping all records to a specific type of key, then performing reduce operations on all records mapped to that key. Most of their data processing tasks fit into this paradigm.

Let’s Understand this by example

Programming Model

map(String key, String value):
    // key: document name
    // value: document contents
    for each word w in value: 
      EmitIntermediate(w, "1"); 
  
reduce(String key, Iterator values): 
    // key: a word 
    // values: a list of counts int result = 0; 
    for each v in values:
	   result += ParseInt(v);
    Emit(AsString(result));

example: Word Count

Let’s see what Map and Reduce actually do with an example. Assume we have input from 2 documents:

  • doc1: “apple banana apple”
  • doc2: “banana orange apple”

The Map() function processes this and generates: (“apple”, “1”), (“banana”, “1”), (“apple”, “1”) from doc1, and (“banana”, “1”), (“orange”, “1”), (“apple”, “1”) from doc2.

Basically, it creates Key/Value pairs that we need for the reduce operation. Here, we’ve decided to use each word as a separate key with a default count of 1.

Then the Reduce() function combines these results: (“apple”, “3”) (“banana”, “2”) (“orange”, “1”)

Okay All of this still feels abstract, and it should, because users only need to worry about defining the Map-Reduce functions. Let’s now take a closer look at how data flows across machines and understand what happens behind the scenes to give us this word count result.

Architecture

  • At the top, there is one designated machine called the Master Node that orchestrates the whole process and handles resource allocation. The Master decides which workers will function as Mappers and which as Reducers.
  • Input is split into many files. In their original paper from 2004, Google stated that each file can be 16-64 MB in size.
  • Mappers compute their tasks and write results to R files on their attached disks. They then send these file addresses to the Master, which communicates this information to the Reducers that will consume those specific files.
  • The numbers of Mappers and Reducers are not fixed. There can be X Mappers and Y Reducers with no correlation between these numbers. Users can configure these based on their specific requirements.
  • Output files from Mappers are distributed to each Reducer. For example, if we have 2 Reducers, the output from Mapper(1) will be split and sent to both Reducers. So each Mapper creates multiple output files based on the number of available Reducers. We’ll discuss the advantages of this approach later.
  • Reducers read from their respective files and sort the input based on keys, which makes the Reduce operation possible. As we saw in our example, (“banana”,1), (“banana”,1) becomes (“banana”,2).
  • After completing their operations, Reducers send their output to their associated output files.

Advantages of this architecture:

  1. It’s stateless, so you can add more computers or disks to any section and the operation will still work and scale horizontally with as many machines.
  2. If any component fails, we can restart it independently or replace it quickly.
  3. The computers don’t need to be powerful—these are inexpensive commodity machines.

Why Mappers and Reducers are not correlated ?

  1. If all map outputs were stored in a single file per map task, the reduce tasks would need to scan and filter through the entire file, which is inefficient.
  2. By partitioning the output into R files during the map phase, each reduce task can quickly fetch only its relevant data, reducing unnecessary data transfer and processing.

How Does It Work?

  1. Map Phase (Splitting and Partitioning)
    • Each map task processes an input split and produces intermediate key-value pairs.
    • These key-value pairs are hashed (partitioned) based on the key to determine which reduce task should process them.
    • Instead of storing all intermediate output in a single file, the map task writes R separate files (one for each reduce task).
  2. Shuffle Phase (Data Transfer)
    • Once a map task is done, the master node informs the reduce tasks where to find their respective files.
    • Each reduce task fetches its assigned partitions from all completed map tasks.
  3. Reduce Phase (Merging and Final Output)
    • Each reduce task processes only the key-value pairs assigned to it.
    • The reduce task outputs a single final file

What about faults?

Since there will be lots of machines running simultaneously, fault tolerance is crucial.

Worker fault

The Master pings every worker at specific intervals and receives acknowledgement replies. After a certain number of missed replies, the Master declares that specific worker as idle and reassigns their tasks to other workers. This applies to both Mappers and Reducers—if they fail, their tasks are reset and the same input is rescheduled to another machine.

Even if large numbers of clusters fail, disconnect, or become unreachable due to network or power issues, the Master will re-execute the tasks on available machines and complete the job. This might take more time, but the process will finish.

Master failure

In this case, Google simply restarts the whole task. Users can select a more durable machine for the Master role, but failures can still occur due to uncertainty.

Users can also configure the system to elect a new Master if needed, since each Master saves progress at regular checkpoints.

Handling slow machines

If some machines respond slowly due to low disk read throughput or if the operating system prioritizes other tasks over Hadoop processes, these machines can become outliers that slow down the entire operation, even when other machines have completed their processing.

Hadoop solves this problem through a technique called speculative execution. As the process nears completion, the system identifies which tasks are still running on slow machines and assigns duplicate copies of these pending tasks to idle machines that have already finished their work. Whichever machine either the original slow one or the newly assigned one completes the task first is considered successful, and that result is used. This prevents a few slow machines from becoming bottlenecks in the overall MapReduce job.

Additional Refinements

Combiner Function

Users can specify a combiner function, which works just like a reduce task but runs on the Mapper side. Before the Map writes data to disk and sends it over the network, this function executes, essentially pre-reducing data and making it more compact before transfer. This makes overall execution faster since less data needs to be sent and computed.

Skipping Bad Records

Some computations cause exceptions. We can define handlers for these exceptions to prevent the program from halting execution—either handling them properly or skipping them entirely. If a particular input file causes too many exceptions, the Master flags it to be skipped.

Status Information

The Master provides status updates via HTTP and exports stats for users. This status shows progress metrics like how many tasks are completed, how many are pending, standard error and output statuses, and whether machines are working properly. This helps users plan resource allocation.

Counters

As mentioned above, the Master communicates status along with additional information if requested as part of the computation. For example, in our Word Count example, if we wanted to know “How many instances of ‘apple’ have been found so far,” we can pass this counter function to the Map-Reduce Library and receive regular updates from the Master.

Performance testing

lets take 2 examples 1. Search a word in documents and 2. Sort

Machine Configuration Cluster of 1800 machines, Each machine had two 2GHz processor, 4GB memory and 160 GB storage disk. Everything connected with Ethernet. Out of the 4GB of memory, approximately 1-1.5GB was reserved by other tasks running on the cluster.

Search Task

The program scans through 10^10 records of 100 bytes each, looking for a three-character pattern (found in only 92,337 records). We split the input into roughly 64MB chunks (M = 15000) and put all output in one file (R = 1). Figure 2 shows how the computation progresses over time. The Y-axis shows the data scanning rate. The speed increases as more machines join the MapReduce job, reaching its peak at over 30 GB/s with 1764 workers assigned. As map tasks finish, the rate drops and hits zero around 80 seconds into the process. The whole computation takes about 150 seconds total, including nearly a minute of startup overhead. This overhead comes from distributing the program to all worker machines and the delays in talking to GFS to open those 1000 input files and gather information needed for locality optimization.

Sort Task

The sort program sorts 10^10 100-byte records (about 1 terabyte of data). The input data is split into 64MB pieces (M = 15000), and the sorted output goes into 4000 files (R = 4000).

Let’s break this down into 3 parts:

  • (a) Normal execution
  • (b) Execution with no backup tasks (that help slower machines)
  • (c) Execution when machines were intentionally shut down, but backup tasks were enabled
(a) Normal Execution
  1. The top-left graph shows how fast we’re reading input data. The rate peaks at around 13 GB/s and drops quickly since all map tasks finish before 200 seconds. IO operations are heavier here compared to the Search program, since we’re reading a lot and passing the same amount of key-value pairs to the next disk write operation.
  2. The middle-left graph shows data being transferred from mappers to reducers over the network as soon as map tasks complete.
  3. The bottom-left graph shows how fast the sorted data is written to final output files by reduce tasks. There’s a delay between the end of shuffling and the start of writing because machines are busy sorting the intermediate data.

The highest peak is in graph 1 because data is read locally, giving us the highest input rate. The transfer rate is lower than input but higher than output as data is being sent and shuffled. Output is slowest because two copies are written for fault tolerance.

Total time: 891 seconds

(b) No Backup Tasks

Here, no backup or helper tasks are enabled to help slow machines finish. Because of this, most tasks finish in 960 seconds, but the slow ones take an extra 300 seconds, making the total execution time 1283 seconds.

(c) Backup Tasks Enabled

Here, 200 machines were forcefully stopped mid-process, and their inputs were lost (you can see a negative dip in the top-right graph). Following the fault tolerance mechanism, these failed tasks were rescheduled to working machines. Once these machines finished their own tasks, they picked up the failed tasks, completing the whole process in 933 seconds.

Even with a failure rate of about 10% (200/1800 machines), we still performed better than when dealing with just a few slow machines, which hurts throughput much more. In this case, triggering the fault tolerance mechanism and rescheduling tasks turned out to be faster than dealing with a few slow machines.

This was understanding of Hadoop archietecture. One of the best example of horizontal scaling and solving problems at scale.

Thank you for reading.