If you’ve ever wondered how Facebook handles millions of users fetching data at lightning speed, it’s possible because of systems like Memcache. It is a network cache that helps Facebook serve data quickly while keeping things reliable. Main purpose of memcache is to provide frequently accessed data fast and save time in computing redundant information. Let’s look at how Facebook modified Memcache to operate at their massive scale and solve real-world challenges, based on their paper _"Scaling Memcache at Facebook."

But first, a quick look at:

How Memcache Works

The basic process is simple:

For Reading Data:

  1. Web server checks the cache first
  2. If the data is there, it’s returned right away
  3. If not, the server gets it from the database
  4. The server saves this data in the cache before sending it back

For Writing Data:

  1. The system just deletes the cached key instead of updating it
  2. Deletion can happen multiple times without causing problems since it is idempotent operation
  3. Fresh data gets added to the cache when needed next

Reducing Latency: Making Data Fetching Lightning Fast

To fetch anything, server to server communication and server to cache communication is very common. The requested item can contain multiple parts retrieved from multiple services and different data sources, a single user request can consist of multiple back and forth trips across the cache, so reducing latency is big factor in serving overall request.

Facebook added few functionality around this to overcome this:

  1. Parallel Requests and Batching: instead of calling cache server for single item, application can batch multiple requests of keys and send it to Memcache for retrieval.
  2. No Server to Server communication: so each Memcache server only cares about the keys it has, they don’t know if other Memcache server exist, so no need to preserve state. this helps simplifying logic and architecture overall.
  3. UDP vs TCP usage: UDP is used for get requests to reduce latency (20% reduction compared to TCP) and TCP is used for set and delete operations for reliability. Web servers avoid inserting entries after errors to prevent overloading High memory demands of TCP connections make it expensive to maintain connections between every thread and server
  4. Congestion Control for requests: Clients implement a sliding window mechanism, window increase of request are passing and shrinks as requests are failing

Another key component Facebook introduced is Mcrouter, a proxy for Memcache servers. Instead of each application request creating its own connection to Memcache, Mcrouter acts as a middle layer. Applications can either use a client library or route requests through Mcrouter, which maintains a persistent connection pool to the Memcache servers. This approach reduces memory usage and connection overhead, making it the standard and preferred way to interact with Memcache at scale.

All the above approach reduce latency by either cutting down network trips i.e. through batching and Mcrouter, and connection overhead through no server-to-server communication, UDP. Also saving overloads through congestion control.

Reducing Load: Avoiding Database Overload

The main purpose of a cache is to reduce the load on the database by avoiding repeated fetching of the same data. However, in some scenarios, cache misses themselves become the reason for increased database calls. This not only affects performance and latency but also has a cascading impact on database load.

Stale State

This occurs when multiple servers try to write a value to a certain key, but due to being out of order, they end up overwriting each other’s updates, resulting in a value that’s not the latest.
For example, when Server A reads a key from cache and finds a miss, it queries the database for the value. Meanwhile, Server B encounters the same scenario—cache miss and DB query. If Server B gets the new value and writes it first, and then Server A writes its (now stale) value, we end up in an inconsistent or stale state.

  • How it’s solved
    • Server A requests “user_score” (100), gets a cache miss, and receives a 64-bit lease token from Memcached.
    • Server B requests the same key, but since A has the lease, B gets a “wait” notification instead of a token.
    • Server A calculates the new score (110), uses its lease token to set it, and Memcached accepts it.
    • Server B retries after a short delay, sees the updated 110, and proceeds without overwriting—making the operation atomic.
Thundering Herd

This problem occurs when a key is requested by many clients at the same time, leading all those requests to hit a cache miss and go straight to the database.

  • How it’s solved
    • Memcached limits token issuing to once every 10 seconds per key by slightly modifying the lease approach, we can solve this problem too.
    • Subsequent requests within this period receive a “wait” notification.
    • Since the token holder typically sets data within milliseconds, waiting clients usually find the data in cache when they retry.
Replication Within Pools

Whenever some keys which rise in demand and the load becomes too much for 1 server to handle, Memcache instead of splitting the keys between 2 servers to reduce the load, they choose replicating all these keys in 2 servers. The difference in overhead for retrieving 100 keys per request instead of 1 key is small.

For example each server stores 100 keys and each server can handle 500k requests/sec:

How to Scale Up to 1M requests/sec?

  1. Option 1: Split Key Space

    • Add another server.
    • Each server holds half the keys (50 keys each).
    • Now, for every client request (asking 100 keys):
      • Client must split the request into 2 parts (50 keys to server A, 50 keys to server B).
      • Each server still handles 1M small requests overall → no real load reduction.
  2. Option 2: Replicate Keys

    • Instead of splitting keys, copy all 100 items to both servers.
    • Now, each server has the full set.
    • A client can send the full request (for 100 keys) to any one server.
    • Load gets divided → each server handles only 500k requests/sec.
Handling Failures

Memcache failures, even on a small scale, can lead to serious backend overload and cascading system failures, especially if retrying requests overwhelms the database. To handle this, a small set of dedicated servers called Gutter (about 1% of the Memcache cluster) temporarily take over the role of failed servers. When a Memcache request fails, the client retries it through the Gutter pool. If the data isn’t found there, the client queries the database and stores the result in Gutter. These entries expire quickly, reducing complexity and ensuring stale data doesn’t linger.

This approach is preferred over rehashing keys to other Memcache servers, which risks overloading them—especially if some keys are accessed much more frequently than others. By redirecting failed requests to idle Gutter servers, the system prevents load spikes, protects backend systems, and helps maintain system stability even during partial outages.

Performance Improvements

Dynamic Hashtable

Original Problem: The original memcached used a fixed-size hash table. As more data was added, hash collisions would increase, causing lookup times to degrade toward O(n) complexity (linear search through collision chains).

Implementation: They modified memcached to automatically resize the hash table as the data volume grew. This meant:

  • The system could detect when the hash table was becoming too densely populated
  • It would allocate a larger hash table when needed
  • Keys would be rehashed into the new, larger table
  • This maintained O(1) lookup performance regardless of data size

This optimization ensures consistent performance even as the cache grows to hold millions of records

Global Lock and multi-thread access

Original Problem: The original Memcache was single-threaded, which couldn’t fully utilize modern multi-core servers and limited throughput.

Implementation: They modified memcached to:

  • Create multiple worker threads to handle requests in parallel
  • Implement a global lock to protect shared data structures from concurrent access
  • Allow threads to process different requests simultaneously when not accessing shared data
Per-thread UDP Ports

Original Problem: In the multi-threaded version, threads would contend for network resources when sending replies, creating a bottleneck.

Implementation: They assigned each thread its own dedicated UDP port:

  • Each thread exclusively owns one UDP port for sending responses
  • This eliminates contention between threads when transmitting data
  • It distributes interrupt processing overhead across CPU cores
  • Network packet handling becomes more efficient

This optimization reduced thread contention at the network layer and improved overall throughput, particularly for UDP traffic.

New Adaptive Slab Allocation System

Memcache had fixed slab allocation system which was not very useful when usage pattern changed

  • If item sizes shifted (e.g., more small items, fewer large items)
  • Memory remained locked in slab classes that no longer needed it
  • Other slab classes could be constantly evicting items despite available memory elsewhere
  • This resulted in poor hit rates
New Adaptive Solution

Facebook made an algorithm that redistribute the memory among needy servers and avoids memory fragmentation which they otherwise have to deal with shifting keys and background processes.

Memcached organizes memory into “slab classes” - groups of uniformly sized chunks designed to store items efficiently:

  1. Identifying Needy Slab Classes:
    • The system looks for classes that are actively evicting items
    • The next item to be evicted must be at least 20% more recently used than the average of the least recently used items in other slab classes
  2. Memory Transfer Mechanism:
    • When a needy class is found, the system identifies the slab class with the least recently used item
    • It frees that slab (1MB) from the “donor” class
    • It transfers that memory to the “needy” class

Chunk sizes increase by a factor of 1.25 ( So given the smallest chunk size being 80 bytes means the next size is 100 bytes, and then 128 bytes). So what are the benefits and overheads of this technique? Well the first is that memory never fragments in memcached. As such there is no need to compact it, and as such there are no background processes needed to rearrange items store. In addition there is no need to ever clean memory, you can just overwrite the existing slab. There are some negatives. There is some overhead involved in managing this process and another is a slab items will never be fixed size so there will be some space still left inside the slabs, but on global level the whole slab can be reassigned and reused.

Thanks for reading, I hope you found this insightful.

You reach out to me for any suggestion or queries on rahuldsoni@gmail.com or https://x.com/ssoonniiii .