Like we know the answer, we know that a laptop cannot be that powerful to have such capability, that it handles millions of users, but it was always vague to me and I did not know the exact technical reason why it is the case. Where things break? Where are the bottlenecks? Are they related to hardware or something else? Is this a framework dependent question? Or what actually stops us from going there.
This will be a deep dive on why my device cannot handle that many users, not just relying on the answer of “we need a big machine for many users”, but instead to actually understand when we need to pull the levers to reach that many users.
This will be 2 part discussion, Part 1 is where we will read about all the topics and Part 2 is where I will put my PC into test.
Low level understanding of request cycle
To understand where things break at scale, we first need to understand what actually happens when a request comes in, you hit http://localhost:3000/users/42 and get a JSON response in 5ms locally. Let’s unfold it, because the cracks that appear at 1M users are sitting right here, invisible.
(We’ll be looking at this from a Linux kernel perspective since most servers run Linux but my project is in .NET, so metrics and performance numbers will be from the Windows ecosystem. Don’t mind the mix.)
When your server starts, it makes a system call socket() asking the OS kernel for a socket. A socket is just a kernel-managed data structure: a buffer for incoming bytes, a buffer for outgoing bytes, and some metadata. The process then calls bind() to claim port 3000, and listen() to tell the kernel: _“when SYN packets arrive on this port, queue them.”
At this point, your process isn’t handling connections the kernel is. The process is sleeping, blocked on an accept() call. The kernel maintains two queues here: the SYN queue (incoming connection requests) and the accept queue (connections that have completed the TCP handshake and are waiting for your app to pick them up). Once your app calls accept(), it pulls a connection off that queue and gets to work.
From there, the kernel’s send and receive buffers handle the actual bytes, incoming request bytes land in the receive buffer, and response bytes go out through the send buffer. These are separate from accept and request queue.
Server then parses those bytes into a request object, maps it to a handler, and if there’s a database call involved, opens another TCP connection this time to Postgres. Postgres does its thing, returns rows. Client serializes them to JSON. More CPU work. Finally, the app calls write() on the file descriptor, the kernel copies the bytes into the socket’s send buffer, and they go out on the wire.
Now where things break?
Layer 1: 0 → 100 concurrent users the first honest bottleneck
Modern laptop can absolutely handle 100 concurrent users. But how you’re handling them starts to matter.
Model A: One Thread Per Connection
Threads share memory, so they’re cheaper ~1–2MB stack per thread.
Why this breaks?
- Context switches. Every time the OS scheduler swaps from one thread to another, it has to save the current thread’s CPU registers (general-purpose, floating-point, vector roughly 15+ registers), update scheduler bookkeeping, and load the next thread’s registers. The raw switch itself takes ~1–5 microseconds.
That doesn’t sound like much, but if you have 10,000 threads and the scheduler round-robins through them, you’re burning tens of milliseconds of pure overhead per full cycle not counting the cache-miss penalty afterward, which can extend the effective cost to 10–50 microseconds per switch. You’re now spending more CPU on switching than on doing actual work.
-
CPU cache thrashing. Every context switch cools the cache the new thread’s data isn’t there. Cache misses turn a 1ns operation into 100ns. At high thread counts, your CPU spends most of its time waiting for memory, not computing.
-
Memory. 2MB stack × 10,000 threads = 20GB RAM just for thread stacks. Possible, but not practical though to be fair, this is lazily committed, starting from 4KB and doubling at each threshold up to 2MB, so it’s not 20GB all at once.
But the deeper reason this model is wrong and this is the insight that led to async/event-loop architectures is that most of those threads aren’t doing anything. They’re blocked on I/O. Waiting for the database. Waiting for the network. Each thread is holding 2MB of stack and a kernel scheduler slot just to wait.
Model B: One Thread, Non-Blocking I/O, Event Loop (Node.js, async Python)
Instead of blocking a thread on each I/O operation, we tell the kernel: “let me know when any of these 10,000 sockets has data ready.” The kernel checks which sockets are ready and alerts the server → server handles them → they either finish or go to I/O operation → repeat.
Because I/O is the slow part, a single CPU thread can juggle thousands of connections as long as each handler does only a tiny amount of CPU work between I/O calls.
The kernel maintains the watched file descriptors as a persistent data structure. When an FD becomes ready, it gets put on a ready list and your event loop picks it up.
Model C: Thread Pool with Completion-Based Async I/O (.NET, Java virtual threads, Go)
Many OS threads, but threads aren’t tied to specific connections. When a request awaits I/O, its thread is released back to the pool and picks up other work. When the I/O completes, any available thread resumes the continuation.
This gives you Node’s I/O efficiency and real CPU parallelism across cores.
Layer 2 : 100 → 1000 concurrent users where the cracks first appear
Our PC is unlikely to break here, but there are certain things to consider.
Accept queue overflow
The two kernel queues from earlier the SYN queue (half-open connections during handshake) and the accept queue (established connections waiting for your app to call accept()). At this scale, the accept queue is the one that breaks first.
Server is slow to call accept() because it’s busy serving existing requests. The accept queue fills. New clients experience connection failures but your monitoring of existing requests looks fine. You see occasional “connection refused” errors. This is called accept queue overflow.
File descriptors
These are a Linux concept. Windows uses HANDLEs instead same idea (kernel-tracked references to open resources), different implementation. We’ll focus on Linux since most servers run there.
When your process calls socket(), accept(), or open(), the kernel creates an entry in a per-process table called the file descriptor table. The “file descriptor” you get back (an integer like 5 or 247) is just an index into this table. The actual entry points to a kernel structure (struct file in Linux) which holds the socket buffers, the connection state, the read/write position, permissions, and so on.
The kernel doesn’t let a single process hold unlimited FDs for two reasons:
The table is bounded per-process. Linux traditionally defaults to a soft limit of 1024 FDs per process, though modern distros often raise this for services. You can check with ulimit -n and raise it via ulimit or systemd unit files.
This isn’t because the kernel can’t handle more it absolutely can it’s a guardrail. The per-process limit isolates one buggy process from taking resources others need. A separate system-wide limit (fs.file-max) protects total kernel memory from being exhausted by all processes combined. A runaway process leaking sockets hits its per-process limit first, before it can hurt the rest of the machine.
Why FD accounting matters?
A single incoming HTTP request in a typical app touches multiple FDs:
- 1 FD for the client socket (per request)
- 1 FD for the database connection but borrowed from a pool, not opened per request. A pool of 20 connections means 20 FDs total for DB, regardless of how many requests are flowing through.
- 1 FD per downstream microservice call (also typically pooled via HttpClient/keep-alive)
- 1 FD per file you open to read/write
- 1 FD per Redis connection (also pooled)
Without pooling, 1000 concurrent requests × 5 backends = 5000 FDs. With pooling, it’s more like 1000 client FDs + ~100 pooled backend FDs = 1100 total. Pooling is what keeps FD counts manageable at scale and is the reason real production apps don’t blow through their FD limits even under heavy load.
This is where our Part 1 ends and in Part 2 we will resume from right here and Test out machine on various type and levels of load.
Thanks for reading.
Feel free to connect with me on (rahuldsoni2001@gmail.com)