Notes on Go Scheduler | Nithin's blog

Go Scheduler

Go uses goroutines which are an abstraction over real threads
Goroutines act as application level threads that run on OS threads
Context switching while cpu bound load is inefficient.
- Context switching of OS threads takes about ~1000ns or 1 micro second. To give a ballpark, we might lose about ~12k instructions during a context switch. Given these assumptions: At 3 GHz, one clock cycle takes about 0.33 nanoseconds (1 / 3 billion). So, three cycles per nano second. In 1 nanosecond, a 3 GHz CPU core could execute approximately 3 instructions. Modern CPUs can often execute more than one instruction per cycle, but for simplicity, let’s assume 4 IPC. So, that comes down to ~12 instructions per nano second.
- context switching between IO bound work load is good because the CPU is free, it is not executing any instructions, rather waiting for an IO process to complete. So, we can context switch during io workloads
In Go, we mostly write application which perform IO bound workloads.
Go scheduler has a scheduling interval that is roughly set to 10ms.
Let’s say in a webserver which is serving 10_000 request per second. Each request is handled by one goroutine.
- Now, we know that there’s roughly 10ms of potential execution time. But what if, in the first one ms in, the request hits the database and is waiting for io to happen.
- This is detected and go scheduler, context switches to another goroutine, which leads to efficient use of runtime.
  - This context switch takes roughly just around 200ns as compared to 1000ns in OS thread context switch.
- Essentially, we have turned IO bound workloads into efficient CPU bound workloads through context switches amongst Goroutines.
For CPU bound workloads, its efficient to run on just one thread per cpu core. Why ?
1. Core utilization:
  - Each CPU core can execute one thread at a time.
  - One thread per core ensures full utilization of available processing power.
2. Minimal context switching:
  - When there are more threads than cores, the OS must switch between threads.
  - Context switches incur overhead: saving and loading thread states.
  - With one thread per core, context switches are minimized.
3. Cache efficiency:
  - Each core has its own cache (L1 and often L2).
  - One thread per core maximizes cache utilization and reduces cache thrashing.
  - Threads staying on the same core benefit from cached data and instructions.
4. Reduced contention:
  - Fewer threads mean less competition for shared resources.
  - Reduces time spent on thread synchronization and locking.
5. Predictable performance:
  - With one-to-one mapping of threads to cores, execution becomes more predictable.
  - Easier to reason about and optimize performance.
6. Energy efficiency:
  - Avoiding excessive threading reduces unnecessary CPU activity.
  - Can lead to better power consumption, especially important in battery-powered devices.
7. Scalability:
  - This approach scales well as the number of cores increases.
  - Adding more cores with matched thread count provides near-linear performance improvement for parallelizable workloads. Hence, GOMAXPROCS is set to number of available cores by default
One more thread is created for the Network Poller Queue. Goroutines are sent here to be executed for async network related work and go back to their original queue.

Reference

Go scheduling behaviour

The Go scheduler doesn’t assign a fixed “execution period” or time slice to goroutines in the same way traditional operating systems assign time slices to processes or threads. Instead, it uses a more dynamic and cooperative approach. However, we can discuss some relevant aspects of Go’s scheduling:

No fixed time slice:
- Go doesn’t use preemptive scheduling with fixed time slices.
- Goroutines yield control cooperatively or when they block.
Approximate scheduling interval:
- While not a “period” given to goroutines, the scheduler does have a concept of scheduling intervals.
- This is roughly set to 10ms (10,000,000 nanoseconds).
- However, this doesn’t mean each goroutine runs for 10ms.
Preemption points:
- Go 1.14+ introduced asynchronous preemption.
- The runtime may preempt a goroutine after it has been running for about 10ms.
- This helps prevent long-running goroutines from monopolizing a P (processor).
Yielding:
- Goroutines typically yield control much more frequently than every 10ms.
- They yield on channel operations, network I/O, system calls, etc.
Work-stealing:
- If a P (processor) runs out of work, it will try to steal goroutines from other Ps.
- This can lead to goroutines being rescheduled very quickly.
Fairness:
- The scheduler attempts to be fair, but doesn’t guarantee equal CPU time to all goroutines.
- It balances between throughput and fairness.

So while there’s no default “execution period” for goroutines, the scheduler aims to ensure all goroutines get a chance to run, with soft limits around 10ms of continuous execution. The actual running time of a goroutine before it yields or is preempted can vary greatly depending on its behavior and system load.

How does Go manage goroutines using run queues ?

The Go scheduler, which is part of the Go runtime, uses a unique approach to manage goroutines (Go’s lightweight threads) using various queues. While it doesn’t directly map to traditional operating system run queues, it does employ queue-like structures for efficient scheduling.

Local Run Queues (LRQ):
- Each P (processor) has its own local run queue
- Holds goroutines that are ready to run
- Typically implemented as a circular queue
- Used for quick access to runnable goroutines
Global Run Queue (GRQ):
- Single queue shared among all Ps
- Holds goroutines that don’t have a specific P assigned
- Used when local queues are empty or for load balancing
Network Poller Queue:
- Holds goroutines waiting for network I/O
- Managed by a dedicated network poller thread
Timer Queue:
- Holds goroutines that are sleeping or waiting for a timer
- Implemented as a heap for efficient timer management
GC Work Queues:
- Used during garbage collection
- Hold tasks related to marking and sweeping objects

The Go scheduler uses these queues in the following ways:

When a new goroutine is created, it’s typically placed in the local run queue of the P that created it.
If the local queue is full, the goroutine is pushed to the global queue.
When a P’s local queue is empty, it tries to steal work from other Ps’ local queues or the global queue.
The scheduler periodically checks the global queue to ensure fairness.
Goroutines that block on I/O or timers are moved to the appropriate queues and requeued when ready.

Structured logging in fiber using slog