Go Scheduler

When a Go program starts up, it's given a Logical Processor (P) for every virtual core on the host machine. If you have a processor with multiple threads per physical core (Hyper-Threading), each hardware thread will be represented as a virtual core.

=> 4 physical cores + inter core i7 is hyper-threading = 8 virtual cores

Every P is assigned an OS Thread (M).

Every Go program is given an initial Goroutine (G) which is the path of execution for a Go program. Just as OS Threads (M) are context-switched on and off a core (P), Goroutines (G) are context-switched on and off an M.

2 different run queues:

  • GRQ = Global Run Queue

  • LRQ = Local Run Queue

Each P is given a LRQ that manages the Goroutines assigned to be executed within the context of a P. These Goroutines G take turns being context-switched on and off the M assigned to that P.

The GRQ is for Goroutines that have not been assigned to a P yet.

Goroutine States

Waiting: G is stopped and waiting for something in order to continue. This could be for reasons like waiting for the operating system (system calls) or synchronization calls (atomic and mutex operations). These types of latencies are a root cause for bad performance.

Runnable: This means G wants time on an M so it can execute its assigned instructions. If you have a lot of Gs that want time, then Gs have to wait longer to get time. Also, the individual amount of time any given g gets is shortened as more Gs compete for time. This type of scheduling latency can also be a cause of bad performance.

Executing: This means the Goroutine has been placed on an M and is executing its instructions.

Asynchronous System Calls

When the OS you are running on has the ability to handle a system call asynchronously, something called the network poller can be used to process the system call more efficiently. This is accomplished by using kqueue (MacOS), epoll (Linux) or iocp (Windows) within these respective OS’s.

=> the scheduler can prevent Goroutines from blocking the M when those system calls are made. This helps to keep the M available to execute other Goroutines in the P’s LRQ without the need to create new Ms. This helps to reduce scheduling load on the OS.

  • Goroutine-1 is executing on the M and there are 3 more Goroutines waiting in the LRQ to get their time on the M. The network poller is idle with nothing to do.

  • Goroutine-1 wants to make a network system call, so Goroutine-1 is moved to the network poller and the asynchronous network system call is processed. Once Goroutine-1 is moved to the network poller, the M is now available to execute a different Goroutine from the LRQ. In this case, Goroutine-2 is context-switched on the M.

  • The asynchronous network system call is completed by the network poller and Goroutine-1 is moved back into the LRQ for the P. Once Goroutine-1 can be context-switched back on the M, the Go related code it’s responsible for can execute again. The big win here is that, to execute network system calls, no extra Ms are needed. The network poller has an OS Thread and it is handling an efficient event loop.

Synchronous System Calls

The network poller can’t be used and the Goroutine making the system call is going to block the M.

  • Goroutine-1 is going to make a synchronous system call that will block M1.

  • The scheduler is able to identify that Goroutine-1 has caused the M to block. At this point, the scheduler detaches M1 from the P with the blocking Goroutine-1 still attached. Then the scheduler brings in a new M2 to service the P. At that point, Goroutine-2 can be selected from the LRQ and context-switched on M2. If an M already exists because of a previous swap, this transition is quicker than having to create a new M.

  • The blocking system call that was made by Goroutine-1 finishes. At this point, Goroutine-1 can move back into the LRQ and be serviced by the P again. M1 is then placed on the side for future use if this scenario needs to happen again.

Work Stealing

helps to balance the Goroutines across all the P’s so the work is better distributed and getting done more efficiently.

  • We have a multi-threaded Go program with two P’s servicing four Goroutines each and a single Goroutine in the GRQ. What happens if one of the P’s services all of its Goroutines quickly?

  • Half the Goroutines are taken from P2 and now P1 can execute those Goroutines. What happens if P2 finishes servicing all of its Goroutines and P1 has nothing left in its LRQ?

  • P2 finished all its work and now needs to steal some. First, it will look at the LRQ of P1 but it won’t find any Goroutines. Next, it will look at the GRQ. There it will find Goroutine-9.

  • P2 steals Goroutine-9 from the GRQ and begins to execute the work.

Resources:

Last updated