Launching goroutines: A journey into the Go scheduler

Introduction

In this article, we go on a journey inside the Go scheduler. We explore it, see how it ticks, and how the goroutines operate.¹ It can be hard to understand the scheduler by jumping directly into the code and following logs and traces. It helps to build an abstract insight first and add the details later. Here we try to follow that path.

We start with a clean slate and assume we know nothing about the scheduling.² Then, we follow the execution of a series of programs. We start simple and — as we gain insight — move to the more complex experiments. We’re on this journey together, as I, too, was learning while writing.

Launching a goroutine

Let’s launch a single goroutine that does a bit of CPU-intensive work, taking several milliseconds on our machine, which has four CPU cores. We compile and run the following code.

func run() {
	// do work for several milliseconds
}

func main() {
	go run()

	// wait for the goroutine to finish
}

Here we launch a new goroutine and then wait for it to finish.³

If we had to do it ourselves, what would be the simplest way we could implement running a goroutine? Start a thread. That’s simple. Let’s illustrate this with a picture.⁴

A single goroutine running as a thread.

Here $G_0$ represents the goroutine, and $M_0$ , the thread. The Go scheduler documentation and code denote a thread with the letter $M$ as the “machine”. The goroutine $G_0$ executes on the thread $M_0$ , created specifically for that purpose. The starting time of the goroutine seems long since starting threads takes much more time than we expect from goroutines. But that does not bother us since there’s only one goroutine to run. We could be happy with such an implementation at the moment. Everything works. The goroutine runs its course and finishes. We move to the next example.

A thousand goroutines

Now, let’s initiate many — say, a thousand — goroutines with the following code.

func main() {
	for i := 0; i < 1000; i++ {
		go run()
	}

	// wait
}

How do we imagine this happening? We use the same approach as before. Our supposed scheduler creates a new thread for each goroutine, and we end up with something like this.

A thousand goroutines running as threads.

However, as we start the goroutines, we notice two problems. First, the goroutines start slowly. It wasn’t a problem when we launched a single goroutine, but it becomes noticeable as we start more of them. The goroutines should be much quicker to start than threads. It is part of their nature as “lightweight threads”.⁵

There’s also the second problem. We have created significantly more threads than CPU cores (we only have four). They stall the operating system scheduler, as having a thousand threads trying to run on just four cores creates a lot of contention. Most of the time, the threads are waiting to be scheduled. On top of that, our CPU use is inefficient because of the frequent context switches.

We can solve the latter problem by limiting the number of threads that run simultaneously. So, we don’t run all the goroutines at once but one by one. Since we don’t launch all goroutines simultaneously, we need to store them while they’re waiting. We put them into a first-in-first-out queue, which we call the global run queue. Then we take them out one-by-one and execute them. When one goroutine finishes, we start another one.⁶ This is no less efficient than the previous attempt since the amount of CPU time is the same. However, now there’s much less context switching, and the operating system’s scheduling works much better.

What should be the limit for the number of simultaneous goroutines? Let’s run at most as many goroutines as there are CPU cores. This approach is efficient and plays well with the operating system.⁷

The limitation we put in place leads us to the solution for the slow start problem. We could reuse already started threads when creating new goroutines. So, when we start the goroutines after the first four, we reuse the thread of the one that just finished. This way, we don’t have to create a thread slowly every time.

It turns out we can implement both solutions — limit the number of threads and reuse them — by introducing one remarkable concept — the processor. In fact, we will have as many of them as the limit we put on the number of running goroutines — four in our case. Each thread can run a goroutine if — and only if — it owns a processor. Each processor is like a license for a thread to run a goroutine. We could also think about processors as mutexes or handles. Let’s see how this idea works for our program.

When we start execution, instead of launching a thousand threads, we put the goroutines in the global run queue. We have four processors — four “licenses” to run goroutines. We create four threads as well.

Goroutines in the queue.

The $P_0$ , $P_1$ , and so on are the processors. They are currently not owned — or occupied — by any threads.

We have to mention that the threads are the only actors here. Why? A thread is the only component in our picture that executes instructions on a CPU. So, when we read the code of the scheduler runtime, it is run by the threads. We can think of the threads as actors.⁸

When a thread runs a goroutine, it switches to its memory environment — which, among other things, includes its call stack — and executes the goroutine’s code. However, as soon as the goroutine finishes, the thread switches to its own memory context to perform runtime-level operations that, by the way, include the scheduling itself. There’s no scheduler per se as an independent actor — threads do all the scheduling. So, moving on, the action happens from the threads' point of view.

What do the threads do in this situation? Let’s consider the thread $M_0$ . First, it tries to obtain a processor to get into action — remember, a thread cannot run goroutines without a processor. It acquires the free $P_0$ . Now it has permission to run goroutines. So, $M_0$ checks the global run queue. There’s some work to be done there, so it pops $G_0$ from the head of the queue and starts running it. The other threads act the same, and after they’ve picked their processors and goroutines, we have the following picture.

Threads get the processors and start the goroutines.

When a goroutine finishes, the thread becomes idle but still holds its processor. Then, it checks the queue for new work, gets a fresh goroutine, and runs it.

The thread picked a new goroutine.

Here $M_0$ finished running $G_0$ and switched to $G_4$ .⁹ The other threads follow the same script. We continue like this until all goroutines finish.

As we can see, we did not start a new thread whenever launching a goroutine. We also didn’t overload the operating system with too many threads. Everything seems fine now.

Goroutines of goroutines

We encounter a new problem when running the following code.

func runMany() {
	for i := 0; i < 1000; i++ {
		go run()
	}
}

func main () {
	for i := 0; i < 1000; i++ {
		go runMany()
	}

	// wait
}

Here every goroutine starts a thousand new ones. First, the initial four goroutines start running and spawn the sub-goroutines.

Launched sub-goroutines in queue.

The $G_0$ starts the goroutines from $G_0^0$ to $G_0^{999}$ , the $G_1$ starts the ones from $G_1^0$ to $G_1^{999}$ , and so on. In our system, as we see it so far, all these goroutines are on the global run queue.

The threads start picking up and executing the sub-goroutines after the first four ones finish, and we see something like this.¹⁰

Sub-goroutines executing.

However, we notice with disappointment that the system is slow again. Why? For safe concurrency, each thread gets the global run queue lock with every interaction. So, when pushing in and popping out the goroutines, a thread must wait for the lock, obtain it, access the queue, and release the lock. Only a single thread can access the queue simultaneously. All others have to wait. The threads contend with each other, and the performance drops dramatically.¹¹

What’s the solution? We need each thread to access the global queue less frequently. To achieve this, we introduce local run queues in each processor; the threads now try to access the local queue before the global one. So, when a goroutine — run by a thread — schedules another one, the new sub-goroutine doesn’t go to the global run queue but instead to the local one. When looking for work, the threads check the local queue first; they check — and lock — the global queue only when there’s nothing in the local one.

We re-run our code with this modification and the picture looks different.

Sub-goroutines in the local queues.

The sub-goroutines are in the local queues. The thread that just finished a goroutine and owns a processor first checks that processor’s local queue. It only checks the global run queue if the local one is empty. After the four initial goroutines finish, we have the following situation.

Sub-goroutines executing from the local queues.

After a while, when all the goroutines spawned by, for example, $G_1$ are finished, the $P_1$ local queue is empty, and $M_1$ picks $G_4$ from the global run queue, which fills the local queue with the goroutines from $G_4^0$ to $G_4^{999}$ . The pattern repeats until all goroutines and sub-goroutines complete.

As we can see, after our modification, the contention for the global run queue drops significantly, and everything works fast. So we can turn to more exciting experiments.¹²

An uneven load

Here’s our next experiment.

func main() {
	go runMany()

	// wait
}

It’s easy to see a problem here. After we start the sub-goroutines — with our current idea of the scheduler — we face this picture.

A single long local queue filled in P0.

Only one thread is doing all the work, while the other three remain unused.

Only M0 is executing its local queue.

We need to balance the work among the threads, so we add another detail to our scheduling algorithm: work stealing.¹³ After a thread checks the local and global run queues, it turns to the other processors' run queues, and if there’s work there, it steals half of it. In our example, the $M_1$ checks the $P_0$ and — finding work there — takes half of it.¹⁴

M1 took half of the goroutines from P0.

$M_2$ checks $P_0$ first and take the front half of the queue; $M_3$ does the same. Now, every thread is working.

All threads busy.

When threads run out of work, they get more from the others. It’s easy to see how it goes: every thread that holds a processor has work to do until we run out of goroutines.

Another idea that comes to mind when faced with the problem of load-balancing is to have a single central coordinator. However, it doesn’t look appealing if we think about it for a while. Because the coordinator would become a single central lock, and we already know: such a system would be hard to scale. That’s why the scheduler designers went with the distributed approach.¹⁵

Thus, we resolved the problem of a possible disbalance between the processors.

Blocked goroutines

Another thing that happens to goroutines — and subsequently threads — is blocking, for example, when reading from a channel. When a goroutine is blocked, the execution halts until another goroutine unblocks it. In case of a channel read, we need another goroutine to send something to it. Here’s a particularly thorny example of blocking.

func put(ch chan<- bool) {
	for i := 0; i < 1000; i++ {
		run()
	}

	ch <- true
	ch <- true
	ch <- true
}

func take(ch <-chan bool) {
	<- ch
	run()
}

func main() {
	ch := make(chan bool, 3000)

	for i := 0; i < 1000; i++ {
		go put(ch)
		go take(ch)
		go take(ch)
		go take(ch)
	}

	// wait
}

Here’s what the scheduling system looks like after the first iteration of the loop — and, actually, most of the time.¹⁶

Three blocked processors, and a signle one running.

Only $M_0$ is running; the other three threads, together with goroutines, wait for $G^{put}_0$ to send data to the channel. (The grey fill reflects it in the picture). The $G_0$ , $G_1$ , and $G_2$ are blocked on the channel read. Right before the $G_0$ ends, they run for a short while, and we return to the same situation. The average CPU utilization is $25.1%$ .¹⁷

Clearly, the threads shouldn’t be occupied with the waiting goroutines as they can’t do any useful work while still holding the resources. So, we put aside the blocked goroutines and run the next ones.

However, two questions arise. First, where do the waiting goroutines go? It depends on what blocked them. In our case, goroutines are put in the receive queue of the channel. Second, what will unblock the goroutines? Since they don’t run, how do we know when they are ready to be run? They will be unblocked by the goroutines sending on the channel, more specifically, by their threads.

With this change in place, when running the code, we first get into the same situation as above: three goroutines blocked with their threads and processors. But now the blocked goroutines are put aside immediately, and we see the following.

Most of the goroutines are parked; four take are running

All the blocked take goroutines go on the receive queue of the channel where they remain waiting (thus the grey color).

When the $G_0^{put}$ is about to finish, it sends data on the channel unblocking goroutines $G^{take}_1$ , $G^{take}_2$ , and $G^{take}_3$ and putting them on its local queue.

G0 is about to exit. Previously parked goroutines moved to the local queue.

The freshly unblocked goroutines are picked up and executed by $M_0$ .

The pattern continues resulting — in theory — in $100%$ processor utilization, as all the threads are busy all the time.

System calls

Another phenomenon arises when we start using the system calls (or syscalls for short) — the functions that transfer control to the operating system. System calls tend to be much slower than usual operations. In our example, we assume the syscall lasts for $100ms$ on average, which may happen in real life. Let’s turn to the code.

func runSys() {
	// a syscall that lasts for 100ms
}

func main() {
	for i := 0; i < 1000; i++ {
		go runSys()
	}

	// wait
}

We run the code, and the scheduling plays the same way as before: the threads find the processors and start the goroutines from the run queue. However, after a short while, we notice that the execution looks stuck for good because the threads executing syscalls block all the processors, as we can see below.

Syscalls blocking execution.

Here all four threads are blocked in syscalls together with goroutines and processors until the syscalls finish. Then the threads start four more syscalls, and so on. There are four syscalls in progress at any given time. It’s easy to calculate the total execution time to be $25$ seconds.¹⁸ We can do much better.

On top of that, the syscalls do not use any user CPU time — and frequently, neither they use any system CPU time — yet they occupy the processors. The main goal of the processor is to limit concurrent access to the CPU. Thus, holding a processor in a syscall doesn’t make sense. A thread must give away the processor. That’s what the scheduler does.

Let’s turn to the threads — the main actors — and follow the events from their point of view. As a thread’s goroutine enters a syscall, the thread gives up the processor but continues running the goroutine until the syscall finishes. It also makes sure there’s another thread to take its place.¹⁹ When the syscall finishes, the thread tries to get a processor to continue the goroutine execution. If no processor is available, the thread puts the goroutine in the global run queue and stays idle.

In this case, we cannot leave the goroutine by itself, like with goroutine blocking, because there’s no mechanism to put it back in play. So, after the first four syscalls started and were put aside, the system looks like this.

Syscalls put aside.

The newly created threads enter the syscalls as well; they, too, are put aside; new threads start, and so on. We continue executing syscalls and spawning the threads in batches of four until we start all the syscalls. Provided none of the syscalls are finished yet, we get the following picture.

All syscalls in progress.

When the syscalls start finishing, the threads put the goroutines back into run queues, where the goroutines complete.

Now we have new artifacts: idle threads. They stay around and check if there are free processors. Getting a processor is the only way they can run goroutines. If a free processor appears, they pick it up and start looking for work. If not, they hang around forever.²⁰

With all this, we have a reasonably performant system. However, we have left out a key detail, which we illustrate next.

The forever goroutines

So far, we assumed our goroutines are short. But that’s not true in the real world. We can have goroutines that — on the scale of the program run — last forever.

func forever() {
	for {
		run()
	}
}

func main() {
	go forever()
	go forever()
	go forever()
	go forever()

	go forever()

	// wait
}

Here the first four forever goroutines take up all four processors, and the one following them never runs.

All processors busy with forever goroutines.

That’s not what we expect. Every goroutine should have a chance to run.

The scheduler solves this problem with goroutine preemption. We preempt — that is stop — every running goroutine after a while if it doesn’t finish by then. After that, it is put in the tail of the global queue, and another goroutine takes its place.²¹

Waiting goroutine executing.

This way, every goroutine has a chance to run and has its share of the CPU.²² The fairness we just added is essential.

Outro

We visited the most notable landmarks on our trip through the Go scheduler. There’s much more to explore than described here, of course. We have idealized and simplified the actual algorithms to get a clear picture. If we run our experiments for real, we will see a much more complex and chaotic behavior in logs.

The runtime code is the ultimate source of truth. I hope I’ve provided a good starting point to explore it.

Acknowledgements

Thanks to Mikhail Dutikov, Eric Herman, Gural Vural, and Alexey Surikov for reviewing my writings and providing invaluable feedback!

Specifically, we discuss go1.19.2. ↩︎
I assume, however, that the reader knows the fundamentals of Go and operating systems. ↩︎
Note that we must wait at the end of our program; otherwise, it exits before the goroutine finishes. ↩︎
For clarity, the explanations and illustrations only include the goroutines and threads we start during program execution. ↩︎
It is the way the language designers sometimes call goroutines. See, for example, here. ↩︎
We consider goroutines that run for a long time later. For now, we only look at the short ones. ↩︎
That’s reflected in the default value of GOMAXPROCS, which is the number of cores. Please check the proposal by Russ Cox. ↩︎
This view also helps to read and understand the scheduler code if we’re inclined to do so in the future. ↩︎
Any running goroutine could have ended first, but we assume it’s the $G_0$ . ↩︎
In reality, the goroutines may end up queueing in a different order. Also, the first four goroutines may not finish at the same time. Here we assume they do for illustration. ↩︎
Our example system has a mild problem since there are only 4 processors. In reality, however, we can have up to 128 cores on a single server, making the problem catastrophic. ↩︎
There’s one more rather obvious optimization: instead of getting a single goroutine out of the global run queue, a thread can take a whole batch. It further reduces the contention for the global queue lock, and the runtime does it precisely like that. See the globrunqget function. In this article, we omit this feature for clarity. Also, in reality, the local run queues have a size limit. See the runqput for details. ↩︎
See Robert Blumofe and Charles Leiserson, “Scheduling multithreaded computations by work stealing”. ↩︎
The implementation is more complex. The schedule and the findRunnable functions contain details on the stealing mechanism. See also the runqgrab. ↩︎
See the corresponding section in the scheduler design document for more details. ↩︎
The following illustration is idealized. The real picture is more complex, and the procedure is much less predictable. However, the basic principle remains the same. ↩︎
This can be calculated as $(1000 + 3)/(4\cdot1000) \approx 0.251$ , provided all run() functions take the same time on average. ↩︎
That is, $(N_{goroutines}\cdot{duration}) / N_{processors} = 25$ seconds. ↩︎
See the startm function that is called by the entersyscallblock. ↩︎
One may think it’s wasteful to keep multiple idle threads. It’s true and can be a problem in some systems. However, this remains an open issue at the time of writing. ↩︎
How preemption works is a big topic; see here. ↩︎
Preemption is also in play in all the previous examples, but we did not mention it until now to keep them clear and straightforward. All the illustrations are still accurate, even with preemption in place. ↩︎

Published 15/11/2022