An Introduction to the Problems of Modern Linux Kernel CPU Scheduling

Multi core, heterogeneous embedded devices have been available for some time, but we are still learning a lot about how to use them to their full potential. My colleague and I have been trying to understand how the kernel scheduler affects the responsiveness of the user interface and how to maximize and stabilize the frame rate without consuming excessive energy. We want to improve the usage of that little battery so many people complain about!

This article will focus on how CPU and Kernel interact from the user space point of view. Later, in another blog post, we will look at how to design libraries and applications to be as energy efficient as possible. There is still a lot that could be covered on other subsystems like the GPU or network, but these are big topics that are beyond the scope of this article.

CPU Management in Today’s Linux Kernel

Currently, the Linux Kernel has 3 components that deal with the CPU to determine where and how all tasks are run:
– scheduler
– cpuidle
– cpufreq

The scheduler is the component in charge of choosing which task to run next and which core the task will run on. Depending on the task history, it might keep the task on its current CPU or move it to another if needed. Once a task has been moved to a new CPU it will lose all information regarding its history. So the scheduler just manages where immediately available tasks should be run by guessing what they might do in the future.

Cpuidle is in charge of choosing how much a CPU sleeps. CPUs have multiple clock domains; when one is shut down, the device uses less energy, but it also will take more time to reuse that part of the CPU. This means shutting down clock domains is a tradeoff between reducing energy usage versus time needed to turn it back on when needed. Cpuidler uses system-wide measures to guess the load of the system and imagine what is likely to happen in the future. Based on this heuristic, it might power a CPU off completely, put a certain number of domains to sleep, or do nothing at all.

Cpufreq is the last component that impacts power management and task scheduling. It manages the frequency and voltage of a specific CPU, depending on system wide measures (eg. load and IO). It use various “governors” to apply different policies. A part of this governor is a heuristic that works to improve the reaction to the system needs. It does this by scaling the frequency slowly instead of using bursts. This works well to limit the energy used for a task, but it also means it will take additional time to reach the optimum frequency for a task. This can cause serious negative effects on the user experience, as the first few frames after idle time will always have some lag.

An Introduction to the Problems of Modern Linux Kernel CPU Scheduling - Cpu-balancing.png

These components are all independent of each other and base their decisions completely on their own. This can frequently lead to bad decisions; imagine a task that spends 4ms on an IO bound operation, followed by 5 ms on a CPU bound operation, finishing with 5 ms of a memory bound operation. The Kernel would make bad decisions during most of this hypothetical operation. It would start by setting the cpufreq as if it would be IO bound during the time the operation is CPU bound, it would then raise it too much while performing the memory bound operation, and wouldn’t drop it enough if it needs to go back to the IO bound operation. This is an endless loop of wrong decisions that is essentially the profile of every application with an on-screen user interface!

Hacking Our Way to More Efficient CPU Utilization

Of course, most users of Android phones and other battery-powered Linux devices will have already experienced these issues, even if they don’t realize the root of the problem. Over time various vendors have implemented fixes, or more exactly hacks, to work around these problems.

One solution is to force the CPU frequency to use a burst for the first few frames before going back to a more conservative frequency. This can help relive sluggish user experiences in some cases, but it is not power efficient, nor does it adapt to the task that’s running.

Some vendors have also started using user space daemons that monitor applications and change the CPU frequency and CPU idle completely, depending on the configured profile. This would provide the best solution if not for user space daemons’ inability to keep up with the rapidly changing needs of a task. Even if it improves the visual user experience it will likely come with a serious impact on the battery.

A similar solution is to have privileged applications trigger the CPU frequency change themselves. It’s usually possible to know when CPU intensive tasks will be launched, and CPU frequency changes can be triggered in anticipation of this. This solution can be effective, but it comes with a few problems.

First, it’s necessary to trust the user space, something that is not always acceptable. This is particularly true in environments that run third party applications with little to no code review. Additionally, this solution is not scalable and requires an expert to find the sweet spot for each application; this won’t be doable for an entire OS any time soon. Only a handful of applications could benefit from this kind of approach, so it would be better to make a toolkit carry out this work instead.

It’s difficult to reproduce a benchmark that demonstrates how bad the scheduler is, because this depends largely on user interaction. For example, it’s hard to benchmark something like the device going to sleep suddenly before the user can stop it. Still, even a simple benchmark can demonstrate how cpufreq governors seriously affect power consumption and performance. For example, expedite is an EFL benchmark application that can be used with with various cpufreq governors to see all possible value changes. This could allow someone to experience first-hand how inefficient the interaction between the user space and kernel is for this kind of task.

An Introduction to the Problems of Modern Linux Kernel CPU Scheduling - scheduling-impact.png

If you didn’t want to bother doing it yourself, this graphic shows how forcing the frequency to be at max does give a better and more stable result than the more energy-efficient governor. This issue gets even more dramatic when tested on an ARM device that aggressively turns the CPU on and off.

At this point it should be clear that none of these previous solutions solve our problem and that we should be working on something better. In the next blog,  I’ll take a look at how the kernel community is fixing this issue and how toolkits should evolve in respect to this requirement.

Author: Cedric Bail

Cedric has been contributing for a long time to EFL. He is known as the borker due to his work on optimizing the core libraries and triggering side effect bugs which tend to take years to be discovered.

2 thoughts on “An Introduction to the Problems of Modern Linux Kernel CPU Scheduling”

  1. This is a great article. Power Savings is not just the job of the kernel. Get too aggressive and the sluggishness shows up, get too liberal and the CPU never sleeps.

    And then there is the userspace. Today, an ill behaving application can not just drain more power, but can also bring a Linux kernel to its knees, very easily.

    Close co-ordination in between the kernel and the userspace is required. I’m glad that you mentioned toolkits, that they need to be more “power aware”.

    Today, systemd is a step in the right direction for some of the problems Linux suffers. With more awareness, hopefully, things would improve in the Linux world too.

Comments are closed.