Skip to content

Commit ef80068

Browse files
committed
Merge tag 'pm-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki: "These are PM-runtime framework changes to use ktime instead of jiffies for accounting, new PM core flag to mark devices that don't need any form of power management, cpuidle updates including driver API documentation and a new governor, cpufreq updates including a new driver for Armada 8K, thermal cleanups and more, some energy-aware scheduling (EAS) enabling changes, new chips support in the intel_idle and RAPL drivers and assorted cleanups in some other places. Specifics: - Update the PM-runtime framework to use ktime instead of jiffies for accounting (Thara Gopinath, Vincent Guittot) - Optimize the autosuspend code in the PM-runtime framework somewhat (Ladislav Michl) - Add a PM core flag to mark devices that don't need any form of power management (Sudeep Holla) - Introduce driver API documentation for cpuidle and add a new cpuidle governor for tickless systems (Rafael Wysocki) - Add Jacobsville support to the intel_idle driver (Zhang Rui) - Clean up a cpuidle core header file and the cpuidle-dt and ACPI processor-idle drivers (Yangtao Li, Joseph Lo, Yazen Ghannam) - Add new cpufreq driver for Armada 8K (Gregory Clement) - Fix and clean up cpufreq core (Rafael Wysocki, Viresh Kumar, Amit Kucheria) - Add support for light-weight tear-down and bring-up of CPUs to the cpufreq core and use it in the cpufreq-dt driver (Viresh Kumar) - Fix cpu_cooling Kconfig dependencies, add support for CPU cooling auto-registration to the cpufreq core and use it in multiple cpufreq drivers (Amit Kucheria) - Fix some minor issues and do some cleanups in the davinci, e_powersaver, ap806, s5pv210, qcom and kryo cpufreq drivers (Bartosz Golaszewski, Gustavo Silva, Julia Lawall, Paweł Chmiel, Taniya Das, Viresh Kumar) - Add a Hisilicon CPPC quirk to the cppc_cpufreq driver (Xiongfeng Wang) - Clean up the intel_pstate and acpi-cpufreq drivers (Erwan Velu, Rafael Wysocki) - Clean up multiple cpufreq drivers (Yangtao Li) - Update cpufreq-related MAINTAINERS entries (Baruch Siach, Lukas Bulwahn) - Add support for exposing the Energy Model via debugfs and make multiple cpufreq drivers register an Energy Model to support energy-aware scheduling (Quentin Perret, Dietmar Eggemann, Matthias Kaehlcke) - Add Ice Lake mobile and Jacobsville support to the Intel RAPL power-capping driver (Gayatri Kammela, Zhang Rui) - Add a power estimation helper to the operating performance points (OPP) framework and clean up a core function in it (Quentin Perret, Viresh Kumar) - Make minor improvements in the generic power domains (genpd), OPP and system suspend frameworks and in the PM core (Aditya Pakki, Douglas Anderson, Greg Kroah-Hartman, Rafael Wysocki, Yangtao Li)" * tag 'pm-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (80 commits) cpufreq: kryo: Release OPP tables on module removal cpufreq: ap806: add missing of_node_put after of_device_is_available cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies cpufreq: Pass updated policy to driver ->setpolicy() callback cpufreq: Fix two debug messages in cpufreq_set_policy() cpufreq: Reorder and simplify cpufreq_update_policy() cpufreq: Add kerneldoc comments for two core functions PM / core: Add support to skip power management in device/driver model cpufreq: intel_pstate: Rework iowait boosting to be less aggressive cpufreq: intel_pstate: Eliminate intel_pstate_get_base_pstate() cpufreq: intel_pstate: Avoid redundant initialization of local vars powercap/intel_rapl: add Ice Lake mobile ACPI / processor: Set P_LVL{2,3} idle state descriptions cpufreq / cppc: Work around for Hisilicon CPPC cpufreq ACPI / CPPC: Add a helper to get desired performance cpufreq: davinci: move configuration to include/linux/platform_data cpufreq: speedstep: convert BUG() to BUG_ON() cpufreq: powernv: fix missing check of return value in init_powernv_pstates() cpufreq: longhaul: remove unneeded semicolon cpufreq: pcc-cpufreq: remove unneeded semicolon ..
2 parents 8dcd175 + 1271d6d commit ef80068

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+1921
-558
lines changed

Documentation/admin-guide/pm/cpuidle.rst

Lines changed: 96 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -155,14 +155,14 @@ governor uses that information depends on what algorithm is implemented by it
155155
and that is the primary reason for having more than one governor in the
156156
``CPUIdle`` subsystem.
157157

158-
There are two ``CPUIdle`` governors available, ``menu`` and ``ladder``. Which
159-
of them is used depends on the configuration of the kernel and in particular on
160-
whether or not the scheduler tick can be `stopped by the idle
161-
loop <idle-cpus-and-tick_>`_. It is possible to change the governor at run time
162-
if the ``cpuidle_sysfs_switch`` command line parameter has been passed to the
163-
kernel, but that is not safe in general, so it should not be done on production
164-
systems (that may change in the future, though). The name of the ``CPUIdle``
165-
governor currently used by the kernel can be read from the
158+
There are three ``CPUIdle`` governors available, ``menu``, `TEO <teo-gov_>`_
159+
and ``ladder``. Which of them is used by default depends on the configuration
160+
of the kernel and in particular on whether or not the scheduler tick can be
161+
`stopped by the idle loop <idle-cpus-and-tick_>`_. It is possible to change the
162+
governor at run time if the ``cpuidle_sysfs_switch`` command line parameter has
163+
been passed to the kernel, but that is not safe in general, so it should not be
164+
done on production systems (that may change in the future, though). The name of
165+
the ``CPUIdle`` governor currently used by the kernel can be read from the
166166
:file:`current_governor_ro` (or :file:`current_governor` if
167167
``cpuidle_sysfs_switch`` is present in the kernel command line) file under
168168
:file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``.
@@ -256,6 +256,8 @@ the ``menu`` governor by default and if it is not tickless, the default
256256
``CPUIdle`` governor on it will be ``ladder``.
257257

258258

259+
.. _menu-gov:
260+
259261
The ``menu`` Governor
260262
=====================
261263

@@ -333,6 +335,92 @@ that time, the governor may need to select a shallower state with a suitable
333335
target residency.
334336

335337

338+
.. _teo-gov:
339+
340+
The Timer Events Oriented (TEO) Governor
341+
========================================
342+
343+
The timer events oriented (TEO) governor is an alternative ``CPUIdle`` governor
344+
for tickless systems. It follows the same basic strategy as the ``menu`` `one
345+
<menu-gov_>`_: it always tries to find the deepest idle state suitable for the
346+
given conditions. However, it applies a different approach to that problem.
347+
348+
First, it does not use sleep length correction factors, but instead it attempts
349+
to correlate the observed idle duration values with the available idle states
350+
and use that information to pick up the idle state that is most likely to
351+
"match" the upcoming CPU idle interval. Second, it does not take the tasks
352+
that were running on the given CPU in the past and are waiting on some I/O
353+
operations to complete now at all (there is no guarantee that they will run on
354+
the same CPU when they become runnable again) and the pattern detection code in
355+
it avoids taking timer wakeups into account. It also only uses idle duration
356+
values less than the current time till the closest timer (with the scheduler
357+
tick excluded) for that purpose.
358+
359+
Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain
360+
the *sleep length*, which is the time until the closest timer event with the
361+
assumption that the scheduler tick will be stopped (that also is the upper bound
362+
on the time until the next CPU wakeup). That value is then used to preselect an
363+
idle state on the basis of three metrics maintained for each idle state provided
364+
by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``.
365+
366+
The ``hits`` and ``misses`` metrics measure the likelihood that a given idle
367+
state will "match" the observed (post-wakeup) idle duration if it "matches" the
368+
sleep length. They both are subject to decay (after a CPU wakeup) every time
369+
the target residency of the idle state corresponding to them is less than or
370+
equal to the sleep length and the target residency of the next idle state is
371+
greater than the sleep length (that is, when the idle state corresponding to
372+
them "matches" the sleep length). The ``hits`` metric is increased if the
373+
former condition is satisfied and the target residency of the given idle state
374+
is less than or equal to the observed idle duration and the target residency of
375+
the next idle state is greater than the observed idle duration at the same time
376+
(that is, it is increased when the given idle state "matches" both the sleep
377+
length and the observed idle duration). In turn, the ``misses`` metric is
378+
increased when the given idle state "matches" the sleep length only and the
379+
observed idle duration is too short for its target residency.
380+
381+
The ``early_hits`` metric measures the likelihood that a given idle state will
382+
"match" the observed (post-wakeup) idle duration if it does not "match" the
383+
sleep length. It is subject to decay on every CPU wakeup and it is increased
384+
when the idle state corresponding to it "matches" the observed (post-wakeup)
385+
idle duration and the target residency of the next idle state is less than or
386+
equal to the sleep length (i.e. the idle state "matching" the sleep length is
387+
deeper than the given one).
388+
389+
The governor walks the list of idle states provided by the ``CPUIdle`` driver
390+
and finds the last (deepest) one with the target residency less than or equal
391+
to the sleep length. Then, the ``hits`` and ``misses`` metrics of that idle
392+
state are compared with each other and it is preselected if the ``hits`` one is
393+
greater (which means that that idle state is likely to "match" the observed idle
394+
duration after CPU wakeup). If the ``misses`` one is greater, the governor
395+
preselects the shallower idle state with the maximum ``early_hits`` metric
396+
(or if there are multiple shallower idle states with equal ``early_hits``
397+
metric which also is the maximum, the shallowest of them will be preselected).
398+
[If there is a wakeup latency constraint coming from the `PM QoS framework
399+
<cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the
400+
target residency within the sleep length, the deepest idle state with the exit
401+
latency within the constraint is preselected without consulting the ``hits``,
402+
``misses`` and ``early_hits`` metrics.]
403+
404+
Next, the governor takes several idle duration values observed most recently
405+
into consideration and if at least a half of them are greater than or equal to
406+
the target residency of the preselected idle state, that idle state becomes the
407+
final candidate to ask for. Otherwise, the average of the most recent idle
408+
duration values below the target residency of the preselected idle state is
409+
computed and the governor walks the idle states shallower than the preselected
410+
one and finds the deepest of them with the target residency within that average.
411+
That idle state is then taken as the final candidate to ask for.
412+
413+
Still, at this point the governor may need to refine the idle state selection if
414+
it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That
415+
generally happens if the target residency of the idle state selected so far is
416+
less than the tick period and the tick has not been stopped already (in a
417+
previous iteration of the idle loop). Then, like in the ``menu`` governor
418+
`case <menu-gov_>`_, the sleep length used in the previous computations may not
419+
reflect the real time until the closest timer event and if it really is greater
420+
than that time, a shallower state with a suitable target residency may need to
421+
be selected.
422+
423+
336424
.. _idle-states-representation:
337425

338426
Representation of Idle States

Documentation/cpuidle/driver.txt

Lines changed: 0 additions & 37 deletions
This file was deleted.

Documentation/cpuidle/governor.txt

Lines changed: 0 additions & 28 deletions
This file was deleted.

0 commit comments

Comments
 (0)