Commit 346658a5 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'docs-5.18' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "It has been a moderately busy cycle for documentation; some of the
  highlights are:

   - Numerous PDF-generation improvements

   - Kees's new document with guidelines for researchers studying the
     development community.

   - The ongoing stream of Chinese translations

   - Thorsten's new document on regression handling

   - A major reworking of the internal documentation for the kernel-doc
     script.

  Plus the usual stream of typo fixes and such"

* tag 'docs-5.18' of git://git.lwn.net/linux: (80 commits)
  docs/kernel-parameters: update description of mem=
  docs/zh_CN: Add sched-nice-design Chinese translation
  docs: scheduler: Convert schedutil.txt to ReST
  Docs: ktap: add code-block type
  docs: serial: fix a reference file name in driver.rst
  docs: UML: Mention telnetd for port channel
  docs/zh_CN: add damon reclaim translation
  docs/zh_CN: add damon usage translation
  docs/zh_CN: add admin-guide damon start translation
  docs/zh_CN: add admin-guide damon index translation
  docs/zh_CN: Refactoring the admin-guide directory index
  zh_CN: Add translation for admin-guide/mm/index.rst
  zh_CN: Add translations for admin-guide/mm/ksm.rst
  Add Chinese translation for vm/ksm.rst
  docs/zh_CN: Add sched-stats Chinese translation
  docs/zh_CN: add devicetree of_unittest translation
  docs/zh_CN: add devicetree usage-model translation
  docs/zh_CN: add devicetree index translation
  Documentation: describe how to apply incremental stable patches
  docs/zh_CN: add peci subsystem translation
  ...
parents d2eb5500 75c05fab
......@@ -26,7 +26,7 @@ SPHINX_CONF = conf.py
PAPER =
BUILDDIR = $(obj)/output
PDFLATEX = xelatex
LATEXOPTS = -interaction=batchmode
LATEXOPTS = -interaction=batchmode -no-shell-escape
ifeq ($(KBUILD_VERBOSE),0)
SPHINXOPTS += "-q"
......
......@@ -315,8 +315,8 @@ To use the feature, admin should set up backing device via::
echo /dev/sda5 > /sys/block/zramX/backing_dev
before disksize setting. It supports only partition at this moment.
If admin wants to use incompressible page writeback, they could do via::
before disksize setting. It supports only partitions at this moment.
If admin wants to use incompressible page writeback, they could do it via::
echo huge > /sys/block/zramX/writeback
......@@ -341,9 +341,9 @@ Admin can request writeback of those idle pages at right timing via::
echo idle > /sys/block/zramX/writeback
With the command, zram writeback idle pages from memory to the storage.
With the command, zram will writeback idle pages from memory to the storage.
If admin want to write a specific page in zram device to backing device,
If an admin wants to write a specific page in zram device to the backing device,
they could write a page index into the interface.
echo "page_index=1251" > /sys/block/zramX/writeback
......@@ -354,7 +354,7 @@ to guarantee storage health for entire product life.
To overcome the concern, zram supports "writeback_limit" feature.
The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
any writeback. IOW, if admin wants to apply writeback budget, he should
any writeback. IOW, if admin wants to apply writeback budget, they should
enable writeback_limit_enable via::
$ echo 1 > /sys/block/zramX/writeback_limit_enable
......@@ -365,7 +365,7 @@ until admin sets the budget via /sys/block/zramX/writeback_limit.
(If admin doesn't enable writeback_limit_enable, writeback_limit's value
assigned via /sys/block/zramX/writeback_limit is meaningless.)
If admin want to limit writeback as per-day 400M, he could do it
If admin wants to limit writeback as per-day 400M, they could do it
like below::
$ MB_SHIFT=20
......@@ -375,16 +375,16 @@ like below::
$ echo 1 > /sys/block/zram0/writeback_limit_enable
If admins want to allow further write again once the budget is exhausted,
he could do it like below::
they could do it like below::
$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit
If admin wants to see remaining writeback budget since last set::
If an admin wants to see the remaining writeback budget since last set::
$ cat /sys/block/zramX/writeback_limit
If admin want to disable writeback limit, he could do::
If an admin wants to disable writeback limit, they could do::
$ echo 0 > /sys/block/zramX/writeback_limit_enable
......@@ -393,7 +393,7 @@ system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
writeback happened until you reset the zram to allocate extra writeback
budget in next setting is user's job.
If admin wants to measure writeback count in a certain period, he could
If admin wants to measure writeback count in a certain period, they could
know it via /sys/block/zram0/bd_stat's 3rd column.
memory tracking
......
......@@ -35,6 +35,7 @@ problems and bugs in particular.
:maxdepth: 1
reporting-issues
reporting-regressions
security-bugs
bug-hunting
bug-bisect
......
......@@ -76,7 +76,7 @@ Field 3 -- # of sectors read (unsigned long)
Field 4 -- # of milliseconds spent reading (unsigned int)
This is the total number of milliseconds spent by all reads (as
measured from __make_request() to end_that_request_last()).
measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 5 -- # of writes completed (unsigned long)
This is the total number of writes completed successfully.
......@@ -89,7 +89,7 @@ Field 7 -- # of sectors written (unsigned long)
Field 8 -- # of milliseconds spent writing (unsigned int)
This is the total number of milliseconds spent by all writes (as
measured from __make_request() to end_that_request_last()).
measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 9 -- # of I/Os currently in progress (unsigned int)
The only field that should go to zero. Incremented as requests are
......@@ -120,7 +120,7 @@ Field 14 -- # of sectors discarded (unsigned long)
Field 15 -- # of milliseconds spent discarding (unsigned int)
This is the total number of milliseconds spent by all discards (as
measured from __make_request() to end_that_request_last()).
measured from blk_mq_alloc_request() to __blk_mq_end_request()).
Field 16 -- # of flush requests completed
This is the total number of flush requests completed successfully.
......
......@@ -2827,6 +2827,9 @@
For details see: Documentation/admin-guide/hw-vuln/mds.rst
mem=nn[KMG] [HEXAGON] Set the memory size.
Must be specified, otherwise memory size will be 0.
mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
Amount of memory to be used in cases as follows:
......@@ -2834,6 +2837,13 @@
2 when the kernel is not able to see the whole system memory;
3 memory that lies after 'mem=' boundary is excluded from
the hypervisor, then assigned to KVM guests.
4 to limit the memory available for kdump kernel.
[ARC,MICROBLAZE] - the limit applies only to low memory,
high memory is not affected.
[ARM64] - only limits memory covered by the linear
mapping. The NOMAP regions are not affected.
[X86] Work as limiting max address. Use together
with memmap= to avoid physical address space collisions.
......@@ -2844,6 +2854,14 @@
in above case 3, memory may need be hot added after boot
if system memory of hypervisor is not sufficient.
mem=nn[KMG]@ss[KMG]
[ARM,MIPS] - override the memory layout reported by
firmware.
Define a memory region of size nn[KMG] starting at
ss[KMG].
Multiple different regions can be specified with
multiple mem= parameters on the command line.
mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel
memory.
......
......@@ -8,6 +8,7 @@ Performance monitor support
:maxdepth: 1
hisi-pmu
hisi-pcie-pmu
imx-ddr
qcom_l2_pmu
qcom_l3_pmu
......
.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
..
If you want to distribute this text under CC-BY-4.0 only, please use 'The
Linux kernel developers' for author attribution and link this as source:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-issues.rst
..
Note: Only the content of this RST file as found in the Linux kernel sources
is available under CC-BY-4.0, as versions of this text that were processed
(for example by the kernel's build system) might contain content taken from
files which use a more restrictive license.
.. See the bottom of this file for additional redistribution information.
Reporting issues
++++++++++++++++
......@@ -395,22 +386,16 @@ fixed as soon as possible, hence there are 'issues of high priority' that get
handled slightly differently in the reporting process. Three type of cases
qualify: regressions, security issues, and really severe problems.
You deal with a 'regression' if something that worked with an older version of
the Linux kernel does not work with a newer one or somehow works worse with it.
It thus is a regression when a WiFi driver that did a fine job with Linux 5.7
somehow misbehaves with 5.8 or doesn't work at all. It's also a regression if
an application shows erratic behavior with a newer kernel, which might happen
due to incompatible changes in the interface between the kernel and the
userland (like procfs and sysfs). Significantly reduced performance or
increased power consumption also qualify as regression. But keep in mind: the
new kernel needs to be built with a configuration that is similar to the one
from the old kernel (see below how to achieve that). That's because the kernel
developers sometimes can not avoid incompatibilities when implementing new
features; but to avoid regressions such features have to be enabled explicitly
during build time configuration.
You deal with a regression if some application or practical use case running
fine with one Linux kernel works worse or not at all with a newer version
compiled using a similar configuration. The document
Documentation/admin-guide/reporting-regressions.rst explains this in more
detail. It also provides a good deal of other information about regressions you
might want to be aware of; it for example explains how to add your issue to the
list of tracked regressions, to ensure it won't fall through the cracks.
What qualifies as security issue is left to your judgment. Consider reading
'Documentation/admin-guide/security-bugs.rst' before proceeding, as it
Documentation/admin-guide/security-bugs.rst before proceeding, as it
provides additional details how to best handle security issues.
An issue is a 'really severe problem' when something totally unacceptably bad
......@@ -517,7 +502,7 @@ line starting with 'CPU:'. It should end with 'Not tainted' if the kernel was
not tainted when it noticed the problem; it was tainted if you see 'Tainted:'
followed by a few spaces and some letters.
If your kernel is tainted, study 'Documentation/admin-guide/tainted-kernels.rst'
If your kernel is tainted, study Documentation/admin-guide/tainted-kernels.rst
to find out why. Try to eliminate the reason. Often it's caused by one these
three things:
......@@ -1043,7 +1028,7 @@ down the culprit, as maintainers often won't have the time or setup at hand to
reproduce it themselves.
To find the change there is a process called 'bisection' which the document
'Documentation/admin-guide/bug-bisect.rst' describes in detail. That process
Documentation/admin-guide/bug-bisect.rst describes in detail. That process
will often require you to build about ten to twenty kernel images, trying to
reproduce the issue with each of them before building the next. Yes, that takes
some time, but don't worry, it works a lot quicker than most people assume.
......@@ -1073,10 +1058,11 @@ When dealing with regressions make sure the issue you face is really caused by
the kernel and not by something else, as outlined above already.
In the whole process keep in mind: an issue only qualifies as regression if the
older and the newer kernel got built with a similar configuration. The best way
to archive this: copy the configuration file (``.config``) from the old working
kernel freshly to each newer kernel version you try. Afterwards run ``make
olddefconfig`` to adjust it for the needs of the new version.
older and the newer kernel got built with a similar configuration. This can be
achieved by using ``make olddefconfig``, as explained in more detail by
Documentation/admin-guide/reporting-regressions.rst; that document also
provides a good deal of other information about regressions you might want to be
aware of.
Write and send the report
......@@ -1283,7 +1269,7 @@ them when sending the report by mail. If you filed it in a bug tracker, forward
the report's text to these addresses; but on top of it put a small note where
you mention that you filed it with a link to the ticket.
See 'Documentation/admin-guide/security-bugs.rst' for more information.
See Documentation/admin-guide/security-bugs.rst for more information.
Duties after the report went out
......@@ -1571,7 +1557,7 @@ Once your report is out your might get asked to do a proper one, as it allows to
pinpoint the exact change that causes the issue (which then can easily get
reverted to fix the issue quickly). Hence consider to do a proper bisection
right away if time permits. See the section 'Special care for regressions' and
the document 'Documentation/admin-guide/bug-bisect.rst' for details how to
the document Documentation/admin-guide/bug-bisect.rst for details how to
perform one. In case of a successful bisection add the author of the culprit to
the recipients; also CC everyone in the signed-off-by chain, which you find at
the end of its commit message.
......@@ -1594,7 +1580,7 @@ Some fixes are too complex
Even small and seemingly obvious code-changes sometimes introduce new and
totally unexpected problems. The maintainers of the stable and longterm kernels
are very aware of that and thus only apply changes to these kernels that are
within rules outlined in 'Documentation/process/stable-kernel-rules.rst'.
within rules outlined in Documentation/process/stable-kernel-rules.rst.
Complex or risky changes for example do not qualify and thus only get applied
to mainline. Other fixes are easy to get backported to the newest stable and
......@@ -1756,10 +1742,23 @@ art will lay some groundwork to improve the situation over time.
..
This text is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If you
spot a typo or small mistake, feel free to let him know directly and he'll
fix it. You are free to do the same in a mostly informal way if you want
to contribute changes to the text, but for copyright reasons please CC
end-of-content
..
This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If
you spot a typo or small mistake, feel free to let him know directly and
he'll fix it. You are free to do the same in a mostly informal way if you
want to contribute changes to the text, but for copyright reasons please CC
linux-doc@vger.kernel.org and "sign-off" your contribution as
Documentation/process/submitting-patches.rst outlines in the section "Sign
your work - the Developer's Certificate of Origin".
..
This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
of the file. If you want to distribute this text under CC-BY-4.0 only,
please use "The Linux kernel developers" for author attribution and link
this as source:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-issues.rst
..
Note: Only the content of this RST file as found in the Linux kernel sources
is available under CC-BY-4.0, as versions of this text that were processed
(for example by the kernel's build system) might contain content taken from
files which use a more restrictive license.
.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
.. [see the bottom of this file for redistribution information]
Reporting regressions
+++++++++++++++++++++
"*We don't cause regressions*" is the first rule of Linux kernel development;
Linux founder and lead developer Linus Torvalds established it himself and
ensures it's obeyed.
This document describes what the rule means for users and how the Linux kernel's
development model ensures to address all reported regressions; aspects relevant
for kernel developers are left to Documentation/process/handling-regressions.rst.
The important bits (aka "TL;DR")
================================
#. It's a regression if something running fine with one Linux kernel works worse
or not at all with a newer version. Note, the newer kernel has to be compiled
using a similar configuration; the detailed explanations below describes this
and other fine print in more detail.
#. Report your issue as outlined in Documentation/admin-guide/reporting-issues.rst,
it already covers all aspects important for regressions and repeated
below for convenience. Two of them are important: start your report's subject
with "[REGRESSION]" and CC or forward it to `the regression mailing list
<https://lore.kernel.org/regressions/>`_ (regressions@lists.linux.dev).
#. Optional, but recommended: when sending or forwarding your report, make the
Linux kernel regression tracking bot "regzbot" track the issue by specifying
when the regression started like this::
#regzbot introduced v5.13..v5.14-rc1
All the details on Linux kernel regressions relevant for users
==============================================================
The important basics
--------------------
What is a "regression" and what is the "no regressions rule"?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It's a regression if some application or practical use case running fine with
one Linux kernel works worse or not at all with a newer version compiled using a
similar configuration. The "no regressions rule" forbids this to take place; if
it happens by accident, developers that caused it are expected to quickly fix
the issue.
It thus is a regression when a WiFi driver from Linux 5.13 works fine, but with
5.14 doesn't work at all, works significantly slower, or misbehaves somehow.
It's also a regression if a perfectly working application suddenly shows erratic
behavior with a newer kernel version; such issues can be caused by changes in
procfs, sysfs, or one of the many other interfaces Linux provides to userland
software. But keep in mind, as mentioned earlier: 5.14 in this example needs to
be built from a configuration similar to the one from 5.13. This can be achieved
using ``make olddefconfig``, as explained in more detail below.
Note the "practical use case" in the first sentence of this section: developers
despite the "no regressions" rule are free to change any aspect of the kernel
and even APIs or ABIs to userland, as long as no existing application or use
case breaks.
Also be aware the "no regressions" rule covers only interfaces the kernel
provides to the userland. It thus does not apply to kernel-internal interfaces
like the module API, which some externally developed drivers use to hook into
the kernel.
How do I report a regression?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Just report the issue as outlined in
Documentation/admin-guide/reporting-issues.rst, it already describes the
important points. The following aspects outlined there are especially relevant
for regressions:
* When checking for existing reports to join, also search the `archives of the
Linux regressions mailing list <https://lore.kernel.org/regressions/>`_ and
`regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
* Start your report's subject with "[REGRESSION]".
* In your report, clearly mention the last kernel version that worked fine and
the first broken one. Ideally try to find the exact change causing the
regression using a bisection, as explained below in more detail.
* Remember to let the Linux regressions mailing list
(regressions@lists.linux.dev) know about your report:
* If you report the regression by mail, CC the regressions list.
* If you report your regression to some bug tracker, forward the submitted
report by mail to the regressions list while CCing the maintainer and the
mailing list for the subsystem in question.
If it's a regression within a stable or longterm series (e.g.
v5.15.3..v5.15.5), remember to CC the `Linux stable mailing list
<https://lore.kernel.org/stable/>`_ (stable@vger.kernel.org).
In case you performed a successful bisection, add everyone to the CC the
culprit's commit message mentions in lines starting with "Signed-off-by:".
When CCing for forwarding your report to the list, consider directly telling the
aforementioned Linux kernel regression tracking bot about your report. To do
that, include a paragraph like this in your mail::
#regzbot introduced: v5.13..v5.14-rc1
Regzbot will then consider your mail a report for a regression introduced in the
specified version range. In above case Linux v5.13 still worked fine and Linux
v5.14-rc1 was the first version where you encountered the issue. If you
performed a bisection to find the commit that caused the regression, specify the
culprit's commit-id instead::
#regzbot introduced: 1f2e3d4c5d
Placing such a "regzbot command" is in your interest, as it will ensure the
report won't fall through the cracks unnoticed. If you omit this, the Linux
kernel's regressions tracker will take care of telling regzbot about your
regression, as long as you send a copy to the regressions mailing lists. But the
regression tracker is just one human which sometimes has to rest or occasionally
might even enjoy some time away from computers (as crazy as that might sound).
Relying on this person thus will result in an unnecessary delay before the
regressions becomes mentioned `on the list of tracked and unresolved Linux
kernel regressions <https://linux-regtracking.leemhuis.info/regzbot/>`_ and the
weekly regression reports sent by regzbot. Such delays can result in Linus
Torvalds being unaware of important regressions when deciding between "continue
development or call this finished and release the final?".
Are really all regressions fixed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nearly all of them are, as long as the change causing the regression (the
"culprit commit") is reliably identified. Some regressions can be fixed without
this, but often it's required.
Who needs to find the root cause of a regression?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Developers of the affected code area should try to locate the culprit on their
own. But for them that's often impossible to do with reasonable effort, as quite
a lot of issues only occur in a particular environment outside the developer's
reach -- for example, a specific hardware platform, firmware, Linux distro,
system's configuration, or application. That's why in the end it's often up to
the reporter to locate the culprit commit; sometimes users might even need to
run additional tests afterwards to pinpoint the exact root cause. Developers
should offer advice and reasonably help where they can, to make this process
relatively easy and achievable for typical users.
How can I find the culprit?
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Perform a bisection, as roughly outlined in
Documentation/admin-guide/reporting-issues.rst and described in more detail by
Documentation/admin-guide/bug-bisect.rst. It might sound like a lot of work, but
in many cases finds the culprit relatively quickly. If it's hard or
time-consuming to reliably reproduce the issue, consider teaming up with other
affected users to narrow down the search range together.
Who can I ask for advice when it comes to regressions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
issue might better be dealt with in private, feel free to omit the list.
Additional details about regressions
------------------------------------
What is the goal of the "no regressions rule"?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Users should feel safe when updating kernel versions and not have to worry
something might break. This is in the interest of the kernel developers to make
updating attractive: they don't want users to stay on stable or longterm Linux
series that are either abandoned or more than one and a half years old. That's
in everybody's interest, as `those series might have known bugs, security
issues, or other problematic aspects already fixed in later versions
<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
Additionally, the kernel developers want to make it simple and appealing for
users to test the latest pre-release or regular release. That's also in
everybody's interest, as it's a lot easier to track down and fix problems, if
they are reported shortly after being introduced.
Is the "no regressions" rule really adhered in practice?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It's taken really seriously, as can be seen by many mailing list posts from
Linux creator and lead developer Linus Torvalds, some of which are quoted in
Documentation/process/handling-regressions.rst.
Exceptions to this rule are extremely rare; in the past developers almost always
turned out to be wrong when they assumed a particular situation was warranting
an exception.
Who ensures the "no regressions" is actually followed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The subsystem maintainers should take care of that, which are watched and
supported by the tree maintainers -- e.g. Linus Torvalds for mainline and
Greg Kroah-Hartman et al. for various stable/longterm series.
All of them are helped by people trying to ensure no regression report falls
through the cracks. One of them is Thorsten Leemhuis, who's currently acting as
the Linux kernel's "regressions tracker"; to facilitate this work he relies on
regzbot, the Linux kernel regression tracking bot. That's why you want to bring
your report on the radar of these people by CCing or forwarding each report to
the regressions mailing list, ideally with a "regzbot command" in your mail to
get it tracked immediately.
How quickly are regressions normally fixed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Developers should fix any reported regression as quickly as possible, to provide
affected users with a solution in a timely manner and prevent more users from
running into the issue; nevertheless developers need to take enough time and
care to ensure regression fixes do not cause additional damage.
The answer thus depends on various factors like the impact of a regression, its
age, or the Linux series in which it occurs. In the end though, most regressions
should be fixed within two weeks.
Is it a regression, if the issue can be avoided by updating some software?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Almost always: yes. If a developer tells you otherwise, ask the regression
tracker for advice as outlined above.
Is it a regression, if a newer kernel works slower or consumes more energy?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes, but the difference has to be significant. A five percent slow-down in a
micro-benchmark thus is unlikely to qualify as regression, unless it also
influences the results of a broad benchmark by more than one percent. If in
doubt, ask for advice.
Is it a regression, if an external kernel module breaks when updating Linux?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
No, as the "no regression" rule is about interfaces and services the Linux
kernel provides to the userland. It thus does not cover building or running
externally developed kernel modules, as they run in kernel-space and hook into
the kernel using internal interfaces occasionally changed.
How are regressions handled that are caused by security fixes?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In extremely rare situations security issues can't be fixed without causing
regressions; those fixes are given way, as they are the lesser evil in the end.
Luckily this middling almost always can be avoided, as key developers for the
affected area and often Linus Torvalds himself try very hard to fix security
issues without causing regressions.
If you nevertheless face such a case, check the mailing list archives if people
tried their best to avoid the regression. If not, report it; if in doubt, ask
for advice as outlined above.
What happens if fixing a regression is impossible without causing another?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sadly these things happen, but luckily not very often; if they occur, expert
developers of the affected code area should look into the issue to find a fix
that avoids regressions or at least their impact. If you run into such a
situation, do what was outlined already for regressions caused by security
fixes: check earlier discussions if people already tried their best and ask for
advice if in doubt.
A quick note while at it: these situations could be avoided, if people would
regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each
development cycle a test run. This is best explained by imagining a change
integrated between Linux v5.14 and v5.15-rc1 which causes a regression, but at
the same time is a hard requirement for some other improvement applied for
5.15-rc1. All these changes often can simply be reverted and the regression thus
solved, if someone finds and reports it before 5.15 is released. A few days or
weeks later this solution can become impossible, as some software might have
started to rely on aspects introduced by one of the follow-up changes: reverting
all changes would then cause a regression for users of said software and thus is
out of the question.
Is it a regression, if some feature I relied on was removed months ago?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is, but often it's hard to fix such regressions due to the aspects outlined
in the previous section. It hence needs to be dealt with on a case-by-case
basis. This is another reason why it's in everybody's interest to regularly test
mainline pre-releases.
Does the "no regression" rule apply if I seem to be the only affected person?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It does, but only for practical usage: the Linux developers want to be free to
remove support for hardware only to be found in attics and museums anymore.
Note, sometimes regressions can't be avoided to make progress -- and the latter
is needed to prevent Linux from stagnation. Hence, if only very few users seem
to be affected by a regression, it for the greater good might be in their and
everyone else's interest to lettings things pass. Especially if there is an
easy way to circumvent the regression somehow, for example by updating some
software or using a kernel parameter created just for this purpose.
Does the regression rule apply for code in the staging tree as well?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Not according to the `help text for the configuration option covering all
staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,
which since its early days states::
Please note that these drivers are under heavy development, may or
may not work, and may contain userspace interfaces that most likely
will be changed in the near future.
The staging developers nevertheless often adhere to the "no regressions" rule,
but sometimes bend it to make progress. That's for example why some users had to
deal with (often negligible) regressions when a WiFi driver from the staging
tree was replaced by a totally different one written from scratch.
Why do later versions have to be "compiled with a similar configuration"?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Because the Linux kernel developers sometimes integrate changes known to cause
regressions, but make them optional and disable them in the kernel's default
configuration. This trick allows progress, as the "no regressions" rule
otherwise would lead to stagnation.
Consider for example a new security feature blocking access to some kernel
interfaces often abused by malware, which at the same time are required to run a
few rarely used applications. The outlined approach makes both camps happy:
people using these applications can leave the new security feature off, while
everyone else can enable it without running into trouble.
How to create a configuration similar to the one of an older kernel?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Start your machine with a known-good kernel and configure the newer Linux
version with ``make olddefconfig``. This makes the kernel's build scripts pick
up the configuration file (the ".config" file) from the running kernel as base
for the new one you are about to compile; afterwards they set all new
configuration options to their default value, which should disable new features
that might cause regressions.
Can I report a regression I found with pre-compiled vanilla kernels?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You need to ensure the newer kernel was compiled with a similar configuration
file as the older one (see above), as those that built them might have enabled
some known-to-be incompatible feature for the newer kernel. If in doubt, report
the matter to the kernel's provider and ask for advice.
More about regression tracking with "regzbot"
---------------------------------------------
What is regression tracking and why should I care about it?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Rules like "no regressions" need someone to ensure they are followed, otherwise
they are broken either accidentally or on purpose. History has shown this to be
true for Linux kernel development as well. That's why Thorsten Leemhuis, the
Linux Kernel's regression tracker, and some people try to ensure all regression
are fixed by keeping an eye on them until they are resolved. Neither of them are
paid for this, that's why the work is done on a best effort basis.
Why and how are Linux kernel regressions tracked using a bot?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tracking regressions completely manually has proven to be quite hard due to the
distributed and loosely structured nature of Linux kernel development process.
That's why the Linux kernel's regression tracker developed regzbot to facilitate
the work, with the long term goal to automate regression tracking as much as
possible for everyone involved.
Regzbot works by watching for replies to reports of tracked regressions.
Additionally, it's looking out for posted or committed patches referencing such
reports with "Link:" tags; replies to such patch postings are tracked as well.
Combined this data provides good insights into the current state of the fixing
process.
How to see which regressions regzbot tracks currently?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Check out `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
What kind of issues are supposed to be tracked by regzbot?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The bot is meant to track regressions, hence please don't involve regzbot for
regular issues. But it's okay for the Linux kernel's regression tracker if you
involve regzbot to track severe issues, like reports about hangs, corrupted
data, or internal errors (Panic, Oops, BUG(), warning, ...).
How to change aspects of a tracked regression?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By using a 'regzbot command' in a direct or indirect reply to the mail with the
report. The easiest way to do that: find the report in your "Sent" folder or the
mailing list archive and reply to it using your mailer's "Reply-all" function.
In that mail, use one of the following commands in a stand-alone paragraph (IOW:
use blank lines to separate one or multiple of these commands from the rest of
the mail's text).
* Update when the regression started to happen, for example after performing a
bisection::
#regzbot introduced: 1f2e3d4c5d
* Set or update the title::
#regzbot title: foo
* Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of
the issue or a fix are discussed:::
#regzbot monitor: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
#regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
* Point to a place with further details of interest, like a mailing list post
or a ticket in a bug tracker that are slightly related, but about a different
topic::
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
* Mark a regression as invalid::
#regzbot invalid: wasn't a regression, problem has always existed
Regzbot supports a few other commands primarily used by developers or people
tracking regressions. They and more details about the aforementioned regzbot
commands can be found in the `getting started guide
<https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_ and
the `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
for regzbot.
..
end-of-content
..
This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
of the file. If you want to distribute this text under CC-BY-4.0 only,
please use "The Linux kernel developers" for author attribution and link
this as source:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-regressions.rst
..
Note: Only the content of this RST file as found in the Linux kernel sources
is available under CC-BY-4.0, as versions of this text that were processed
(for example by the kernel's build system) might contain content taken from
files which use a more restrictive license.
......@@ -11,7 +11,7 @@ Getting started quick
- Compile and install kernel and modules, reboot.
- You need the udftools package (pktsetup, mkudffs, cdrwtool).
Download from http://sourceforge.net/projects/linux-udf/
Download from https://github.com/pali/udftools
- Grab a new CD-RW disc and format it (assuming CD-RW is hdc, substitute
as appropriate)::
......@@ -102,7 +102,7 @@ Using the pktcdvd sysfs interface
Since Linux 2.6.20, the pktcdvd module has a sysfs interface
and can be controlled by it. For example the "pktcdvd" tool uses
this interface. (see http://tom.ist-im-web.de/download/pktcdvd )
this interface. (see http://tom.ist-im-web.de/linux/software/pktcdvd )
"pktcdvd" works similar to "pktsetup", e.g.::
......
......@@ -409,135 +409,25 @@ latex_elements = {
# Additional stuff for the LaTeX preamble.
'preamble': '''
% Prevent column squeezing of tabulary.
\\setlength{\\tymin}{20em}
% Use some font with UTF-8 support with XeLaTeX
\\usepackage{fontspec}
\\setsansfont{DejaVu Sans}
\\setromanfont{DejaVu Serif}
\\setmonofont{DejaVu Sans Mono}
% Adjust \\headheight for fancyhdr
\\addtolength{\\headheight}{1.6pt}
\\addtolength{\\topmargin}{-1.6pt}
''',
''',
}
# Translations have Asian (CJK) characters which are only displayed if
# xeCJK is used
latex_elements['preamble'] += '''
\\IfFontExistsTF{Noto Sans CJK SC}{
% This is needed for translations
\\usepackage{xeCJK}
\\IfFontExistsTF{Noto Serif CJK SC}{
\\setCJKmainfont{Noto Serif CJK SC}[AutoFakeSlant]
}{
\\setCJKmainfont{Noto Sans CJK SC}[AutoFakeSlant]
}
\\setCJKsansfont{Noto Sans CJK SC}[AutoFakeSlant]
\\setCJKmonofont{Noto Sans Mono CJK SC}[AutoFakeSlant]
% CJK Language-specific font choices
\\IfFontExistsTF{Noto Serif CJK SC}{
\\newCJKfontfamily[SCmain]\\scmain{Noto Serif CJK SC}[AutoFakeSlant]
\\newCJKfontfamily[SCserif]\\scserif{Noto Serif CJK SC}[AutoFakeSlant]
}{
\\newCJKfontfamily[SCmain]\\scmain{Noto Sans CJK SC}[AutoFakeSlant]
\\newCJKfontfamily[SCserif]\\scserif{Noto Sans CJK SC}[AutoFakeSlant]
}
\\newCJKfontfamily[SCsans]\\scsans{Noto Sans CJK SC}[AutoFakeSlant]
\\newCJKfontfamily[SCmono]\\scmono{Noto Sans Mono CJK SC}[AutoFakeSlant]
\\IfFontExistsTF{Noto Serif CJK TC}{
\\newCJKfontfamily[TCmain]\\tcmain{Noto Serif CJK TC}[AutoFakeSlant]
\\newCJKfontfamily[TCserif]\\tcserif{Noto Serif CJK TC}[AutoFakeSlant]
}{
\\newCJKfontfamily[TCmain]\\tcmain{Noto Sans CJK TC}[AutoFakeSlant]
\\newCJKfontfamily[TCserif]\\tcserif{Noto Sans CJK TC}[AutoFakeSlant]
}
\\newCJKfontfamily[TCsans]\\tcsans{Noto Sans CJK TC}[AutoFakeSlant]
\\newCJKfontfamily[TCmono]\\tcmono{Noto Sans Mono CJK TC}[AutoFakeSlant]
\\IfFontExistsTF{Noto Serif CJK KR}{
\\newCJKfontfamily[KRmain]\\krmain{Noto Serif CJK KR}[AutoFakeSlant]
\\newCJKfontfamily[KRserif]\\krserif{Noto Serif CJK KR}[AutoFakeSlant]
}{
\\newCJKfontfamily[KRmain]\\krmain{Noto Sans CJK KR}[AutoFakeSlant]
\\newCJKfontfamily[KRserif]\\krserif{Noto Sans CJK KR}[AutoFakeSlant]
}
\\newCJKfontfamily[KRsans]\\krsans{Noto Sans CJK KR}[AutoFakeSlant]
\\newCJKfontfamily[KRmono]\\krmono{Noto Sans Mono CJK KR}[AutoFakeSlant]
\\IfFontExistsTF{Noto Serif CJK JP}{
\\newCJKfontfamily[JPmain]\\jpmain{Noto Serif CJK JP}[AutoFakeSlant]
\\newCJKfontfamily[JPserif]\\jpserif{Noto Serif CJK JP}[AutoFakeSlant]
}{
\\newCJKfontfamily[JPmain]\\jpmain{Noto Sans CJK JP}[AutoFakeSlant]
\\newCJKfontfamily[JPserif]\\jpserif{Noto Sans CJK JP}[AutoFakeSlant]
}
\\newCJKfontfamily[JPsans]\\jpsans{Noto Sans CJK JP}[AutoFakeSlant]
\\newCJKfontfamily[JPmono]\\jpmono{Noto Sans Mono CJK JP}[AutoFakeSlant]
% Dummy commands for Sphinx < 2.3 (no 'extrapackages' support)
\\providecommand{\\onehalfspacing}{}
\\providecommand{\\singlespacing}{}
% Define custom macros to on/off CJK
\\newcommand{\\kerneldocCJKon}{\\makexeCJKactive\\onehalfspacing}
\\newcommand{\\kerneldocCJKoff}{\\makexeCJKinactive\\singlespacing}
\\newcommand{\\kerneldocBeginSC}{%
\\begingroup%
\\scmain%
}
\\newcommand{\\kerneldocEndSC}{\\endgroup}
\\newcommand{\\kerneldocBeginTC}{%
\\begingroup%
\\tcmain%
\\renewcommand{\\CJKrmdefault}{TCserif}%
\\renewcommand{\\CJKsfdefault}{TCsans}%
\\renewcommand{\\CJKttdefault}{TCmono}%
}
\\newcommand{\\kerneldocEndTC}{\\endgroup}
\\newcommand{\\kerneldocBeginKR}{%
\\begingroup%
\\xeCJKDeclareCharClass{HalfLeft}{`“,`‘}%
\\xeCJKDeclareCharClass{HalfRight}{`”,`’}%
\\krmain%
\\renewcommand{\\CJKrmdefault}{KRserif}%
\\renewcommand{\\CJKsfdefault}{KRsans}%
\\renewcommand{\\CJKttdefault}{KRmono}%
\\xeCJKsetup{CJKspace = true} % For inter-phrase space
}
\\newcommand{\\kerneldocEndKR}{\\endgroup}
\\newcommand{\\kerneldocBeginJP}{%
\\begingroup%
\\xeCJKDeclareCharClass{HalfLeft}{`“,`‘}%
\\xeCJKDeclareCharClass{HalfRight}{`”,`’}%
\\jpmain%
\\renewcommand{\\CJKrmdefault}{JPserif}%
\\renewcommand{\\CJKsfdefault}{JPsans}%
\\renewcommand{\\CJKttdefault}{JPmono}%
}
\\newcommand{\\kerneldocEndJP}{\\endgroup}
% Single spacing in literal blocks
\\fvset{baselinestretch=1}
% To customize \\sphinxtableofcontents
\\usepackage{etoolbox}
% Inactivate CJK after tableofcontents
\\apptocmd{\\sphinxtableofcontents}{\\kerneldocCJKoff}{}{}
}{ % No CJK font found
% Custom macros to on/off CJK (Dummy)
\\newcommand{\\kerneldocCJKon}{}
\\newcommand{\\kerneldocCJKoff}{}
\\newcommand{\\kerneldocBeginSC}{}
\\newcommand{\\kerneldocEndSC}{}
\\newcommand{\\kerneldocBeginTC}{}
\\newcommand{\\kerneldocEndTC}{}
\\newcommand{\\kerneldocBeginKR}{}
\\newcommand{\\kerneldocEndKR}{}
\\newcommand{\\kerneldocBeginJP}{}
\\newcommand{\\kerneldocEndJP}{}
}
'''
# Fix reference escape troubles with Sphinx 1.4.x
if major == 1:
latex_elements['preamble'] += '\\renewcommand*{\\DUrole}[2]{ #2 }\n'
# Load kerneldoc specific LaTeX settings
latex_elements['preamble'] += '''
% Load kerneldoc specific LaTeX settings
\\input{kerneldoc-preamble.sty}
'''
# With Sphinx 1.6, it is possible to change the Bg color directly
# by using:
# \definecolor{sphinxnoteBgColor}{RGB}{204,255,255}
......@@ -599,6 +489,11 @@ for fn in os.listdir('.'):
# If false, no module index is generated.
#latex_domain_indices = True
# Additional LaTeX stuff to be copied to build directory
latex_additional_files = [
'sphinx/kerneldoc-preamble.sty',
]
# -- Options for manual page output ---------------------------------------
......
Entry/exit handling for exceptions, interrupts, syscalls and KVM
================================================================
All transitions between execution domains require state updates which are
subject to strict ordering constraints. State updates are required for the
following:
* Lockdep
* RCU / Context tracking
* Preemption counter
* Tracing
* Time accounting
The update order depends on the transition type and is explained below in
the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
exceptions`_, `NMI and NMI-like exceptions`_.
Non-instrumentable code - noinstr
---------------------------------
Most instrumentation facilities depend on RCU, so intrumentation is prohibited
for entry code before RCU starts watching and exit code after RCU stops
watching. In addition, many architectures must save and restore register state,
which means that (for example) a breakpoint in the breakpoint entry code would
overwrite the debug registers of the initial breakpoint.
Such code must be marked with the 'noinstr' attribute, placing that code into a
special section inaccessible to instrumentation and debug facilities. Some
functions are partially instrumentable, which is handled by marking them
noinstr and using instrumentation_begin() and instrumentation_end() to flag the
instrumentable ranges of code:
.. code-block:: c
noinstr void entry(void)
{
handle_entry(); // <-- must be 'noinstr' or '__always_inline'
...
instrumentation_begin();
handle_context(); // <-- instrumentable code
instrumentation_end();
...
handle_exit(); // <-- must be 'noinstr' or '__always_inline'
}
This allows verification of the 'noinstr' restrictions via objtool on
supported architectures.
Invoking non-instrumentable functions from instrumentable context has no
restrictions and is useful to protect e.g. state switching which would
cause malfunction if instrumented.
All non-instrumentable entry/exit code sections before and after the RCU
state transitions must run with interrupts disabled.
Syscalls
--------
Syscall-entry code starts in assembly code and calls out into low-level C code
after establishing low-level architecture-specific state and stack frames. This
low-level C code must not be instrumented. A typical syscall handling function
invoked from low-level assembly code looks like this:
.. code-block:: c
noinstr void syscall(struct pt_regs *regs, int nr)
{
arch_syscall_enter(regs);
nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();
if (!invoke_syscall(regs, nr) && nr != -1)
result_reg(regs) = __sys_ni_syscall(regs);
instrumentation_end();
syscall_exit_to_user_mode(regs);
}
syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
establishes state in the following order:
* Lockdep
* RCU / Context tracking
* Tracing
and then invokes the various entry work functions like ptrace, seccomp, audit,
syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
function can be invoked. The instrumentable code section then ends, after which
syscall_exit_to_user_mode() is invoked.
syscall_exit_to_user_mode() handles all work which needs to be done before
returning to user space like tracing, audit, signals, task work etc. After
that it invokes exit_to_user_mode() which again handles the state
transition in the reverse order:
* Tracing
* RCU / Context tracking
* Lockdep
syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
available as fine grained subfunctions in cases where the architecture code
has to do extra work between the various steps. In such cases it has to
ensure that enter_from_user_mode() is called first on entry and
exit_to_user_mode() is called last on exit.
Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
to print a warning.
KVM
---
Entering or exiting guest mode is very similar to syscalls. From the host
kernel point of view the CPU goes off into user space when entering the
guest and returns to the kernel on exit.
kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
The state operations have the same ordering.
Task work handling is done separately for guest at the boundary of the
vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
the work handled on return to user space.
Do not nest KVM entry/exit transitions because doing so is nonsensical.
Interrupts and regular exceptions
---------------------------------
Interrupts entry and exit handling is slightly more complex than syscalls
and KVM transitions.
If an interrupt is raised while the CPU executes in user space, the entry
and exit handling is exactly the same as for syscalls.
If the interrupt is raised while the CPU executes in kernel space the entry and
exit handling is slightly different. RCU state is only updated when the
interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
already be watching. Lockdep and tracing have to be updated unconditionally.
irqentry_enter() and irqentry_exit() provide the implementation for this.
The architecture-specific part looks similar to syscall handling:
.. code-block:: c
noinstr void interrupt(struct pt_regs *regs, int nr)
{
arch_interrupt_enter(regs);
state = irqentry_enter(regs);
instrumentation_begin();
irq_enter_rcu();
invoke_irq_handler(regs, nr);
irq_exit_rcu();
instrumentation_end();
irqentry_exit(regs, state);
}
Note that the invocation of the actual interrupt handler is within a
irq_enter_rcu() and irq_exit_rcu() pair.
irq_enter_rcu() updates the preemption count which makes in_hardirq()
return true, handles NOHZ tick state and interrupt time accounting. This
means that up to the point where irq_enter_rcu() is invoked in_hardirq()
returns false.
irq_exit_rcu() handles interrupt time accounting, undoes the preemption
count update and eventually handles soft interrupts and NOHZ tick state.
In theory, the preemption count could be updated in irqentry_enter(). In
practice, deferring this update to irq_enter_rcu() allows the preemption-count
code to be traced, while also maintaining symmetry with irq_exit_rcu() and
irqentry_exit(), which are described in the next paragraph. The only downside
is that the early entry code up to irq_enter_rcu() must be aware that the
preemption count has not yet been updated with the HARDIRQ_OFFSET state.
Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
before it handles soft interrupts, whose handlers must run in BH context rather
than irq-disabled context. In addition, irqentry_exit() might schedule, which
also requires that HARDIRQ_OFFSET has been removed from the preemption count.
Even though interrupt handlers are expected to run with local interrupts
disabled, interrupt nesting is common from an entry/exit perspective. For
example, softirq handling happens within an irqentry_{enter,exit}() block with
local interrupts enabled. Also, although uncommon, nothing prevents an
interrupt handler from re-enabling interrupts.
Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
the entry code is shared between the two.
NMI and NMI-like exceptions
---------------------------
NMIs and NMI-like exceptions (machine checks, double faults, debug
interrupts, etc.) can hit any context and must be extra careful with
the state.
State changes for debug exceptions and machine-check exceptions depend on
whether these exceptions happened in user-space (breakpoints or watchpoints) or
in kernel mode (code patching). From user-space, they are treated like
interrupts, while from kernel mode they are treated like NMIs.
NMIs and other NMI-like exceptions handle state transitions without
distinguishing between user-mode and kernel-mode origin.
The state update on entry is handled in irqentry_nmi_enter() which updates
state in the following order:
* Preemption counter
* Lockdep
* RCU / Context tracking
* Tracing
The exit counterpart irqentry_nmi_exit() does the reverse operation in the
reverse order.
Note that the update of the preemption counter has to be the first
operation on enter and the last operation on exit. The reason is that both
lockdep and RCU rely on in_nmi() returning true in this case. The
preemption count modification in the NMI entry/exit case must not be
traced.
Architecture-specific code looks like this:
.. code-block:: c
noinstr void nmi(struct pt_regs *regs)
{
arch_nmi_enter(regs);
state = irqentry_nmi_enter(regs);
instrumentation_begin();
nmi_handler(regs);
instrumentation_end();
irqentry_nmi_exit(regs);
}
and for e.g. a debug exception it can look like this:
.. code-block:: c
noinstr void debug(struct pt_regs *regs)
{
arch_nmi_enter(regs);
debug_regs = save_debug_regs();
if (user_mode(regs)) {
state = irqentry_enter(regs);
instrumentation_begin();
user_mode_debug_handler(regs, debug_regs);
instrumentation_end();
irqentry_exit(regs, state);
} else {
state = irqentry_nmi_enter(regs);
instrumentation_begin();
kernel_mode_debug_handler(regs, debug_regs);
instrumentation_end();
irqentry_nmi_exit(regs, state);
}
}
There is no combined irqentry_nmi_if_kernel() function available as the
above cannot be handled in an exception-agnostic way.
NMIs can happen in any context. For example, an NMI-like exception triggered
while handling an NMI. So NMI entry code has to be reentrant and state updates
need to handle nesting.
......@@ -44,6 +44,14 @@ Library functionality that is used throughout the kernel.
timekeeping
errseq
Low level entry and exit
========================
.. toctree::
:maxdepth: 1
entry
Concurrency primitives
======================
......
.. SPDX-License-Identifier: GPL-2.0
========================================
The Kernel Test Anything Protocol (KTAP)
========================================
===================================================
The Kernel Test Anything Protocol (KTAP), version 1
===================================================
TAP, or the Test Anything Protocol is a format for specifying test results used
by a number of projects. It's website and specification are found at this `link
......@@ -68,7 +68,7 @@ Test case result lines
Test case result lines indicate the final status of a test.
They are required and must have the format:
.. code-block::
.. code-block:: none
<result> <number> [<description>][ # [<directive>] [<diagnostic data>]]
......@@ -117,32 +117,32 @@ separator.
Example result lines include:
.. code-block::
.. code-block:: none
ok 1 test_case_name
The test "test_case_name" passed.
.. code-block::
.. code-block:: none
not ok 1 test_case_name
The test "test_case_name" failed.
.. code-block::
.. code-block:: none
ok 1 test # SKIP necessary dependency unavailable
The test "test" was SKIPPED with the diagnostic message "necessary dependency
unavailable".
.. code-block::
.. code-block:: none
not ok 1 test # TIMEOUT 30 seconds
The test "test" timed out, with diagnostic data "30 seconds".
.. code-block::
.. code-block:: none
ok 5 check return code # rcode=0
......@@ -174,6 +174,13 @@ There may be lines within KTAP output that do not follow the format of one of
the four formats for lines described above. This is allowed, however, they will
not influence the status of the tests.
This is an important difference from TAP. Kernel tests may print messages
to the system console or a log file. Both of these destinations may contain
messages either from unrelated kernel or userspace activity, or kernel
messages from non-test code that is invoked by the test. The kernel code
invoked by the test likely is not aware that a test is in progress and
thus can not print the message as a diagnostic message.
Nested tests
------------
......@@ -186,13 +193,16 @@ starting with another KTAP version line and test plan, and end with the overall
result. If one of the subtests fail, for example, the parent test should also
fail.
Additionally, all result lines in a subtest should be indented. One level of
Additionally, all lines in a subtest should be indented. One level of
indentation is two spaces: " ". The indentation should begin at the version
line and should end before the parent test's result line.
"Unknown lines" are not considered to be lines in a subtest and thus are
allowed to be either indented or not indented.
An example of a test with two nested subtests:
.. code-block::
.. code-block:: none
KTAP version 1
1..1
......@@ -205,7 +215,7 @@ An example of a test with two nested subtests:
An example format with multiple levels of nested testing:
.. code-block::
.. code-block:: none
KTAP version 1
1..2
......@@ -224,10 +234,15 @@ An example format with multiple levels of nested testing:
Major differences between TAP and KTAP
--------------------------------------
Note the major differences between the TAP and KTAP specification:
- yaml and json are not recommended in diagnostic messages
- TODO directive not recognized
- KTAP allows for an arbitrary number of tests to be nested
================================================== ========= ===============
Feature TAP KTAP
================================================== ========= ===============
yaml and json in diagnosic message ok not recommended
TODO directive ok not recognized
allows an arbitrary number of tests to be nested no yes
"Unknown lines" are in category of "Anything else" yes no
"Unknown lines" are incorrect allowed
================================================== ========= ===============
The TAP14 specification does permit nested tests, but instead of using another
nested version line, uses a line of the form
......@@ -235,7 +250,7 @@ nested version line, uses a line of the form
Example KTAP output
--------------------
.. code-block::
.. code-block:: none
KTAP version 1
1..1
......
......@@ -311,7 +311,7 @@ hardware.
This call must not sleep
set_ldisc(port,termios)
Notifier for discipline change. See Documentation/driver-api/serial/tty.rst.
Notifier for discipline change. See Documentation/tty/tty_ldisc.rst.
Locking: caller holds tty_port->mutex
......
......@@ -247,7 +247,7 @@ based on rt_mutex which changes the semantics:
Non-PREEMPT_RT kernels disable preemption to get this effect.
PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
preemption disabled. The lock disables softirq handlers and also
preemption enabled. The lock disables softirq handlers and also
prevents reentrancy due to task preemption.
PREEMPT_RT kernels preserve all other spinlock_t semantics:
......
......@@ -249,6 +249,10 @@ The 5.x.y (-stable) and 5.x patches live at
https://www.kernel.org/pub/linux/kernel/v5.x/
The 5.x.y incremental patches live at
https://www.kernel.org/pub/linux/kernel/v5.x/incr/
The -rc patches are not stored on the webserver but are generated on
demand from git tags such as
......@@ -308,12 +312,11 @@ versions.
If no 5.x.y kernel is available, then the highest numbered 5.x kernel is
the current stable kernel.
.. note::
The -stable team provides normal as well as incremental patches. Below is
how to apply these patches.
The -stable team usually do make incremental patches available as well
as patches against the latest mainline release, but I only cover the
non-incremental ones below. The incremental ones can be found at
https://www.kernel.org/pub/linux/kernel/v5.x/incr/
Normal patches
~~~~~~~~~~~~~~
These patches are not incremental, meaning that for example the 5.7.3
patch does not apply on top of the 5.7.2 kernel source, but rather on top
......@@ -331,6 +334,21 @@ Here's a small example::
$ cd ..
$ mv linux-5.7.2 linux-5.7.3 # rename the kernel source dir
Incremental patches
~~~~~~~~~~~~~~~~~~~
Incremental patches are different: instead of being applied on top
of base 5.x kernel, they are applied on top of previous stable kernel
(5.x.y-1).
Here's the example to apply these::
$ cd ~/linux-5.7.2 # change to the kernel source dir
$ patch -p1 < ../patch-5.7.2-3 # apply the new 5.7.3 patch
$ cd ..
$ mv linux-5.7.2 linux-5.7.3 # rename the kernel source dir
The -rc kernels
===============
......
.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
.. See the bottom of this file for additional redistribution information.
Handling regressions
++++++++++++++++++++
*We don't cause regressions* -- this document describes what this "first rule of
Linux kernel development" means in practice for developers. It complements
Documentation/admin-guide/reporting-regressions.rst, which covers the topic from a
user's point of view; if you never read that text, go and at least skim over it
before continuing here.
The important bits (aka "The TL;DR")
====================================
#. Ensure subscribers of the `regression mailing list <https://lore.kernel.org/regressions/>`_
(regressions@lists.linux.dev) quickly become aware of any new regression
report:
* When receiving a mailed report that did not CC the list, bring it into the
loop by immediately sending at least a brief "Reply-all" with the list
CCed.
* Forward or bounce any reports submitted in bug trackers to the list.
#. Make the Linux kernel regression tracking bot "regzbot" track the issue (this
is optional, but recommended):
* For mailed reports, check if the reporter included a line like ``#regzbot
introduced v5.13..v5.14-rc1``. If not, send a reply (with the regressions
list in CC) containing a paragraph like the following, which tells regzbot
when the issue started to happen::
#regzbot ^introduced 1f2e3d4c5b6a
* When forwarding reports from a bug tracker to the regressions list (see
above), include a paragraph like the following::
#regzbot introduced: v5.13..v5.14-rc1
#regzbot from: Some N. Ice Human <some.human@example.com>
#regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
#. When submitting fixes for regressions, add "Link:" tags to the patch
description pointing to all places where the issue was reported, as
mandated by Documentation/process/submitting-patches.rst and
:ref:`Documentation/process/5.Posting.rst <development_posting>`.
#. Try to fix regressions quickly once the culprit has been identified; fixes
for most regressions should be merged within two weeks, but some need to be
resolved within two or three days.
All the details on Linux kernel regressions relevant for developers
===================================================================
The important basics in more detail
-----------------------------------
What to do when receiving regression reports
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ensure the Linux kernel's regression tracker and others subscribers of the
`regression mailing list <https://lore.kernel.org/regressions/>`_
(regressions@lists.linux.dev) become aware of any newly reported regression:
* When you receive a report by mail that did not CC the list, immediately bring
it into the loop by sending at least a brief "Reply-all" with the list CCed;
try to ensure it gets CCed again in case you reply to a reply that omitted
the list.
* If a report submitted in a bug tracker hits your Inbox, forward or bounce it
to the list. Consider checking the list archives beforehand, if the reporter
already forwarded the report as instructed by
Documentation/admin-guide/reporting-issues.rst.
When doing either, consider making the Linux kernel regression tracking bot
"regzbot" immediately start tracking the issue:
* For mailed reports, check if the reporter included a "regzbot command" like
``#regzbot introduced 1f2e3d4c5b6a``. If not, send a reply (with the
regressions list in CC) with a paragraph like the following:::
#regzbot ^introduced: v5.13..v5.14-rc1
This tells regzbot the version range in which the issue started to happen;
you can specify a range using commit-ids as well or state a single commit-id
in case the reporter bisected the culprit.
Note the caret (^) before the "introduced": it tells regzbot to treat the
parent mail (the one you reply to) as the initial report for the regression
you want to see tracked; that's important, as regzbot will later look out
for patches with "Link:" tags pointing to the report in the archives on
lore.kernel.org.
* When forwarding a regressions reported to a bug tracker, include a paragraph
with these regzbot commands::
#regzbot introduced: 1f2e3d4c5b6a
#regzbot from: Some N. Ice Human <some.human@example.com>
#regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
Regzbot will then automatically associate patches with the report that
contain "Link:" tags pointing to your mail or the mentioned ticket.
What's important when fixing regressions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You don't need to do anything special when submitting fixes for regression, just
remember to do what Documentation/process/submitting-patches.rst,
:ref:`Documentation/process/5.Posting.rst <development_posting>`, and
Documentation/process/stable-kernel-rules.rst already explain in more detail:
* Point to all places where the issue was reported using "Link:" tags::
Link: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890
* Add a "Fixes:" tag to specify the commit causing the regression.
* If the culprit was merged in an earlier development cycle, explicitly mark
the fix for backporting using the ``Cc: stable@vger.kernel.org`` tag.
All this is expected from you and important when it comes to regression, as
these tags are of great value for everyone (you included) that might be looking
into the issue weeks, months, or years later. These tags are also crucial for
tools and scripts used by other kernel developers or Linux distributions; one of
these tools is regzbot, which heavily relies on the "Link:" tags to associate
reports for regression with changes resolving them.
Prioritize work on fixing regressions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You should fix any reported regression as quickly as possible, to provide
affected users with a solution in a timely manner and prevent more users from
running into the issue; nevertheless developers need to take enough time and
care to ensure regression fixes do not cause additional damage.
In the end though, developers should give their best to prevent users from
running into situations where a regression leaves them only three options: "run
a kernel with a regression that seriously impacts usage", "continue running an
outdated and thus potentially insecure kernel version for more than two weeks
after a regression's culprit was identified", and "downgrade to a still
supported kernel series that lack required features".
How to realize this depends a lot on the situation. Here are a few rules of
thumb for you, in order or importance:
* Prioritize work on handling regression reports and fixing regression over all
other Linux kernel work, unless the latter concerns acute security issues or
bugs causing data loss or damage.
* Always consider reverting the culprit commits and reapplying them later
together with necessary fixes, as this might be the least dangerous and
quickest way to fix a regression.
* Developers should handle regressions in all supported kernel series, but are
free to delegate the work to the stable team, if the issue probably at no
point in time occurred with mainline.
* Try to resolve any regressions introduced in the current development before
its end. If you fear a fix might be too risky to apply only days before a new
mainline release, let Linus decide: submit the fix separately to him as soon
as possible with the explanation of the situation. He then can make a call
and postpone the release if necessary, for example if multiple such changes
show up in his inbox.
* Address regressions in stable, longterm, or proper mainline releases with
more urgency than regressions in mainline pre-releases. That changes after
the release of the fifth pre-release, aka "-rc5": mainline then becomes as
important, to ensure all the improvements and fixes are ideally tested
together for at least one week before Linus releases a new mainline version.
* Fix regressions within two or three days, if they are critical for some
reason -- for example, if the issue is likely to affect many users of the
kernel series in question on all or certain architectures. Note, this
includes mainline, as issues like compile errors otherwise might prevent many
testers or continuous integration systems from testing the series.
* Aim to fix regressions within one week after the culprit was identified, if
the issue was introduced in either:
* a recent stable/longterm release
* the development cycle of the latest proper mainline release
In the latter case (say Linux v5.14), try to address regressions even
quicker, if the stable series for the predecessor (v5.13) will be abandoned
soon or already was stamped "End-of-Life" (EOL) -- this usually happens about
three to four weeks after a new mainline release.
* Try to fix all other regressions within two weeks after the culprit was
found. Two or three additional weeks are acceptable for performance
regressions and other issues which are annoying, but don't prevent anyone
from running Linux (unless it's an issue in the current development cycle,
as those should ideally be addressed before the release). A few weeks in
total are acceptable if a regression can only be fixed with a risky change
and at the same time is affecting only a few users; as much time is
also okay if the regression is already present in the second newest longterm
kernel series.
Note: The aforementioned time frames for resolving regressions are meant to
include getting the fix tested, reviewed, and merged into mainline, ideally with
the fix being in linux-next at least briefly. This leads to delays you need to
account for.
Subsystem maintainers are expected to assist in reaching those periods by doing
timely reviews and quick handling of accepted patches. They thus might have to
send git-pull requests earlier or more often than usual; depending on the fix,
it might even be acceptable to skip testing in linux-next. Especially fixes for
regressions in stable and longterm kernels need to be handled quickly, as fixes
need to be merged in mainline before they can be backported to older series.
More aspects regarding regressions developers should be aware of
----------------------------------------------------------------
How to deal with changes where a risk of regression is known
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Evaluate how big the risk of regressions is, for example by performing a code
search in Linux distributions and Git forges. Also consider asking other
developers or projects likely to be affected to evaluate or even test the
proposed change; if problems surface, maybe some solution acceptable for all
can be found.
If the risk of regressions in the end seems to be relatively small, go ahead
with the change, but let all involved parties know about the risk. Hence, make
sure your patch description makes this aspect obvious. Once the change is
merged, tell the Linux kernel's regression tracker and the regressions mailing
list about the risk, so everyone has the change on the radar in case reports
trickle in. Depending on the risk, you also might want to ask the subsystem
maintainer to mention the issue in his mainline pull request.
What else is there to known about regressions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Check out Documentation/admin-guide/reporting-regressions.rst, it covers a lot
of other aspects you want might want to be aware of:
* the purpose of the "no regressions rule"
* what issues actually qualify as regression
* who's in charge for finding the root cause of a regression
* how to handle tricky situations, e.g. when a regression is caused by a
security fix or when fixing a regression might cause another one
Whom to ask for advice when it comes to regressions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
issue might better be dealt with in private, feel free to omit the list.
More about regression tracking and regzbot
------------------------------------------
Why the Linux kernel has a regression tracker, and why is regzbot used?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Rules like "no regressions" need someone to ensure they are followed, otherwise
they are broken either accidentally or on purpose. History has shown this to be
true for the Linux kernel as well. That's why Thorsten Leemhuis volunteered to
keep an eye on things as the Linux kernel's regression tracker, who's
occasionally helped by other people. Neither of them are paid to do this,
that's why regression tracking is done on a best effort basis.
Earlier attempts to manually track regressions have shown it's an exhausting and
frustrating work, which is why they were abandoned after a while. To prevent
this from happening again, Thorsten developed regzbot to facilitate the work,
with the long term goal to automate regression tracking as much as possible for
everyone involved.
How does regression tracking work with regzbot?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The bot watches for replies to reports of tracked regressions. Additionally,
it's looking out for posted or committed patches referencing such reports
with "Link:" tags; replies to such patch postings are tracked as well.
Combined this data provides good insights into the current state of the fixing
process.
Regzbot tries to do its job with as little overhead as possible for both
reporters and developers. In fact, only reporters are burdened with an extra
duty: they need to tell regzbot about the regression report using the ``#regzbot
introduced`` command outlined above; if they don't do that, someone else can
take care of that using ``#regzbot ^introduced``.
For developers there normally is no extra work involved, they just need to make
sure to do something that was expected long before regzbot came to light: add
"Link:" tags to the patch description pointing to all reports about the issue
fixed.
Do I have to use regzbot?
~~~~~~~~~~~~~~~~~~~~~~~~~
It's in the interest of everyone if you do, as kernel maintainers like Linus
Torvalds partly rely on regzbot's tracking in their work -- for example when
deciding to release a new version or extend the development phase. For this they
need to be aware of all unfixed regression; to do that, Linus is known to look
into the weekly reports sent by regzbot.
Do I have to tell regzbot about every regression I stumble upon?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ideally yes: we are all humans and easily forget problems when something more
important unexpectedly comes up -- for example a bigger problem in the Linux
kernel or something in real life that's keeping us away from keyboards for a
while. Hence, it's best to tell regzbot about every regression, except when you
immediately write a fix and commit it to a tree regularly merged to the affected
kernel series.
How to see which regressions regzbot tracks currently?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_
for the latest info; alternatively, `search for the latest regression report
<https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
which regzbot normally sends out once a week on Sunday evening (UTC), which is a
few hours before Linus usually publishes new (pre-)releases.
What places is regzbot monitoring?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Regzbot is watching the most important Linux mailing lists as well as the git
repositories of linux-next, mainline, and stable/longterm.
What kind of issues are supposed to be tracked by regzbot?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The bot is meant to track regressions, hence please don't involve regzbot for
regular issues. But it's okay for the Linux kernel's regression tracker if you
use regzbot to track severe issues, like reports about hangs, corrupted data,
or internal errors (Panic, Oops, BUG(), warning, ...).
Can I add regressions found by CI systems to regzbot's tracking?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Feel free to do so, if the particular regression likely has impact on practical
use cases and thus might be noticed by users; hence, please don't involve
regzbot for theoretical regressions unlikely to show themselves in real world
usage.
How to interact with regzbot?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By using a 'regzbot command' in a direct or indirect reply to the mail with the
regression report. These commands need to be in their own paragraph (IOW: they
need to be separated from the rest of the mail using blank lines).
One such command is ``#regzbot introduced <version or commit>``, which makes
regzbot consider your mail as a regressions report added to the tracking, as
already described above; ``#regzbot ^introduced <version or commit>`` is another
such command, which makes regzbot consider the parent mail as a report for a
regression which it starts to track.
Once one of those two commands has been utilized, other regzbot commands can be
used in direct or indirect replies to the report. You can write them below one
of the `introduced` commands or in replies to the mail that used one of them
or itself is a reply to that mail:
* Set or update the title::
#regzbot title: foo
* Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of
the issue or a fix are discussed -- for example the posting of a patch fixing
the regression::
#regzbot monitor: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
Monitoring only works for lore.kernel.org and bugzilla.kernel.org; regzbot
will consider all messages in that thread or ticket as related to the fixing
process.
* Point to a place with further details of interest, like a mailing list post
or a ticket in a bug tracker that are slightly related, but about a different
topic::
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
* Mark a regression as fixed by a commit that is heading upstream or already
landed::
#regzbot fixed-by: 1f2e3d4c5d
* Mark a regression as a duplicate of another one already tracked by regzbot::
#regzbot dup-of: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
* Mark a regression as invalid::
#regzbot invalid: wasn't a regression, problem has always existed
Is there more to tell about regzbot and its commands?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
More detailed and up-to-date information about the Linux
kernel's regression tracking bot can be found on its
`project page <https://gitlab.com/knurd42/regzbot>`_, which among others
contains a `getting started guide <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_
and `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
which both cover more details than the above section.
Quotes from Linus about regression
----------------------------------
Find below a few real life examples of how Linus Torvalds expects regressions to
be handled:
* From `2017-10-26 (1/2)
<https://lore.kernel.org/lkml/CA+55aFwiiQYJ+YoLKCXjN_beDVfu38mg=Ggg5LFOcqHE8Qi7Zw@mail.gmail.com/>`_::
If you break existing user space setups THAT IS A REGRESSION.
It's not ok to say "but we'll fix the user space setup".
Really. NOT OK.
[...]
The first rule is:
- we don't cause regressions
and the corollary is that when regressions *do* occur, we admit to
them and fix them, instead of blaming user space.
The fact that you have apparently been denying the regression now for
three weeks means that I will revert, and I will stop pulling apparmor
requests until the people involved understand how kernel development
is done.
* From `2017-10-26 (2/2)
<https://lore.kernel.org/lkml/CA+55aFxW7NMAMvYhkvz1UPbUTUJewRt6Yb51QAx5RtrWOwjebg@mail.gmail.com/>`_::
People should basically always feel like they can update their kernel
and simply not have to worry about it.
I refuse to introduce "you can only update the kernel if you also
update that other program" kind of limitations. If the kernel used to
work for you, the rule is that it continues to work for you.
There have been exceptions, but they are few and far between, and they
generally have some major and fundamental reasons for having happened,
that were basically entirely unavoidable, and people _tried_hard_ to
avoid them. Maybe we can't practically support the hardware any more
after it is decades old and nobody uses it with modern kernels any
more. Maybe there's a serious security issue with how we did things,
and people actually depended on that fundamentally broken model. Maybe
there was some fundamental other breakage that just _had_ to have a
flag day for very core and fundamental reasons.
And notice that this is very much about *breaking* peoples environments.
Behavioral changes happen, and maybe we don't even support some
feature any more. There's a number of fields in /proc/<pid>/stat that
are printed out as zeroes, simply because they don't even *exist* in
the kernel any more, or because showing them was a mistake (typically
an information leak). But the numbers got replaced by zeroes, so that
the code that used to parse the fields still works. The user might not
see everything they used to see, and so behavior is clearly different,
but things still _work_, even if they might no longer show sensitive
(or no longer relevant) information.
But if something actually breaks, then the change must get fixed or
reverted. And it gets fixed in the *kernel*. Not by saying "well, fix
your user space then". It was a kernel change that exposed the
problem, it needs to be the kernel that corrects for it, because we
have a "upgrade in place" model. We don't have a "upgrade with new
user space".
And I seriously will refuse to take code from people who do not
understand and honor this very simple rule.
This rule is also not going to change.
And yes, I realize that the kernel is "special" in this respect. I'm
proud of it.
I have seen, and can point to, lots of projects that go "We need to
break that use case in order to make progress" or "you relied on
undocumented behavior, it sucks to be you" or "there's a better way to
do what you want to do, and you have to change to that new better
way", and I simply don't think that's acceptable outside of very early
alpha releases that have experimental users that know what they signed
up for. The kernel hasn't been in that situation for the last two
decades.
We do API breakage _inside_ the kernel all the time. We will fix
internal problems by saying "you now need to do XYZ", but then it's
about internal kernel API's, and the people who do that then also
obviously have to fix up all the in-kernel users of that API. Nobody
can say "I now broke the API you used, and now _you_ need to fix it
up". Whoever broke something gets to fix it too.
And we simply do not break user space.
* From `2020-05-21
<https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/>`_::
The rules about regressions have never been about any kind of
documented behavior, or where the code lives.
The rules about regressions are always about "breaks user workflow".
Users are literally the _only_ thing that matters.
No amount of "you shouldn't have used this" or "that behavior was
undefined, it's your own fault your app broke" or "that used to work
simply because of a kernel bug" is at all relevant.
Now, reality is never entirely black-and-white. So we've had things
like "serious security issue" etc that just forces us to make changes
that may break user space. But even then the rule is that we don't
really have other options that would allow things to continue.
And obviously, if users take years to even notice that something
broke, or if we have sane ways to work around the breakage that
doesn't make for too much trouble for users (ie "ok, there are a
handful of users, and they can use a kernel command line to work
around it" kind of things) we've also been a bit less strict.
But no, "that was documented to be broken" (whether it's because the
code was in staging or because the man-page said something else) is
irrelevant. If staging code is so useful that people end up using it,
that means that it's basically regular kernel code with a flag saying
"please clean this up".
The other side of the coin is that people who talk about "API
stability" are entirely wrong. API's don't matter either. You can make
any changes to an API you like - as long as nobody notices.
Again, the regression rule is not about documentation, not about
API's, and not about the phase of the moon.
It's entirely about "we caused problems for user space that used to work".
* From `2017-11-05
<https://lore.kernel.org/all/CA+55aFzUvbGjD8nQ-+3oiMBx14c_6zOj2n7KLN3UsJ-qsd4Dcw@mail.gmail.com/>`_::
And our regression rule has never been "behavior doesn't change".
That would mean that we could never make any changes at all.
For example, we do things like add new error handling etc all the
time, which we then sometimes even add tests for in our kselftest
directory.
So clearly behavior changes all the time and we don't consider that a
regression per se.
The rule for a regression for the kernel is that some real user
workflow breaks. Not some test. Not a "look, I used to be able to do
X, now I can't".
* From `2018-08-03
<https://lore.kernel.org/all/CA+55aFwWZX=CXmWDTkDGb36kf12XmTehmQjbiMPCqCRG2hi9kw@mail.gmail.com/>`_::
YOU ARE MISSING THE #1 KERNEL RULE.
We do not regress, and we do not regress exactly because your are 100% wrong.
And the reason you state for your opinion is in fact exactly *WHY* you
are wrong.
Your "good reasons" are pure and utter garbage.
The whole point of "we do not regress" is so that people can upgrade
the kernel and never have to worry about it.
> Kernel had a bug which has been fixed
That is *ENTIRELY* immaterial.
Guys, whether something was buggy or not DOES NOT MATTER.
Why?
Bugs happen. That's a fact of life. Arguing that "we had to break
something because we were fixing a bug" is completely insane. We fix
tens of bugs every single day, thinking that "fixing a bug" means that
we can break something is simply NOT TRUE.
So bugs simply aren't even relevant to the discussion. They happen,
they get found, they get fixed, and it has nothing to do with "we
break users".
Because the only thing that matters IS THE USER.
How hard is that to understand?
Anybody who uses "but it was buggy" as an argument is entirely missing
the point. As far as the USER was concerned, it wasn't buggy - it
worked for him/her.
Maybe it worked *because* the user had taken the bug into account,
maybe it worked because the user didn't notice - again, it doesn't
matter. It worked for the user.
Breaking a user workflow for a "bug" is absolutely the WORST reason
for breakage you can imagine.
It's basically saying "I took something that worked, and I broke it,
but now it's better". Do you not see how f*cking insane that statement
is?
And without users, your program is not a program, it's a pointless
piece of code that you might as well throw away.
Seriously. This is *why* the #1 rule for kernel development is "we
don't break users". Because "I fixed a bug" is absolutely NOT AN
ARGUMENT if that bug fix broke a user setup. You actually introduced a
MUCH BIGGER bug by "fixing" something that the user clearly didn't
even care about.
And dammit, we upgrade the kernel ALL THE TIME without upgrading any
other programs at all. It is absolutely required, because flag-days
and dependencies are horribly bad.
And it is also required simply because I as a kernel developer do not
upgrade random other tools that I don't even care about as I develop
the kernel, and I want any of my users to feel safe doing the same
time.
So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel
without upgrading some other random binary, then we have a problem.
* From `2021-06-05
<https://lore.kernel.org/all/CAHk-=wiUVqHN76YUwhkjZzwTdjMMJf_zN4+u7vEJjmEGh3recw@mail.gmail.com/>`_::
THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.
Honestly, security people need to understand that "not working" is not
a success case of security. It's a failure case.
Yes, "not working" may be secure. But security in that case is *pointless*.
* From `2011-05-06 (1/3)
<https://lore.kernel.org/all/BANLkTim9YvResB+PwRp7QTK-a5VNg2PvmQ@mail.gmail.com/>`_::
Binary compatibility is more important.
And if binaries don't use the interface to parse the format (or just
parse it wrongly - see the fairly recent example of adding uuid's to
/proc/self/mountinfo), then it's a regression.
And regressions get reverted, unless there are security issues or
similar that makes us go "Oh Gods, we really have to break things".
I don't understand why this simple logic is so hard for some kernel
developers to understand. Reality matters. Your personal wishes matter
NOT AT ALL.
If you made an interface that can be used without parsing the
interface description, then we're stuck with the interface. Theory
simply doesn't matter.
You could help fix the tools, and try to avoid the compatibility
issues that way. There aren't that many of them.
From `2011-05-06 (2/3)
<https://lore.kernel.org/all/BANLkTi=KVXjKR82sqsz4gwjr+E0vtqCmvA@mail.gmail.com/>`_::
it's clearly NOT an internal tracepoint. By definition. It's being
used by powertop.
From `2011-05-06 (3/3)
<https://lore.kernel.org/all/BANLkTinazaXRdGovYL7rRVp+j6HbJ7pzhg@mail.gmail.com/>`_::
We have programs that use that ABI and thus it's a regression if they break.
* From `2012-07-06 <https://lore.kernel.org/all/CA+55aFwnLJ+0sjx92EGREGTWOx84wwKaraSzpTNJwPVV8edw8g@mail.gmail.com/>`_::
> Now this got me wondering if Debian _unstable_ actually qualifies as a
> standard distro userspace.
Oh, if the kernel breaks some standard user space, that counts. Tons
of people run Debian unstable
* From `2019-09-15
<https://lore.kernel.org/lkml/CAHk-=wiP4K8DRJWsCo=20hn_6054xBamGKF2kPgUzpB5aMaofA@mail.gmail.com/>`_::
One _particularly_ last-minute revert is the top-most commit (ignoring
the version change itself) done just before the release, and while
it's very annoying, it's perhaps also instructive.
What's instructive about it is that I reverted a commit that wasn't
actually buggy. In fact, it was doing exactly what it set out to do,
and did it very well. In fact it did it _so_ well that the much
improved IO patterns it caused then ended up revealing a user-visible
regression due to a real bug in a completely unrelated area.
The actual details of that regression are not the reason I point that
revert out as instructive, though. It's more that it's an instructive
example of what counts as a regression, and what the whole "no
regressions" kernel rule means. The reverted commit didn't change any
API's, and it didn't introduce any new bugs. But it ended up exposing
another problem, and as such caused a kernel upgrade to fail for a
user. So it got reverted.
The point here being that we revert based on user-reported _behavior_,
not based on some "it changes the ABI" or "it caused a bug" concept.
The problem was really pre-existing, and it just didn't happen to
trigger before. The better IO patterns introduced by the change just
happened to expose an old bug, and people had grown to depend on the
previously benign behavior of that old issue.
And never fear, we'll re-introduce the fix that improved on the IO
patterns once we've decided just how to handle the fact that we had a
bad interaction with an interface that people had then just happened
to rely on incidental behavior for before. It's just that we'll have
to hash through how to do that (there are no less than three different
patches by three different developers being discussed, and there might
be more coming...). In the meantime, I reverted the thing that exposed
the problem to users for this release, even if I hope it will be
re-introduced (perhaps even backported as a stable patch) once we have
consensus about the issue it exposed.
Take-away from the whole thing: it's not about whether you change the
kernel-userspace ABI, or fix a bug, or about whether the old code
"should never have worked in the first place". It's about whether
something breaks existing users' workflow.
Anyway, that was my little aside on the whole regression thing. Since
it's that "first rule of kernel programming", I felt it is perhaps
worth just bringing it up every once in a while
..
end-of-content
..
This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
of the file. If you want to distribute this text under CC-BY-4.0 only,
please use "The Linux kernel developers" for author attribution and link
this as source:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/process/handling-regressions.rst
..
Note: Only the content of this RST file as found in the Linux kernel sources
is available under CC-BY-4.0, as versions of this text that were processed
(for example by the kernel's build system) might contain content taken from
files which use a more restrictive license.
......@@ -25,6 +25,7 @@ Below are the essential guides that every developer should read.
code-of-conduct-interpretation
development-process
submitting-patches
handling-regressions
programming-language
coding-style
maintainer-handbooks
......@@ -48,6 +49,7 @@ Other guides to the community that are of interest to most developers are:
deprecated
embargoed-hardware-issues
maintainers
researcher-guidelines
These are some overall technical guides that have been put here for now for
lack of a better place.
......
.. SPDX-License-Identifier: GPL-2.0
.. _researcher_guidelines:
Researcher Guidelines
+++++++++++++++++++++
The Linux kernel community welcomes transparent research on the Linux
kernel, the activities involved in producing it, and any other byproducts
of its development. Linux benefits greatly from this kind of research, and
most aspects of Linux are driven by research in one form or another.
The community greatly appreciates if researchers can share preliminary
findings before making their results public, especially if such research
involves security. Getting involved early helps both improve the quality
of research and ability for Linux to improve from it. In any case,
sharing open access copies of the published research with the community
is recommended.
This document seeks to clarify what the Linux kernel community considers
acceptable and non-acceptable practices when conducting such research. At
the very least, such research and related activities should follow
standard research ethics rules. For more background on research ethics
generally, ethics in technology, and research of developer communities
in particular, see:
* `History of Research Ethics <https://www.unlv.edu/research/ORI-HSR/history-ethics>`_
* `IEEE Ethics <https://www.ieee.org/about/ethics/index.html>`_
* `Developer and Researcher Views on the Ethics of Experiments on Open-Source Projects <https://arxiv.org/pdf/2112.13217.pdf>`_
The Linux kernel community expects that everyone interacting with the
project is participating in good faith to make Linux better. Research on
any publicly-available artifact (including, but not limited to source
code) produced by the Linux kernel community is welcome, though research
on developers must be distinctly opt-in.
Passive research that is based entirely on publicly available sources,
including posts to public mailing lists and commits to public
repositories, is clearly permissible. Though, as with any research,
standard ethics must still be followed.
Active research on developer behavior, however, must be done with the
explicit agreement of, and full disclosure to, the individual developers
involved. Developers cannot be interacted with/experimented on without
consent; this, too, is standard research ethics.
To help clarify: sending patches to developers *is* interacting
with them, but they have already consented to receiving *good faith
contributions*. Sending intentionally flawed/vulnerable patches or
contributing misleading information to discussions is not consented
to. Such communication can be damaging to the developer (e.g. draining
time, effort, and morale) and damaging to the project by eroding
the entire developer community's trust in the contributor (and the
contributor's organization as a whole), undermining efforts to provide
constructive feedback to contributors, and putting end users at risk of
software flaws.
Participation in the development of Linux itself by researchers, as
with anyone, is welcomed and encouraged. Research into Linux code is
a common practice, especially when it comes to developing or running
analysis tools that produce actionable results.
When engaging with the developer community, sending a patch has
traditionally been the best way to make an impact. Linux already has
plenty of known bugs -- what's much more helpful is having vetted fixes.
Before contributing, carefully read the appropriate documentation:
* Documentation/process/development-process.rst
* Documentation/process/submitting-patches.rst
* Documentation/admin-guide/reporting-issues.rst
* Documentation/admin-guide/security-bugs.rst
Then send a patch (including a commit log with all the details listed
below) and follow up on any feedback from other developers.
When sending patches produced from research, the commit logs should
contain at least the following details, so that developers have
appropriate context for understanding the contribution. Answer:
* What is the specific problem that has been found?
* How could the problem be reached on a running system?
* What effect would encountering the problem have on the system?
* How was the problem found? Specifically include details about any
testing, static or dynamic analysis programs, and any other tools or
methods used to perform the work.
* Which version of Linux was the problem found on? Using the most recent
release or a recent linux-next branch is strongly preferred (see
Documentation/process/howto.rst).
* What was changed to fix the problem, and why it is believed to be correct?
* How was the change build tested and run-time tested?
* What prior commit does this change fix? This should go in a "Fixes:"
tag as the documentation describes.
* Who else has reviewed this patch? This should go in appropriate
"Reviewed-by:" tags; see below.
For example::
From: Author <author@email>
Subject: [PATCH] drivers/foo_bar: Add missing kfree()
The error path in foo_bar driver does not correctly free the allocated
struct foo_bar_info. This can happen if the attached foo_bar device
rejects the initialization packets sent during foo_bar_probe(). This
would result in a 64 byte slab memory leak once per device attach,
wasting memory resources over time.
This flaw was found using an experimental static analysis tool we are
developing, LeakMagic[1], which reported the following warning when
analyzing the v5.15 kernel release:
path/to/foo_bar.c:187: missing kfree() call?
Add the missing kfree() to the error path. No other references to
this memory exist outside the probe function, so this is the only
place it can be freed.
x86_64 and arm64 defconfig builds with CONFIG_FOO_BAR=y using GCC
11.2 show no new warnings, and LeakMagic no longer warns about this
code path. As we don't have a FooBar device to test with, no runtime
testing was able to be performed.
[1] https://url/to/leakmagic/details
Reported-by: Researcher <researcher@email>
Fixes: aaaabbbbccccdddd ("Introduce support for FooBar")
Signed-off-by: Author <author@email>
Reviewed-by: Reviewer <reviewer@email>
If you are a first time contributor it is recommended that the patch
itself be vetted by others privately before being posted to public lists.
(This is required if you have been explicitly told your patches need
more careful internal review.) These people are expected to have their
"Reviewed-by" tag included in the resulting patch. Finding another
developer familiar with Linux contribution, especially within your own
organization, and having them help with reviews before sending them to
the public mailing lists tends to significantly improve the quality of the
resulting patches, and there by reduces the burden on other developers.
If no one can be found to internally review patches and you need
help finding such a person, or if you have any other questions
related to this document and the developer community's expectations,
please reach out to the private Technical Advisory Board mailing list:
<tech-board@lists.linux-foundation.org>.
......@@ -495,7 +495,8 @@ Using Reported-by:, Tested-by:, Reviewed-by:, Suggested-by: and Fixes:
The Reported-by tag gives credit to people who find bugs and report them and it
hopefully inspires them to help us again in the future. Please note that if
the bug was reported in private, then ask for permission first before using the
Reported-by tag.
Reported-by tag. The tag is intended for bugs; please do not use it to credit
feature requests.
A Tested-by: tag indicates that the patch has been successfully tested (in
some environment) by the person named. This tag informs maintainers that
......
......@@ -14,6 +14,7 @@ Linux Scheduler
sched-domains
sched-capacity
sched-energy
schedutil
sched-nice-design
sched-rt-group
sched-stats
......
......@@ -37,10 +37,10 @@ rebalancing event for the current runqueue has arrived. The actual load
balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
in softirq context (SCHED_SOFTIRQ).
The latter function takes two arguments: the current CPU and whether it was idle
at the time the scheduler_tick() happened and iterates over all sched domains
our CPU is on, starting from its base domain and going up the ->parent chain.
While doing that, it checks to see if the current domain has exhausted its
The latter function takes two arguments: the runqueue of current CPU and whether
the CPU was idle at the time the scheduler_tick() happened and iterates over all
sched domains our CPU is on, starting from its base domain and going up the ->parent
chain. While doing that, it checks to see if the current domain has exhausted its
rebalance interval. If so, it runs load_balance() on that domain. It then checks
the parent sched_domain (if it exists), and the parent of the parent and so
forth.
......
=========
Schedutil
=========
.. note::
NOTE; all this assumes a linear relation between frequency and work capacity,
we know this is flawed, but it is the best workable approximation.
All this assumes a linear relation between frequency and work capacity,
we know this is flawed, but it is the best workable approximation.
PELT (Per Entity Load Tracking)
-------------------------------
===============================
With PELT we track some metrics across the various scheduler entities, from
individual tasks to task-group slices to CPU runqueues. As the basis for this
......@@ -38,8 +42,8 @@ while 'runnable' will increase to reflect the amount of contention.
For more detail see: kernel/sched/pelt.c
Frequency- / CPU Invariance
---------------------------
Frequency / CPU Invariance
==========================
Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
......@@ -47,7 +51,7 @@ a big CPU, we allow architectures to scale the time delta with two ratios, one
Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
For simple DVFS architectures (where software is in full control) we trivially
compute the ratio as:
compute the ratio as::
f_cur
r_dvfs := -----
......@@ -55,7 +59,7 @@ compute the ratio as:
For more dynamic systems where the hardware is in control of DVFS we use
hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
For Intel specifically, we use:
For Intel specifically, we use::
APERF
f_cur := ----- * P0
......@@ -87,7 +91,7 @@ For more detail see:
UTIL_EST / UTIL_EST_FASTUP
--------------------------
==========================
Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
......@@ -106,7 +110,7 @@ For more detail see: kernel/sched/fair.c:util_est_dequeue()
UCLAMP
------
======
It is possible to set effective u_min and u_max clamps on each CFS or RT task;
the runqueue keeps an max aggregate of these clamps for all running tasks.
......@@ -115,7 +119,7 @@ For more detail see: include/uapi/linux/sched/types.h
Schedutil / DVFS
----------------
================
Every time the scheduler load tracking is updated (task wakeup, task
migration, time progression) we call out to schedutil to update the hardware
......@@ -123,7 +127,7 @@ DVFS state.
The basis is the CPU runqueue's 'running' metric, which per the above it is
the frequency invariant utilization estimate of the CPU. From this we compute
a desired frequency like:
a desired frequency like::
max( running, util_est ); if UTIL_EST
u_cfs := { running; otherwise
......@@ -135,7 +139,7 @@ a desired frequency like:
f_des := min( f_max, 1.25 u * f_max )
XXX IO-wait; when the update is due to a task wakeup from IO-completion we
XXX IO-wait: when the update is due to a task wakeup from IO-completion we
boost 'u' above.
This frequency is then used to select a P-state/OPP or directly munged into a
......@@ -153,7 +157,7 @@ For more information see: kernel/sched/cpufreq_schedutil.c
NOTES
-----
=====
- On low-load scenarios, where DVFS is most relevant, the 'running' numbers
will closely reflect utilization.
......
% -*- coding: utf-8 -*-
% SPDX-License-Identifier: GPL-2.0
%
% LaTeX preamble for "make latexdocs" or "make pdfdocs" including:
% - TOC width settings
% - Setting of tabulary (\tymin)
% - Headheight setting for fancyhdr
% - Fontfamily settings for CJK (Chinese, Japanese, and Korean) translations
%
% Note on the suffix of .sty:
% This is not implemented as a LaTeX style file, but as a file containing
% plain LaTeX code to be included into preamble.
% ".sty" is chosen because ".tex" would cause the build scripts to confuse
% this file with a LaTeX main file.
%
% Copyright (C) 2022 Akira Yokosawa
% Custom width parameters for TOC
% - Redefine low-level commands defined in report.cls.
% - Indent of 2 chars is preserved for ease of comparison.
% Summary of changes from default params:
% Width of page number (\@pnumwidth): 1.55em -> 2.7em
% Width of chapter number: 1.5em -> 1.8em
% Indent of section number: 1.5em -> 1.8em
% Width of section number: 2.6em -> 3.2em
% Indent of sebsection number: 4.1em -> 5em
% Width of subsection number: 3.5em -> 4.3em
%
% These params can have 4 digit page counts, 2 digit chapter counts,
% section counts of 4 digits + 1 period (e.g., 18.10), and subsection counts
% of 5 digits + 2 periods (e.g., 18.7.13).
\makeatletter
%% Redefine \@pnumwidth (page number width)
\renewcommand*\@pnumwidth{2.7em}
%% Redefine \l@chapter (chapter list entry)
\renewcommand*\l@chapter[2]{%
\ifnum \c@tocdepth >\m@ne
\addpenalty{-\@highpenalty}%
\vskip 1.0em \@plus\p@
\setlength\@tempdima{1.8em}%
\begingroup
\parindent \z@ \rightskip \@pnumwidth
\parfillskip -\@pnumwidth
\leavevmode \bfseries
\advance\leftskip\@tempdima
\hskip -\leftskip
#1\nobreak\hfil
\nobreak\hb@xt@\@pnumwidth{\hss #2%
\kern-\p@\kern\p@}\par
\penalty\@highpenalty
\endgroup
\fi}
%% Redefine \l@section and \l@subsection
\renewcommand*\l@section{\@dottedtocline{1}{1.8em}{3.2em}}
\renewcommand*\l@subsection{\@dottedtocline{2}{5em}{4.3em}}
\makeatother
%% Sphinx < 1.8 doesn't have \sphinxtableofcontentshook
\providecommand{\sphinxtableofcontentshook}{}
%% Undefine it for compatibility with Sphinx 1.7.9
\renewcommand{\sphinxtableofcontentshook}{} % Empty the hook
% Prevent column squeezing of tabulary. \tymin is set by Sphinx as:
% \setlength{\tymin}{3\fontcharwd\font`0 }
% , which is too short.
\setlength{\tymin}{20em}
% Adjust \headheight for fancyhdr
\addtolength{\headheight}{1.6pt}
\addtolength{\topmargin}{-1.6pt}
% Translations have Asian (CJK) characters which are only displayed if
% xeCJK is used
\IfFontExistsTF{Noto Sans CJK SC}{
% Load xeCJK when CJK font is available
\usepackage{xeCJK}
% Noto CJK fonts don't provide slant shape. [AutoFakeSlant] permits
% its emulation.
% Select KR variant at the beginning of each document so that quotation
% and apostorph symbols of half-width is used in TOC of Latin documents.
\IfFontExistsTF{Noto Serif CJK KR}{
\setCJKmainfont{Noto Serif CJK KR}[AutoFakeSlant]
}{
\setCJKmainfont{Noto Sans CJK KR}[AutoFakeSlant]
}
\setCJKsansfont{Noto Sans CJK KR}[AutoFakeSlant]
\setCJKmonofont{Noto Sans Mono CJK KR}[AutoFakeSlant]
% Teach xeCJK of half-width symbols
\xeCJKDeclareCharClass{HalfLeft}{`“,`‘}
\xeCJKDeclareCharClass{HalfRight}{`”,`’}
% CJK Language-specific font choices
%% for Simplified Chinese
\IfFontExistsTF{Noto Serif CJK SC}{
\newCJKfontfamily[SCmain]\scmain{Noto Serif CJK SC}[AutoFakeSlant]
\newCJKfontfamily[SCserif]\scserif{Noto Serif CJK SC}[AutoFakeSlant]
}{
\newCJKfontfamily[SCmain]\scmain{Noto Sans CJK SC}[AutoFakeSlant]
\newCJKfontfamily[SCserif]\scserif{Noto Sans CJK SC}[AutoFakeSlant]
}
\newCJKfontfamily[SCsans]\scsans{Noto Sans CJK SC}[AutoFakeSlant]
\newCJKfontfamily[SCmono]\scmono{Noto Sans Mono CJK SC}[AutoFakeSlant]
%% for Traditional Chinese
\IfFontExistsTF{Noto Serif CJK TC}{
\newCJKfontfamily[TCmain]\tcmain{Noto Serif CJK TC}[AutoFakeSlant]
\newCJKfontfamily[TCserif]\tcserif{Noto Serif CJK TC}[AutoFakeSlant]
}{
\newCJKfontfamily[TCmain]\tcmain{Noto Sans CJK TC}[AutoFakeSlant]
\newCJKfontfamily[TCserif]\tcserif{Noto Sans CJK TC}[AutoFakeSlant]
}
\newCJKfontfamily[TCsans]\tcsans{Noto Sans CJK TC}[AutoFakeSlant]
\newCJKfontfamily[TCmono]\tcmono{Noto Sans Mono CJK TC}[AutoFakeSlant]
%% for Korean
\IfFontExistsTF{Noto Serif CJK KR}{
\newCJKfontfamily[KRmain]\krmain{Noto Serif CJK KR}[AutoFakeSlant]
\newCJKfontfamily[KRserif]\krserif{Noto Serif CJK KR}[AutoFakeSlant]
}{
\newCJKfontfamily[KRmain]\krmain{Noto Sans CJK KR}[AutoFakeSlant]
\newCJKfontfamily[KRserif]\krserif{Noto Sans CJK KR}[AutoFakeSlant]
}
\newCJKfontfamily[KRsans]\krsans{Noto Sans CJK KR}[AutoFakeSlant]
\newCJKfontfamily[KRmono]\krmono{Noto Sans Mono CJK KR}[AutoFakeSlant]
%% for Japanese
\IfFontExistsTF{Noto Serif CJK JP}{
\newCJKfontfamily[JPmain]\jpmain{Noto Serif CJK JP}[AutoFakeSlant]
\newCJKfontfamily[JPserif]\jpserif{Noto Serif CJK JP}[AutoFakeSlant]
}{
\newCJKfontfamily[JPmain]\jpmain{Noto Sans CJK JP}[AutoFakeSlant]
\newCJKfontfamily[JPserif]\jpserif{Noto Sans CJK JP}[AutoFakeSlant]
}
\newCJKfontfamily[JPsans]\jpsans{Noto Sans CJK JP}[AutoFakeSlant]
\newCJKfontfamily[JPmono]\jpmono{Noto Sans Mono CJK JP}[AutoFakeSlant]
% Dummy commands for Sphinx < 2.3 (no 'extrapackages' support)
\providecommand{\onehalfspacing}{}
\providecommand{\singlespacing}{}
% Define custom macros to on/off CJK
%% One and half spacing for CJK contents
\newcommand{\kerneldocCJKon}{\makexeCJKactive\onehalfspacing}
\newcommand{\kerneldocCJKoff}{\makexeCJKinactive\singlespacing}
% Define custom macros for switching CJK font setting
%% for Simplified Chinese
\newcommand{\kerneldocBeginSC}{%
\begingroup%
\scmain%
\xeCJKDeclareCharClass{FullLeft}{`“,`‘}% Full-width in SC
\xeCJKDeclareCharClass{FullRight}{`”,`’}% Full-width in SC
\renewcommand{\CJKrmdefault}{SCserif}%
\renewcommand{\CJKsfdefault}{SCsans}%
\renewcommand{\CJKttdefault}{SCmono}%
\xeCJKsetup{CJKspace = false}% gobble white spaces by ' '
% For CJK ascii-art alignment
\setmonofont{Noto Sans Mono CJK SC}[AutoFakeSlant]%
}
\newcommand{\kerneldocEndSC}{\endgroup}
%% for Traditional Chinese
\newcommand{\kerneldocBeginTC}{%
\begingroup%
\tcmain%
\xeCJKDeclareCharClass{FullLeft}{`“,`‘}% Full-width in TC
\xeCJKDeclareCharClass{FullRight}{`”,`’}% Full-width in TC
\renewcommand{\CJKrmdefault}{TCserif}%
\renewcommand{\CJKsfdefault}{TCsans}%
\renewcommand{\CJKttdefault}{TCmono}%
\xeCJKsetup{CJKspace = false}% gobble white spaces by ' '
% For CJK ascii-art alignment
\setmonofont{Noto Sans Mono CJK TC}[AutoFakeSlant]%
}
\newcommand{\kerneldocEndTC}{\endgroup}
%% for Korean
\newcommand{\kerneldocBeginKR}{%
\begingroup%
\krmain%
\renewcommand{\CJKrmdefault}{KRserif}%
\renewcommand{\CJKsfdefault}{KRsans}%
\renewcommand{\CJKttdefault}{KRmono}%
% \xeCJKsetup{CJKspace = true} % true by default
% For CJK ascii-art alignment (still misaligned for Hangul)
\setmonofont{Noto Sans Mono CJK KR}[AutoFakeSlant]%
}
\newcommand{\kerneldocEndKR}{\endgroup}
%% for Japanese
\newcommand{\kerneldocBeginJP}{%
\begingroup%
\jpmain%
\renewcommand{\CJKrmdefault}{JPserif}%
\renewcommand{\CJKsfdefault}{JPsans}%
\renewcommand{\CJKttdefault}{JPmono}%
\xeCJKsetup{CJKspace = false}% gobble white space by ' '
% For CJK ascii-art alignment
\setmonofont{Noto Sans Mono CJK JP}[AutoFakeSlant]%
}
\newcommand{\kerneldocEndJP}{\endgroup}
% Single spacing in literal blocks
\fvset{baselinestretch=1}
% To customize \sphinxtableofcontents
\usepackage{etoolbox}
% Inactivate CJK after tableofcontents
\apptocmd{\sphinxtableofcontents}{\kerneldocCJKoff}{}{}
\xeCJKsetup{CJKspace = true}% For inter-phrase space of Korean TOC
}{ % No CJK font found
% Custom macros to on/off CJK and switch CJK fonts (Dummy)
\newcommand{\kerneldocCJKon}{}
\newcommand{\kerneldocCJKoff}{}
%% By defining \kerneldocBegin(SC|TC|KR|JP) as commands with an argument
%% and ignore the argument (#1) in their definitions, whole contents of
%% CJK chapters can be ignored.
\newcommand{\kerneldocBeginSC}[1]{%
%% Put a note on missing CJK fonts in place of zh_CN translation.
\begin{sphinxadmonition}{note}{Note on missing fonts:}
Translations of Simplified Chinese (zh\_CN), Traditional Chinese
(zh\_TW), Korean (ko\_KR), and Japanese (ja\_JP) were skipped
due to the lack of suitable font families.
If you want them, please install ``Noto Sans CJK'' font families
by following instructions from
\sphinxcode{./scripts/sphinx-pre-install}.
Having optional ``Noto Serif CJK'' font families will improve
the looks of those translations.
\end{sphinxadmonition}}
\newcommand{\kerneldocEndSC}{}
\newcommand{\kerneldocBeginTC}[1]{}
\newcommand{\kerneldocEndTC}{}
\newcommand{\kerneldocBeginKR}[1]{}
\newcommand{\kerneldocEndKR}{}
\newcommand{\kerneldocBeginJP}[1]{}
\newcommand{\kerneldocEndJP}{}
}
......@@ -31,10 +31,13 @@ u"""
* ``dot(1)``: Graphviz (https://www.graphviz.org). If Graphviz is not
available, the DOT language is inserted as literal-block.
For conversion to PDF, ``rsvg-convert(1)`` of librsvg
(https://gitlab.gnome.org/GNOME/librsvg) is used when available.
* SVG to PDF: To generate PDF, you need at least one of this tools:
- ``convert(1)``: ImageMagick (https://www.imagemagick.org)
- ``inkscape(1)``: Inkscape (https://inkscape.org/)
List of customizations:
......@@ -49,6 +52,7 @@ import os
from os import path
import subprocess
from hashlib import sha1
import re
from docutils import nodes
from docutils.statemachine import ViewList
from docutils.parsers.rst import directives
......@@ -109,10 +113,20 @@ def pass_handle(self, node): # pylint: disable=W0613
# Graphviz's dot(1) support
dot_cmd = None
# dot(1) -Tpdf should be used
dot_Tpdf = False
# ImageMagick' convert(1) support
convert_cmd = None
# librsvg's rsvg-convert(1) support
rsvg_convert_cmd = None
# Inkscape's inkscape(1) support
inkscape_cmd = None
# Inkscape prior to 1.0 uses different command options
inkscape_ver_one = False
def setup(app):
# check toolchain first
......@@ -160,23 +174,62 @@ def setupTools(app):
This function is called once, when the builder is initiated.
"""
global dot_cmd, convert_cmd # pylint: disable=W0603
global dot_cmd, dot_Tpdf, convert_cmd, rsvg_convert_cmd # pylint: disable=W0603
global inkscape_cmd, inkscape_ver_one # pylint: disable=W0603
kernellog.verbose(app, "kfigure: check installed tools ...")
dot_cmd = which('dot')
convert_cmd = which('convert')
rsvg_convert_cmd = which('rsvg-convert')
inkscape_cmd = which('inkscape')
if dot_cmd:
kernellog.verbose(app, "use dot(1) from: " + dot_cmd)
try:
dot_Thelp_list = subprocess.check_output([dot_cmd, '-Thelp'],
stderr=subprocess.STDOUT)
except subprocess.CalledProcessError as err:
dot_Thelp_list = err.output
pass
dot_Tpdf_ptn = b'pdf'
dot_Tpdf = re.search(dot_Tpdf_ptn, dot_Thelp_list)
else:
kernellog.warn(app, "dot(1) not found, for better output quality install "
"graphviz from https://www.graphviz.org")
if convert_cmd:
kernellog.verbose(app, "use convert(1) from: " + convert_cmd)
if inkscape_cmd:
kernellog.verbose(app, "use inkscape(1) from: " + inkscape_cmd)
inkscape_ver = subprocess.check_output([inkscape_cmd, '--version'],
stderr=subprocess.DEVNULL)
ver_one_ptn = b'Inkscape 1'
inkscape_ver_one = re.search(ver_one_ptn, inkscape_ver)
convert_cmd = None
rsvg_convert_cmd = None
dot_Tpdf = False
else:
kernellog.warn(app,
"convert(1) not found, for SVG to PDF conversion install "
"ImageMagick (https://www.imagemagick.org)")
if convert_cmd:
kernellog.verbose(app, "use convert(1) from: " + convert_cmd)
else:
kernellog.warn(app,
"Neither inkscape(1) nor convert(1) found.\n"
"For SVG to PDF conversion, "
"install either Inkscape (https://inkscape.org/) (preferred) or\n"
"ImageMagick (https://www.imagemagick.org)")
if rsvg_convert_cmd:
kernellog.verbose(app, "use rsvg-convert(1) from: " + rsvg_convert_cmd)
kernellog.verbose(app, "use 'dot -Tsvg' and rsvg-convert(1) for DOT -> PDF conversion")
dot_Tpdf = False
else:
kernellog.verbose(app,
"rsvg-convert(1) not found.\n"
" SVG rendering of convert(1) is done by ImageMagick-native renderer.")
if dot_Tpdf:
kernellog.verbose(app, "use 'dot -Tpdf' for DOT -> PDF conversion")
else:
kernellog.verbose(app, "use 'dot -Tsvg' and convert(1) for DOT -> PDF conversion")
# integrate conversion tools
......@@ -242,7 +295,7 @@ def convert_image(img_node, translator, src_fname=None):
elif in_ext == '.svg':
if translator.builder.format == 'latex':
if convert_cmd is None:
if not inkscape_cmd and convert_cmd is None:
kernellog.verbose(app,
"no SVG to PDF conversion available / include SVG raw.")
img_node.replace_self(file2literal(src_fname))
......@@ -266,7 +319,14 @@ def convert_image(img_node, translator, src_fname=None):
if in_ext == '.dot':
kernellog.verbose(app, 'convert DOT to: {out}/' + _name)
ok = dot2format(app, src_fname, dst_fname)
if translator.builder.format == 'latex' and not dot_Tpdf:
svg_fname = path.join(translator.builder.outdir, fname + '.svg')
ok1 = dot2format(app, src_fname, svg_fname)
ok2 = svg2pdf_by_rsvg(app, svg_fname, dst_fname)
ok = ok1 and ok2
else:
ok = dot2format(app, src_fname, dst_fname)
elif in_ext == '.svg':
kernellog.verbose(app, 'convert SVG to: {out}/' + _name)
......@@ -303,22 +363,70 @@ def dot2format(app, dot_fname, out_fname):
return bool(exit_code == 0)
def svg2pdf(app, svg_fname, pdf_fname):
"""Converts SVG to PDF with ``convert(1)`` command.
"""Converts SVG to PDF with ``inkscape(1)`` or ``convert(1)`` command.
Uses ``convert(1)`` from ImageMagick (https://www.imagemagick.org) for
conversion. Returns ``True`` on success and ``False`` if an error occurred.
Uses ``inkscape(1)`` from Inkscape (https://inkscape.org/) or ``convert(1)``
from ImageMagick (https://www.imagemagick.org) for conversion.
Returns ``True`` on success and ``False`` if an error occurred.
* ``svg_fname`` pathname of the input SVG file with extension (``.svg``)
* ``pdf_name`` pathname of the output PDF file with extension (``.pdf``)
"""
cmd = [convert_cmd, svg_fname, pdf_fname]
# use stdout and stderr from parent
exit_code = subprocess.call(cmd)
cmd_name = 'convert(1)'
if inkscape_cmd:
cmd_name = 'inkscape(1)'
if inkscape_ver_one:
cmd = [inkscape_cmd, '-o', pdf_fname, svg_fname]
else:
cmd = [inkscape_cmd, '-z', '--export-pdf=%s' % pdf_fname, svg_fname]
try:
warning_msg = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
exit_code = 0
except subprocess.CalledProcessError as err:
warning_msg = err.output
exit_code = err.returncode
pass
if exit_code != 0:
kernellog.warn(app, "Error #%d when calling: %s" % (exit_code, " ".join(cmd)))
if warning_msg:
kernellog.warn(app, "Warning msg from %s: %s"
% (cmd_name, str(warning_msg, 'utf-8')))
elif warning_msg:
kernellog.verbose(app, "Warning msg from %s (likely harmless):\n%s"
% (cmd_name, str(warning_msg, 'utf-8')))
return bool(exit_code == 0)
def svg2pdf_by_rsvg(app, svg_fname, pdf_fname):
"""Convert SVG to PDF with ``rsvg-convert(1)`` command.
* ``svg_fname`` pathname of input SVG file, including extension ``.svg``
* ``pdf_fname`` pathname of output PDF file, including extension ``.pdf``
Input SVG file should be the one generated by ``dot2format()``.
SVG -> PDF conversion is done by ``rsvg-convert(1)``.
If ``rsvg-convert(1)`` is unavailable, fall back to ``svg2pdf()``.
"""
if rsvg_convert_cmd is None:
ok = svg2pdf(app, svg_fname, pdf_fname)
else:
cmd = [rsvg_convert_cmd, '--format=pdf', '-o', pdf_fname, svg_fname]
# use stdout and stderr from parent
exit_code = subprocess.call(cmd)
if exit_code != 0:
kernellog.warn(app, "Error #%d when calling: %s" % (exit_code, " ".join(cmd)))
ok = bool(exit_code == 0)
return ok
# image handling
# ---------------------
......
......@@ -51,7 +51,7 @@ For example::
[root@f32 ~]# cd /sys/kernel/tracing/
[root@f32 tracing]# echo osnoise > current_tracer
It is possible to follow the trace by reading the trace trace file::
It is possible to follow the trace by reading the trace file::
[root@f32 tracing]# cat trace
# tracer: osnoise
......@@ -108,7 +108,7 @@ The tracer has a set of options inside the osnoise directory, they are:
option.
- tracing_threshold: the minimum delta between two time() reads to be
considered as noise, in us. When set to 0, the default value will
will be used, which is currently 5 us.
be used, which is currently 5 us.
Additional Tracing
------------------
......
# -*- coding: utf-8 -*-
# SPDX-License-Identifier: GPL-2.0
# -- Additinal options for LaTeX output ----------------------------------
# font config for ascii-art alignment
latex_elements['preamble'] += '''
\\IfFontExistsTF{Noto Sans CJK SC}{
% For CJK ascii-art alignment
\\setmonofont{Noto Sans Mono CJK SC}[AutoFakeSlant]
}{}
'''
......@@ -3,7 +3,7 @@
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
\kerneldocBeginJP
\kerneldocBeginJP{
Japanese translations
=====================
......@@ -15,4 +15,4 @@ Japanese translations
.. raw:: latex
\kerneldocEndJP
}\kerneldocEndJP
......@@ -3,7 +3,7 @@
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
\kerneldocBeginKR
\kerneldocBeginKR{
한국어 번역
===========
......@@ -26,5 +26,4 @@
.. raw:: latex
\normalsize
\kerneldocEndKR
}\kerneldocEndKR
......@@ -17,6 +17,8 @@ a) 等待一个CPU(任务为可运行)
b) 完成由该任务发起的块I/O同步请求
c) 页面交换
d) 内存回收
e) 页缓存抖动
f) 直接规整
并将这些统计信息通过taskstats接口提供给用户空间。
......@@ -37,10 +39,10 @@ d) 内存回收
向用户态返回一个通用数据结构,对应每pid或每tgid的统计信息。延时计数功能填写
该数据结构的特定字段。见
include/linux/taskstats.h
include/uapi/linux/taskstats.h
其描述了延时计数相关字段。系统通常以计数器形式返回 CPU、同步块 I/O、交换、内存
回收等的累积延时。
回收、页缓存抖动、直接规整等的累积延时。
取任务某计数器两个连续读数的差值,将得到任务在该时间间隔内等待对应资源的总延时。
......@@ -72,40 +74,36 @@ kernel.task_delayacct进行开关。注意,只有在启用延时计数后启
getdelays命令的一般格式::
getdelays [-t tgid] [-p pid] [-c cmd...]
getdelays [-dilv] [-t tgid] [-p pid]
获取pid为10的任务从系统启动后的延时信息::
# ./getdelays -p 10
# ./getdelays -d -p 10
(输出信息和下例相似)
获取所有tgid为5的任务从系统启动后的总延时信息::
# ./getdelays -t 5
CPU count real total virtual total delay total
7876 92005750 100000000 24001500
IO count delay total
0 0
SWAP count delay total
0 0
RECLAIM count delay total
0 0
获取指定简单命令运行时的延时信息::
# ./getdelays -c ls /
bin data1 data3 data5 dev home media opt root srv sys usr
boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var
CPU count real total virtual total delay total
6 4000250 4000000 0
IO count delay total
0 0
SWAP count delay total
0 0
RECLAIM count delay total
0 0
# ./getdelays -d -t 5
print delayacct stats ON
TGID 5
CPU count real total virtual total delay total delay average
8 7000000 6872122 3382277 0.423ms
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
THRASHING count delay total delay average
0 0 0ms
COMPACT count delay total delay average
0 0 0ms
获取pid为1的IO计数,它只和-p一起使用::
# ./getdelays -i -p 1
printing IO accounting
linuxrc: read=65536, write=0, cancelled_write=0
上面的命令与-v一起使用,可以获取更多调试信息。
......@@ -20,15 +20,15 @@ Linux 内核用户和管理员指南
Todolist:
kernel-parameters
devices
sysctl/index
* kernel-parameters
* devices
* sysctl/index
本节介绍CPU漏洞及其缓解措施。
Todolist:
hw-vuln/index
* hw-vuln/index
下面的一组文档,针对的是试图跟踪问题和bug的用户。
......@@ -44,18 +44,18 @@ Todolist:
Todolist:
reporting-bugs
ramoops
dynamic-debug-howto
kdump/index
perf/index
* reporting-bugs
* ramoops
* dynamic-debug-howto
* kdump/index
* perf/index
这是应用程序开发人员感兴趣的章节的开始。可以在这里找到涵盖内核ABI各个
方面的文档。
Todolist:
sysfs-rules
* sysfs-rules
本手册的其余部分包括各种指南,介绍如何根据您的喜好配置内核的特定行为。
......@@ -69,61 +69,61 @@ Todolist:
lockup-watchdogs
unicode
sysrq
mm/index
Todolist:
acpi/index
aoe/index
auxdisplay/index
bcache
binderfs
binfmt-misc
blockdev/index
bootconfig
braille-console
btmrvl
cgroup-v1/index
cgroup-v2
cifs/index
dell_rbu
device-mapper/index
edid
efi-stub
ext4
nfs/index
gpio/index
highuid
hw_random
initrd
iostats
java
jfs
kernel-per-CPU-kthreads
laptops/index
lcd-panel-cgram
ldm
LSM/index
md
media/index
mm/index
module-signing
mono
namespaces/index
numastat
parport
perf-security
pm/index
pnp
rapidio
ras
rtc
serial-console
svga
thunderbolt
ufs
vga-softcursor
video-output
xfs
* acpi/index
* aoe/index
* auxdisplay/index
* bcache
* binderfs
* binfmt-misc
* blockdev/index
* bootconfig
* braille-console
* btmrvl
* cgroup-v1/index
* cgroup-v2
* cifs/index
* dell_rbu
* device-mapper/index
* edid
* efi-stub
* ext4
* nfs/index
* gpio/index
* highuid
* hw_random
* initrd
* iostats
* java
* jfs
* kernel-per-CPU-kthreads
* laptops/index
* lcd-panel-cgram
* ldm
* LSM/index
* md
* media/index
* module-signing
* mono
* namespaces/index
* numastat
* parport
* perf-security
* pm/index
* pnp
* rapidio
* ras
* rtc
* serial-console
* svga
* thunderbolt
* ufs
* vga-softcursor
* video-output
* xfs
.. only:: subproject and html
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../../../disclaimer-zh_CN.rst
:Original: Documentation/admin-guide/mm/damon/index.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
============
监测数据访问
============
:doc:`DAMON </vm/damon/index>` 允许轻量级的数据访问监测。使用DAMON,
用户可以分析他们系统的内存访问模式,并优化它们。
.. toctree::
:maxdepth: 2
start
usage
reclaim
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../../../disclaimer-zh_CN.rst
:Original: Documentation/admin-guide/mm/damon/reclaim.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
===============
基于DAMON的回收
===============
基于DAMON的回收(DAMON_RECLAIM)是一个静态的内核模块,旨在用于轻度内存压力下的主动和轻
量级的回收。它的目的不是取代基于LRU列表的页面回收,而是有选择地用于不同程度的内存压力和要
求。
哪些地方需要主动回收?
======================
在一般的内存超量使用(over-committed systems,虚拟化相关术语)的系统上,主动回收冷页
有助于节省内存和减少延迟高峰,这些延迟是由直接回收进程或kswapd的CPU消耗引起的,同时只产
生最小的性能下降 [1]_ [2]_ 。
基于空闲页报告 [3]_ 的内存过度承诺的虚拟化系统就是很好的例子。在这样的系统中,客户机
向主机报告他们的空闲内存,而主机则将报告的内存重新分配给其他客户。因此,系统的内存得到了充
分的利用。然而,客户可能不那么节省内存,主要是因为一些内核子系统和用户空间应用程序被设计为
使用尽可能多的内存。然后,客户机可能只向主机报告少量的内存是空闲的,导致系统的内存利用率下降。
在客户中运行主动回收可以缓解这个问题。
它是如何工作的?
================
DAMON_RECLAIM找到在特定时间内没有被访问的内存区域并分页。为了避免它在分页操作中消耗过多
的CPU,可以配置一个速度限制。在这个速度限制下,它首先分页出那些没有被访问过的内存区域。系
统管理员还可以配置在什么情况下这个方案应该自动激活和停用三个内存压力水位。
接口: 模块参数
==============
要使用这个功能,你首先要确保你的系统运行在一个以 ``CONFIG_DAMON_RECLAIM=y`` 构建的内
核上。
为了让系统管理员启用或禁用它,并为给定的系统进行调整,DAMON_RECLAIM利用了模块参数。也就
是说,你可以把 ``damon_reclaim.<parameter>=<value>`` 放在内核启动命令行上,或者把
适当的值写入 ``/sys/modules/damon_reclaim/parameters/<parameter>`` 文件。
注意,除 ``启用`` 外的参数值只在DAMON_RECLAIM启动时应用。因此,如果你想在运行时应用新
的参数值,而DAMON_RECLAIM已经被启用,你应该通过 ``启用`` 的参数文件禁用和重新启用它。
在重新启用之前,应将新的参数值写入适当的参数值中。
下面是每个参数的描述。
enable
------
启用或禁用DAMON_RECLAIM。
你可以通过把这个参数的值设置为 ``Y`` 来启用DAMON_RCLAIM,把它设置为 ``N`` 可以禁用
DAMON_RECLAIM。注意,由于基于水位的激活条件,DAMON_RECLAIM不能进行真正的监测和回收。
这一点请参考下面关于水位参数的描述。
min_age
-------
识别冷内存区域的时间阈值,单位是微秒。
如果一个内存区域在这个时间或更长的时间内没有被访问,DAMON_RECLAIM会将该区域识别为冷的,
并回收它。
默认为120秒。
quota_ms
--------
回收的时间限制,以毫秒为单位。
DAMON_RECLAIM 试图在一个时间窗口(quota_reset_interval_ms)内只使用到这个时间,以
尝试回收冷页。这可以用来限制DAMON_RECLAIM的CPU消耗。如果该值为零,则该限制被禁用。
默认为10ms。
quota_sz
--------
回收的内存大小限制,单位为字节。
DAMON_RECLAIM 收取在一个时间窗口(quota_reset_interval_ms)内试图回收的内存量,并
使其不超过这个限制。这可以用来限制CPU和IO的消耗。如果该值为零,则限制被禁用。
默认情况下是128 MiB。
quota_reset_interval_ms
-----------------------
时间/大小配额收取重置间隔,单位为毫秒。
时间(quota_ms)和大小(quota_sz)的配额的目标重置间隔。也就是说,DAMON_RECLAIM在
尝试回收‘不’超过quota_ms毫秒或quota_sz字节的内存。
默认为1秒。
wmarks_interval
---------------
当DAMON_RECLAIM被启用但由于其水位规则而不活跃时,在检查水位之前的最小等待时间。
wmarks_high
-----------
高水位的可用内存率(每千字节)。
如果系统的可用内存(以每千字节为单位)高于这个数值,DAMON_RECLAIM就会变得不活跃,所以
它什么也不做,只是定期检查水位。
wmarks_mid
----------
中间水位的可用内存率(每千字节)。
如果系统的空闲内存(以每千字节为单位)在这个和低水位线之间,DAMON_RECLAIM就会被激活,
因此开始监测和回收。
wmarks_low
----------
低水位的可用内存率(每千字节)。
如果系统的空闲内存(以每千字节为单位)低于这个数值,DAMON_RECLAIM就会变得不活跃,所以
它除了定期检查水位外什么都不做。在这种情况下,系统会退回到基于LRU列表的页面粒度回收逻辑。
sample_interval
---------------
监测的采样间隔,单位是微秒。
DAMON用于监测冷内存的采样间隔。更多细节请参考DAMON文档 (:doc:`usage`) 。
aggr_interval
-------------
监测的聚集间隔,单位是微秒。
DAMON对冷内存监测的聚集间隔。更多细节请参考DAMON文档 (:doc:`usage`)。
min_nr_regions
--------------
监测区域的最小数量。
DAMON用于冷内存监测的最小监测区域数。这可以用来设置监测质量的下限。但是,设
置的太高可能会导致监测开销的增加。更多细节请参考DAMON文档 (:doc:`usage`) 。
max_nr_regions
--------------
监测区域的最大数量。
DAMON用于冷内存监测的最大监测区域数。这可以用来设置监测开销的上限值。但是,
设置得太低可能会导致监测质量不好。更多细节请参考DAMON文档 (:doc:`usage`) 。
monitor_region_start
--------------------
目标内存区域的物理地址起点。
DAMON_RECLAIM将对其进行工作的内存区域的起始物理地址。也就是说,DAMON_RECLAIM
将在这个区域中找到冷的内存区域并进行回收。默认情况下,该区域使用最大系统内存区。
monitor_region_end
------------------
目标内存区域的结束物理地址。
DAMON_RECLAIM将对其进行工作的内存区域的末端物理地址。也就是说,DAMON_RECLAIM将
在这个区域内找到冷的内存区域并进行回收。默认情况下,该区域使用最大系统内存区。
kdamond_pid
-----------
DAMON线程的PID。
如果DAMON_RECLAIM被启用,这将成为工作线程的PID。否则,为-1。
nr_reclaim_tried_regions
------------------------
试图通过DAMON_RECLAIM回收的内存区域的数量。
bytes_reclaim_tried_regions
---------------------------
试图通过DAMON_RECLAIM回收的内存区域的总字节数。
nr_reclaimed_regions
--------------------
通过DAMON_RECLAIM成功回收的内存区域的数量。
bytes_reclaimed_regions
-----------------------
通过DAMON_RECLAIM成功回收的内存区域的总字节数。
nr_quota_exceeds
----------------
超过时间/空间配额限制的次数。
例子
====
下面的运行示例命令使DAMON_RECLAIM找到30秒或更长时间没有访问的内存区域并“回收”?
为了避免DAMON_RECLAIM在分页操作中消耗过多的CPU时间,回收被限制在每秒1GiB以内。
它还要求DAMON_RECLAIM在系统的可用内存率超过50%时不做任何事情,但如果它低于40%时
就开始真正的工作。如果DAMON_RECLAIM没有取得进展,因此空闲内存率低于20%,它会要求
DAMON_RECLAIM再次什么都不做,这样我们就可以退回到基于LRU列表的页面粒度回收了::
# cd /sys/modules/damon_reclaim/parameters
# echo 30000000 > min_age
# echo $((1 * 1024 * 1024 * 1024)) > quota_sz
# echo 1000 > quota_reset_interval_ms
# echo 500 > wmarks_high
# echo 400 > wmarks_mid
# echo 200 > wmarks_low
# echo Y > enabled
.. [1] https://research.google/pubs/pub48551/
.. [2] https://lwn.net/Articles/787611/
.. [3] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../../../disclaimer-zh_CN.rst
:Original: Documentation/admin-guide/mm/damon/start.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
========
入门指南
========
本文通过演示DAMON的默认用户空间工具,简要地介绍了如何使用DAMON。请注意,为了简洁
起见,本文档只描述了它的部分功能。更多细节请参考该工具的使用文档。
`doc <https://github.com/awslabs/damo/blob/next/USAGE.md>`_ .
前提条件
========
内核
----
首先,你要确保你当前系统中跑的内核构建时选定了这个功能选项 ``CONFIG_DAMON_*=y``.
用户空间工具
------------
在演示中,我们将使用DAMON的默认用户空间工具,称为DAMON Operator(DAMO)。它可以在
https://github.com/awslabs/damo找到。下面的例子假设DAMO在你的$PATH上。当然,但
这并不是强制性的。
因为DAMO使用的是DAMON的debugfs接口(详情请参考 :doc:`usage` 中的使用方法) 你应该
确保debugfs被挂载。手动挂载它,如下所示::
# mount -t debugfs none /sys/kernel/debug/
或者在你的 ``/etc/fstab`` 文件中添加以下一行,这样你的系统就可以在启动时自动挂载
debugfs了::
debugfs /sys/kernel/debug debugfs defaults 0 0
记录数据访问模式
================
下面的命令记录了一个程序的内存访问模式,并将监测结果保存到文件中。 ::
$ git clone https://github.com/sjp38/masim
$ cd masim; make; ./masim ./configs/zigzag.cfg &
$ sudo damo record -o damon.data $(pidof masim)
命令的前两行下载了一个人工内存访问生成器程序并在后台运行。生成器将重复地逐一访问两个
100 MiB大小的内存区域。你可以用你的真实工作负载来代替它。最后一行要求 ``damo`` 将
访问模式记录在 ``damon.data`` 文件中。
将记录的模式可视化
==================
你可以在heatmap中直观地看到这种模式,显示哪个内存区域(X轴)何时被访问(Y轴)以及访
问的频率(数字)。::
$ sudo damo report heats --heatmap stdout
22222222222222222222222222222222222222211111111111111111111111111111111111111100
44444444444444444444444444444444444444434444444444444444444444444444444444443200
44444444444444444444444444444444444444433444444444444444444444444444444444444200
33333333333333333333333333333333333333344555555555555555555555555555555555555200
33333333333333333333333333333333333344444444444444444444444444444444444444444200
22222222222222222222222222222222222223355555555555555555555555555555555555555200
00000000000000000000000000000000000000288888888888888888888888888888888888888400
00000000000000000000000000000000000000288888888888888888888888888888888888888400
33333333333333333333333333333333333333355555555555555555555555555555555555555200
88888888888888888888888888888888888888600000000000000000000000000000000000000000
88888888888888888888888888888888888888600000000000000000000000000000000000000000
33333333333333333333333333333333333333444444444444444444444444444444444444443200
00000000000000000000000000000000000000288888888888888888888888888888888888888400
[...]
# access_frequency: 0 1 2 3 4 5 6 7 8 9
# x-axis: space (139728247021568-139728453431248: 196.848 MiB)
# y-axis: time (15256597248362-15326899978162: 1 m 10.303 s)
# resolution: 80x40 (2.461 MiB and 1.758 s for each character)
你也可以直观地看到工作集的大小分布,按大小排序。::
$ sudo damo report wss --range 0 101 10
# <percentile> <wss>
# target_id 18446632103789443072
# avr: 107.708 MiB
0 0 B | |
10 95.328 MiB |**************************** |
20 95.332 MiB |**************************** |
30 95.340 MiB |**************************** |
40 95.387 MiB |**************************** |
50 95.387 MiB |**************************** |
60 95.398 MiB |**************************** |
70 95.398 MiB |**************************** |
80 95.504 MiB |**************************** |
90 190.703 MiB |********************************************************* |
100 196.875 MiB |***********************************************************|
在上述命令中使用 ``--sortby`` 选项,可以显示工作集的大小是如何按时间顺序变化的。::
$ sudo damo report wss --range 0 101 10 --sortby time
# <percentile> <wss>
# target_id 18446632103789443072
# avr: 107.708 MiB
0 3.051 MiB | |
10 190.703 MiB |***********************************************************|
20 95.336 MiB |***************************** |
30 95.328 MiB |***************************** |
40 95.387 MiB |***************************** |
50 95.332 MiB |***************************** |
60 95.320 MiB |***************************** |
70 95.398 MiB |***************************** |
80 95.398 MiB |***************************** |
90 95.340 MiB |***************************** |
100 95.398 MiB |***************************** |
数据访问模式感知的内存管理
==========================
以下三个命令使每一个大小>=4K的内存区域在你的工作负载中没有被访问>=60秒,就会被换掉。 ::
$ echo "#min-size max-size min-acc max-acc min-age max-age action" > test_scheme
$ echo "4K max 0 0 60s max pageout" >> test_scheme
$ damo schemes -c test_scheme <pid of your workload>
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../../../disclaimer-zh_CN.rst
:Original: Documentation/admin-guide/mm/damon/usage.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
========
详细用法
========
DAMON 为不同的用户提供了下面三种接口。
- *DAMON用户空间工具。*
`这 <https://github.com/awslabs/damo>`_ 为有这特权的人, 如系统管理员,希望有一个刚好
可以工作的人性化界面。
使用它,用户可以以人性化的方式使用DAMON的主要功能。不过,它可能不会为特殊情况进行高度调整。
它同时支持虚拟和物理地址空间的监测。更多细节,请参考它的 `使用文档
<https://github.com/awslabs/damo/blob/next/USAGE.md>`_。
- *debugfs接口。*
:ref:`这 <debugfs_interface>` 是为那些希望更高级的使用DAMON的特权用户空间程序员准备的。
使用它,用户可以通过读取和写入特殊的debugfs文件来使用DAMON的主要功能。因此,你可以编写和使
用你个性化的DAMON debugfs包装程序,代替你读/写debugfs文件。 `DAMON用户空间工具
<https://github.com/awslabs/damo>`_ 就是这种程序的一个例子 它同时支持虚拟和物理地址
空间的监测。注意,这个界面只提供简单的监测结果 :ref:`统计 <damos_stats>`。对于详细的监测
结果,DAMON提供了一个:ref:`跟踪点 <tracepoint>`。
- *内核空间编程接口。*
:doc:`This </vm/damon/api>` 这是为内核空间程序员准备的。使用它,用户可以通过为你编写内
核空间的DAMON应用程序,最灵活有效地利用DAMON的每一个功能。你甚至可以为各种地址空间扩展DAMON。
详细情况请参考接口 :doc:`文件 </vm/damon/api>`。
debugfs接口
===========
DAMON导出了八个文件, ``attrs``, ``target_ids``, ``init_regions``,
``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` 和
``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.
属性
----
用户可以通过读取和写入 ``attrs`` 文件获得和设置 ``采样间隔`` 、 ``聚集间隔`` 、 ``区域更新间隔``
以及监测目标区域的最小/最大数量。要详细了解监测属性,请参考 `:doc:/vm/damon/design` 。例如,
下面的命令将这些值设置为5ms、100ms、1000ms、10和1000,然后再次检查::
# cd <debugfs>/damon
# echo 5000 100000 1000000 10 1000 > attrs
# cat attrs
5000 100000 1000000 10 1000
目标ID
------
一些类型的地址空间支持多个监测目标。例如,虚拟内存地址空间的监测可以有多个进程作为监测目标。用户
可以通过写入目标的相关id值来设置目标,并通过读取 ``target_ids`` 文件来获得当前目标的id。在监
测虚拟地址空间的情况下,这些值应该是监测目标进程的pid。例如,下面的命令将pid为42和4242的进程设
为监测目标,并再次检查::
# cd <debugfs>/damon
# echo 42 4242 > target_ids
# cat target_ids
42 4242
用户还可以通过在文件中写入一个特殊的关键字 "paddr\n" 来监测系统的物理内存地址空间。因为物理地
址空间监测不支持多个目标,读取文件会显示一个假值,即 ``42`` ,如下图所示::
# cd <debugfs>/damon
# echo paddr > target_ids
# cat target_ids
42
请注意,设置目标ID并不启动监测。
初始监测目标区域
----------------
在虚拟地址空间监测的情况下,DAMON自动设置和更新监测的目标区域,这样就可以覆盖目标进程的整个
内存映射。然而,用户可能希望将监测区域限制在特定的地址范围内,如堆、栈或特定的文件映射区域。
或者,一些用户可以知道他们工作负载的初始访问模式,因此希望为“自适应区域调整”设置最佳初始区域。
相比之下,DAMON在物理内存监测的情况下不会自动设置和更新监测目标区域。因此,用户应该自己设置
监测目标区域。
在这种情况下,用户可以通过在 ``init_regions`` 文件中写入适当的值,明确地设置他们想要的初
始监测目标区域。输入的每一行应代表一个区域,形式如下::
<target idx> <start address> <end address>
目标idx应该是 ``target_ids`` 文件中目标的索引,从 ``0`` 开始,区域应该按照地址顺序传递。
例如,下面的命令将设置几个地址范围, ``1-100`` 和 ``100-200`` 作为pid 42的初始监测目标
区域,这是 ``target_ids`` 中的第一个(索引 ``0`` ),另外几个地址范围, ``20-40`` 和
``50-100`` 作为pid 4242的地址,这是 ``target_ids`` 中的第二个(索引 ``1`` )::
# cd <debugfs>/damon
# cat target_ids
42 4242
# echo "0 1 100
0 100 200
1 20 40
1 50 100" > init_regions
请注意,这只是设置了初始的监测目标区域。在虚拟内存监测的情况下,DAMON会在一个 ``区域更新间隔``
后自动更新区域的边界。因此,在这种情况下,如果用户不希望更新的话,应该把 ``区域的更新间隔`` 设
置得足够大。
方案
----
对于通常的基于DAMON的数据访问感知的内存管理优化,用户只是希望系统对特定访问模式的内存区域应用内
存管理操作。DAMON从用户那里接收这种形式化的操作方案,并将这些方案应用到目标进程中。
用户可以通过读取和写入 ``scheme`` debugfs文件来获得和设置这些方案。读取该文件还可以显示每个
方案的统计数据。在文件中,每一个方案都应该在每一行中以下列形式表示出来::
<target access pattern> <action> <quota> <watermarks>
你可以通过简单地在文件中写入一个空字符串来禁用方案。
目标访问模式
~~~~~~~~~~~~
``<目标访问模式>`` 是由三个范围构成的,形式如下::
min-size max-size min-acc max-acc min-age max-age
具体来说,区域大小的字节数( `min-size` 和 `max-size` ),访问频率的每聚合区间的监测访问次
数( `min-acc` 和 `max-acc` ),区域年龄的聚合区间数( `min-age` 和 `max-age` )都被指定。
请注意,这些范围是封闭区间。
动作
~~~~
``<action>`` 是一个预定义的内存管理动作的整数,DAMON将应用于具有目标访问模式的区域。支持
的数字和它们的含义如下::
- 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``
- 1: Call ``madvise()`` for the region with ``MADV_COLD``
- 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``
- 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``
- 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``
- 5: Do nothing but count the statistics
配额
~~~~
每个 ``动作`` 的最佳 ``目标访问模式`` 取决于工作负载,所以不容易找到。更糟糕的是,将某个
动作的方案设置得过于激进会导致严重的开销。为了避免这种开销,用户可以通过下面表格中的 ``<quota>``
来限制方案的时间和大小配额::
<ms> <sz> <reset interval> <priority weights>
这使得DAMON在 ``<reset interval>`` 毫秒内,尽量只用 ``<ms>`` 毫秒的时间对 ``目标访
问模式`` 的内存区域应用动作,并在 ``<reset interval>`` 内只对最多<sz>字节的内存区域应
用动作。将 ``<ms>`` 和 ``<sz>`` 都设置为零,可以禁用配额限制。
当预计超过配额限制时,DAMON会根据 ``目标访问模式`` 的大小、访问频率和年龄,对发现的内存
区域进行优先排序。为了实现个性化的优先级,用户可以在 ``<优先级权重>`` 中设置这三个属性的
权重,具体形式如下::
<size weight> <access frequency weight> <age weight>
水位
~~~~
有些方案需要根据系统特定指标的当前值来运行,如自由内存比率。对于这种情况,用户可以为该条
件指定水位。::
<metric> <check interval> <high mark> <middle mark> <low mark>
``<metric>`` 是一个预定义的整数,用于要检查的度量。支持的数字和它们的含义如下。
- 0: 忽视水位
- 1: 系统空闲内存率 (千分比)
每隔 ``<检查间隔>`` 微秒检查一次公制的值。
如果该值高于 ``<高标>`` 或低于 ``<低标>`` ,该方案被停用。如果该值低于 ``<中标>`` ,
该方案将被激活。
统计数据
~~~~~~~~
它还统计每个方案被尝试应用的区域的总数量和字节数,每个方案被成功应用的区域的两个数量,以
及超过配额限制的总数量。这些统计数据可用于在线分析或调整方案。
统计数据可以通过读取方案文件来显示。读取该文件将显示你在每一行中输入的每个 ``方案`` ,
统计的五个数字将被加在每一行的末尾。
例子
~~~~
下面的命令应用了一个方案:”如果一个大小为[4KiB, 8KiB]的内存区域在[10, 20]的聚合时间
间隔内显示出每一个聚合时间间隔[0, 5]的访问量,请分页出该区域。对于分页,每秒最多只能使
用10ms,而且每秒分页不能超过1GiB。在这一限制下,首先分页出具有较长年龄的内存区域。另外,
每5秒钟检查一次系统的可用内存率,当可用内存率低于50%时开始监测和分页,但如果可用内存率
大于60%,或低于30%,则停止监测“::
# cd <debugfs>/damon
# scheme="4096 8192 0 5 10 20 2" # target access pattern and action
# scheme+=" 10 $((1024*1024*1024)) 1000" # quotas
# scheme+=" 0 0 100" # prioritization weights
# scheme+=" 1 5000000 600 500 300" # watermarks
# echo "$scheme" > schemes
开关
----
除非你明确地启动监测,否则如上所述的文件设置不会产生效果。你可以通过写入和读取 ``monitor_on``
文件来启动、停止和检查监测的当前状态。写入 ``on`` 该文件可以启动对有属性的目标的监测。写入
``off`` 该文件则停止这些目标。如果每个目标进程被终止,DAMON也会停止。下面的示例命令开启、关
闭和检查DAMON的状态::
# cd <debugfs>/damon
# echo on > monitor_on
# echo off > monitor_on
# cat monitor_on
off
请注意,当监测开启时,你不能写到上述的debugfs文件。如果你在DAMON运行时写到这些文件,将会返
回一个错误代码,如 ``-EBUSY`` 。
监测线程PID
-----------
DAMON通过一个叫做kdamond的内核线程来进行请求监测。你可以通过读取 ``kdamond_pid`` 文件获
得该线程的 ``pid`` 。当监测被 ``关闭`` 时,读取该文件不会返回任何信息::
# cd <debugfs>/damon
# cat monitor_on
off
# cat kdamond_pid
none
# echo on > monitor_on
# cat kdamond_pid
18594
使用多个监测线程
----------------
每个监测上下文都会创建一个 ``kdamond`` 线程。你可以使用 ``mk_contexts`` 和 ``rm_contexts``
文件为多个 ``kdamond`` 需要的用例创建和删除监测上下文。
将新上下文的名称写入 ``mk_contexts`` 文件,在 ``DAMON debugfs`` 目录上创建一个该名称的目录。
该目录将有该上下文的 ``DAMON debugfs`` 文件::
# cd <debugfs>/damon
# ls foo
# ls: cannot access 'foo': No such file or directory
# echo foo > mk_contexts
# ls foo
# attrs init_regions kdamond_pid schemes target_ids
如果不再需要上下文,你可以通过把上下文的名字放到 ``rm_contexts`` 文件中来删除它和相应的目录::
# echo foo > rm_contexts
# ls foo
# ls: cannot access 'foo': No such file or directory
注意, ``mk_contexts`` 、 ``rm_contexts`` 和 ``monitor_on`` 文件只在根目录下。
监测结果的监测点
================
DAMON通过一个tracepoint ``damon:damon_aggregated`` 提供监测结果. 当监测开启时,你可
以记录追踪点事件,并使用追踪点支持工具如perf显示结果。比如说::
# echo on > monitor_on
# perf record -e damon:damon_aggregated &
# sleep 5
# kill 9 $(pidof perf)
# echo off > monitor_on
# perf script
.. include:: ../../disclaimer-zh_CN.rst
:Original: Documentation/admin-guide/mm/index.rst
:翻译:
徐鑫 xu xin <xu.xin16@zte.com.cn>
========
内存管理
========
Linux内存管理子系统,顾名思义,是负责系统中的内存管理。它包括了虚拟内存与请求
分页的实现,内核内部结构和用户空间程序的内存分配、将文件映射到进程地址空间以
及许多其他很酷的事情。
Linux内存管理是一个具有许多可配置设置的复杂系统, 且这些设置中的大多数都可以通
过 ``/proc`` 文件系统获得,并且可以使用 ``sysctl`` 进行查询和调整。这些API接
口被描述在Documentation/admin-guide/sysctl/vm.rst文件和 `man 5 proc`_ 中。
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
Linux内存管理有它自己的术语,如果你还不熟悉它,请考虑阅读下面参考:
:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
在此目录下,我们详细描述了如何与Linux内存管理中的各种机制交互。
.. toctree::
:maxdepth: 1
damon/index
ksm
Todolist:
* concepts
* cma_debugfs
* hugetlbpage
* idle_page_tracking
* memory-hotplug
* nommu-mmap
* numa_memory_policy
* numaperf
* pagemap
* soft-dirty
* swap_numa
* transhuge
* userfaultfd
* zswap
.. include:: ../../disclaimer-zh_CN.rst
:Original: Documentation/admin-guide/mm/ksm.rst
:翻译:
徐鑫 xu xin <xu.xin16@zte.com.cn>
============
内核同页合并
============
概述
====
KSM是一种能节省内存的数据去重功能,由CONFIG_KSM=y启用,并在2.6.32版本时被添
加到Linux内核。详见 ``mm/ksm.c`` 的实现,以及http://lwn.net/Articles/306704
和https://lwn.net/Articles/330589
KSM最初目的是为了与KVM(即著名的内核共享内存)一起使用而开发的,通过共享虚拟机
之间的公共数据,将更多虚拟机放入物理内存。但它对于任何会生成多个相同数据实例的
应用程序都是很有用的。
KSM的守护进程ksmd会定期扫描那些已注册的用户内存区域,查找内容相同的页面,这些
页面可以被单个写保护页面替换(如果进程以后想要更新其内容,将自动复制)。使用:
引用:`sysfs intraface <ksm_sysfs>` 接口来配置KSM守护程序在单个过程中所扫描的页
数以及两个过程之间的间隔时间。
KSM只合并匿名(私有)页面,从不合并页缓存(文件)页面。KSM的合并页面最初只能被
锁定在内核内存中,但现在可以就像其他用户页面一样被换出(但当它们被交换回来时共
享会被破坏: ksmd必须重新发现它们的身份并再次合并)。
以madvise控制KSM
================
KSM仅在特定的地址空间区域时运行,即应用程序通过使用如下所示的madvise(2)系统调
用来请求某块地址成为可能的合并候选者的地址空间::
int madvise(addr, length, MADV_MERGEABLE)
应用程序当然也可以通过调用::
int madvise(addr, length, MADV_UNMERGEABLE)
来取消该请求,并恢复为非共享页面:此时KSM将去除合并在该范围内的任何合并页。注意:
这个去除合并的调用可能突然需要的内存量超过实际可用的内存量-那么可能会出现EAGAIN
失败,但更可能会唤醒OOM killer。
如果KSM未被配置到正在运行的内核中,则madvise MADV_MERGEABLE 和 MADV_UNMERGEABLE
的调用只会以EINVAL 失败。如果正在运行的内核是用CONFIG_KSM=y方式构建的,那么这些
调用通常会成功:即使KSM守护程序当前没有运行,MADV_MERGEABLE 仍然会在KSM守护程序
启动时注册范围,即使该范围不能包含KSM实际可以合并的任何页面,即使MADV_UNMERGEABLE
应用于从未标记为MADV_MERGEABLE的范围。
如果一块内存区域必须被拆分为至少一个新的MADV_MERGEABLE区域或MADV_UNMERGEABLE区域,
当该进程将超过 ``vm.max_map_count`` 的设定,则madvise可能返回ENOMEM。(请参阅文档
Documentation/admin-guide/sysctl/vm.rst)。
与其他madvise调用一样,它们在用户地址空间的映射区域上使用:如果指定的范围包含未
映射的间隙(尽管在中间的映射区域工作),它们将报告ENOMEM,如果没有足够的内存用于
内部结构,则可能会因EAGAIN而失败。
KSM守护进程sysfs接口
====================
KSM守护进程可以由``/sys/kernel/mm/ksm/`` 中的sysfs文件控制,所有人都可以读取,但
只能由root用户写入。各接口解释如下:
pages_to_scan
ksmd进程进入睡眠前要扫描的页数。
例如, ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``
默认值:100(该值被选择用于演示目的)
sleep_millisecs
ksmd在下次扫描前应休眠多少毫秒
例如, ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs``
默认值:20(该值被选择用于演示目的)
merge_across_nodes
指定是否可以合并来自不同NUMA节点的页面。当设置为0时,ksm仅合并在物理上位
于同一NUMA节点的内存区域中的页面。这降低了访问共享页面的延迟。在有明显的
NUMA距离上,具有更多节点的系统可能受益于设置该值为0时的更低延迟。而对于
需要对内存使用量最小化的较小系统来说,设置该值为1(默认设置)则可能会受
益于更大共享页面。在决定使用哪种设置之前,您可能希望比较系统在每种设置下
的性能。 ``merge_across_nodes`` 仅当系统中没有ksm共享页面时,才能被更改设
置:首先将接口`run` 设置为2从而对页进行去合并,然后在修改
``merge_across_nodes`` 后再将‘run’又设置为1,以根据新设置来重新合并。
默认值:1(如早期的发布版本一样合并跨站点)
run
* 设置为0可停止ksmd运行,但保留合并页面,
* 设置为1可运行ksmd,例如, ``echo 1 > /sys/kernel/mm/ksm/run`` ,
* 设置为2可停止ksmd运行,并且对所有目前已合并的页进行去合并,但保留可合并
区域以供下次运行。
默认值:0(必须设置为1才能激活KSM,除非禁用了CONFIG_SYSFS)
use_zero_pages
指定是否应当特殊处理空页(即那些仅含zero的已分配页)。当该值设置为1时,
空页与内核零页合并,而不是像通常情况下那样空页自身彼此合并。这可以根据
工作负载的不同,在具有着色零页的架构上可以提高性能。启用此设置时应小心,
因为它可能会降低某些工作负载的KSM性能,比如,当待合并的候选页面的校验和
与空页面的校验和恰好匹配的时候。此设置可随时更改,仅对那些更改后再合并
的页面有效。
默认值:0(如同早期版本的KSM正常表现)
max_page_sharing
单个KSM页面允许的最大共享站点数。这将强制执行重复数据消除限制,以避免涉
及遍历共享KSM页面的虚拟映射的虚拟内存操作的高延迟。最小值为2,因为新创
建的KSM页面将至少有两个共享者。该值越高,KSM合并内存的速度越快,去重
因子也越高,但是对于任何给定的KSM页面,虚拟映射的最坏情况遍历的速度也会
越慢。减慢了这种遍历速度就意味着在交换、压缩、NUMA平衡和页面迁移期间,
某些虚拟内存操作将有更高的延迟,从而降低这些虚拟内存操作调用者的响应能力。
其他任务如果不涉及执行虚拟映射遍历的VM操作,其任务调度延迟不受此参数的影
响,因为这些遍历本身是调度友好的。
stable_node_chains_prune_millisecs
指定KSM检查特定页面的元数据的频率(即那些达到过时信息数据去重限制标准的
页面)单位是毫秒。较小的毫秒值将以更低的延迟来释放KSM元数据,但它们将使
ksmd在扫描期间使用更多CPU。如果还没有一个KSM页面达到 ``max_page_sharing``
标准,那就没有什么用。
KSM与MADV_MERGEABLE的工作有效性体现于 ``/sys/kernel/mm/ksm/`` 路径下的接口:
pages_shared
表示多少共享页正在被使用
pages_sharing
表示还有多少站点正在共享这些共享页,即节省了多少
pages_unshared
表示有多少页是唯一的,但被反复检查以进行合并
pages_volatile
表示有多少页因变化太快而无法放在tree中
full_scans
表示所有可合并区域已扫描多少次
stable_node_chains
达到 ``max_page_sharing`` 限制的KSM页数
stable_node_dups
重复的KSM页数
比值 ``pages_sharing/pages_shared`` 的最大值受限制于 ``max_page_sharing``
的设定。要想增加该比值,则相应地要增加 ``max_page_sharing`` 的值。
......@@ -42,6 +42,7 @@
kref
assoc_array
xarray
rbtree
Todolist:
......@@ -49,7 +50,6 @@ Todolist:
idr
circular-buffers
rbtree
generic-radix-tree
packing
bus-virt-phys-mapping
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/rbtree.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
=========================
Linux中的红黑树(rbtree)
=========================
:日期: 2007年1月18日
:作者: Rob Landley <rob@landley.net>
何为红黑树,它们有什么用?
--------------------------
红黑树是一种自平衡二叉搜索树,被用来存储可排序的键/值数据对。这与基数树(被用来高效
存储稀疏数组,因此使用长整型下标来插入/访问/删除结点)和哈希表(没有保持排序因而无法
容易地按序遍历,同时必须调节其大小和哈希函数,然而红黑树可以优雅地伸缩以便存储任意
数量的键)不同。
红黑树和AVL树类似,但在插入和删除时提供了更快的实时有界的最坏情况性能(分别最多两次
旋转和三次旋转,来平衡树),查询时间轻微变慢(但时间复杂度仍然是O(log n))。
引用Linux每周新闻(Linux Weekly News):
内核中有多处红黑树的使用案例。最后期限调度器和完全公平排队(CFQ)I/O调度器利用
红黑树跟踪请求;数据包CD/DVD驱动程序也是如此。高精度时钟代码使用一颗红黑树组织
未完成的定时器请求。ext3文件系统用红黑树跟踪目录项。虚拟内存区域(VMAs)、epoll
文件描述符、密码学密钥和在“分层令牌桶”调度器中的网络数据包都由红黑树跟踪。
本文档涵盖了对Linux红黑树实现的使用方法。更多关于红黑树的性质和实现的信息,参见:
Linux每周新闻关于红黑树的文章
https://lwn.net/Articles/184495/
维基百科红黑树词条
https://en.wikipedia.org/wiki/Red-black_tree
红黑树的Linux实现
-----------------
Linux的红黑树实现在文件“lib/rbtree.c”中。要使用它,需要“#include <linux/rbtree.h>”。
Linux的红黑树实现对速度进行了优化,因此比传统的实现少一个间接层(有更好的缓存局部性)。
每个rb_node结构体的实例嵌入在它管理的数据结构中,因此不需要靠指针来分离rb_node和它
管理的数据结构。用户应该编写他们自己的树搜索和插入函数,来调用已提供的红黑树函数,
而不是使用一个比较回调函数指针。加锁代码也留给红黑树的用户编写。
创建一颗红黑树
--------------
红黑树中的数据结点是包含rb_node结构体成员的结构体::
struct mytype {
struct rb_node node;
char *keystring;
};
当处理一个指向内嵌rb_node结构体的指针时,包住rb_node的结构体可用标准的container_of()
宏访问。此外,个体成员可直接用rb_entry(node, type, member)访问。
每颗红黑树的根是一个rb_root数据结构,它由以下方式初始化为空:
struct rb_root mytree = RB_ROOT;
在一颗红黑树中搜索值
--------------------
为你的树写一个搜索函数是相当简单的:从树根开始,比较每个值,然后根据需要继续前往左边或
右边的分支。
示例::
struct mytype *my_search(struct rb_root *root, char *string)
{
struct rb_node *node = root->rb_node;
while (node) {
struct mytype *data = container_of(node, struct mytype, node);
int result;
result = strcmp(string, data->keystring);
if (result < 0)
node = node->rb_left;
else if (result > 0)
node = node->rb_right;
else
return data;
}
return NULL;
}
在一颗红黑树中插入数据
----------------------
在树中插入数据的步骤包括:首先搜索插入新结点的位置,然后插入结点并对树再平衡
("recoloring")。
插入的搜索和上文的搜索不同,它要找到嫁接新结点的位置。新结点也需要一个指向它的父节点
的链接,以达到再平衡的目的。
示例::
int my_insert(struct rb_root *root, struct mytype *data)
{
struct rb_node **new = &(root->rb_node), *parent = NULL;
/* Figure out where to put new node */
while (*new) {
struct mytype *this = container_of(*new, struct mytype, node);
int result = strcmp(data->keystring, this->keystring);
parent = *new;
if (result < 0)
new = &((*new)->rb_left);
else if (result > 0)
new = &((*new)->rb_right);
else
return FALSE;
}
/* Add new node and rebalance tree. */
rb_link_node(&data->node, parent, new);
rb_insert_color(&data->node, root);
return TRUE;
}
在一颗红黑树中删除或替换已经存在的数据
--------------------------------------
若要从树中删除一个已经存在的结点,调用::
void rb_erase(struct rb_node *victim, struct rb_root *tree);
示例::
struct mytype *data = mysearch(&mytree, "walrus");
if (data) {
rb_erase(&data->node, &mytree);
myfree(data);
}
若要用一个新结点替换树中一个已经存在的键值相同的结点,调用::
void rb_replace_node(struct rb_node *old, struct rb_node *new,
struct rb_root *tree);
通过这种方式替换结点不会对树做重排序:如果新结点的键值和旧结点不同,红黑树可能被
破坏。
(按排序的顺序)遍历存储在红黑树中的元素
----------------------------------------
我们提供了四个函数,用于以排序的方式遍历一颗红黑树的内容。这些函数可以在任意红黑树
上工作,并且不需要被修改或包装(除非加锁的目的)::
struct rb_node *rb_first(struct rb_root *tree);
struct rb_node *rb_last(struct rb_root *tree);
struct rb_node *rb_next(struct rb_node *node);
struct rb_node *rb_prev(struct rb_node *node);
要开始迭代,需要使用一个指向树根的指针调用rb_first()或rb_last(),它将返回一个指向
树中第一个或最后一个元素所包含的节点结构的指针。要继续的话,可以在当前结点上调用
rb_next()或rb_prev()来获取下一个或上一个结点。当没有剩余的结点时,将返回NULL。
迭代器函数返回一个指向被嵌入的rb_node结构体的指针,由此,包住rb_node的结构体可用
标准的container_of()宏访问。此外,个体成员可直接用rb_entry(node, type, member)
访问。
示例::
struct rb_node *node;
for (node = rb_first(&mytree); node; node = rb_next(node))
printk("key=%s\n", rb_entry(node, struct mytype, node)->keystring);
带缓存的红黑树
--------------
计算最左边(最小的)结点是二叉搜索树的一个相当常见的任务,例如用于遍历,或用户根据
他们自己的逻辑依赖一个特定的顺序。为此,用户可以使用'struct rb_root_cached'来优化
时间复杂度为O(logN)的rb_first()的调用,以简单地获取指针,避免了潜在的昂贵的树迭代。
维护操作的额外运行时间开销可忽略,不过内存占用较大。
和rb_root结构体类似,带缓存的红黑树由以下方式初始化为空::
struct rb_root_cached mytree = RB_ROOT_CACHED;
带缓存的红黑树只是一个常规的rb_root,加上一个额外的指针来缓存最左边的节点。这使得
rb_root_cached可以存在于rb_root存在的任何地方,并且只需增加几个接口来支持带缓存的
树::
struct rb_node *rb_first_cached(struct rb_root_cached *tree);
void rb_insert_color_cached(struct rb_node *, struct rb_root_cached *, bool);
void rb_erase_cached(struct rb_node *node, struct rb_root_cached *);
操作和删除也有对应的带缓存的树的调用::
void rb_insert_augmented_cached(struct rb_node *node, struct rb_root_cached *,
bool, struct rb_augment_callbacks *);
void rb_erase_augmented_cached(struct rb_node *, struct rb_root_cached *,
struct rb_augment_callbacks *);
对增强型红黑树的支持
--------------------
增强型红黑树是一种在每个结点里存储了“一些”附加数据的红黑树,其中结点N的附加数据
必须是以N为根的子树中所有结点的内容的函数。它是建立在红黑树基础设施之上的可选特性。
想要使用这个特性的红黑树用户,插入和删除结点时必须调用增强型接口并提供增强型回调函数。
实现增强型红黑树操作的C文件必须包含<linux/rbtree_augmented.h>而不是<linux/rbtree.h>。
注意,linux/rbtree_augmented.h暴露了一些红黑树实现的细节而你不应依赖它们,请坚持
使用文档记录的API,并且不要在头文件中包含<linux/rbtree_augmented.h>,以最小化你的
用户意外地依赖这些实现细节的可能。
插入时,用户必须更新通往被插入节点的路径上的增强信息,然后像往常一样调用rb_link_node(),
然后是rb_augment_inserted()而不是平时的rb_insert_color()调用。如果
rb_augment_inserted()再平衡了红黑树,它将回调至一个用户提供的函数来更新受影响的
子树上的增强信息。
删除一个结点时,用户必须调用rb_erase_augmented()而不是rb_erase()。
rb_erase_augmented()回调至一个用户提供的函数来更新受影响的子树上的增强信息。
在两种情况下,回调都是通过rb_augment_callbacks结构体提供的。必须定义3个回调:
- 一个传播回调,它更新一个给定结点和它的祖先们的增强数据,直到一个给定的停止点
(如果是NULL,将更新一路更新到树根)。
- 一个复制回调,它将一颗给定子树的增强数据复制到一个新指定的子树树根。
- 一个树旋转回调,它将一颗给定的子树的增强值复制到新指定的子树树根上,并重新计算
先前的子树树根的增强值。
rb_erase_augmented()编译后的代码可能会内联传播、复制回调,这将导致函数体积更大,
因此每个增强型红黑树的用户应该只有一个rb_erase_augmented()的调用点,以限制编译后
的代码大小。
使用示例
^^^^^^^^
区间树是增强型红黑树的一个例子。参考Cormen,Leiserson,Rivest和Stein写的
《算法导论》。区间树的更多细节:
经典的红黑树只有一个键,它不能直接用来存储像[lo:hi]这样的区间范围,也不能快速查找
与新的lo:hi重叠的部分,或者查找是否有与新的lo:hi完全匹配的部分。
然而,红黑树可以被增强,以一种结构化的方式来存储这种区间范围,从而使高效的查找和
精确匹配成为可能。
这个存储在每个节点中的“额外信息”是其所有后代结点中的最大hi(max_hi)值。这个信息
可以保持在每个结点上,只需查看一下该结点和它的直系子结点们。这将被用于时间复杂度
为O(log n)的最低匹配查找(所有可能的匹配中最低的起始地址),就像这样::
struct interval_tree_node *
interval_tree_first_match(struct rb_root *root,
unsigned long start, unsigned long last)
{
struct interval_tree_node *node;
if (!root->rb_node)
return NULL;
node = rb_entry(root->rb_node, struct interval_tree_node, rb);
while (true) {
if (node->rb.rb_left) {
struct interval_tree_node *left =
rb_entry(node->rb.rb_left,
struct interval_tree_node, rb);
if (left->__subtree_last >= start) {
/*
* Some nodes in left subtree satisfy Cond2.
* Iterate to find the leftmost such node N.
* If it also satisfies Cond1, that's the match
* we are looking for. Otherwise, there is no
* matching interval as nodes to the right of N
* can't satisfy Cond1 either.
*/
node = left;
continue;
}
}
if (node->start <= last) { /* Cond1 */
if (node->last >= start) /* Cond2 */
return node; /* node is leftmost match */
if (node->rb.rb_right) {
node = rb_entry(node->rb.rb_right,
struct interval_tree_node, rb);
if (node->__subtree_last >= start)
continue;
}
}
return NULL; /* No match */
}
}
插入/删除是通过以下增强型回调来定义的::
static inline unsigned long
compute_subtree_last(struct interval_tree_node *node)
{
unsigned long max = node->last, subtree_last;
if (node->rb.rb_left) {
subtree_last = rb_entry(node->rb.rb_left,
struct interval_tree_node, rb)->__subtree_last;
if (max < subtree_last)
max = subtree_last;
}
if (node->rb.rb_right) {
subtree_last = rb_entry(node->rb.rb_right,
struct interval_tree_node, rb)->__subtree_last;
if (max < subtree_last)
max = subtree_last;
}
return max;
}
static void augment_propagate(struct rb_node *rb, struct rb_node *stop)
{
while (rb != stop) {
struct interval_tree_node *node =
rb_entry(rb, struct interval_tree_node, rb);
unsigned long subtree_last = compute_subtree_last(node);
if (node->__subtree_last == subtree_last)
break;
node->__subtree_last = subtree_last;
rb = rb_parent(&node->rb);
}
}
static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new)
{
struct interval_tree_node *old =
rb_entry(rb_old, struct interval_tree_node, rb);
struct interval_tree_node *new =
rb_entry(rb_new, struct interval_tree_node, rb);
new->__subtree_last = old->__subtree_last;
}
static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new)
{
struct interval_tree_node *old =
rb_entry(rb_old, struct interval_tree_node, rb);
struct interval_tree_node *new =
rb_entry(rb_new, struct interval_tree_node, rb);
new->__subtree_last = old->__subtree_last;
old->__subtree_last = compute_subtree_last(old);
}
static const struct rb_augment_callbacks augment_callbacks = {
augment_propagate, augment_copy, augment_rotate
};
void interval_tree_insert(struct interval_tree_node *node,
struct rb_root *root)
{
struct rb_node **link = &root->rb_node, *rb_parent = NULL;
unsigned long start = node->start, last = node->last;
struct interval_tree_node *parent;
while (*link) {
rb_parent = *link;
parent = rb_entry(rb_parent, struct interval_tree_node, rb);
if (parent->__subtree_last < last)
parent->__subtree_last = last;
if (start < parent->start)
link = &parent->rb.rb_left;
else
link = &parent->rb.rb_right;
}
node->__subtree_last = last;
rb_link_node(&node->rb, rb_parent, link);
rb_insert_augmented(&node->rb, root, &augment_callbacks);
}
void interval_tree_remove(struct interval_tree_node *node,
struct rb_root *root)
{
rb_erase_augmented(&node->rb, root, &augment_callbacks);
}
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/Devicetree/index.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
=============================
Open Firmware 和 Devicetree
=============================
该文档是整个设备树文档的总目录,标题中多是业内默认的术语,初见就翻译成中文,
晦涩难懂,因此尽量保留,后面翻译其子文档时,可能会根据语境,灵活地翻译为中文。
内核Devicetree的使用
=======================
.. toctree::
:maxdepth: 1
usage-model
of_unittest
Todolist:
* kernel-api
Devicetree Overlays
===================
.. toctree::
:maxdepth: 1
Todolist:
* changesets
* dynamic-resolution-notes
* overlay-notes
Devicetree Bindings
===================
.. toctree::
:maxdepth: 1
Todolist:
* bindings/index
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/Devicetree/of_unittest.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
=================================
Open Firmware Devicetree 单元测试
=================================
作者: Gaurav Minocha <gaurav.minocha.os@gmail.com>
1. 概述
=======
本文档解释了执行 OF 单元测试所需的测试数据是如何动态地附加到实时树上的,与机器的架构无关。
建议在继续读下去之前,先阅读以下文件。
(1) Documentation/devicetree/usage-model.rst
(2) http://www.devicetree.org/Device_Tree_Usage
OF Selftest被设计用来测试提供给设备驱动开发者的接口(include/linux/of.h),以从未扁平
化的设备树数据结构中获取设备信息等。这个接口被大多数设备驱动在各种使用情况下使用。
2. 测试数据
===========
设备树源文件(drivers/of/unittest-data/testcases.dts)包含执行drivers/of/unittest.c
中自动化单元测试所需的测试数据。目前,以下设备树源包含文件(.dtsi)被包含在testcases.dt中::
drivers/of/unittest-data/tests-interrupts.dtsi
drivers/of/unittest-data/tests-platform.dtsi
drivers/of/unittest-data/tests-phandle.dtsi
drivers/of/unittest-data/tests-match.dtsi
当内核在启用OF_SELFTEST的情况下被构建时,那么下面的make规则::
$(obj)/%.dtb: $(src)/%.dts FORCE
$(call if_changed_dep, dtc)
用于将DT源文件(testcases.dts)编译成二进制blob(testcases.dtb),也被称为扁平化的DT。
之后,使用以下规则将上述二进制blob包装成一个汇编文件(testcases.dtb.S)::
$(obj)/%.dtb.S: $(obj)/%.dtb
$(call cmd, dt_S_dtb)
汇编文件被编译成一个对象文件(testcases.dtb.o),并被链接到内核镜像中。
2.1. 添加测试数据
-----------------
未扁平化的设备树结构体:
未扁平化的设备树由连接的设备节点组成,其树状结构形式如下所述::
// following struct members are used to construct the tree
struct device_node {
...
struct device_node *parent;
struct device_node *child;
struct device_node *sibling;
...
};
图1描述了一个机器的未扁平化设备树的通用结构,只考虑了子节点和同级指针。存在另一个指针,
``*parent`` ,用于反向遍历该树。因此,在一个特定的层次上,子节点和所有的兄弟姐妹节点将
有一个指向共同节点的父指针(例如,child1、sibling2、sibling3、sibling4的父指针指向
根节点)::
root ('/')
|
child1 -> sibling2 -> sibling3 -> sibling4 -> null
| | | |
| | | null
| | |
| | child31 -> sibling32 -> null
| | | |
| | null null
| |
| child21 -> sibling22 -> sibling23 -> null
| | | |
| null null null
|
child11 -> sibling12 -> sibling13 -> sibling14 -> null
| | | |
| | | null
| | |
null null child131 -> null
|
null
Figure 1: 未扁平化的设备树的通用结构
在执行OF单元测试之前,需要将测试数据附加到机器的设备树上(如果存在)。因此,当调用
selftest_data_add()时,首先会读取通过以下内核符号链接到内核镜像中的扁平化设备树
数据::
__dtb_testcases_begin - address marking the start of test data blob
__dtb_testcases_end - address marking the end of test data blob
其次,它调用of_fdt_unflatten_tree()来解除扁平化的blob。最后,如果机器的设备树
(即实时树)是存在的,那么它将未扁平化的测试数据树附加到实时树上,否则它将自己作为
实时设备树附加。
attach_node_and_children()使用of_attach_node()将节点附加到实时树上,如下所
述。为了解释这一点,图2中描述的测试数据树被附加到图1中描述的实时树上::
root ('/')
|
testcase-data
|
test-child0 -> test-sibling1 -> test-sibling2 -> test-sibling3 -> null
| | | |
test-child01 null null null
Figure 2: 将测试数据树附在实时树上的例子。
根据上面的方案,实时树已经存在,所以不需要附加根('/')节点。所有其他节点都是通过在
每个节点上调用of_attach_node()来附加的。
在函数of_attach_node()中,新的节点被附在实时树中给定的父节点的子节点上。但是,如
果父节点已经有了一个孩子,那么新节点就会取代当前的孩子,并将其变成其兄弟姐妹。因此,
当测试案例的数据节点被连接到上面的实时树(图1)时,最终的结构如图3所示::
root ('/')
|
testcase-data -> child1 -> sibling2 -> sibling3 -> sibling4 -> null
| | | | |
(...) | | | null
| | child31 -> sibling32 -> null
| | | |
| | null null
| |
| child21 -> sibling22 -> sibling23 -> null
| | | |
| null null null
|
child11 -> sibling12 -> sibling13 -> sibling14 -> null
| | | |
null null | null
|
child131 -> null
|
null
-----------------------------------------------------------------------
root ('/')
|
testcase-data -> child1 -> sibling2 -> sibling3 -> sibling4 -> null
| | | | |
| (...) (...) (...) null
|
test-sibling3 -> test-sibling2 -> test-sibling1 -> test-child0 -> null
| | | |
null null null test-child01
Figure 3: 附加测试案例数据后的实时设备树结构。
聪明的读者会注意到,与先前的结构相比,test-child0节点成为最后一个兄弟姐妹(图2)。
在连接了第一个test-child0节点之后,又连接了test-sibling1节点,该节点推动子节点
(即test-child0)成为兄弟姐妹,并使自己成为子节点,如上所述。
如果发现一个重复的节点(即如果一个具有相同full_name属性的节点已经存在于实时树中),
那么该节点不会被附加,而是通过调用函数update_node_properties()将其属性更新到活
树的节点中。
2.2. 删除测试数据
-----------------
一旦测试用例执行完,selftest_data_remove被调用,以移除最初连接的设备节点(首先是
叶子节点被分离,然后向上移动父节点被移除,最后是整个树)。selftest_data_remove()
调用detach_node_and_children(),使用of_detach_node()将节点从实时设备树上分离。
为了分离一个节点,of_detach_node()要么将给定节点的父节点的子节点指针更新为其同级节
点,要么根据情况将前一个同级节点附在给定节点的同级节点上。就这样吧。 :)
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/Devicetree/usage-model.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
===================
Linux 和 Devicetree
===================
Linux对设备树数据的使用模型
:作者: Grant Likely <grant.likely@secretlab.ca>
这篇文章描述了Linux如何使用设备树。关于设备树数据格式的概述可以在
devicetree.org\ [1]_ 的设备树使用页面上找到。
.. [1] https://www.devicetree.org/specifications/
"Open Firmware Device Tree",或简称为Devicetree(DT),是一种用于描述硬
件的数据结构和语言。更确切地说,它是一种操作系统可读的硬件描述,这样操作系统就不
需要对机器的细节进行硬编码。
从结构上看,DT是一棵树,或者说是带有命名节点的无环图,节点可以有任意数量的命名
属性来封装任意的数据。还存在一种机制,可以在自然的树状结构之外创建从一个节点到
另一个节点的任意链接。
从概念上讲,一套通用的使用惯例,称为 "bindings"(后文译为绑定),被定义为数据
应该如何出现在树中,以描述典型的硬件特性,包括数据总线、中断线、GPIO连接和外围
设备。
尽可能使用现有的绑定来描述硬件,以最大限度地利用现有的支持代码,但由于属性和节
点名称是简单的文本字符串,通过定义新的节点和属性来扩展现有的绑定或创建新的绑定
很容易。然而,要警惕的是,在创建一个新的绑定之前,最好先对已经存在的东西做一些
功课。目前有两种不同的、不兼容的i2c总线的绑定,这是因为在创建新的绑定时没有事先
调查i2c设备在现有系统中是如何被枚举的。
1. 历史
-------
DT最初是由Open Firmware创建的,作为将数据从Open Firmware传递给客户程序
(如传递给操作系统)的通信方法的一部分。操作系统使用设备树在运行时探测硬件的拓
扑结构,从而在没有硬编码信息的情况下支持大多数可用的硬件(假设所有设备的驱动程
序都可用)。
由于Open Firmware通常在PowerPC和SPARC平台上使用,长期以来,对这些架构的
Linux支持一直使用设备树。
2005年,当PowerPC Linux开始大规模清理并合并32位和64位支持时,决定在所有
Powerpc平台上要求DT支持,无论它们是否使用Open Firmware。为了做到这一点,
我们创建了一个叫做扁平化设备树(FDT)的DT表示法,它可以作为一个二进制的blob
传递给内核,而不需要真正的Open Firmware实现。U-Boot、kexec和其他引导程序
被修改,以支持传递设备树二进制(dtb)和在引导时修改dtb。DT也被添加到PowerPC
引导包装器(arch/powerpc/boot/\*)中,这样dtb就可以被包裹在内核镜像中,以
支持引导现有的非DT察觉的固件。
一段时间后,FDT基础架构被普及到了所有的架构中。在写这篇文章的时候,6个主线架
构(arm、microblaze、mips、powerpc、sparc和x86)和1个非主线架构(ios)
有某种程度的DT支持。
1. 数据模型
-----------
如果你还没有读过设备树用法\ [1]_页,那么现在就去读吧。没关系,我等着....
2.1 高层次视角
--------------
最重要的是要明白,DT只是一个描述硬件的数据结构。它没有什么神奇之处,也不会神
奇地让所有的硬件配置问题消失。它所做的是提供一种语言,将硬件配置与Linux内核
(或任何其他操作系统)中的板卡和设备驱动支持解耦。使用它可以使板卡和设备支持
变成数据驱动;根据传递到内核的数据做出设置决定,而不是根据每台机器的硬编码选
择。
理想情况下,数据驱动的平台设置应该导致更少的代码重复,并使其更容易用一个内核
镜像支持各种硬件。
Linux使用DT数据有三个主要目的:
1) 平台识别。
2) 运行时配置,以及
3) 设备数量。
2.2 平台识别
------------
首先,内核将使用DT中的数据来识别特定的机器。在一个理想的世界里,具体的平台对
内核来说并不重要,因为所有的平台细节都会被设备树以一致和可靠的方式完美描述。
但是,硬件并不完美,所以内核必须在早期启动时识别机器,以便有机会运行特定于机
器的修复程序。
在大多数情况下,机器的身份是不相关的,而内核将根据机器的核心CPU或SoC来选择
设置代码。例如,在ARM上,arch/arm/kernel/setup.c中的setup_arch()将调
用arch/arm/kernel/devtree.c中的setup_machine_fdt(),它通过
machine_desc表搜索并选择与设备树数据最匹配的machine_desc。它通过查看根
设备树节点中的'compatible'属性,并将其与struct machine_desc中的
dt_compat列表(如果你好奇,该列表定义在arch/arm/include/asm/mach/arch.h
中)进行比较,从而确定最佳匹配。
“compatible” 属性包含一个排序的字符串列表,以机器的确切名称开始,后面是
一个可选的与之兼容的板子列表,从最兼容到最不兼容排序。例如,TI BeagleBoard
和它的后继者BeagleBoard xM板的根兼容属性可能看起来分别为::
compatible = "ti,omap3-beagleboard", "ti,omap3450", "ti,omap3";
compatible = "ti,omap3-beagleboard-xm", "ti,omap3450", "ti,omap3";
其中 "ti,map3-beagleboard-xm "指定了确切的型号,它还声称它与OMAP 3450 SoC
以及一般的OMP3系列SoC兼容。你会注意到,该列表从最具体的(确切的板子)到最
不具体的(SoC系列)进行排序。
聪明的读者可能会指出,Beagle xM也可以声称与原Beagle板兼容。然而,我们应
该当心在板级上这样做,因为通常情况下,即使在同一产品系列中,每块板都有很高
的变化,而且当一块板声称与另一块板兼容时,很难确定到底是什么意思。对于高层
来说,最好是谨慎行事,不要声称一块板子与另一块板子兼容。值得注意的例外是,
当一块板子是另一块板子的载体时,例如CPU模块连接到一个载体板上。
关于兼容值还有一个注意事项。在兼容属性中使用的任何字符串都必须有文件说明它
表示什么。在Documentation/devicetree/bindings中添加兼容字符串的文档。
同样在ARM上,对于每个machine_desc,内核会查看是否有任何dt_compat列表条
目出现在兼容属性中。如果有,那么该机器_desc就是驱动该机器的候选者。在搜索
了整个machine_descs表之后,setup_machine_fdt()根据每个machine_desc
在兼容属性中匹配的条目,返回 “最兼容” 的machine_desc。如果没有找到匹配
的machine_desc,那么它将返回NULL。
这个方案背后的原因是观察到,在大多数情况下,如果它们都使用相同的SoC或相同
系列的SoC,一个机器_desc可以支持大量的电路板。然而,不可避免地会有一些例
外情况,即特定的板子需要特殊的设置代码,这在一般情况下是没有用的。特殊情况
可以通过在通用设置代码中明确检查有问题的板子来处理,但如果超过几个情况下,
这样做很快就会变得很难看和/或无法维护。
相反,兼容列表允许通用机器_desc通过在dt_compat列表中指定“不太兼容”的值
来提供对广泛的通用板的支持。在上面的例子中,通用板支持可以声称与“ti,ompa3”
或“ti,ompa3450”兼容。如果在最初的beagleboard上发现了一个bug,需要在
早期启动时使用特殊的变通代码,那么可以添加一个新的machine_desc,实现变通,
并且只在“ti,omap3-beagleboard”上匹配。
PowerPC使用了一个稍微不同的方案,它从每个机器_desc中调用.probe()钩子,
并使用第一个返回TRUE的钩子。然而,这种方法没有考虑到兼容列表的优先级,对于
新的架构支持可能应该避免。
2.3 运行时配置
--------------
在大多数情况下,DT是将数据从固件传递给内核的唯一方法,所以也被用来传递运行
时和配置数据,如内核参数字符串和initrd镜像的位置。
这些数据大部分都包含在/chosen节点中,当启动Linux时,它看起来就像这样::
chosen {
bootargs = "console=ttyS0,115200 loglevel=8";
initrd-start = <0xc8000000>;
initrd-end = <0xc8200000>;
};
bootargs属性包含内核参数,initrd-\*属性定义initrd blob的地址和大小。注
意initrd-end是initrd映像后的第一个地址,所以这与结构体资源的通常语义不一
致。选择的节点也可以选择包含任意数量的额外属性,用于平台特定的配置数据。
在早期启动过程中,架构设置代码通过不同的辅助回调函数多次调用
of_scan_flat_dt()来解析设备树数据,然后进行分页设置。of_scan_flat_dt()
代码扫描设备树,并使用辅助函数来提取早期启动期间所需的信息。通常情况下,
early_init_dt_scan_chosen()辅助函数用于解析所选节点,包括内核参数,
early_init_dt_scan_root()用于初始化DT地址空间模型,early_init_dt_scan_memory()
用于确定可用RAM的大小和位置。
在ARM上,函数setup_machine_fdt()负责在选择支持板子的正确machine_desc
后,对设备树进行早期扫描。
2.4 设备数量
------------
在电路板被识别后,在早期配置数据被解析后,内核初始化可以以正常方式进行。在
这个过程中的某个时刻,unflatten_device_tree()被调用以将数据转换成更有
效的运行时表示。这也是调用机器特定设置钩子的时候,比如ARM上的machine_desc
.init_early()、.init_irq()和.init_machine()钩子。本节的其余部分使用
了ARM实现的例子,但所有架构在使用DT时都会做几乎相同的事情。
从名称上可以猜到,.init_early()用于在启动过程早期需要执行的任何机器特定设
置,而.init_irq()则用于设置中断处理。使用DT并不会实质性地改变这两个函数的
行为。如果提供了DT,那么.init_early()和.init_irq()都能调用任何一个DT查
询函数(of_* in include/linux/of*.h),以获得关于平台的额外数据。
DT上下文中最有趣的钩子是.init_machine(),它主要负责将平台的数据填充到
Linux设备模型中。历史上,这在嵌入式平台上是通过在板卡support .c文件中定
义一组静态时钟结构、platform_devices和其他数据,并在.init_machine()中
大量注册来实现的。当使用DT时,就不用为每个平台的静态设备进行硬编码,可以通过
解析DT获得设备列表,并动态分配设备结构体。
最简单的情况是,.init_machine()只负责注册一个platform_devices。
platform_device是Linux使用的一个概念,用于不能被硬件检测到的内存或I/O映
射的设备,以及“复合”或 “虚拟”设备(后面会详细介绍)。虽然DT没有“平台设备”的
术语,但平台设备大致对应于树根的设备节点和简单内存映射总线节点的子节点。
现在是举例说明的好时机。下面是NVIDIA Tegra板的设备树的一部分::
/{
compatible = "nvidia,harmony", "nvidia,tegra20";
#address-cells = <1>;
#size-cells = <1>;
interrupt-parent = <&intc>;
chosen { };
aliases { };
memory {
device_type = "memory";
reg = <0x00000000 0x40000000>;
};
soc {
compatible = "nvidia,tegra20-soc", "simple-bus";
#address-cells = <1>;
#size-cells = <1>;
ranges;
intc: interrupt-controller@50041000 {
compatible = "nvidia,tegra20-gic";
interrupt-controller;
#interrupt-cells = <1>;
reg = <0x50041000 0x1000>, < 0x50040100 0x0100 >;
};
serial@70006300 {
compatible = "nvidia,tegra20-uart";
reg = <0x70006300 0x100>;
interrupts = <122>;
};
i2s1: i2s@70002800 {
compatible = "nvidia,tegra20-i2s";
reg = <0x70002800 0x100>;
interrupts = <77>;
codec = <&wm8903>;
};
i2c@7000c000 {
compatible = "nvidia,tegra20-i2c";
#address-cells = <1>;
#size-cells = <0>;
reg = <0x7000c000 0x100>;
interrupts = <70>;
wm8903: codec@1a {
compatible = "wlf,wm8903";
reg = <0x1a>;
interrupts = <347>;
};
};
};
sound {
compatible = "nvidia,harmony-sound";
i2s-controller = <&i2s1>;
i2s-codec = <&wm8903>;
};
};
在.init_machine()时,Tegra板支持代码将需要查看这个DT,并决定为哪些节点
创建platform_devices。然而,看一下这个树,并不能立即看出每个节点代表什么
类型的设备,甚至不能看出一个节点是否代表一个设备。/chosen、/aliases和
/memory节点是信息节点,并不描述设备(尽管可以说内存可以被认为是一个设备)。
/soc节点的子节点是内存映射的设备,但是codec@1a是一个i2c设备,而sound节
点代表的不是一个设备,而是其他设备是如何连接在一起以创建音频子系统的。我知
道每个设备是什么,因为我熟悉电路板的设计,但是内核怎么知道每个节点该怎么做?
诀窍在于,内核从树的根部开始,寻找具有“兼容”属性的节点。首先,一般认为任何
具有“兼容”属性的节点都代表某种设备;其次,可以认为树根的任何节点要么直接连
接到处理器总线上,要么是无法用其他方式描述的杂项系统设备。对于这些节点中的
每一个,Linux都会分配和注册一个platform_device,它又可能被绑定到一个
platform_driver。
为什么为这些节点使用platform_device是一个安全的假设?嗯,就Linux对设备
的建模方式而言,几乎所有的总线类型都假定其设备是总线控制器的孩子。例如,每
个i2c_client是i2c_master的一个子节点。每个spi_device都是SPI总线的一
个子节点。类似的还有USB、PCI、MDIO等。同样的层次结构也出现在DT中,I2C设
备节点只作为I2C总线节点的子节点出现。同理,SPI、MDIO、USB等等。唯一不需
要特定类型的父设备的设备是platform_devices(和amba_devices,但后面会
详细介绍),它们将愉快地运行在Linux/sys/devices树的底部。因此,如果一个
DT节点位于树的根部,那么它真的可能最好注册为platform_device。
Linux板支持代码调用of_platform_populate(NULL, NULL, NULL, NULL)来
启动树根的设备发现。参数都是NULL,因为当从树的根部开始时,不需要提供一个起
始节点(第一个NULL),一个父结构设备(最后一个NULL),而且我们没有使用匹配
表(尚未)。对于只需要注册设备的板子,除了of_platform_populate()的调用,
.init_machine()可以完全为空。
在Tegra的例子中,这说明了/soc和/sound节点,但是SoC节点的子节点呢?它们
不应该也被注册为平台设备吗?对于Linux DT支持,一般的行为是子设备在驱动
.probe()时被父设备驱动注册。因此,一个i2c总线设备驱动程序将为每个子节点
注册一个i2c_client,一个SPI总线驱动程序将注册其spi_device子节点,其他
总线类型也是如此。根据该模型,可以编写一个与SoC节点绑定的驱动程序,并简单
地为其每个子节点注册platform_device。板卡支持代码将分配和注册一个SoC设
备,一个(理论上的)SoC设备驱动程序可以绑定到SoC设备,并在其.probe()钩
中为/soc/interruptcontroller、/soc/serial、/soc/i2s和/soc/i2c注
册platform_devices。很简单,对吗?
实际上,事实证明,将一些platform_device的子设备注册为更多的platform_device
是一种常见的模式,设备树支持代码反映了这一点,并使上述例子更简单。
of_platform_populate()的第二个参数是一个of_device_id表,任何与该表
中的条目相匹配的节点也将获得其子节点的注册。在Tegra的例子中,代码可以是
这样的::
static void __init harmony_init_machine(void)
{
/* ... */
of_platform_populate(NULL, of_default_bus_match_table, NULL, NULL);
}
“simple-bus”在Devicetree规范中被定义为一个属性,意味着一个简单的内存映射
的总线,所以of_platform_populate()代码可以被写成只是假设简单总线兼容的节
点将总是被遍历。然而,我们把它作为一个参数传入,以便电路板支持代码可以随时覆
盖默认行为。
[需要添加关于添加i2c/spi/etc子设备的讨论] 。
附录A:AMBA设备
---------------
ARM Primecell是连接到ARM AMBA总线的某种设备,它包括对硬件检测和电源管理
的一些支持。在Linux中,amba_device和amba_bus_type结构体被用来表示
Primecell设备。然而,棘手的一点是,AMBA总线上的所有设备并非都是Primecell,
而且对于Linux来说,典型的情况是amba_device和platform_device实例都是同
一总线段的同义词。
当使用DT时,这给of_platform_populate()带来了问题,因为它必须决定是否将
每个节点注册为platform_device或amba_device。不幸的是,这使设备创建模型
变得有点复杂,但解决方案原来并不是太具有侵略性。如果一个节点与“arm,amba-primecell”
兼容,那么of_platform_populate()将把它注册为amba_device而不是
platform_device。
......@@ -5,7 +5,7 @@
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
\kerneldocBeginSC
\kerneldocBeginSC{
.. _linux_doc_zh:
......@@ -56,10 +56,14 @@ TODOList:
下列文档描述了内核需要的平台固件相关信息。
.. toctree::
:maxdepth: 2
devicetree/index
TODOList:
* firmware-guide/index
* devicetree/index
应用程序开发人员文档
--------------------
......@@ -104,14 +108,17 @@ TODOList:
:maxdepth: 2
core-api/index
accounting/index
cpu-freq/index
iio/index
infiniband/index
power/index
virt/index
sound/index
filesystems/index
virt/index
infiniband/index
accounting/index
scheduler/index
vm/index
peci/index
TODOList:
......@@ -129,7 +136,6 @@ TODOList:
* netlabel/index
* networking/index
* pcmcia/index
* power/index
* target/index
* timers/index
* spi/index
......@@ -140,7 +146,6 @@ TODOList:
* gpu/index
* security/index
* crypto/index
* vm/index
* bpf/index
* usb/index
* PCI/index
......@@ -198,4 +203,4 @@ TODOList:
.. raw:: latex
\kerneldocEndSC
}\kerneldocEndSC
.. SPDX-License-Identifier: GPL-2.0-only
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/peci/index.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
=================
Linux PECI 子系统
=================
.. toctree::
peci
.. only:: subproject and html
Indices
=======
* :ref:`genindex`
.. SPDX-License-Identifier: GPL-2.0-only
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/peci/peci.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
====
概述
====
平台环境控制接口(PECI)是英特尔处理器和管理控制器(如底板管理控制器,BMC)
之间的一个通信接口。PECI提供的服务允许管理控制器通过访问各种寄存器来配置、监
控和调试平台。它定义了一个专门的命令协议,管理控制器作为PECI的发起者,处理器
作为PECI的响应者。PECI可以用于基于单处理器和多处理器的系统中。
注意:英特尔PECI规范没有作为专门的文件发布,而是作为英特尔CPU的外部设计规范
(EDS)的一部分。外部设计规范通常是不公开的。
PECI 线
---------
PECI线接口使用单线进行自锁和数据传输。它不需要任何额外的控制线--物理层是一个
自锁的单线总线信号,每一个比特都从接近零伏的空闲状态开始驱动、上升边缘。驱动高
电平信号的持续时间可以确定位值是逻辑 “0” 还是逻辑 “1”。PECI线还包括与每个信
息建立的可变数据速率。
对于PECI线,每个处理器包将在一个定义的范围内利用唯一的、固定的地址,该地址应
该与处理器插座ID有固定的关系--如果其中一个处理器被移除,它不会影响其余处理器
的地址。
PECI子系统代码内嵌文档
------------------------
该API在以下内核代码中:
include/linux/peci.h
drivers/peci/internal.h
drivers/peci/core.c
drivers/peci/request.c
PECI CPU 驱动 API
-------------------
该API在以下内核代码中:
drivers/peci/cpu.c
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/power/energy-model.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
============
设备能量模型
============
1. 概述
-------
能量模型(EM)框架是一种驱动程序与内核子系统之间的接口。其中驱动程序了解不同
性能层级的设备所消耗的功率,而内核子系统愿意使用该信息做出能量感知决策。
设备所消耗的功率的信息来源在不同的平台上可能有很大的不同。这些功率成本在某些
情况下可以使用设备树数据来估算。在其它情况下,固件会更清楚。或者,用户空间可能
是最清楚的。以此类推。为了避免每一个客户端子系统对每一种可能的信息源自己重新
实现支持,EM框架作为一个抽象层介入,它在内核中对功率成本表的格式进行标准化,
因此能够避免多余的工作。
功率值可以用毫瓦或“抽象刻度”表示。多个子系统可能使用EM,由系统集成商来检查
功率值刻度类型的要求是否满足。可以在能量感知调度器的文档中找到一个例子
Documentation/scheduler/sched-energy.rst。对于一些子系统,比如热能或
powercap,用“抽象刻度”描述功率值可能会导致问题。这些子系统对过去使用的功率的
估算值更感兴趣,因此可能需要真实的毫瓦。这些要求的一个例子可以在智能功率分配
Documentation/driver-api/thermal/power_allocator.rst文档中找到。
内核子系统可能(基于EM内部标志位)实现了对EM注册设备是否具有不一致刻度的自动
检查。要记住的重要事情是,当功率值以“抽象刻度”表示时,从中推导以毫焦耳为单位
的真实能量消耗是不可能的。
下图描述了一个驱动的例子(这里是针对Arm的,但该方法适用于任何体系结构),它
向EM框架提供了功率成本,感兴趣的客户端可从中读取数据::
+---------------+ +-----------------+ +---------------+
| Thermal (IPA) | | Scheduler (EAS) | | Other |
+---------------+ +-----------------+ +---------------+
| | em_cpu_energy() |
| | em_cpu_get() |
+---------+ | +---------+
| | |
v v v
+---------------------+
| Energy Model |
| Framework |
+---------------------+
^ ^ ^
| | | em_dev_register_perf_domain()
+----------+ | +---------+
| | |
+---------------+ +---------------+ +--------------+
| cpufreq-dt | | arm_scmi | | Other |
+---------------+ +---------------+ +--------------+
^ ^ ^
| | |
+--------------+ +---------------+ +--------------+
| Device Tree | | Firmware | | ? |
+--------------+ +---------------+ +--------------+
对于CPU设备,EM框架管理着系统中每个“性能域”的功率成本表。一个性能域是一组
性能一起伸缩的CPU。性能域通常与CPUFreq策略具有1对1映射。一个性能域中的
所有CPU要求具有相同的微架构。不同性能域中的CPU可以有不同的微架构。
2. 核心API
----------
2.1 配置选项
^^^^^^^^^^^^
必须使能CONFIG_ENERGY_MODEL才能使用EM框架。
2.2 性能域的注册
^^^^^^^^^^^^^^^^
“高级”EM的注册
~~~~~~~~~~~~~~~~
“高级”EM因它允许驱动提供更精确的功率模型而得名。它并不受限于框架中的一些已
实现的数学公式(就像“简单”EM那样)。它可以更好地反映每个性能状态的实际功率
测量。因此,在EM静态功率(漏电流功率)是重要的情况下,应该首选这种注册方式。
驱动程序应通过以下API将性能域注册到EM框架中::
int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
struct em_data_callback *cb, cpumask_t *cpus, bool milliwatts);
驱动程序必须提供一个回调函数,为每个性能状态返回<频率,功率>元组。驱动程序
提供的回调函数可以自由地从任何相关位置(DT、固件......)以及以任何被认为是
必要的方式获取数据。只有对于CPU设备,驱动程序必须使用cpumask指定性能域的CPU。
对于CPU以外的其他设备,最后一个参数必须被设置为NULL。
最后一个参数“milliwatts”(毫瓦)设置成正确的值是很重要的,使用EM的内核
子系统可能会依赖这个标志来检查所有的EM设备是否使用相同的刻度。如果有不同的
刻度,这些子系统可能决定:返回警告/错误,停止工作或崩溃(panic)。
关于实现这个回调函数的驱动程序的例子,参见第3节。或者在第2.4节阅读这个API
的更多文档。
“简单”EM的注册
~~~~~~~~~~~~~~~~
“简单”EM是用框架的辅助函数cpufreq_register_em_with_opp()注册的。它实现了
一个和以下数学公式紧密相关的功率模型::
Power = C * V^2 * f
使用这种方法注册的EM可能无法正确反映真实设备的物理特性,例如当静态功率
(漏电流功率)很重要时。
2.3 访问性能域
^^^^^^^^^^^^^^
有两个API函数提供对能量模型的访问。em_cpu_get()以CPU id为参数,em_pd_get()
以设备指针为参数。使用哪个接口取决于子系统,但对于CPU设备来说,这两个函数都返
回相同的性能域。
对CPU的能量模型感兴趣的子系统可以通过em_cpu_get() API检索它。在创建性能域时
分配一次能量模型表,它保存在内存中不被修改。
一个性能域所消耗的能量可以使用em_cpu_energy() API来估算。该估算假定CPU设备
使用的CPUfreq监管器是schedutil。当前该计算不能提供给其它类型的设备。
关于上述API的更多细节可以在 ``<linux/energy_model.h>`` 或第2.4节中找到。
2.4 API的细节描述
^^^^^^^^^^^^^^^^^
参见 include/linux/energy_model.h 和 kernel/power/energy_model.c 的kernel doc。
3. 驱动示例
-----------
CPUFreq框架支持专用的回调函数,用于为指定的CPU(们)注册EM:
cpufreq_driver::register_em()。这个回调必须为每个特定的驱动程序正确实现,
因为框架会在设置过程中适时地调用它。本节提供了一个简单的例子,展示CPUFreq驱动
在能量模型框架中使用(假的)“foo”协议注册性能域。该驱动实现了一个est_power()
函数提供给EM框架::
-> drivers/cpufreq/foo_cpufreq.c
01 static int est_power(unsigned long *mW, unsigned long *KHz,
02 struct device *dev)
03 {
04 long freq, power;
05
06 /* 使用“foo”协议设置频率上限 */
07 freq = foo_get_freq_ceil(dev, *KHz);
08 if (freq < 0);
09 return freq;
10
11 /* 估算相关频率下设备的功率成本 */
12 power = foo_estimate_power(dev, freq);
13 if (power < 0);
14 return power;
15
16 /* 将这些值返回给EM框架 */
17 *mW = power;
18 *KHz = freq;
19
20 return 0;
21 }
22
23 static void foo_cpufreq_register_em(struct cpufreq_policy *policy)
24 {
25 struct em_data_callback em_cb = EM_DATA_CB(est_power);
26 struct device *cpu_dev;
27 int nr_opp;
28
29 cpu_dev = get_cpu_device(cpumask_first(policy->cpus));
30
31 /* 查找该策略支持的OPP数量 */
32 nr_opp = foo_get_nr_opp(policy);
33
34 /* 并注册新的性能域 */
35 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus,
36 true);
37 }
38
39 static struct cpufreq_driver foo_cpufreq_driver = {
40 .register_em = foo_cpufreq_register_em,
41 };
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/power/index.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
========
电源管理
========
.. toctree::
:maxdepth: 1
energy-model
opp
TODOList:
* apm-acpi
* basic-pm-debugging
* charger-manager
* drivers-testing
* freezing-of-tasks
* pci
* pm_qos_interface
* power_supply_class
* runtime_pm
* s2ram
* suspend-and-cpuhotplug
* suspend-and-interrupts
* swsusp-and-swap-files
* swsusp-dmcrypt
* swsusp
* video
* tricks
* userland-swsusp
* powercap/powercap
* powercap/dtpm
* regulator/consumer
* regulator/design
* regulator/machine
* regulator/overview
* regulator/regulator
.. only:: subproject and html
Indices
=======
* :ref:`genindex`
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/power/opp.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
======================
操作性能值(OPP)库
======================
(C) 2009-2010 Nishanth Menon <nm@ti.com>, 德州仪器公司
.. 目录
1. 简介
2. OPP链表初始注册
3. OPP搜索函数
4. OPP可用性控制函数
5. OPP数据检索函数
6. 数据结构
1. 简介
=======
1.1 何为操作性能值(OPP)?
------------------------------
当今复杂的单片系统(SoC)由多个子模块组成,这些子模块会联合工作。在一个执行不同用例
的操作系统中,并不是SoC中的所有模块都需要一直以最高频率工作。为了促成这一点,SoC中
的子模块被分组为不同域,允许一些域以较低的电压和频率运行,而其它域则以较高的“电压/
频率对”运行。
设备按域支持的由频率电压对组成的离散的元组的集合,被称为操作性能值(组),或OPPs。
举例来说:
让我们考虑一个支持下述频率、电压值的内存保护单元(MPU)设备:
{300MHz,最低电压为1V}, {800MHz,最低电压为1.2V}, {1GHz,最低电压为1.3V}
我们能将它们表示为3个OPP,如下述{Hz, uV}元组(译注:频率的单位是赫兹,电压的单位是
微伏)。
- {300000000, 1000000}
- {800000000, 1200000}
- {1000000000, 1300000}
1.2 操作性能值库
----------------
OPP库提供了一组辅助函数来组织和查询OPP信息。该库位于drivers/opp/目录下,其头文件
位于include/linux/pm_opp.h中。OPP库可以通过开启CONFIG_PM_OPP来启用。某些SoC,
如德州仪器的OMAP框架允许在不需要cpufreq的情况下可选地在某一OPP下启动。
OPP库的典型用法如下::
(用户) -> 注册一个默认的OPP集合 -> (库)
(SoC框架) -> 在必要的情况下,对某些OPP进行修改 -> OPP layer
-> 搜索/检索信息的查询 ->
OPP层期望每个域由一个唯一的设备指针来表示。SoC框架在OPP层为每个设备注册了一组初始
OPP。这个链表的长度被期望是一个最优化的小数字,通常每个设备大约5个。初始链表包含了
一个OPP集合,这个集合被期望能在系统中安全使能。
关于OPP可用性的说明
^^^^^^^^^^^^^^^^^^^
随着系统的运行,SoC框架可能会基于各种外部因素选择让某些OPP在每个设备上可用或不可用,
示例:温度管理或其它异常场景中,SoC框架可能会选择禁用一个较高频率的OPP以安全地继续
运行,直到该OPP被重新启用(如果可能)。
OPP库在它的实现中达成了这个概念。以下操作函数只能对可用的OPP使用:
dev_pm_opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage,
dev_pm_opp_get_freq, dev_pm_opp_get_opp_count。
dev_pm_opp_find_freq_exact是用来查找OPP指针的,该指针可被用在dev_pm_opp_enable/
disable函数,使一个OPP在被需要时变为可用。
警告:如果对一个设备调用dev_pm_opp_enable/disable函数,OPP库的用户应该使用
dev_pm_opp_get_opp_count来刷新OPP的可用性计数。触发这些的具体机制,或者对有依赖的
子系统(比如cpufreq)的通知机制,都是由使用OPP库的SoC特定框架酌情处理的。在这些操作
中,同样需要注意刷新cpufreq表。
2. OPP链表初始注册
==================
SoC的实现会迭代调用dev_pm_opp_add函数来增加每个设备的OPP。预期SoC框架将以最优的
方式注册OPP条目 - 典型的数字范围小于5。通过注册OPP生成的OPP链表,在整个设备运行过程
中由OPP库维护。SoC框架随后可以使用dev_pm_opp_enable / disable函数动态地
控制OPP的可用性。
dev_pm_opp_add
为设备指针所指向的特定域添加一个新的OPP。OPP是用频率和电压定义的。一旦完成
添加,OPP被认为是可用的,可以用dev_pm_opp_enable/disable函数来控制其可用性。
OPP库内部用dev_pm_opp结构体存储并管理这些信息。这个函数可以被SoC框架根据SoC
的使用环境的需求来定义一个最优链表。
警告:
不要在中断上下文使用这个函数。
示例::
soc_pm_init()
{
/* 做一些事情 */
r = dev_pm_opp_add(mpu_dev, 1000000, 900000);
if (!r) {
pr_err("%s: unable to register mpu opp(%d)\n", r);
goto no_cpufreq;
}
/* 做一些和cpufreq相关的事情 */
no_cpufreq:
/* 做剩余的事情 */
}
3. OPP搜索函数
==============
cpufreq等高层框架对频率进行操作,为了将频率映射到相应的OPP,OPP库提供了便利的函数
来搜索OPP库内部管理的OPP链表。这些搜索函数如果找到匹配的OPP,将返回指向该OPP的指针,
否则返回错误。这些错误预计由标准的错误检查,如IS_ERR()来处理,并由调用者采取适当的
行动。
这些函数的调用者应在使用完OPP后调用dev_pm_opp_put()。否则,OPP的内存将永远不会
被释放,并导致内存泄露。
dev_pm_opp_find_freq_exact
根据 *精确的* 频率和可用性来搜索OPP。这个函数对默认不可用的OPP特别有用。
例子:在SoC框架检测到更高频率可用的情况下,它可以使用这个函数在调用
dev_pm_opp_enable之前找到OPP::
opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
dev_pm_opp_put(opp);
/* 不要操作指针.. 只是做有效性检查.. */
if (IS_ERR(opp)) {
pr_err("frequency not disabled!\n");
/* 触发合适的操作.. */
} else {
dev_pm_opp_enable(dev,1000000000);
}
注意:
这是唯一一个可以搜索不可用OPP的函数。
dev_pm_opp_find_freq_floor
搜索一个 *最多* 提供指定频率的可用OPP。这个函数在搜索较小的匹配或按频率
递减的顺序操作OPP信息时很有用。
例子:要找的一个设备的最高OPP::
freq = ULONG_MAX;
opp = dev_pm_opp_find_freq_floor(dev, &freq);
dev_pm_opp_put(opp);
dev_pm_opp_find_freq_ceil
搜索一个 *最少* 提供指定频率的可用OPP。这个函数在搜索较大的匹配或按频率
递增的顺序操作OPP信息时很有用。
例1:找到一个设备最小的OPP::
freq = 0;
opp = dev_pm_opp_find_freq_ceil(dev, &freq);
dev_pm_opp_put(opp);
例: 一个SoC的cpufreq_driver->target的简易实现::
soc_cpufreq_target(..)
{
/* 做策略检查等操作 */
/* 找到和请求最接近的频率 */
opp = dev_pm_opp_find_freq_ceil(dev, &freq);
dev_pm_opp_put(opp);
if (!IS_ERR(opp))
soc_switch_to_freq_voltage(freq);
else
/* 当不能满足请求时,要做的事 */
/* 做其它事 */
}
4. OPP可用性控制函数
====================
在OPP库中注册的默认OPP链表也许无法满足所有可能的场景。OPP库提供了一套函数来修改
OPP链表中的某个OPP的可用性。这使得SoC框架能够精细地动态控制哪一组OPP是可用于操作
的。设计这些函数的目的是在诸如考虑温度时 *暂时地* 删除某个OPP(例如,在温度下降
之前不要使用某OPP)。
警告:
不要在中断上下文使用这些函数。
dev_pm_opp_enable
使一个OPP可用于操作。
例子:假设1GHz的OPP只有在SoC温度低于某个阈值时才可用。SoC框架的实现可能
会选择做以下事情::
if (cur_temp < temp_low_thresh) {
/* 若1GHz未使能,则使能 */
opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false);
dev_pm_opp_put(opp);
/* 仅仅是错误检查 */
if (!IS_ERR(opp))
ret = dev_pm_opp_enable(dev, 1000000000);
else
goto try_something_else;
}
dev_pm_opp_disable
使一个OPP不可用于操作。
例子:假设1GHz的OPP只有在SoC温度高于某个阈值时才可用。SoC框架的实现可能
会选择做以下事情::
if (cur_temp > temp_high_thresh) {
/* 若1GHz已使能,则关闭 */
opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true);
dev_pm_opp_put(opp);
/* 仅仅是错误检查 */
if (!IS_ERR(opp))
ret = dev_pm_opp_disable(dev, 1000000000);
else
goto try_something_else;
}
5. OPP数据检索函数
==================
由于OPP库对OPP信息进行了抽象化处理,因此需要一组函数来从dev_pm_opp结构体中提取
信息。一旦使用搜索函数检索到一个OPP指针,以下函数就可以被SoC框架用来检索OPP层
内部描述的信息。
dev_pm_opp_get_voltage
检索OPP指针描述的电压。
例子: 当cpufreq切换到到不同频率时,SoC框架需要用稳压器框架将OPP描述
的电压设置到提供电压的电源管理芯片中::
soc_switch_to_freq_voltage(freq)
{
/* 做一些事情 */
opp = dev_pm_opp_find_freq_ceil(dev, &freq);
v = dev_pm_opp_get_voltage(opp);
dev_pm_opp_put(opp);
if (v)
regulator_set_voltage(.., v);
/* 做其它事 */
}
dev_pm_opp_get_freq
检索OPP指针描述的频率。
例子:比方说,SoC框架使用了几个辅助函数,通过这些函数,我们可以将OPP
指针传入,而不是传入额外的参数,用来处理一系列数据参数::
soc_cpufreq_target(..)
{
/* 做一些事情.. */
max_freq = ULONG_MAX;
max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq);
requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq);
if (!IS_ERR(max_opp) && !IS_ERR(requested_opp))
r = soc_test_validity(max_opp, requested_opp);
dev_pm_opp_put(max_opp);
dev_pm_opp_put(requested_opp);
/* 做其它事 */
}
soc_test_validity(..)
{
if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp))
return -EINVAL;
if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp))
return -EINVAL;
/* 做一些事情.. */
}
dev_pm_opp_get_opp_count
检索某个设备可用的OPP数量。
例子:假设SoC中的一个协处理器需要知道某个表中的可用频率,主处理器可以
按如下方式发出通知::
soc_notify_coproc_available_frequencies()
{
/* 做一些事情 */
num_available = dev_pm_opp_get_opp_count(dev);
speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL);
/* 按升序填充表 */
freq = 0;
while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) {
speeds[i] = freq;
freq++;
i++;
dev_pm_opp_put(opp);
}
soc_notify_coproc(AVAILABLE_FREQs, speeds, num_available);
/* 做其它事 */
}
6. 数据结构
===========
通常,一个SoC包含多个可变电压域。每个域由一个设备指针描述。和OPP之间的关系可以
按以下方式描述::
SoC
|- device 1
| |- opp 1 (availability, freq, voltage)
| |- opp 2 ..
... ...
| `- opp n ..
|- device 2
...
`- device m
OPP库维护着一个内部链表,SoC框架使用上文描述的各个函数来填充和访问。然而,描述
真实OPP和域的结构体是OPP库自身的内部组成,以允许合适的抽象在不同系统中得到复用。
struct dev_pm_opp
OPP库的内部数据结构,用于表示一个OPP。除了频率、电压、可用性信息外,
它还包含OPP库运行所需的内部统计信息。指向这个结构体的指针被提供给
用户(比如SoC框架)使用,在与OPP层的交互中作为OPP的标识符。
警告:
结构体dev_pm_opp的指针不应该由用户解析或修改。一个实例的默认值由
dev_pm_opp_add填充,但OPP的可用性由dev_pm_opp_enable/disable函数
修改。
struct device
这用于向OPP层标识一个域。设备的性质和它的实现是由OPP库的用户决定的,
如SoC框架。
总体来说,以一个简化的视角看,对数据结构的操作可以描述为下面各图::
初始化 / 修改:
+-----+ /- dev_pm_opp_enable
dev_pm_opp_add --> | opp | <-------
| +-----+ \- dev_pm_opp_disable
\-------> domain_info(device)
搜索函数:
/-- dev_pm_opp_find_freq_ceil ---\ +-----+
domain_info<---- dev_pm_opp_find_freq_exact -----> | opp |
\-- dev_pm_opp_find_freq_floor ---/ +-----+
检索函数:
+-----+ /- dev_pm_opp_get_voltage
| opp | <---
+-----+ \- dev_pm_opp_get_freq
domain_info <- dev_pm_opp_get_opp_count
......@@ -18,6 +18,7 @@ RISC-V 体系结构
:maxdepth: 1
boot-image-header
vm-layout
pmu
patch-acceptance
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/riscv/vm-layout.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
============================
RISC-V Linux上的虚拟内存布局
============================
:作者: Alexandre Ghiti <alex@ghiti.fr>
:日期: 12 February 2021
这份文件描述了RISC-V Linux内核使用的虚拟内存布局。
32位 RISC-V Linux 内核
======================
RISC-V Linux Kernel SV32
------------------------
TODO
64位 RISC-V Linux 内核
======================
RISC-V特权架构文档指出,64位地址 "必须使第63-48位值都等于第47位,否则将发生缺页异常。":这将虚
拟地址空间分成两半,中间有一个非常大的洞,下半部分是用户空间所在的地方,上半部分是RISC-V Linux
内核所在的地方。
RISC-V Linux Kernel SV39
------------------------
::
========================================================================================================================
开始地址 | 偏移 | 结束地址 | 大小 | 虚拟内存区域描述
========================================================================================================================
| | | |
0000000000000000 | 0 | 0000003fffffffff | 256 GB | 用户空间虚拟内存,每个内存管理器不同
__________________|____________|__________________|_________|___________________________________________________________
| | | |
0000004000000000 | +256 GB | ffffffbfffffffff | ~16M TB | ... 巨大的、几乎64位宽的直到内核映射的-256GB地方
| | | | 开始偏移的非经典虚拟内存地址空洞。
| | | |
__________________|____________|__________________|_________|___________________________________________________________
|
| 内核空间的虚拟内存,在所有进程之间共享:
____________________________________________________________|___________________________________________________________
| | | |
ffffffc6fee00000 | -228 GB | ffffffc6feffffff | 2 MB | fixmap
ffffffc6ff000000 | -228 GB | ffffffc6ffffffff | 16 MB | PCI io
ffffffc700000000 | -228 GB | ffffffc7ffffffff | 4 GB | vmemmap
ffffffc800000000 | -224 GB | ffffffd7ffffffff | 64 GB | vmalloc/ioremap space
ffffffd800000000 | -160 GB | fffffff6ffffffff | 124 GB | 直接映射所有物理内存
fffffff700000000 | -36 GB | fffffffeffffffff | 32 GB | kasan
__________________|____________|__________________|_________|____________________________________________________________
|
|
____________________________________________________________|____________________________________________________________
| | | |
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | modules, BPF
ffffffff80000000 | -2 GB | ffffffffffffffff | 2 GB | kernel
__________________|____________|__________________|_________|____________________________________________________________
......@@ -5,6 +5,7 @@
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
:校译:
......@@ -23,16 +24,14 @@ Linux调度器
sched-design-CFS
sched-domains
sched-capacity
sched-energy
sched-nice-design
sched-stats
TODOList:
sched-bwc
sched-deadline
sched-energy
sched-nice-design
sched-rt-group
sched-stats
text_files
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/scheduler/sched-energy.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
============
能量感知调度
============
1. 简介
-------
能量感知调度(EAS)使调度器有能力预测其决策对CPU所消耗的能量的影响。EAS依靠
一个能量模型(EM)来为每个任务选择一个节能的CPU,同时最小化对吞吐率的影响。
本文档致力于介绍介绍EAS是如何工作的,它背后的主要设计决策是什么,以及使其运行
所需的条件细节。
在进一步阅读之前,请注意,在撰写本文时::
/!\ EAS不支持对称CPU拓扑的平台 /!\
EAS只在异构CPU拓扑结构(如Arm大小核,big.LITTLE)上运行。因为在这种情况下,
通过调度来节约能量的潜力是最大的。
EAS实际使用的EM不是由调度器维护的,而是一个专门的框架。关于这个框架的细节和
它提供的内容,请参考其文档(见Documentation/power/energy-model.rst)。
2. 背景和术语
-------------
从一开始就说清楚定义:
- 能量 = [焦耳] (比如供电设备上的电池提供的资源)
- 功率 = 能量/时间 = [焦耳/秒] = [瓦特]
EAS的目标是最小化能量消耗,同时仍能将工作完成。也就是说,我们要最大化::
性能 [指令数/秒]
----------------
功率 [瓦特]
它等效于最小化::
能量 [焦耳]
-----------
指令数
同时仍然获得“良好”的性能。当前调度器只考虑性能目标,因此该式子本质上是一个
可选的优化目标,它同时考虑了两个目标:能量效率和性能。
引入EM的想法是为了让调度器评估其决策的影响,而不是盲目地应用可能仅在部分
平台有正面效果的节能技术。同时,EM必须尽可能的简单,以最小化调度器的时延
影响。
简而言之,EAS改变了CFS任务分配给CPU的方式。当调度器决定一个任务应该在哪里
运行时(在唤醒期间),EM被用来在不损害系统吞吐率的情况下,从几个较好的候选
CPU中挑选一个经预测能量消耗最优的CPU。EAS的预测依赖于对平台拓扑结构特定元素
的了解,包括CPU的“算力”,以及它们各自的能量成本。
3. 拓扑信息
-----------
EAS(以及调度器的剩余部分)使用“算力”的概念来区分不同计算吞吐率的CPU。一个
CPU的“算力”代表了它在最高频率下运行时能完成的工作量,且这个值是相对系统中
算力最大的CPU而言的。算力值被归一化为1024以内,并且可与由实体负载跟踪
(PELT)机制算出的利用率信号做对比。由于有算力值和利用率值,EAS能够估计一个
任务/CPU有多大/有多忙,并在评估性能与能量时将其考虑在内。CPU算力由特定体系
结构实现的arch_scale_cpu_capacity()回调函数提供。
EAS使用的其余平台信息是直接从能量模型(EM)框架中读取的。一个平台的EM是一张
表,表中每项代表系统中一个“性能域”的功率成本。(若要了解更多关于性能域的细节,
见Documentation/power/energy-model.rst)
当调度域被建立或重新建立时,调度器管理对拓扑代码中EM对象的引用。对于每个根域
(rd),调度器维护一个与当前rd->span相交的所有性能域的单向链表。链表中的每个
节点都包含一个指向EM框架所提供的结构体em_perf_domain的指针。
链表被附加在根域上,以应对独占的cpuset的配置。由于独占的cpuset的边界不一定与
性能域的边界一致,不同根域的链表可能包含重复的元素。
示例1
让我们考虑一个有12个CPU的平台,分成3个性能域,(pd0,pd4和pd8),按以下
方式组织::
CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
PDs: |--pd0--|--pd4--|---pd8---|
RDs: |----rd1----|-----rd2-----|
现在,考虑用户空间决定将系统分成两个独占的cpusets,因此创建了两个独立的根域,
每个根域包含6个CPU。这两个根域在上图中被表示为rd1和rd2。由于pd4与rd1和rd2
都有交集,它将同时出现于附加在这两个根域的“->pd”链表中:
* rd1->pd: pd0 -> pd4
* rd2->pd: pd4 -> pd8
请注意,调度器将为pd4创建两个重复的链表节点(每个链表中各一个)。然而这
两个节点持有指向同一个EM框架的共享数据结构的指针。
由于对这些链表的访问可能与热插拔及其它事件并发发生,因此它们受RCU锁保护,就像
被调度器操控的拓扑结构体中剩下字段一样。
EAS同样维护了一个静态键(sched_energy_present),当至少有一个根域满足EAS
启动的所有条件时,这个键就会被启动。在第6节中总结了这些条件。
4. 能量感知任务放置
-------------------
EAS覆盖了CFS的任务唤醒平衡代码。在唤醒平衡时,它使用平台的EM和PELT信号来选择节能
的目标CPU。当EAS被启用时,select_task_rq_fair()调用find_energy_efficient_cpu()
来做任务放置决定。这个函数寻找在每个性能域中寻找具有最高剩余算力(CPU算力 - CPU
利用率)的CPU,因为它能让我们保持最低的频率。然后,该函数检查将任务放在新CPU相较
依然放在之前活动的prev_cpu是否可以节省能量。
如果唤醒的任务被迁移,find_energy_efficient_cpu()使用compute_energy()来估算
系统将消耗多少能量。compute_energy()检查各CPU当前的利用率情况,并尝试调整来
“模拟”任务迁移。EM框架提供了API em_pd_energy()计算每个性能域在给定的利用率条件
下的预期能量消耗。
下面详细介绍一个优化能量消耗的任务放置决策的例子。
示例2
让我们考虑一个有两个独立性能域的(伪)平台,每个性能域含有2个CPU。CPU0和CPU1
是小核,CPU2和CPU3是大核。
调度器必须决定将任务P放在哪个CPU上,这个任务的util_avg = 200(平均利用率),
prev_cpu = 0(上一次运行在CPU0)。
目前CPU的利用率情况如下图所示。CPU 0-3的util_avg分别为400、100、600和500。
每个性能域有三个操作性能值(OPP)。与每个OPP相关的CPU算力和功率成本列在能量
模型表中。P的util_avg在图中显示为"PP"::
CPU util.
1024 - - - - - - - Energy Model
+-----------+-------------+
| Little | Big |
768 ============= +-----+-----+------+------+
| Cap | Pwr | Cap | Pwr |
+-----+-----+------+------+
512 =========== - ##- - - - - | 170 | 50 | 512 | 400 |
## ## | 341 | 150 | 768 | 800 |
341 -PP - - - - ## ## | 512 | 300 | 1024 | 1700 |
PP ## ## +-----+-----+------+------+
170 -## - - - - ## ##
## ## ## ##
------------ -------------
CPU0 CPU1 CPU2 CPU3
Current OPP: ===== Other OPP: - - - util_avg (100 each): ##
find_energy_efficient_cpu()将首先在两个性能域中寻找具有最大剩余算力的CPU。
在这个例子中是CPU1和CPU3。然后,它将估算,当P被放在它们中的任意一个时,系统的
能耗,并检查这样做是否会比把P放在CPU0上节省一些能量。EAS假定OPPs遵循利用率
(这与CPUFreq监管器schedutil的行为一致。关于这个问题的更多细节,见第6节)。
**情况1. P被迁移到CPU1**::
1024 - - - - - - -
Energy calculation:
768 ============= * CPU0: 200 / 341 * 150 = 88
* CPU1: 300 / 341 * 150 = 131
* CPU2: 600 / 768 * 800 = 625
512 - - - - - - - ##- - - - - * CPU3: 500 / 768 * 800 = 520
## ## => total_energy = 1364
341 =========== ## ##
PP ## ##
170 -## - - PP- ## ##
## ## ## ##
------------ -------------
CPU0 CPU1 CPU2 CPU3
**情况2. P被迁移到CPU3**::
1024 - - - - - - -
Energy calculation:
768 ============= * CPU0: 200 / 341 * 150 = 88
* CPU1: 100 / 341 * 150 = 43
PP * CPU2: 600 / 768 * 800 = 625
512 - - - - - - - ##- - -PP - * CPU3: 700 / 768 * 800 = 729
## ## => total_energy = 1485
341 =========== ## ##
## ##
170 -## - - - - ## ##
## ## ## ##
------------ -------------
CPU0 CPU1 CPU2 CPU3
**情况3. P依旧留在prev_cpu/CPU0**::
1024 - - - - - - -
Energy calculation:
768 ============= * CPU0: 400 / 512 * 300 = 234
* CPU1: 100 / 512 * 300 = 58
* CPU2: 600 / 768 * 800 = 625
512 =========== - ##- - - - - * CPU3: 500 / 768 * 800 = 520
## ## => total_energy = 1437
341 -PP - - - - ## ##
PP ## ##
170 -## - - - - ## ##
## ## ## ##
------------ -------------
CPU0 CPU1 CPU2 CPU3
从这些计算结果来看,情况1的总能量最低。所以从节约能量的角度看,CPU1是最佳候选
者。
大核通常比小核更耗电,因此主要在任务不适合在小核运行时使用。然而,小核并不总是比
大核节能。举例来说,对于某些系统,小核的高OPP可能比大核的低OPP能量消耗更高。因此,
如果小核在某一特定时间点刚好有足够的利用率,在此刻被唤醒的小任务放在大核执行可能
会更节能,尽管它在小核上运行也是合适的。
即使在大核所有OPP都不如小核OPP节能的情况下,在某些特定条件下,令小任务运行在大核
上依然可能节能。事实上,将一个任务放在一个小核上可能导致整个性能域的OPP提高,这将
增加已经在该性能域运行的任务的能量成本。如果唤醒的任务被放在一个大核上,它的执行
成本可能比放在小核上更高,但这不会影响小核上的其它任务,这些任务将继续以较低的OPP
运行。因此,当考虑CPU消耗的总能量时,在大核上运行一个任务的额外成本可能小于为所有
其它运行在小核的任务提高OPP的成本。
上面的例子几乎不可能以一种通用的方式得到正确的结果;同时,对于所有平台,在不知道
系统所有CPU每个不同OPP的运行成本时,也无法得到正确的结果。得益于基于EM的设计,
EAS应该能够正确处理这些问题而不会引发太多麻烦。然而,为了确保对高利用率场景的
吞吐率造成的影响最小化,EAS同样实现了另外一种叫“过度利用率”的机制。
5. 过度利用率
-------------
从一般的角度来看,EAS能提供最大帮助的是那些涉及低、中CPU利用率的使用场景。每当CPU
密集型的长任务运行,它们将需要所有的可用CPU算力,调度器将没有什么办法来节省能量同时
又不损害吞吐率。为了避免EAS损害性能,一旦CPU被使用的算力超过80%,它将被标记为“过度
利用”。只要根域中没有CPU是过度利用状态,负载均衡被禁用,而EAS将覆盖唤醒平衡代码。
EAS很可能将负载放置在系统中能量效率最高的CPU而不是其它CPU上,只要不损害吞吐率。
因此,负载均衡器被禁用以防止它打破EAS发现的节能任务放置。当系统未处于过度利用状态时,
这样做是安全的,因为低于80%的临界点意味着:
a. 所有的CPU都有一些空闲时间,所以EAS使用的利用率信号很可能准确地代表各种任务
的“大小”。
b. 所有任务,不管它们的nice值是多大,都应该被提供了足够多的CPU算力。
c. 既然有多余的算力,那么所有的任务都必须定期阻塞/休眠,在唤醒时进行平衡就足够
了。
只要一个CPU利用率超过80%的临界点,上述三个假设中至少有一个是不正确的。在这种情况下,
整个根域的“过度利用”标志被设置,EAS被禁用,负载均衡器被重新启用。通过这样做,调度器
又回到了在CPU密集的条件下基于负载的算法做负载均衡。这更好地尊重了任务的nice值。
由于过度利用率的概念在很大程度上依赖于检测系统中是否有一些空闲时间,所以必须考虑
(比CFS)更高优先级的调度类(以及中断)“窃取”的CPU算力。像这样,对过度使用率的检测
不仅要考虑CFS任务使用的算力,还需要考虑其它调度类和中断。
6. EAS的依赖和要求
------------------
能量感知调度依赖系统的CPU具有特定的硬件属性,以及内核中的其它特性被启用。本节列出
了这些依赖,并对如何满足这些依赖提供了提示。
6.1 - 非对称CPU拓扑
^^^^^^^^^^^^^^^^^^^
如简介所提,目前只有非对称CPU拓扑结构的平台支持EAS。通过在运行时查询
SD_ASYM_CPUCAPACITY_FULL标志位是否在创建调度域时已设置来检查这一要求是否满足。
参阅Documentation/scheduler/sched-capacity.rst以了解在sched_domain层次结构
中设置此标志位所需满足的要求。
请注意,EAS并非从根本上与SMP不兼容,但在SMP平台上还没有观察到明显的节能。这一
限制可以在将来进行修改,如果被证明不是这样的话。
6.2 - 当前的能量模型
^^^^^^^^^^^^^^^^^^^^
EAS使用一个平台的EM来估算调度决策对能量的影响。因此,你的平台必须向EM框架提供
能量成本表,以启动EAS。要做到这一点,请参阅文档
Documentation/power/energy-model.rst中的独立EM框架部分。
另请注意,调度域需要在EM注册后重建,以便启动EAS。
EAS使用EM对能量使用率进行预测决策,因此它在检查任务放置的可能选项时更加注重
差异。对于EAS来说,EM的功率值是以毫瓦还是以“抽象刻度”为单位表示并不重要。
6.3 - 能量模型复杂度
^^^^^^^^^^^^^^^^^^^^
任务唤醒路径是时延敏感的。当一个平台的EM太复杂(太多CPU,太多性能域,太多状态
等),在唤醒路径中使用它的成本就会升高到不可接受。能量感知唤醒算法的复杂度为:
C = Nd * (Nc + Ns)
其中:Nd是性能域的数量;Nc是CPU的数量;Ns是OPP的总数(例如:对于两个性能域,
每个域有4个OPP,则Ns = 8)。
当调度域建立时,复杂性检查是在根域上进行的。如果一个根域的复杂度C恰好高于完全
主观设定的EM_MAX_COMPLEXITY阈值(在本文写作时,是2048),则EAS不会在此根域
启动。
如果你的平台的能量模型的复杂度太高,EAS无法在这个根域上使用,但你真的想用,
那么你就只剩下两个可能的选择:
1. 将你的系统拆分成分离的、较小的、使用独占cpuset的根域,并在每个根域局部
启用EAS。这个方案的好处是开箱即用,但缺点是无法在根域之间实现负载均衡,
这可能会导致总体系统负载不均衡。
2. 提交补丁以降低EAS唤醒算法的复杂度,从而使其能够在合理的时间内处理更大
的EM。
6.4 - Schedutil监管器
^^^^^^^^^^^^^^^^^^^^^
EAS试图预测CPU在不久的将来会在哪个OPP下运行,以估算它们的能量消耗。为了做到
这一点,它假定CPU的OPP跟随CPU利用率变化而变化。
尽管在实践中很难对这一假设的准确性提供硬性保证(因为,举例来说硬件可能不会做
它被要求做的事情),相对于其他CPUFreq监管器,schedutil至少_请求_使用利用率
信号计算的频率。因此,与EAS一起使用的唯一合理的监管器是schedutil,因为它是
唯一一个在频率请求和能量预测之间提供某种程度的一致性的监管器。
不支持将EAS与schedutil之外的任何其它监管器一起使用。
6.5 刻度不变性使用率信号
^^^^^^^^^^^^^^^^^^^^^^^^
为了对不同的CPU和所有的性能状态做出准确的预测,EAS需要频率不变的和CPU不变的
PELT信号。这些信号可以通过每个体系结构定义的arch_scale{cpu,freq}_capacity()
回调函数获取。
不支持在没有实现这两个回调函数的平台上使用EAS。
6.6 多线程(SMT)
^^^^^^^^^^^^^^^^^
当前实现的EAS是不感知SMT的,因此无法利用多线程硬件节约能量。EAS认为线程是独立的
CPU,这实际上对性能和能量消耗都是不利的。
不支持在SMT上使用EAS。
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/scheduler/sched-nice-design.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
=====================
调度器nice值设计
=====================
本文档解释了新的Linux调度器中修改和精简后的nice级别的实现思路。
Linux的nice级别总是非常脆弱,人们持续不断地缠着我们,让nice +19的任务占用
更少的CPU时间。
不幸的是,在旧的调度器中,这不是那么容易实现的(否则我们早就做到了),因为对
nice级别的支持在历史上是与时间片长度耦合的,而时间片单位是由HZ滴答驱动的,
所以最小的时间片是1/HZ。
在O(1)调度器中(2003年),我们改变了负的nice级别,使它们比2.4内核更强
(人们对这一变化很满意),而且我们还故意校正了线性时间片准则,使得nice +19
的级别 _正好_ 是1 jiffy。为了让大家更好地理解它,时间片的图会是这样的(质量
不佳的ASCII艺术提醒!)::
A
\ | [timeslice length]
\ |
\ |
\ |
\ |
\|___100msecs
|^ . _
| ^ . _
| ^ . _
-*----------------------------------*-----> [nice level]
-20 | +19
|
|
因此,如果有人真的想renice任务,相较线性规则,+19会给出更大的效果(改变
ABI来扩展优先级的解决方案在早期就被放弃了)。
这种方法在一定程度上奏效了一段时间,但后来HZ=1000时,它导致1 jiffy为
1ms,这意味着0.1%的CPU使用率,我们认为这有点过度。过度 _不是_ 因为它表示
的CPU使用率过小,而是因为它引发了过于频繁(每毫秒1次)的重新调度(因此会
破坏缓存,等等。请记住,硬件更弱、cache更小是很久以前的事了,当时人们在
nice +19级别运行数量颇多的应用程序)。
因此,对于HZ=1000,我们将nice +19改为5毫秒,因为这感觉像是正确的最小
粒度——这相当于5%的CPU利用率。但nice +19的根本的HZ敏感属性依然保持不变,
我们没有收到过关于nice +19在CPU利用率方面太 _弱_ 的任何抱怨,我们只收到
过它(依然)太 _强_ 的抱怨 :-)。
总结一下:我们一直想让nice各级别一致性更强,但在HZ和jiffies的限制下,以及
nice级别与时间片、调度粒度耦合是令人讨厌的设计,这一目标并不真正可行。
第二个关于Linux nice级别支持的抱怨是(不那么频繁,但仍然定期发生),它
在原点周围的不对称性(你可以在上面的图片中看到),或者更准确地说:事实上
nice级别的行为取决于 _绝对的_ nice级别,而nice应用程序接口本身从根本上
说是“相对”的:
int nice(int inc);
asmlinkage long sys_nice(int increment)
(第一个是glibc的应用程序接口,第二个是syscall的应用程序接口)
注意,“inc”是相对当前nice级别而言的,类似bash的“nice”命令等工具是这个
相对性应用程序接口的镜像。
在旧的调度器中,举例来说,如果你以nice +1启动一个任务,并以nice +2启动
另一个任务,这两个任务的CPU分配将取决于父外壳程序的nice级别——如果它是
nice -10,那么CPU的分配不同于+5或+10。
第三个关于Linux nice级别支持的抱怨是,负数nice级别“不够有力”,以很多人
不得不诉诸于实时调度优先级来运行音频(和其它多媒体)应用程序,比如
SCHED_FIFO。但这也造成了其它问题:SCHED_FIFO未被证明是免于饥饿的,一个
有问题的SCHED_FIFO应用程序也会锁住运行良好的系统。
v2.6.23版内核的新调度器解决了这三种类型的抱怨:
为了解决第一个抱怨(nice级别不够“有力”),调度器与“时间片”、HZ的概念
解耦(调度粒度被处理成一个和nice级别独立的概念),因此有可能实现更好、
更一致的nice +19支持:在新的调度器中,nice +19的任务得到一个HZ无关的
1.5%CPU使用率,而不是旧版调度器中3%-5%-9%的可变范围。
为了解决第二个抱怨(nice各级别不一致),新调度器令调用nice(1)对各任务的
CPU利用率有相同的影响,无论其绝对nice级别如何。所以在新调度器上,运行一个
nice +10和一个nice +11的任务会与运行一个nice -5和一个nice -4的任务的
CPU利用率分割是相同的(一个会得到55%的CPU,另一个会得到45%)。这是为什么
nice级别被改为“乘法”(或指数)——这样的话,不管你从哪个级别开始,“相对”
结果将总是一样的。
第三个抱怨(负数nice级别不够“有力”,并迫使音频应用程序在更危险的
SCHED_FIFO调度策略下运行)几乎被新的调度器自动解决了:更强的负数级别
具有重新校正nice级别动态范围的自动化副作用。
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/scheduler/sched-stats.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
==============
调度器统计数据
==============
第15版schedstats去掉了sched_yield的一些计数器:yld_exp_empty,yld_act_empty
和yld_both_empty。在其它方面和第14版完全相同。
第14版schedstats包括对sched_domains(译注:调度域)的支持,该特性进入内核
主线2.6.20,不过这一版schedstats与2.6.13-2.6.19内核的版本12的统计数据是完全
相同的(内核未发布第13版)。有些计数器按每个运行队列统计是更有意义的,其它则
按每个调度域统计是更有意义的。注意,调度域(以及它们的附属信息)仅在开启
CONFIG_SMP的机器上是相关的和可用的。
在第14版schedstat中,每个被列出的CPU至少会有一级域统计数据,且很可能有一个
以上的域。在这个实现中,域没有特别的名字,但是编号最高的域通常在机器上所有的
CPU上仲裁平衡,而domain0是最紧密聚焦的域,有时仅在一对CPU之间进行平衡。此时,
没有任何体系结构需要3层以上的域。域统计数据中的第一个字段是一个位图,表明哪些
CPU受该域的影响。
这些字段是计数器,而且只能递增。使用这些字段的程序将需要从基线观测开始,然后在
后续每一个观测中计算出计数器的变化。一个能以这种方式处理其中很多字段的perl脚本
可见
http://eaglet.pdxhosts.com/rick/linux/schedstat/
请注意,任何这样的脚本都必须是特定于版本的,改变版本的主要原因是输出格式的变化。
对于那些希望编写自己的脚本的人,可以参考这里描述的各个字段。
CPU统计数据
-----------
cpu<N> 1 2 3 4 5 6 7 8 9
第一个字段是sched_yield()的统计数据:
1) sched_yield()被调用了#次
接下来的三个是schedule()的统计数据:
2) 这个字段是一个过时的数组过期计数,在O(1)调度器中使用。为了ABI兼容性,
我们保留了它,但它总是被设置为0。
3) schedule()被调用了#次
4) 调用schedule()导致处理器变为空闲了#次
接下来的两个是try_to_wake_up()的统计数据:
5) try_to_wake_up()被调用了#次
6) 调用try_to_wake_up()导致本地CPU被唤醒了#次
接下来的三个统计数据描述了调度延迟:
7) 本处理器运行任务的总时间,单位是jiffies
8) 本处理器任务等待运行的时间,单位是jiffies
9) 本CPU运行了#个时间片
域统计数据
----------
对于每个被描述的CPU,和它相关的每一个调度域均会产生下面一行数据(注意,如果
CONFIG_SMP没有被定义,那么*没有*调度域被使用,这些行不会出现在输出中)。
domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
第一个字段是一个位掩码,表明该域在操作哪些CPU。
接下来的24个字段是load_balance()函数的各个统计数据,按空闲类型分组(空闲,
繁忙,新空闲):
1) 当CPU空闲时,load_balance()在这个调度域中被调用了#次
2) 当CPU空闲时,load_balance()在这个调度域中被调用,但是发现负载无需
均衡#次
3) 当CPU空闲时,load_balance()在这个调度域中被调用,试图迁移1个或更多
任务且失败了#次
4) 当CPU空闲时,load_balance()在这个调度域中被调用,发现不均衡(如果有)
#次
5) 当CPU空闲时,pull_task()在这个调度域中被调用#次
6) 当CPU空闲时,尽管目标任务是热缓存状态,pull_task()依然被调用#次
7) 当CPU空闲时,load_balance()在这个调度域中被调用,未能找到更繁忙的
队列#次
8) 当CPU空闲时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组
#次
9) 当CPU繁忙时,load_balance()在这个调度域中被调用了#次
10) 当CPU繁忙时,load_balance()在这个调度域中被调用,但是发现负载无需
均衡#次
11) 当CPU繁忙时,load_balance()在这个调度域中被调用,试图迁移1个或更多
任务且失败了#次
12) 当CPU繁忙时,load_balance()在这个调度域中被调用,发现不均衡(如果有)
#次
13) 当CPU繁忙时,pull_task()在这个调度域中被调用#次
14) 当CPU繁忙时,尽管目标任务是热缓存状态,pull_task()依然被调用#次
15) 当CPU繁忙时,load_balance()在这个调度域中被调用,未能找到更繁忙的
队列#次
16) 当CPU繁忙时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组
#次
17) 当CPU新空闲时,load_balance()在这个调度域中被调用了#次
18) 当CPU新空闲时,load_balance()在这个调度域中被调用,但是发现负载无需
均衡#次
19) 当CPU新空闲时,load_balance()在这个调度域中被调用,试图迁移1个或更多
任务且失败了#次
20) 当CPU新空闲时,load_balance()在这个调度域中被调用,发现不均衡(如果有)
#次
21) 当CPU新空闲时,pull_task()在这个调度域中被调用#次
22) 当CPU新空闲时,尽管目标任务是热缓存状态,pull_task()依然被调用#次
23) 当CPU新空闲时,load_balance()在这个调度域中被调用,未能找到更繁忙的
队列#次
24) 当CPU新空闲时,在调度域中找到了更繁忙的队列,但未找到更繁忙的调度组
#次
接下来的3个字段是active_load_balance()函数的各个统计数据:
25) active_load_balance()被调用了#次
26) active_load_balance()被调用,试图迁移1个或更多任务且失败了#次
27) active_load_balance()被调用,成功迁移了#次任务
接下来的3个字段是sched_balance_exec()函数的各个统计数据:
28) sbe_cnt不再被使用
29) sbe_balanced不再被使用
30) sbe_pushed不再被使用
接下来的3个字段是sched_balance_fork()函数的各个统计数据:
31) sbf_cnt不再被使用
32) sbf_balanced不再被使用
33) sbf_pushed不再被使用
接下来的3个字段是try_to_wake_up()函数的各个统计数据:
34) 在这个调度域中调用try_to_wake_up()唤醒任务时,任务在调度域中一个
和上次运行不同的新CPU上运行了#次
35) 在这个调度域中调用try_to_wake_up()唤醒任务时,任务被迁移到发生唤醒
的CPU次数为#,因为该任务在原CPU是冷缓存状态
36) 在这个调度域中调用try_to_wake_up()唤醒任务时,引发被动负载均衡#次
/proc/<pid>/schedstat
---------------------
schedstats还添加了一个新的/proc/<pid>/schedstat文件,来提供一些进程级的
相同信息。这个文件中,有三个字段与该进程相关:
1) 在CPU上运行花费的时间
2) 在运行队列上等待的时间
3) 在CPU上运行了#个时间片
可以很容易地编写一个程序,利用这些额外的字段来报告一个特定的进程或一组进程在
调度器策略下的表现如何。这样的程序的一个简单版本可在下面的链接找到
http://eaglet.pdxhosts.com/rick/linux/schedstat/v12/latency.c
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/vm/active_mm.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
=========
Active MM
=========
这是一封linux之父回复开发者的一封邮件,所以翻译时我尽量保持邮件格式的完整。
::
List: linux-kernel
Subject: Re: active_mm
From: Linus Torvalds <torvalds () transmeta ! com>
Date: 1999-07-30 21:36:24
因为我并不经常写解释,所以已经抄送到linux-kernel邮件列表,而当我做这些,
且更多的人在阅读它们时,我觉得棒极了。
1999年7月30日 星期五, David Mosberger 写道:
>
> 是否有一个简短的描述,说明task_struct中的
> "mm" 和 "active_mm"应该如何使用? (如果
> 这个问题在邮件列表中讨论过,我表示歉意--我刚
> 刚度假回来,有一段时间没能关注linux-kernel了)。
基本上,新的设定是:
- 我们有“真实地址空间”和“匿名地址空间”。区别在于,匿名地址空间根本不关心用
户级页表,所以当我们做上下文切换到匿名地址空间时,我们只是让以前的地址空间
处于活动状态。
一个“匿名地址空间”的明显用途是任何不需要任何用户映射的线程--所有的内核线
程基本上都属于这一类,但即使是“真正的”线程也可以暂时说在一定时间内它们不
会对用户空间感兴趣,调度器不妨试着避免在切换VM状态上浪费时间。目前只有老
式的bdflush sync能做到这一点。
- “tsk->mm” 指向 “真实地址空间”。对于一个匿名进程来说,tsk->mm将是NULL,
其逻辑原因是匿名进程实际上根本就 “没有” 真正的地址空间。
- 然而,我们显然需要跟踪我们为这样的匿名用户“偷用”了哪个地址空间。为此,我们
有 “tsk->active_mm”,它显示了当前活动的地址空间是什么。
规则是,对于一个有真实地址空间的进程(即tsk->mm是 non-NULL),active_mm
显然必须与真实的mm相同。
对于一个匿名进程,tsk->mm == NULL,而tsk->active_mm是匿名进程运行时
“借用”的mm。当匿名进程被调度走时,借用的地址空间被返回并清除。
为了支持所有这些,“struct mm_struct”现在有两个计数器:一个是 “mm_users”
计数器,即有多少 “真正的地址空间用户”,另一个是 “mm_count”计数器,即 “lazy”
用户(即匿名用户)的数量,如果有任何真正的用户,则加1。
通常情况下,至少有一个真正的用户,但也可能是真正的用户在另一个CPU上退出,而
一个lazy的用户仍在活动,所以你实际上得到的情况是,你有一个地址空间 **只**
被lazy的用户使用。这通常是一个短暂的生命周期状态,因为一旦这个线程被安排给一
个真正的线程,这个 “僵尸” mm就会被释放,因为 “mm_count”变成了零。
另外,一个新的规则是,**没有人** 再把 “init_mm” 作为一个真正的MM了。
“init_mm”应该被认为只是一个 “没有其他上下文时的lazy上下文”,事实上,它主
要是在启动时使用,当时还没有真正的VM被创建。因此,用来检查的代码
if (current->mm == &init_mm)
一般来说,应该用
if (!current->mm)
取代上面的写法(这更有意义--测试基本上是 “我们是否有一个用户环境”,并且通常
由缺页异常处理程序和类似的东西来完成)。
总之,我刚才在ftp.kernel.org上放了一个pre-patch-2.3.13-1,因为它稍微改
变了接口以适配alpha(谁会想到呢,但alpha体系结构上下文切换代码实际上最终是
最丑陋的之一--不像其他架构的MM和寄存器状态是分开的,alpha的PALcode将两者
连接起来,你需要同时切换两者)。
(文档来源 http://marc.info/?l=linux-kernel&m=93337278602211&w=2)
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/vm/balance.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
========
内存平衡
========
2000年1月开始,作者:Kanoj Sarcar <kanoj@sgi.com>
对于 !__GFP_HIGH 和 !__GFP_KSWAPD_RECLAIM 以及非 __GFP_IO 的分配,需要进行
内存平衡。
调用者避免回收的第一个原因是调用者由于持有自旋锁或处于中断环境中而无法睡眠。第二个
原因可能是,调用者愿意在不产生页面回收开销的情况下分配失败。这可能发生在有0阶回退
选项的机会主义高阶分配请求中。在这种情况下,调用者可能也希望避免唤醒kswapd。
__GFP_IO分配请求是为了防止文件系统死锁。
在没有非睡眠分配请求的情况下,做平衡似乎是有害的。页面回收可以被懒散地启动,也就是
说,只有在需要的时候(也就是区域的空闲内存为0),而不是让它成为一个主动的过程。
也就是说,内核应该尝试从直接映射池中满足对直接映射页的请求,而不是回退到dma池中,
这样就可以保持dma池为dma请求(不管是不是原子的)所填充。类似的争论也适用于高内存
和直接映射的页面。相反,如果有很多空闲的dma页,最好是通过从dma池中分配一个来满足
常规的内存请求,而不是产生常规区域平衡的开销。
在2.2中,只有当空闲页总数低于总内存的1/64时,才会启动内存平衡/页面回收。如果dma
和常规内存的比例合适,即使dma区完全空了,也很可能不会进行平衡。2.2已经在不同内存
大小的生产机器上运行,即使有这个问题存在,似乎也做得不错。在2.3中,由于HIGHMEM的
存在,这个问题变得更加严重。
在2.3中,区域平衡可以用两种方式之一来完成:根据区域的大小(可能是低级区域的大小),
我们可以在初始化阶段决定在平衡任何区域时应该争取多少空闲页。好的方面是,在平衡的时
候,我们不需要看低级区的大小,坏的方面是,我们可能会因为忽略低级区可能较低的使用率
而做过于频繁的平衡。另外,只要对分配程序稍作修改,就有可能将memclass()宏简化为一
个简单的等式。
另一个可能的解决方案是,我们只在一个区 **和** 其所有低级区的空闲内存低于该区及其
低级区总内存的1/64时进行平衡。这就解决了2.2的平衡问题,并尽可能地保持了与2.2行为
的接近。另外,平衡算法在各种架构上的工作方式也是一样的,这些架构有不同数量和类型的
内存区。如果我们想变得更花哨一点,我们可以在未来为不同区域的自由页面分配不同的权重。
请注意,如果普通区的大小与dma区相比是巨大的,那么在决定是否平衡普通区的时候,考虑
空闲的dma页就变得不那么重要了。那么第一个解决方案就变得更有吸引力。
所附的补丁实现了第二个解决方案。它还 “修复”了两个问题:首先,在低内存条件下,kswapd
被唤醒,就像2.2中的非睡眠分配。第二,HIGHMEM区也被平衡了,以便给replace_with_highmem()
一个争取获得HIGHMEM页的机会,同时确保HIGHMEM分配不会落回普通区。这也确保了HIGHMEM
页不会被泄露(例如,在一个HIGHMEM页在交换缓存中但没有被任何人使用的情况下)。
kswapd还需要知道它应该平衡哪些区。kswapd主要是在无法进行平衡的情况下需要的,可能
是因为所有的分配请求都来自中断上下文,而所有的进程上下文都在睡眠。对于2.3,
kswapd并不真正需要平衡高内存区,因为中断上下文并不请求高内存页。kswapd看zone
结构体中的zone_wake_kswapd字段来决定一个区是否需要平衡。
如果从进程内存和shm中偷取页面可以减轻该页面节点中任何区的内存压力,而该区的内存压力
已经低于其水位,则会进行偷取。
watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd:
这些是每个区的字段,用于确定一个区何时需要平衡。当页面数低于水位[WMARK_MIN]时,
hysteric 的字段low_on_memory被设置。这个字段会一直被设置,直到空闲页数变成水位
[WMARK_HIGH]。当low_on_memory被设置时,页面分配请求将尝试释放该区域的一些页面(如果
请求中设置了GFP_WAIT)。与此相反的是,决定唤醒kswapd以释放一些区的页。这个决定不是基于
hysteresis 的,而是当空闲页的数量低于watermark[WMARK_LOW]时就会进行;在这种情况下,
zone_wake_kswapd也被设置。
我所听到的(超棒的)想法:
1. 动态经历应该影响平衡:可以跟踪一个区的失败请求的数量,并反馈到平衡方案中(jalvo@mbay.net)。
2. 实现一个类似于replace_with_highmem()的replace_with_regular(),以保留dma页面。
(lkd@tantalophile.demon.co.uk)
.. SPDX-License-Identifier: GPL-2.0
:Original: Documentation/vm/damon/api.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
=======
API参考
=======
内核空间的程序可以使用下面的API来使用DAMON的每个功能。你所需要做的就是引用 ``damon.h`` ,
它位于源代码树的include/linux/。
结构体
======
该API在以下内核代码中:
include/linux/damon.h
函数
====
该API在以下内核代码中:
mm/damon/core.c
.. SPDX-License-Identifier: GPL-2.0
:Original: Documentation/vm/damon/design.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
====
设计
====
可配置的层
==========
DAMON提供了数据访问监控功能,同时使其准确性和开销可控。基本的访问监控需要依赖于目标地址空间
并为之优化的基元。另一方面,作为DAMON的核心,准确性和开销的权衡机制是在纯逻辑空间中。DAMON
将这两部分分离在不同的层中,并定义了它的接口,以允许各种低层次的基元实现与核心逻辑的配置。
由于这种分离的设计和可配置的接口,用户可以通过配置核心逻辑和适当的低级基元实现来扩展DAMON的
任何地址空间。如果没有提供合适的,用户可以自己实现基元。
例如,物理内存、虚拟内存、交换空间、那些特定的进程、NUMA节点、文件和支持的内存设备将被支持。
另外,如果某些架构或设备支持特殊的优化访问检查基元,这些基元将很容易被配置。
特定地址空间基元的参考实现
==========================
基本访问监测的低级基元被定义为两部分。:
1. 确定地址空间的监测目标地址范围
2. 目标空间中特定地址范围的访问检查。
DAMON目前为物理和虚拟地址空间提供了基元的实现。下面两个小节描述了这些工作的方式。
基于VMA的目标地址范围构造
-------------------------
这仅仅是针对虚拟地址空间基元的实现。对于物理地址空间,只是要求用户手动设置监控目标地址范围。
在进程的超级巨大的虚拟地址空间中,只有小部分被映射到物理内存并被访问。因此,跟踪未映射的地
址区域只是一种浪费。然而,由于DAMON可以使用自适应区域调整机制来处理一定程度的噪声,所以严
格来说,跟踪每一个映射并不是必须的,但在某些情况下甚至会产生很高的开销。也就是说,监测目标
内部过于巨大的未映射区域应该被移除,以不占用自适应机制的时间。
出于这个原因,这个实现将复杂的映射转换为三个不同的区域,覆盖地址空间的每个映射区域。这三个
区域之间的两个空隙是给定地址空间中两个最大的未映射区域。这两个最大的未映射区域是堆和最上面
的mmap()区域之间的间隙,以及在大多数情况下最下面的mmap()区域和堆之间的间隙。因为这些间隙
在通常的地址空间中是异常巨大的,排除这些间隙就足以做出合理的权衡。下面详细说明了这一点::
<heap>
<BIG UNMAPPED REGION 1>
<uppermost mmap()-ed region>
(small mmap()-ed regions and munmap()-ed regions)
<lowermost mmap()-ed region>
<BIG UNMAPPED REGION 2>
<stack>
基于PTE访问位的访问检查
-----------------------
物理和虚拟地址空间的实现都使用PTE Accessed-bit进行基本访问检查。唯一的区别在于从地址中
找到相关的PTE访问位的方式。虚拟地址的实现是为该地址的目标任务查找页表,而物理地址的实现则
是查找与该地址有映射关系的每一个页表。通过这种方式,实现者找到并清除下一个采样目标地址的位,
并检查该位是否在一个采样周期后再次设置。这可能会干扰其他使用访问位的内核子系统,即空闲页跟
踪和回收逻辑。为了避免这种干扰,DAMON使其与空闲页面跟踪相互排斥,并使用 ``PG_idle`` 和
``PG_young`` 页面标志来解决与回收逻辑的冲突,就像空闲页面跟踪那样。
独立于地址空间的核心机制
========================
下面四个部分分别描述了DAMON的核心机制和五个监测属性,即 ``采样间隔`` 、 ``聚集间隔`` 、
``区域更新间隔`` 、 ``最小区域数`` 和 ``最大区域数`` 。
访问频率监测
------------
DAMON的输出显示了在给定的时间内哪些页面的访问频率是多少。访问频率的分辨率是通过设置
``采样间隔`` 和 ``聚集间隔`` 来控制的。详细地说,DAMON检查每个 ``采样间隔`` 对每
个页面的访问,并将结果汇总。换句话说,计算每个页面的访问次数。在每个 ``聚合间隔`` 过
去后,DAMON调用先前由用户注册的回调函数,以便用户可以阅读聚合的结果,然后再清除这些结
果。这可以用以下简单的伪代码来描述::
while monitoring_on:
for page in monitoring_target:
if accessed(page):
nr_accesses[page] += 1
if time() % aggregation_interval == 0:
for callback in user_registered_callbacks:
callback(monitoring_target, nr_accesses)
for page in monitoring_target:
nr_accesses[page] = 0
sleep(sampling interval)
这种机制的监测开销将随着目标工作负载规模的增长而任意增加。
基于区域的抽样调查
------------------
为了避免开销的无限制增加,DAMON将假定具有相同访问频率的相邻页面归入一个区域。只要保持
这个假设(一个区域内的页面具有相同的访问频率),该区域内就只需要检查一个页面。因此,对
于每个 ``采样间隔`` ,DAMON在每个区域中随机挑选一个页面,等待一个 ``采样间隔`` ,检
查该页面是否同时被访问,如果被访问则增加该区域的访问频率。因此,监测开销是可以通过设置
区域的数量来控制的。DAMON允许用户设置最小和最大的区域数量来进行权衡。
然而,如果假设没有得到保证,这个方案就不能保持输出的质量。
适应性区域调整
--------------
即使最初的监测目标区域被很好地构建以满足假设(同一区域内的页面具有相似的访问频率),数
据访问模式也会被动态地改变。这将导致监测质量下降。为了尽可能地保持假设,DAMON根据每个
区域的访问频率自适应地进行合并和拆分。
对于每个 ``聚集区间`` ,它比较相邻区域的访问频率,如果频率差异较小,就合并这些区域。
然后,在它报告并清除每个区域的聚合接入频率后,如果区域总数不超过用户指定的最大区域数,
它将每个区域拆分为两个或三个区域。
通过这种方式,DAMON提供了其最佳的质量和最小的开销,同时保持了用户为其权衡设定的界限。
动态目标空间更新处理
--------------------
监测目标地址范围可以动态改变。例如,虚拟内存可以动态地被映射和解映射。物理内存可以被
热插拔。
由于在某些情况下变化可能相当频繁,DAMON检查动态内存映射的变化,并仅在用户指定的时间
间隔( ``区域更新间隔`` )内将其应用于抽象的目标区域。
.. SPDX-License-Identifier: GPL-2.0
:Original: Documentation/vm/damon/faq.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
========
常见问题
========
为什么是一个新的子系统,而不是扩展perf或其他用户空间工具?
==========================================================
首先,因为它需要尽可能的轻量级,以便可以在线使用,所以应该避免任何不必要的开销,如内核-用户
空间的上下文切换成本。第二,DAMON的目标是被包括内核在内的其他程序所使用。因此,对特定工具
(如perf)的依赖性是不可取的。这就是DAMON在内核空间实现的两个最大的原因。
“闲置页面跟踪” 或 “perf mem” 可以替代DAMON吗?
==============================================
闲置页跟踪是物理地址空间访问检查的一个低层次的原始方法。“perf mem”也是类似的,尽管它可以
使用采样来减少开销。另一方面,DAMON是一个更高层次的框架,用于监控各种地址空间。它专注于内
存管理优化,并提供复杂的精度/开销处理机制。因此,“空闲页面跟踪” 和 “perf mem” 可以提供
DAMON输出的一个子集,但不能替代DAMON。
DAMON是否只支持虚拟内存?
=========================
不,DAMON的核心是独立于地址空间的。用户可以在DAMON核心上实现和配置特定地址空间的低级原始
部分,包括监测目标区域的构造和实际的访问检查。通过这种方式,DAMON用户可以用任何访问检查技
术来监测任何地址空间。
尽管如此,DAMON默认为虚拟内存和物理内存提供了基于vma/rmap跟踪和PTE访问位检查的地址空间
相关功能的实现,以供参考和方便使用。
我可以简单地监测页面的粒度吗?
==============================
是的,你可以通过设置 ``min_nr_regions`` 属性高于工作集大小除以页面大小的值来实现。
因为监视目标区域的大小被强制为 ``>=page size`` ,所以区域分割不会产生任何影响。
.. SPDX-License-Identifier: GPL-2.0
:Original: Documentation/vm/damon/index.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
==========================
DAMON:数据访问监视器
==========================
DAMON是Linux内核的一个数据访问监控框架子系统。DAMON的核心机制使其成为
(该核心机制详见(Documentation/translations/zh_CN/vm/damon/design.rst))
- *准确度* (监测输出对DRAM级别的内存管理足够有用;但可能不适合CPU Cache级别),
- *轻量级* (监控开销低到可以在线应用),以及
- *可扩展* (无论目标工作负载的大小,开销的上限值都在恒定范围内)。
因此,利用这个框架,内核的内存管理机制可以做出高级决策。会导致高数据访问监控开销的实
验性内存管理优化工作可以再次进行。同时,在用户空间,有一些特殊工作负载的用户可以编写
个性化的应用程序,以便更好地了解和优化他们的工作负载和系统。
.. toctree::
:maxdepth: 2
faq
design
api
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/vm/_free_page_reporting.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
==========
空闲页报告
==========
空闲页报告是一个API,设备可以通过它来注册接收系统当前未使用的页面列表。这在虚拟
化的情况下是很有用的,客户机能够使用这些数据来通知管理器它不再使用内存中的某些页
面。
对于驱动,通常是气球驱动要使用这个功能,它将分配和初始化一个page_reporting_dev_info
结构体。它要填充的结构体中的字段是用于处理散点列表的 "report" 函数指针。它还必
须保证每次调用该函数时能处理至少相当于PAGE_REPORTING_CAPACITY的散点列表条目。
假设没有其他页面报告设备已经注册, 对page_reporting_register的调用将向报告框
架注册页面报告接口。
一旦注册,页面报告API将开始向驱动报告成批的页面。API将在接口被注册后2秒开始报告
页面,并在任何足够高的页面被释放之后2秒继续报告。
报告的页面将被存储在传递给报告函数的散列表中,最后一个条目的结束位被设置在条目
nent-1中。 当页面被报告函数处理时,分配器将无法访问它们。一旦报告函数完成,这些
页将被返回到它们所获得的自由区域。
在移除使用空闲页报告的驱动之前,有必要调用page_reporting_unregister,以移除
目前被空闲页报告使用的page_reporting_dev_info结构体。这样做将阻止进一步的报
告通过该接口发出。如果另一个驱动或同一驱动被注册,它就有可能恢复前一个驱动在报告
空闲页方面的工作。
Alexander Duyck, 2019年12月04日
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/vm/highmem.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
==========
高内存处理
==========
作者: Peter Zijlstra <a.p.zijlstra@chello.nl>
.. contents:: :local:
高内存是什么?
==============
当物理内存的大小接近或超过虚拟内存的最大大小时,就会使用高内存(highmem)。在这一点上,内
核不可能在任何时候都保持所有可用的物理内存的映射。这意味着内核需要开始使用它想访问的物理内
存的临时映射。
没有被永久映射覆盖的那部分(物理)内存就是我们所说的 "高内存"。对于这个边界的确切位置,有
各种架构上的限制。
例如,在i386架构中,我们选择将内核映射到每个进程的虚拟空间,这样我们就不必为内核的进入/退
出付出全部的TLB作废代价。这意味着可用的虚拟内存空间(i386上为4GiB)必须在用户和内核空间之
间进行划分。
使用这种方法的架构的传统分配方式是3:1,3GiB用于用户空间,顶部的1GiB用于内核空间。::
+--------+ 0xffffffff
| Kernel |
+--------+ 0xc0000000
| |
| User |
| |
+--------+ 0x00000000
这意味着内核在任何时候最多可以映射1GiB的物理内存,但是由于我们需要虚拟地址空间来做其他事
情--包括访问其余物理内存的临时映射--实际的直接映射通常会更少(通常在~896MiB左右)。
其他有mm上下文标签的TLB的架构可以有独立的内核和用户映射。然而,一些硬件(如一些ARM)在使
用mm上下文标签时,其虚拟空间有限。
临时虚拟映射
============
内核包含几种创建临时映射的方法。:
* vmap(). 这可以用来将多个物理页长期映射到一个连续的虚拟空间。它需要synchronization
来解除映射。
* kmap(). 这允许对单个页面进行短期映射。它需要synchronization,但在一定程度上被摊销。
当以嵌套方式使用时,它也很容易出现死锁,因此不建议在新代码中使用它。
* kmap_atomic(). 这允许对单个页面进行非常短的时间映射。由于映射被限制在发布它的CPU上,
它表现得很好,但发布任务因此被要求留在该CPU上直到它完成,以免其他任务取代它的映射。
kmap_atomic() 也可以由中断上下文使用,因为它不睡眠,而且调用者可能在调用kunmap_atomic()
之后才睡眠。
可以假设k[un]map_atomic()不会失败。
使用kmap_atomic
===============
何时何地使用 kmap_atomic() 是很直接的。当代码想要访问一个可能从高内存(见__GFP_HIGHMEM)
分配的页面的内容时,例如在页缓存中的页面,就会使用它。该API有两个函数,它们的使用方式与
下面类似::
/* 找到感兴趣的页面。 */
struct page *page = find_get_page(mapping, offset);
/* 获得对该页内容的访问权。 */
void *vaddr = kmap_atomic(page);
/* 对该页的内容做一些处理。 */
memset(vaddr, 0, PAGE_SIZE);
/* 解除该页面的映射。 */
kunmap_atomic(vaddr);
注意,kunmap_atomic()调用的是kmap_atomic()调用的结果而不是参数。
如果你需要映射两个页面,因为你想从一个页面复制到另一个页面,你需要保持kmap_atomic调用严
格嵌套,如::
vaddr1 = kmap_atomic(page1);
vaddr2 = kmap_atomic(page2);
memcpy(vaddr1, vaddr2, PAGE_SIZE);
kunmap_atomic(vaddr2);
kunmap_atomic(vaddr1);
临时映射的成本
==============
创建临时映射的代价可能相当高。体系架构必须操作内核的页表、数据TLB和/或MMU的寄存器。
如果CONFIG_HIGHMEM没有被设置,那么内核会尝试用一点计算来创建映射,将页面结构地址转换成
指向页面内容的指针,而不是去捣鼓映射。在这种情况下,解映射操作可能是一个空操作。
如果CONFIG_MMU没有被设置,那么就不可能有临时映射和高内存。在这种情况下,也将使用计算方法。
i386 PAE
========
在某些情况下,i386 架构将允许你在 32 位机器上安装多达 64GiB 的内存。但这有一些后果:
* Linux需要为系统中的每个页面建立一个页帧结构,而且页帧需要驻在永久映射中,这意味着:
* 你最多可以有896M/sizeof(struct page)页帧;由于页结构体是32字节的,所以最终会有
112G的页;然而,内核需要在内存中存储更多的页帧......
* PAE使你的页表变大--这使系统变慢,因为更多的数据需要在TLB填充等方面被访问。一个好处
是,PAE有更多的PTE位,可以提供像NX和PAT这样的高级功能。
一般的建议是,你不要在32位机器上使用超过8GiB的空间--尽管更多的空间可能对你和你的工作
量有用,但你几乎是靠你自己--不要指望内核开发者真的会很关心事情的进展情况。
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/vm/index.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
=================
Linux内存管理文档
=================
这是一个关于Linux内存管理(mm)子系统内部的文档集,其中有不同层次的细节,包括注释
和邮件列表的回复,用于阐述数据结构和算法的基本情况。如果你正在寻找关于简单分配内存的建
议,请参阅(Documentation/translations/zh_CN/core-api/memory-allocation.rst)。
对于控制和调整指南,请参阅(Documentation/admin-guide/mm/index)。
TODO:待引用文档集被翻译完毕后请及时修改此处)
.. toctree::
:maxdepth: 1
active_mm
balance
damon/index
free_page_reporting
highmem
ksm
TODOLIST:
* arch_pgtable_helpers
* free_page_reporting
* frontswap
* hmm
* hwpoison
* hugetlbfs_reserv
* memory-model
* mmu_notifier
* numa
* overcommit-accounting
* page_migration
* page_frags
* page_owner
* page_table_check
* remap_file_pages
* slub
* split_page_table_lock
* transhuge
* unevictable-lru
* vmalloced-kernel-stacks
* z3fold
* zsmalloc
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/vm/ksm.rst
:翻译:
徐鑫 xu xin <xu.xin16@zte.com.cn>
============
内核同页合并
============
KSM 是一种节省内存的数据去重功能,由CONFIG_KSM=y启用,并在2.6.32版本时被添加
到Linux内核。详见 ``mm/ksm.c`` 的实现,以及http://lwn.net/Articles/306704和
https://lwn.net/Articles/330589
KSM的用户空间的接口在Documentation/translations/zh_CN/admin-guide/mm/ksm.rst
文档中有描述。
设计
====
概述
----
概述内容请见mm/ksm.c文档中的“DOC: Overview”
逆映射
------
KSM维护着稳定树中的KSM页的逆映射信息。
当KSM页面的共享数小于 ``max_page_sharing`` 的虚拟内存区域(VMAs)时,则代表了
KSM页的稳定树其中的节点指向了一个rmap_item结构体类型的列表。同时,这个KSM页
的 ``page->mapping`` 指向了该稳定树节点。
如果共享数超过了阈值,KSM将给稳定树添加第二个维度。稳定树就变成链接一个或多
个稳定树"副本"的"链"。每个副本都保留KSM页的逆映射信息,其中 ``page->mapping``
指向该"副本"。
每个链以及链接到该链中的所有"副本"强制不变的是,它们代表了相同的写保护内存
内容,尽管任中一个"副本"是由同一片内存区的不同的KSM复制页所指向的。
这样一来,相比与无限的逆映射链表,稳定树的查找计算复杂性不受影响。但在稳定树
本身中不能有重复的KSM页面内容仍然是强制要求。
由 ``max_page_sharing`` 强制决定的数据去重限制是必要的,以此来避免虚拟内存
rmap链表变得过大。rmap的遍历具有O(N)的复杂度,其中N是共享页面的rmap_项(即
虚拟映射)的数量,而这个共享页面的节点数量又被 ``max_page_sharing`` 所限制。
因此,这有效地将线性O(N)计算复杂度从rmap遍历中分散到不同的KSM页面上。ksmd进
程在稳定节点"链"上的遍历也是O(N),但这个N是稳定树"副本"的数量,而不是rmap项
的数量,因此它对ksmd性能没有显著影响。实际上,最佳稳定树"副本"的候选节点将
保留在"副本"列表的开头。
``max_page_sharing`` 的值设置得高了会促使更快的内存合并(因为将有更少的稳定
树副本排队进入稳定节点chain->hlist)和更高的数据去重系数,但代价是在交换、压
缩、NUMA平衡和页面迁移过程中可能导致KSM页的最大rmap遍历速度较慢。
``stable_node_dups/stable_node_chains`` 的比值还受 ``max_page_sharing`` 调控
的影响,高比值可能意味着稳定节点dup中存在碎片,这可以通过在ksmd中引入碎片算
法来解决,该算法将rmap项从一个稳定节点dup重定位到另一个稳定节点dup,以便释放
那些仅包含极少rmap项的稳定节点"dup",但这可能会增加ksmd进程的CPU使用率,并可
能会减慢应用程序在KSM页面上的只读计算。
KSM会定期扫描稳定节点"链"中链接的所有稳定树"副本",以便删减过时了的稳定节点。
这种扫描的频率由 ``stable_node_chains_prune_millisecs`` 这个sysfs 接口定义。
参考
====
内核代码请见mm/ksm.c。
涉及的函数(mm_slot ksm_scan stable_node rmap_item)。
......@@ -5,7 +5,7 @@
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
\kerneldocBeginTC
\kerneldocBeginTC{
.. _linux_doc_zh_tw:
......@@ -174,4 +174,4 @@ TODOList:
.. raw:: latex
\kerneldocEndTC
}\kerneldocEndTC
......@@ -256,7 +256,7 @@ Code Seq# Include File Comments
'l' 00-3F linux/tcfs_fs.h transparent cryptographic file system
<http://web.archive.org/web/%2A/http://mikonos.dia.unisa.it/tcfs>
'l' 40-7F linux/udf_fs_i.h in development:
<http://sourceforge.net/projects/linux-udf/>
<https://github.com/pali/udftools>
'm' 00-09 linux/mmtimer.h conflict!
'm' all linux/mtio.h conflict!
'm' all linux/soundcard.h conflict!
......
......@@ -664,7 +664,11 @@ one is input, the second one output.
* The fd channel - use file descriptor numbers for input/output. Example:
``con1=fd:0,fd:1.``
* The port channel - listen on TCP port number. Example: ``con1=port:4321``
* The port channel - start a telnet server on TCP port number. Example:
``con1=port:4321``. The host must have /usr/sbin/in.telnetd (usually part of
a telnetd package) and the port-helper from the UML utilities (see the
information for the xterm channel below). UML will not boot until a client
connects.
* The pty and pts channels - use system pty/pts.
......
......@@ -26,9 +26,9 @@ fragmentation statistics can be obtained through gfp flag information of
each page. It is already implemented and activated if page owner is
enabled. Other usages are more than welcome.
page owner is disabled in default. So, if you'd like to use it, you need
to add "page_owner=on" into your boot cmdline. If the kernel is built
with page owner and page owner is disabled in runtime due to no enabling
page owner is disabled by default. So, if you'd like to use it, you need
to add "page_owner=on" to your boot cmdline. If the kernel is built
with page owner and page owner is disabled in runtime due to not enabling
boot option, runtime overhead is marginal. If disabled in runtime, it
doesn't require memory to store owner information, so there is no runtime
memory overhead. And, page owner inserts just two unlikely branches into
......@@ -85,7 +85,7 @@ Usage
cat /sys/kernel/debug/page_owner > page_owner_full.txt
./page_owner_sort page_owner_full.txt sorted_page_owner.txt
The general output of ``page_owner_full.txt`` is as follows:
The general output of ``page_owner_full.txt`` is as follows::
Page allocated via order XXX, ...
PFN XXX ...
......@@ -100,7 +100,7 @@ Usage
and pages of buf, and finally sorts them according to the times.
See the result about who allocated each page
in the ``sorted_page_owner.txt``. General output:
in the ``sorted_page_owner.txt``. General output::
XXX times, XXX pages:
Page allocated via order XXX, ...
......
......@@ -10455,6 +10455,8 @@ KERNEL REGRESSIONS
M: Thorsten Leemhuis <linux@leemhuis.info>
L: regressions@lists.linux.dev
S: Supported
F: Documentation/admin-guide/reporting-regressions.rst
F: Documentation/process/handling-regressions.rst
KERNEL SELFTEST FRAMEWORK
M: Shuah Khan <shuah@kernel.org>
......
......@@ -12,202 +12,46 @@ use strict;
## ##
## #define enhancements by Armin Kuster <akuster@mvista.com> ##
## Copyright (c) 2000 MontaVista Software, Inc. ##
## ##
## This software falls under the GNU General Public License. ##
## Please read the COPYING file for more information ##
#
# Copyright (C) 2022 Tomasz Warniełło (POD)
# 18/01/2001 - Cleanups
# Functions prototyped as foo(void) same as foo()
# Stop eval'ing where we don't need to.
# -- huggie@earth.li
use Pod::Usage qw/pod2usage/;
# 27/06/2001 - Allowed whitespace after initial "/**" and
# allowed comments before function declarations.
# -- Christian Kreibich <ck@whoop.org>
=head1 NAME
# Still to do:
# - add perldoc documentation
# - Look more closely at some of the scarier bits :)
kernel-doc - Print formatted kernel documentation to stdout
# 26/05/2001 - Support for separate source and object trees.
# Return error code.
# Keith Owens <kaos@ocs.com.au>
=head1 SYNOPSIS
# 23/09/2001 - Added support for typedefs, structs, enums and unions
# Support for Context section; can be terminated using empty line
# Small fixes (like spaces vs. \s in regex)
# -- Tim Jansen <tim@tjansen.de>
kernel-doc [-h] [-v] [-Werror]
[ -man |
-rst [-sphinx-version VERSION] [-enable-lineno] |
-none
]
[
-export |
-internal |
[-function NAME] ... |
[-nosymbol NAME] ...
]
[-no-doc-sections]
[-export-file FILE] ...
FILE ...
# 25/07/2012 - Added support for HTML5
# -- Dan Luedtke <mail@danrl.de>
Run `kernel-doc -h` for details.
sub usage {
my $message = <<"EOF";
Usage: $0 [OPTION ...] FILE ...
=head1 DESCRIPTION
Read C language source or header FILEs, extract embedded documentation comments,
and print formatted documentation to standard output.
The documentation comments are identified by "/**" opening comment mark. See
Documentation/doc-guide/kernel-doc.rst for the documentation comment syntax.
Output format selection (mutually exclusive):
-man Output troff manual page format. This is the default.
-rst Output reStructuredText format.
-none Do not output documentation, only warnings.
Output format selection modifier (affects only ReST output):
-sphinx-version Use the ReST C domain dialect compatible with an
specific Sphinx Version.
If not specified, kernel-doc will auto-detect using
the sphinx-build version found on PATH.
Output selection (mutually exclusive):
-export Only output documentation for symbols that have been
exported using EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL()
in any input FILE or -export-file FILE.
-internal Only output documentation for symbols that have NOT been
exported using EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL()
in any input FILE or -export-file FILE.
-function NAME Only output documentation for the given function(s)
or DOC: section title(s). All other functions and DOC:
sections are ignored. May be specified multiple times.
-nosymbol NAME Exclude the specified symbols from the output
documentation. May be specified multiple times.
Output selection modifiers:
-no-doc-sections Do not output DOC: sections.
-enable-lineno Enable output of #define LINENO lines. Only works with
reStructuredText format.
-export-file FILE Specify an additional FILE in which to look for
EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL(). To be used with
-export or -internal. May be specified multiple times.
Other parameters:
-v Verbose output, more warnings and other information.
-h Print this help.
-Werror Treat warnings as errors.
EOF
print $message;
exit 1;
}
The documentation comments are identified by the "/**" opening comment mark.
#
# format of comments.
# In the following table, (...)? signifies optional structure.
# (...)* signifies 0 or more structure elements
# /**
# * function_name(:)? (- short description)?
# (* @parameterx: (description of parameter x)?)*
# (* a blank line)?
# * (Description:)? (Description of function)?
# * (section header: (section description)? )*
# (*)?*/
#
# So .. the trivial example would be:
#
# /**
# * my_function
# */
#
# If the Description: header tag is omitted, then there must be a blank line
# after the last parameter specification.
# e.g.
# /**
# * my_function - does my stuff
# * @my_arg: its mine damnit
# *
# * Does my stuff explained.
# */
#
# or, could also use:
# /**
# * my_function - does my stuff
# * @my_arg: its mine damnit
# * Description: Does my stuff explained.
# */
# etc.
#
# Besides functions you can also write documentation for structs, unions,
# enums and typedefs. Instead of the function name you must write the name
# of the declaration; the struct/union/enum/typedef must always precede
# the name. Nesting of declarations is not supported.
# Use the argument mechanism to document members or constants.
# e.g.
# /**
# * struct my_struct - short description
# * @a: first member
# * @b: second member
# *
# * Longer description
# */
# struct my_struct {
# int a;
# int b;
# /* private: */
# int c;
# };
#
# All descriptions can be multiline, except the short function description.
#
# For really longs structs, you can also describe arguments inside the
# body of the struct.
# eg.
# /**
# * struct my_struct - short description
# * @a: first member
# * @b: second member
# *
# * Longer description
# */
# struct my_struct {
# int a;
# int b;
# /**
# * @c: This is longer description of C
# *
# * You can use paragraphs to describe arguments
# * using this method.
# */
# int c;
# };
#
# This should be use only for struct/enum members.
#
# You can also add additional sections. When documenting kernel functions you
# should document the "Context:" of the function, e.g. whether the functions
# can be called form interrupts. Unlike other sections you can end it with an
# empty line.
# A non-void function should have a "Return:" section describing the return
# value(s).
# Example-sections should contain the string EXAMPLE so that they are marked
# appropriately in DocBook.
#
# Example:
# /**
# * user_function - function that can only be called in user context
# * @a: some argument
# * Context: !in_interrupt()
# *
# * Some description
# * Example:
# * user_function(22);
# */
# ...
#
#
# All descriptive text is further processed, scanning for the following special
# patterns, which are highlighted appropriately.
#
# 'funcname()' - function
# '$ENVVAR' - environmental variable
# '&struct_name' - name of a structure (up to two words including 'struct')
# '&struct_name.member' - name of a structure member
# '@parameter' - name of a parameter
# '%CONST' - name of a constant.
# '``LITERAL``' - literal string without any spaces on it.
See Documentation/doc-guide/kernel-doc.rst for the documentation comment syntax.
=cut
# more perldoc at the end of the file
## init lots of data
......@@ -273,7 +117,13 @@ my $blankline_rst = "\n";
# read arguments
if ($#ARGV == -1) {
usage();
pod2usage(
-message => "No arguments!\n",
-exitval => 1,
-verbose => 99,
-sections => 'SYNOPSIS',
-output => \*STDERR,
);
}
my $kernelversion;
......@@ -468,7 +318,7 @@ while ($ARGV[0] =~ m/^--?(.*)/) {
} elsif ($cmd eq "Werror") {
$Werror = 1;
} elsif (($cmd eq "h") || ($cmd eq "help")) {
usage();
pod2usage(-exitval => 0, -verbose => 2);
} elsif ($cmd eq 'no-doc-sections') {
$no_doc_sections = 1;
} elsif ($cmd eq 'enable-lineno') {
......@@ -494,7 +344,22 @@ while ($ARGV[0] =~ m/^--?(.*)/) {
}
} else {
# Unknown argument
usage();
pod2usage(
-message => "Argument unknown!\n",
-exitval => 1,
-verbose => 99,
-sections => 'SYNOPSIS',
-output => \*STDERR,
);
}
if ($#ARGV < 0){
pod2usage(
-message => "FILE argument missing\n",
-exitval => 1,
-verbose => 99,
-sections => 'SYNOPSIS',
-output => \*STDERR,
);
}
}
......@@ -2521,3 +2386,118 @@ if ($Werror && $warnings) {
} else {
exit($output_mode eq "none" ? 0 : $errors)
}
__END__
=head1 OPTIONS
=head2 Output format selection (mutually exclusive):
=over 8
=item -man
Output troff manual page format.
=item -rst
Output reStructuredText format. This is the default.
=item -none
Do not output documentation, only warnings.
=back
=head2 Output format modifiers
=head3 reStructuredText only
=over 8
=item -sphinx-version VERSION
Use the ReST C domain dialect compatible with a specific Sphinx Version.
If not specified, kernel-doc will auto-detect using the sphinx-build version
found on PATH.
=back
=head2 Output selection (mutually exclusive):
=over 8
=item -export
Only output documentation for the symbols that have been exported using
EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() in any input FILE or -export-file FILE.
=item -internal
Only output documentation for the symbols that have NOT been exported using
EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() in any input FILE or -export-file FILE.
=item -function NAME
Only output documentation for the given function or DOC: section title.
All other functions and DOC: sections are ignored.
May be specified multiple times.
=item -nosymbol NAME
Exclude the specified symbol from the output documentation.
May be specified multiple times.
=back
=head2 Output selection modifiers:
=over 8
=item -no-doc-sections
Do not output DOC: sections.
=item -export-file FILE
Specify an additional FILE in which to look for EXPORT_SYMBOL() and
EXPORT_SYMBOL_GPL().
To be used with -export or -internal.
May be specified multiple times.
=back
=head3 reStructuredText only
=over 8
=item -enable-lineno
Enable output of #define LINENO lines.
=back
=head2 Other parameters:
=over 8
=item -h, -help
Print this help.
=item -v
Verbose output, more warnings and other information.
=item -Werror
Treat warnings as errors.
=back
=cut
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment