Commit 233a806b authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'docs-5.14' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "This was a reasonably active cycle for documentation; this includes:

   - Some kernel-doc cleanups. That script is still regex onslaught from
     hell, but it has gotten a little better.

   - Improvements to the checkpatch docs, which are also used by the
     tool itself.

   - A major update to the pathname lookup documentation.

   - Elimination of :doc: markup, since our automarkup magic can create
     references from filenames without all the extra noise.

   - The flurry of Chinese translation activity continues.

  Plus, of course, the usual collection of updates, typo fixes, and
  warning fixes"

* tag 'docs-5.14' of git://git.lwn.net/linux: (115 commits)
  docs: path-lookup: use bare function() rather than literals
  docs: path-lookup: update symlink description
  docs: path-lookup: update get_link() ->follow_link description
  docs: path-lookup: update WALK_GET, WALK_PUT desc
  docs: path-lookup: no get_link()
  docs: path-lookup: update i_op->put_link and cookie description
  docs: path-lookup: i_op->follow_link replaced with i_op->get_link
  docs: path-lookup: Add macro name to symlink limit description
  docs: path-lookup: remove filename_mountpoint
  docs: path-lookup: update do_last() part
  docs: path-lookup: update path_mountpoint() part
  docs: path-lookup: update path_to_nameidata() part
  docs: path-lookup: update follow_managed() part
  docs: Makefile: Use CONFIG_SHELL not SHELL
  docs: Take a little noise out of the build process
  docs: x86: avoid using ReST :doc:`foo` markup
  docs: virt: kvm: s390-pv-boot.rst: avoid using ReST :doc:`foo` markup
  docs: userspace-api: landlock.rst: avoid using ReST :doc:`foo` markup
  docs: trace: ftrace.rst: avoid using ReST :doc:`foo` markup
  docs: trace: coresight: coresight.rst: avoid using ReST :doc:`foo` markup
  ...
parents 122fa8c5 98cf4951
......@@ -6,4 +6,4 @@ Description:
with the update that cpuidle governor can be changed at runtime in default,
both current_governor and current_governor_ro co-exist under
/sys/devices/system/cpu/cpuidle/ file, it's duplicate so make
current_governor_ro obselete.
current_governor_ro obsolete.
......@@ -5,7 +5,7 @@ Contact: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Description:
The /sys/kernel/uids/<uid>/cpu_shares tunable is used
to set the cpu bandwidth a user is allowed. This is a
propotional value. What that means is that if there
proportional value. What that means is that if there
are two users logged in, each with an equal number of
shares, then they will get equal CPU bandwidth. Another
example would be, if User A has shares = 1024 and user
......
......@@ -61,7 +61,7 @@ Date: September. 2017
KernelVersion: 4.14
Contact: Stephen Hemminger <sthemmin@microsoft.com>
Description: Directory for per-channel information
NN is the VMBUS relid associtated with the channel.
NN is the VMBUS relid associated with the channel.
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/cpu
Date: September. 2017
......
......@@ -19,7 +19,7 @@ Date: April 2011
KernelVersion: 3.0
Contact: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Description:
The major:minor number (in hexidecimal) of the
The major:minor number (in hexadecimal) of the
physical device providing the storage for this backend
block device.
......
......@@ -23,3 +23,86 @@ Description: Default value for the Data Stream Control Register (DSCR) on
here).
If set by a process it will be inherited by child processes.
Values: 64 bit unsigned integer (bit field)
What: /sys/devices/system/cpu/cpuX/topology/physical_package_id
Description: physical package id of cpuX. Typically corresponds to a physical
socket number, but the actual value is architecture and platform
dependent.
Values: integer
What: /sys/devices/system/cpu/cpuX/topology/die_id
Description: the CPU die ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
Values: integer
What: /sys/devices/system/cpu/cpuX/topology/core_id
Description: the CPU core ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
Values: integer
What: /sys/devices/system/cpu/cpuX/topology/book_id
Description: the book ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent. it's only used on s390.
Values: integer
What: /sys/devices/system/cpu/cpuX/topology/drawer_id
Description: the drawer ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent. it's only used on s390.
Values: integer
What: /sys/devices/system/cpu/cpuX/topology/core_cpus
Description: internal kernel map of CPUs within the same core.
(deprecated name: "thread_siblings")
Values: hexadecimal bitmask.
What: /sys/devices/system/cpu/cpuX/topology/core_cpus_list
Description: human-readable list of CPUs within the same core.
The format is like 0-3, 8-11, 14,17.
(deprecated name: "thread_siblings_list").
Values: decimal list.
What: /sys/devices/system/cpu/cpuX/topology/package_cpus
Description: internal kernel map of the CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings").
Values: hexadecimal bitmask.
What: /sys/devices/system/cpu/cpuX/topology/package_cpus_list
Description: human-readable list of CPUs sharing the same physical_package_id.
The format is like 0-3, 8-11, 14,17.
(deprecated name: "core_siblings_list")
Values: decimal list.
What: /sys/devices/system/cpu/cpuX/topology/die_cpus
Description: internal kernel map of CPUs within the same die.
Values: hexadecimal bitmask.
What: /sys/devices/system/cpu/cpuX/topology/die_cpus_list
Description: human-readable list of CPUs within the same die.
The format is like 0-3, 8-11, 14,17.
Values: decimal list.
What: /sys/devices/system/cpu/cpuX/topology/book_siblings
Description: internal kernel map of cpuX's hardware threads within the same
book_id. it's only used on s390.
Values: hexadecimal bitmask.
What: /sys/devices/system/cpu/cpuX/topology/book_siblings_list
Description: human-readable list of cpuX's hardware threads within the same
book_id.
The format is like 0-3, 8-11, 14,17. it's only used on s390.
Values: decimal list.
What: /sys/devices/system/cpu/cpuX/topology/drawer_siblings
Description: internal kernel map of cpuX's hardware threads within the same
drawer_id. it's only used on s390.
Values: hexadecimal bitmask.
What: /sys/devices/system/cpu/cpuX/topology/drawer_siblings_list
Description: human-readable list of cpuX's hardware threads within the same
drawer_id.
The format is like 0-3, 8-11, 14,17. it's only used on s390.
Values: decimal list.
......@@ -173,7 +173,7 @@ What: /sys/bus/dsa/devices/wq<m>.<n>/priority
Date: Oct 25, 2019
KernelVersion: 5.6.0
Contact: dmaengine@vger.kernel.org
Description: The priority value of this work queue, it is a vlue relative to
Description: The priority value of this work queue, it is a value relative to
other work queue in the same group to control quality of service
for dispatching work from multiple workqueues in the same group.
......
......@@ -137,7 +137,7 @@ Contact: Vadim Pasternak <vadimpmellanox.com>
Description: These files show the system reset cause, as following:
COMEX thermal shutdown; wathchdog power off or reset was derived
by one of the next components: COMEX, switch board or by Small Form
Factor mezzanine, reset requested from ASIC, reset cuased by BIOS
Factor mezzanine, reset requested from ASIC, reset caused by BIOS
reload. Value 1 in file means this is reset cause, 0 - otherwise.
Only one of the above causes could be 1 at the same time, representing
only last reset cause.
......@@ -183,7 +183,7 @@ What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/vpd_wp
Date: January 2020
KernelVersion: 5.6
Contact: Vadim Pasternak <vadimpmellanox.com>
Description: This file allows to overwrite system VPD hardware wrtie
Description: This file allows to overwrite system VPD hardware write
protection when attribute is set 1.
The file is read/write.
......
......@@ -31,4 +31,4 @@ Date: April 2016
KernelVersion: 4.7
Description:
Dummy IIO devices directory. Creating a directory here will result
in creating a dummy IIO device in the IIO subystem.
in creating a dummy IIO device in the IIO subsystem.
......@@ -20,7 +20,7 @@ Description:
subbuffer_size
configure the sub-buffer size for this channel
(needed for synchronous and isochrnous data)
(needed for synchronous and isochronous data)
num_buffers
......@@ -75,7 +75,7 @@ Description:
subbuffer_size
configure the sub-buffer size for this channel
(needed for synchronous and isochrnous data)
(needed for synchronous and isochronous data)
num_buffers
......@@ -130,7 +130,7 @@ Description:
subbuffer_size
configure the sub-buffer size for this channel
(needed for synchronous and isochrnous data)
(needed for synchronous and isochronous data)
num_buffers
......@@ -196,7 +196,7 @@ Description:
subbuffer_size
configure the sub-buffer size for this channel
(needed for synchronous and isochrnous data)
(needed for synchronous and isochronous data)
num_buffers
......
......@@ -137,7 +137,7 @@ Description:
This group contains "OS String" extension handling attributes.
============= ===============================================
use flag turning "OS Desctiptors" support on/off
use flag turning "OS Descriptors" support on/off
b_vendor_code one-byte value used for custom per-device and
per-interface requests
qw_sign an identifier to be reported as "OS String"
......
......@@ -170,7 +170,7 @@ Description: Default color matching descriptors
bMatrixCoefficients matrix used to compute luma and
chroma values from the color primaries
bTransferCharacteristics optoelectronic transfer
characteristic of the source picutre,
characteristic of the source picture,
also called the gamma function
bColorPrimaries color primaries and the reference
white
......@@ -311,7 +311,7 @@ Description: Specific streaming header descriptors
a hardware trigger interrupt event
bTriggerSupport flag specifying if hardware
triggering is supported
bStillCaptureMethod method of still image caputre
bStillCaptureMethod method of still image capture
supported
bTerminalLink id of the output terminal to which
the video endpoint of this interface
......
......@@ -31,7 +31,7 @@ What: /sys/kernel/debug/genwqe/genwqe<n>_card/prev_regs
Date: Oct 2013
Contact: haver@linux.vnet.ibm.com
Description: Dump of the error registers before the last reset of
the card occured.
the card occurred.
Only available for PF.
What: /sys/kernel/debug/genwqe/genwqe<n>_card/prev_dbg_uid0
......
......@@ -153,7 +153,7 @@ KernelVersion: 5.1
Contact: ogabbay@kernel.org
Description: Triggers an I2C transaction that is generated by the device's
CPU. Writing to this file generates a write transaction while
reading from the file generates a read transcation
reading from the file generates a read transaction
What: /sys/kernel/debug/habanalabs/hl<n>/i2c_reg
Date: Jan 2019
......
......@@ -12,7 +12,7 @@ KernelVersion: 4.12
Contact: linux-fsi@lists.ozlabs.org
Description:
Sends an FSI BREAK command on a master's communication
link to any connnected slaves. A BREAK resets connected
link to any connected slaves. A BREAK resets connected
device's logic and preps it to receive further commands
from the master.
......
......@@ -786,7 +786,7 @@ What: /sys/.../events/in_capacitanceY_adaptive_thresh_rising_en
What: /sys/.../events/in_capacitanceY_adaptive_thresh_falling_en
KernelVersion: 5.13
Contact: linux-iio@vger.kernel.org
Descrption:
Description:
Adaptive thresholds are similar to normal fixed thresholds
but the value is expressed as an offset from a value which
provides a low frequency approximation of the channel itself.
......@@ -798,10 +798,10 @@ What: /sys/.../in_capacitanceY_adaptive_thresh_rising_timeout
What: /sys/.../in_capacitanceY_adaptive_thresh_falling_timeout
KernelVersion: 5.11
Contact: linux-iio@vger.kernel.org
Descrption:
Description:
When adaptive thresholds are used, the tracking signal
may adjust too slowly to step changes in the raw signal.
*_timeout (in seconds) specifies a time for which the
Thus these specify the time in seconds for which the
difference between the slow tracking signal and the raw
signal is allowed to remain out-of-range before a reset
event occurs in which the tracking signal is made equal
......
......@@ -139,8 +139,8 @@ Description:
binary file containing the Vital Product Data for the
device. It should follow the VPD format defined in
PCI Specification 2.1 or 2.2, but users should consider
that some devices may have malformatted data. If the
underlying VPD has a writable section then the
that some devices may have incorrectly formatted data.
If the underlying VPD has a writable section then the
corresponding section of this file will be writable.
What: /sys/bus/pci/devices/.../virtfnN
......
......@@ -84,3 +84,103 @@ Description:
It can be enabled by writing the value stored in
/sys/class/backlight/<backlight>/max_brightness to
/sys/class/backlight/<backlight>/brightness.
What: /sys/class/backlight/<backlight>/<ambient light zone>_max
Date: Sep, 2009
KernelVersion: v2.6.32
Contact: device-drivers-devel@blackfin.uclinux.org
Description:
Control the maximum brightness for <ambient light zone>
on this <backlight>. Values are between 0 and 127. This file
will also show the brightness level stored for this
<ambient light zone>.
The <ambient light zone> is device-driver specific:
For ADP5520 and ADP5501, <ambient light zone> can be:
=========== ================================================
Ambient sysfs entry
light zone
=========== ================================================
daylight /sys/class/backlight/<backlight>/daylight_max
office /sys/class/backlight/<backlight>/office_max
dark /sys/class/backlight/<backlight>/dark_max
=========== ================================================
For ADP8860, <ambient light zone> can be:
=========== ================================================
Ambient sysfs entry
light zone
=========== ================================================
l1_daylight /sys/class/backlight/<backlight>/l1_daylight_max
l2_office /sys/class/backlight/<backlight>/l2_office_max
l3_dark /sys/class/backlight/<backlight>/l3_dark_max
=========== ================================================
For ADP8870, <ambient light zone> can be:
=========== ================================================
Ambient sysfs entry
light zone
=========== ================================================
l1_daylight /sys/class/backlight/<backlight>/l1_daylight_max
l2_bright /sys/class/backlight/<backlight>/l2_bright_max
l3_office /sys/class/backlight/<backlight>/l3_office_max
l4_indoor /sys/class/backlight/<backlight>/l4_indoor_max
l5_dark /sys/class/backlight/<backlight>/l5_dark_max
=========== ================================================
See also: /sys/class/backlight/<backlight>/ambient_light_zone.
What: /sys/class/backlight/<backlight>/<ambient light zone>_dim
Date: Sep, 2009
KernelVersion: v2.6.32
Contact: device-drivers-devel@blackfin.uclinux.org
Description:
Control the dim brightness for <ambient light zone>
on this <backlight>. Values are between 0 and 127, typically
set to 0. Full off when the backlight is disabled.
This file will also show the dim brightness level stored for
this <ambient light zone>.
The <ambient light zone> is device-driver specific:
For ADP5520 and ADP5501, <ambient light zone> can be:
=========== ================================================
Ambient sysfs entry
light zone
=========== ================================================
daylight /sys/class/backlight/<backlight>/daylight_dim
office /sys/class/backlight/<backlight>/office_dim
dark /sys/class/backlight/<backlight>/dark_dim
=========== ================================================
For ADP8860, <ambient light zone> can be:
=========== ================================================
Ambient sysfs entry
light zone
=========== ================================================
l1_daylight /sys/class/backlight/<backlight>/l1_daylight_dim
l2_office /sys/class/backlight/<backlight>/l2_office_dim
l3_dark /sys/class/backlight/<backlight>/l3_dark_dim
=========== ================================================
For ADP8870, <ambient light zone> can be:
=========== ================================================
Ambient sysfs entry
light zone
=========== ================================================
l1_daylight /sys/class/backlight/<backlight>/l1_daylight_dim
l2_bright /sys/class/backlight/<backlight>/l2_bright_dim
l3_office /sys/class/backlight/<backlight>/l3_office_dim
l4_indoor /sys/class/backlight/<backlight>/l4_indoor_dim
l5_dark /sys/class/backlight/<backlight>/l5_dark_dim
=========== ================================================
See also: /sys/class/backlight/<backlight>/ambient_light_zone.
sysfs interface for analog devices adp5520(01) backlight driver
---------------------------------------------------------------
The backlight brightness control operates at three different levels for the
adp5520 and adp5501 devices: daylight (level 1), office (level 2) and dark
(level 3). By default the brightness operates at the daylight brightness level.
What: /sys/class/backlight/<backlight>/daylight_max
What: /sys/class/backlight/<backlight>/office_max
What: /sys/class/backlight/<backlight>/dark_max
Date: Sep, 2009
KernelVersion: v2.6.32
Contact: Michael Hennerich <michael.hennerich@analog.com>
Description:
(RW) Maximum current setting for the backlight when brightness
is at one of the three levels (daylight, office or dark). This
is an input code between 0 and 127, which is transformed to a
value between 0 mA and 30 mA using linear or non-linear
algorithms.
What: /sys/class/backlight/<backlight>/daylight_dim
What: /sys/class/backlight/<backlight>/office_dim
What: /sys/class/backlight/<backlight>/dark_dim
Date: Sep, 2009
KernelVersion: v2.6.32
Contact: Michael Hennerich <michael.hennerich@analog.com>
Description:
(RW) Dim current setting for the backlight when brightness is at
one of the three levels (daylight, office or dark). This is an
input code between 0 and 127, which is transformed to a value
between 0 mA and 30 mA using linear or non-linear algorithms.
sysfs interface for analog devices adp8860 backlight driver
-----------------------------------------------------------
The backlight brightness control operates at three different levels for the
adp8860, adp8861 and adp8863 devices: daylight (level 1), office (level 2) and
dark (level 3). By default the brightness operates at the daylight brightness
level.
See also /sys/class/backlight/<backlight>/ambient_light_level and
/sys/class/backlight/<backlight>/ambient_light_zone.
What: /sys/class/backlight/<backlight>/l1_daylight_max
What: /sys/class/backlight/<backlight>/l2_office_max
What: /sys/class/backlight/<backlight>/l3_dark_max
Date: Apr, 2010
KernelVersion: v2.6.35
Contact: Michael Hennerich <michael.hennerich@analog.com>
Description:
(RW) Maximum current setting for the backlight when brightness
is at one of the three levels (daylight, office or dark). This
is an input code between 0 and 127, which is transformed to a
value between 0 mA and 30 mA using linear or non-linear
algorithms.
What: /sys/class/backlight/<backlight>/l1_daylight_dim
What: /sys/class/backlight/<backlight>/l2_office_dim
What: /sys/class/backlight/<backlight>/l3_dark_dim
Date: Apr, 2010
KernelVersion: v2.6.35
Contact: Michael Hennerich <michael.hennerich@analog.com>
Description:
(RW) Dim current setting for the backlight when brightness is at
one of the three levels (daylight, office or dark). This is an
input code between 0 and 127, which is transformed to a value
between 0 mA and 30 mA using linear or non-linear algorithms.
See also /sys/class/backlight/<backlight>/ambient_light_level and
/sys/class/backlight/<backlight>/ambient_light_zone.
What: /sys/class/backlight/<backlight>/<ambient light zone>_max
What: /sys/class/backlight/<backlight>/l1_daylight_max
What: /sys/class/backlight/<backlight>/l2_bright_max
What: /sys/class/backlight/<backlight>/l3_office_max
What: /sys/class/backlight/<backlight>/l4_indoor_max
What: /sys/class/backlight/<backlight>/l5_dark_max
Date: May 2011
KernelVersion: 3.0
Contact: device-drivers-devel@blackfin.uclinux.org
Description:
Control the maximum brightness for <ambient light zone>
on this <backlight>. Values are between 0 and 127. This file
will also show the brightness level stored for this
<ambient light zone>.
What: /sys/class/backlight/<backlight>/<ambient light zone>_dim
What: /sys/class/backlight/<backlight>/l2_bright_dim
What: /sys/class/backlight/<backlight>/l3_office_dim
What: /sys/class/backlight/<backlight>/l4_indoor_dim
What: /sys/class/backlight/<backlight>/l5_dark_dim
Date: May 2011
KernelVersion: 3.0
Contact: device-drivers-devel@blackfin.uclinux.org
Description:
Control the dim brightness for <ambient light zone>
on this <backlight>. Values are between 0 and 127, typically
set to 0. Full off when the backlight is disabled.
This file will also show the dim brightness level stored for
this <ambient light zone>.
What: /sys/class/leds/<led>/repeat
Date: September 2019
KernelVersion: 5.5
Description:
EL15203000 supports only indefinitely patterns,
so this file should always store -1.
For more info, please see:
Documentation/ABI/testing/sysfs-class-led-trigger-pattern
......@@ -35,3 +35,6 @@ Description:
This file will always return the originally written repeat
number.
It should be noticed that some leds, like EL15203000 may
only support indefinitely patterns, so they always store -1.
......@@ -50,7 +50,7 @@ Description: Dynamic addition and removal of CPU's. This is not hotplug
architecture specific.
release: writes to this file dynamically remove a CPU from
the system. Information writtento the file to remove CPU's
the system. Information written to the file to remove CPU's
is architecture specific.
What: /sys/devices/system/cpu/cpu#/node
......@@ -97,7 +97,7 @@ Description: CPU topology files that describe a logical CPU's relationship
corresponds to a physical socket number, but the actual value
is architecture and platform dependent.
thread_siblings: internel kernel map of cpu#'s hardware
thread_siblings: internal kernel map of cpu#'s hardware
threads within the same core as cpu#
thread_siblings_list: human-readable list of cpu#'s hardware
......@@ -280,7 +280,7 @@ Description: Disable L3 cache indices
on a processor with this functionality will return the currently
disabled index for that node. There is one L3 structure per
node, or per internal node on MCM machines. Writing a valid
index to one of these files will cause the specificed cache
index to one of these files will cause the specified cache
index to be disabled.
All AMD processors with L3 caches provide this functionality.
......@@ -295,7 +295,7 @@ Description: Processor frequency boosting control
This switch controls the boost setting for the whole system.
Boosting allows the CPU and the firmware to run at a frequency
beyound it's nominal limit.
beyond it's nominal limit.
More details can be found in
Documentation/admin-guide/pm/cpufreq.rst
......@@ -532,7 +532,7 @@ What: /sys/devices/system/cpu/smt
/sys/devices/system/cpu/smt/control
Date: June 2018
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description: Control Symetric Multi Threading (SMT)
Description: Control Symmetric Multi Threading (SMT)
active: Tells whether SMT is active (enabled and siblings online)
......
......@@ -168,7 +168,7 @@ Description: This file shows the manufacturing date in BCD format.
What: /sys/bus/platform/drivers/ufshcd/*/device_descriptor/manufacturer_id
Date: February 2018
Contact: Stanislav Nijnikov <stanislav.nijnikov@wdc.com>
Description: This file shows the manufacturee ID. This is one of the
Description: This file shows the manufacturer ID. This is one of the
UFS device descriptor parameters. The full information about
the descriptor could be found at UFS specifications 2.1.
......@@ -521,7 +521,7 @@ Description: This file shows maximum VCC, VCCQ and VCCQ2 value for
What: /sys/bus/platform/drivers/ufshcd/*/string_descriptors/manufacturer_name
Date: February 2018
Contact: Stanislav Nijnikov <stanislav.nijnikov@wdc.com>
Description: This file contains a device manufactureer name string.
Description: This file contains a device manufacturer name string.
The full information about the descriptor could be found at
UFS specifications 2.1.
......
......@@ -238,7 +238,7 @@ Description: Shows current reserved blocks in system, it may be temporarily
What: /sys/fs/f2fs/<disk>/gc_urgent
Date: August 2017
Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>
Description: Do background GC agressively when set. When gc_urgent = 1,
Description: Do background GC aggressively when set. When gc_urgent = 1,
background thread starts to do GC by given gc_urgent_sleep_time
interval. When gc_urgent = 2, F2FS will lower the bar of
checking idle in order to process outstanding discard commands
......
......@@ -25,14 +25,10 @@ Description: /sys/kernel/iommu_groups/reserved_regions list IOVA
the base IOVA, the second is the end IOVA and the third
field describes the type of the region.
What: /sys/kernel/iommu_groups/reserved_regions
Date: June 2019
KernelVersion: v5.3
Contact: Eric Auger <eric.auger@redhat.com>
Description: In case an RMRR is used only by graphics or USB devices
it is now exposed as "direct-relaxable" instead of "direct".
In device assignment use case, for instance, those RMRR
are considered to be relaxable and safe.
Since kernel 5.3, in case an RMRR is used only by graphics or
USB devices it is now exposed as "direct-relaxable" instead
of "direct". In device assignment use case, for instance,
those RMRR are considered to be relaxable and safe.
What: /sys/kernel/iommu_groups/<grp_id>/type
Date: November 2020
......
......@@ -76,7 +76,7 @@ quiet_cmd_sphinx = SPHINX $@ --> file://$(abspath $(BUILDDIR)/$3/$4)
PYTHONDONTWRITEBYTECODE=1 \
BUILDDIR=$(abspath $(BUILDDIR)) SPHINX_CONF=$(abspath $(srctree)/$(src)/$5/$(SPHINX_CONF)) \
$(PYTHON3) $(srctree)/scripts/jobserver-exec \
$(SHELL) $(srctree)/Documentation/sphinx/parallel-wrapper.sh \
$(CONFIG_SHELL) $(srctree)/Documentation/sphinx/parallel-wrapper.sh \
$(SPHINXBUILD) \
-b $2 \
-c $(abspath $(srctree)/$(src)) \
......
......@@ -22,9 +22,9 @@ or if the device has INTx interrupts connected by platform interrupt
controllers and a _PRT is needed to describe those connections.
ACPI resource description is done via _CRS objects of devices in the ACPI
namespace [2].   The _CRS is like a generalized PCI BAR: the OS can read
namespace [2]. The _CRS is like a generalized PCI BAR: the OS can read
_CRS and figure out what resource is being consumed even if it doesn't have
a driver for the device [3].  That's important because it means an old OS
a driver for the device [3]. That's important because it means an old OS
can work correctly even on a system with new devices unknown to the OS.
The new devices might not do anything, but the OS can at least make sure no
resources conflict with them.
......@@ -41,15 +41,15 @@ ACPI, that device will have a specific _HID/_CID that tells the OS what
driver to bind to it, and the _CRS tells the OS and the driver where the
device's registers are.
PCI host bridges are PNP0A03 or PNP0A08 devices.  Their _CRS should
describe all the address space they consume.  This includes all the windows
PCI host bridges are PNP0A03 or PNP0A08 devices. Their _CRS should
describe all the address space they consume. This includes all the windows
they forward down to the PCI bus, as well as registers of the host bridge
itself that are not forwarded to PCI.  The host bridge registers include
itself that are not forwarded to PCI. The host bridge registers include
things like secondary/subordinate bus registers that determine the bus
range below the bridge, window registers that describe the apertures, etc.
These are all device-specific, non-architected things, so the only way a
PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain
the device-specific details.  The host bridge registers also include ECAM
the device-specific details. The host bridge registers also include ECAM
space, since it is consumed by the host bridge.
ACPI defines a Consumer/Producer bit to distinguish the bridge registers
......@@ -66,7 +66,7 @@ the PNP0A03/PNP0A08 device itself. The workaround was to describe the
bridge registers (including ECAM space) in PNP0C02 catch-all devices [6].
With the exception of ECAM, the bridge register space is device-specific
anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to
know about it.  
know about it.
New architectures should be able to use "Consumer" Extended Address Space
descriptors in the PNP0A03 device for bridge registers, including ECAM,
......@@ -75,9 +75,9 @@ ia64 kernels assume all address space descriptors, including "Consumer"
Extended Address Space ones, are windows, so it would not be safe to
describe bridge registers this way on those architectures.
PNP0C02 "motherboard" devices are basically a catch-all.  There's no
PNP0C02 "motherboard" devices are basically a catch-all. There's no
programming model for them other than "don't use these resources for
anything else."  So a PNP0C02 _CRS should claim any address space that is
anything else." So a PNP0C02 _CRS should claim any address space that is
(1) not claimed by _CRS under any other device object in the ACPI namespace
and (2) should not be assigned by the OS to something else.
......
......@@ -125,4 +125,4 @@ all the EPF devices are created and linked with the EPC device.
| interrupt_pin
| function
[1] :doc:`pci-endpoint`
[1] Documentation/PCI/endpoint/pci-endpoint.rst
......@@ -265,7 +265,7 @@ Set the DMA mask size
---------------------
.. note::
If anything below doesn't make sense, please refer to
:doc:`/core-api/dma-api`. This section is just a reminder that
Documentation/core-api/dma-api.rst. This section is just a reminder that
drivers need to indicate DMA capabilities of the device and is not
an authoritative source for DMA interfaces.
......@@ -291,7 +291,7 @@ Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
Setup shared control data
-------------------------
Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
memory. See :doc:`/core-api/dma-api` for a full description of
memory. See Documentation/core-api/dma-api.rst for a full description of
the DMA APIs. This section is just a reminder that it needs to be done
before enabling DMA on the device.
......@@ -421,7 +421,7 @@ owners if there is one.
Then clean up "consistent" buffers which contain the control data.
See :doc:`/core-api/dma-api` for details on unmapping interfaces.
See Documentation/core-api/dma-api.rst for details on unmapping interfaces.
Unregister from other subsystems
......
......@@ -2,87 +2,10 @@
How CPU topology info is exported via sysfs
===========================================
Export CPU topology info via sysfs. Items (attributes) are similar
to /proc/cpuinfo output of some architectures. They reside in
/sys/devices/system/cpu/cpuX/topology/:
physical_package_id:
physical package id of cpuX. Typically corresponds to a physical
socket number, but the actual value is architecture and platform
dependent.
die_id:
the CPU die ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
core_id:
the CPU core ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
book_id:
the book ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
drawer_id:
the drawer ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
core_cpus:
internal kernel map of CPUs within the same core.
(deprecated name: "thread_siblings")
core_cpus_list:
human-readable list of CPUs within the same core.
(deprecated name: "thread_siblings_list");
package_cpus:
internal kernel map of the CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings")
package_cpus_list:
human-readable list of CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings_list")
die_cpus:
internal kernel map of CPUs within the same die.
die_cpus_list:
human-readable list of CPUs within the same die.
book_siblings:
internal kernel map of cpuX's hardware threads within the same
book_id.
book_siblings_list:
human-readable list of cpuX's hardware threads within the same
book_id.
drawer_siblings:
internal kernel map of cpuX's hardware threads within the same
drawer_id.
drawer_siblings_list:
human-readable list of cpuX's hardware threads within the same
drawer_id.
CPU topology info is exported via sysfs. Items (attributes) are similar
to /proc/cpuinfo output of some architectures. They reside in
/sys/devices/system/cpu/cpuX/topology/. Please refer to the ABI file:
Documentation/ABI/stable/sysfs-devices-system-cpu.
Architecture-neutral, drivers/base/topology.c, exports these attributes.
However, the book and drawer related sysfs files will only be created if
......
......@@ -392,7 +392,7 @@ When mounting an ext4 filesystem, the following option are accepted:
dax
Use direct access (no page cache). See
Documentation/filesystems/dax.txt. Note that this option is
Documentation/filesystems/dax.rst. Note that this option is
incompatible with data=journal.
inlinecrypt
......
......@@ -3,7 +3,8 @@
SRBDS - Special Register Buffer Data Sampling
=============================================
SRBDS is a hardware vulnerability that allows MDS :doc:`mds` techniques to
SRBDS is a hardware vulnerability that allows MDS
Documentation/admin-guide/hw-vuln/mds.rst techniques to
infer values returned from special register accesses. Special register
accesses are accesses to off core registers. According to Intel's evaluation,
the special register reads that have a security expectation of privacy are
......
......@@ -2,7 +2,7 @@
Documentation for Kdump - The kexec-based Crash Dumping Solution
================================================================
This document includes overview, setup and installation, and analysis
This document includes overview, setup, installation, and analysis
information.
Overview
......@@ -13,9 +13,9 @@ dump of the system kernel's memory needs to be taken (for example, when
the system panics). The system kernel's memory image is preserved across
the reboot and is accessible to the dump-capture kernel.
You can use common commands, such as cp and scp, to copy the
memory image to a dump file on the local disk, or across the network to
a remote system.
You can use common commands, such as cp, scp or makedumpfile to copy
the memory image to a dump file on the local disk, or across the network
to a remote system.
Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
s390x, arm and arm64 architectures.
......@@ -26,13 +26,15 @@ the dump-capture kernel. This ensures that ongoing Direct Memory Access
The kexec -p command loads the dump-capture kernel into this reserved
memory.
On x86 machines, the first 640 KB of physical memory is needed to boot,
regardless of where the kernel loads. Therefore, kexec backs up this
region just before rebooting into the dump-capture kernel.
On x86 machines, the first 640 KB of physical memory is needed for boot,
regardless of where the kernel loads. For simpler handling, the whole
low 1M is reserved to avoid any later kernel or device driver writing
data into this area. Like this, the low 1M can be reused as system RAM
by kdump kernel without extra handling.
Similarly on PPC64 machines first 32KB of physical memory is needed for
booting regardless of where the kernel is loaded and to support 64K page
size kexec backs up the first 64KB memory.
On PPC64 machines first 32KB of physical memory is needed for booting
regardless of where the kernel is loaded and to support 64K page size
kexec backs up the first 64KB memory.
For s390x, when kdump is triggered, the crashkernel region is exchanged
with the region [0, crashkernel region size] and then the kdump kernel
......@@ -46,14 +48,14 @@ passed to the dump-capture kernel through the elfcorehdr= boot
parameter. Optionally the size of the ELF header can also be passed
when using the elfcorehdr=[size[KMG]@]offset[KMG] syntax.
With the dump-capture kernel, you can access the memory image through
/proc/vmcore. This exports the dump as an ELF-format file that you can
write out using file copy commands such as cp or scp. Further, you can
use analysis tools such as the GNU Debugger (GDB) and the Crash tool to
debug the dump file. This method ensures that the dump pages are correctly
ordered.
write out using file copy commands such as cp or scp. You can also use
makedumpfile utility to analyze and write out filtered contents with
options, e.g with '-d 31' it will only write out kernel data. Further,
you can use analysis tools such as the GNU Debugger (GDB) and the Crash
tool to debug the dump file. This method ensures that the dump pages are
correctly ordered.
Setup and Installation
======================
......@@ -125,9 +127,18 @@ dump-capture kernels for enabling kdump support.
System kernel config options
----------------------------
1) Enable "kexec system call" in "Processor type and features."::
1) Enable "kexec system call" or "kexec file based system call" in
"Processor type and features."::
CONFIG_KEXEC=y or CONFIG_KEXEC_FILE=y
And both of them will select KEXEC_CORE::
CONFIG_KEXEC=y
CONFIG_KEXEC_CORE=y
Subsequently, CRASH_CORE is selected by KEXEC_CORE::
CONFIG_CRASH_CORE=y
2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo
filesystems." This is usually enabled by default::
......@@ -175,17 +186,19 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
CONFIG_HIGHMEM4G
2) On i386 and x86_64, disable symmetric multi-processing support
under "Processor type and features"::
2) With CONFIG_SMP=y, usually nr_cpus=1 need specified on the kernel
command line when loading the dump-capture kernel because one
CPU is enough for kdump kernel to dump vmcore on most of systems.
CONFIG_SMP=n
However, you can also specify nr_cpus=X to enable multiple processors
in kdump kernel. In this case, "disable_cpu_apicid=" is needed to
tell kdump kernel which cpu is 1st kernel's BSP. Please refer to
admin-guide/kernel-parameters.txt for more details.
(If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
when loading the dump-capture kernel, see section "Load the Dump-capture
Kernel".)
With CONFIG_SMP=n, the above things are not related.
3) If one wants to build and use a relocatable kernel,
Enable "Build a relocatable kernel" support under "Processor type and
3) A relocatable kernel is suggested to be built by default. If not yet,
enable "Build a relocatable kernel" support under "Processor type and
features"::
CONFIG_RELOCATABLE=y
......@@ -232,7 +245,7 @@ Dump-capture kernel config options (Arch Dependent, ia64)
as a dump-capture kernel if desired.
The crashkernel region can be automatically placed by the system
kernel at run time. This is done by specifying the base address as 0,
kernel at runtime. This is done by specifying the base address as 0,
or omitting it all together::
crashkernel=256M@0
......@@ -241,10 +254,6 @@ Dump-capture kernel config options (Arch Dependent, ia64)
crashkernel=256M
If the start address is specified, note that the start address of the
kernel will be aligned to 64Mb, so if the start address is not then
any space below the alignment point will be wasted.
Dump-capture kernel config options (Arch Dependent, arm)
----------------------------------------------------------
......@@ -260,46 +269,82 @@ Dump-capture kernel config options (Arch Dependent, arm64)
on non-VHE systems even if it is configured. This is because the CPU
will not be reset to EL2 on panic.
Extended crashkernel syntax
crashkernel syntax
===========================
1) crashkernel=size@offset
While the "crashkernel=size[@offset]" syntax is sufficient for most
configurations, sometimes it's handy to have the reserved memory dependent
on the value of System RAM -- that's mostly for distributors that pre-setup
the kernel command line to avoid a unbootable system after some memory has
been removed from the machine.
Here 'size' specifies how much memory to reserve for the dump-capture kernel
and 'offset' specifies the beginning of this reserved memory. For example,
"crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
starting at physical address 0x01000000 (16MB) for the dump-capture kernel.
The syntax is::
The crashkernel region can be automatically placed by the system
kernel at run time. This is done by specifying the base address as 0,
or omitting it all together::
crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
range=start-[end]
crashkernel=256M@0
For example::
or::
crashkernel=512M-2G:64M,2G-:128M
crashkernel=256M
This would mean:
If the start address is specified, note that the start address of the
kernel will be aligned to a value (which is Arch dependent), so if the
start address is not then any space below the alignment point will be
wasted.
1) if the RAM is smaller than 512M, then don't reserve anything
(this is the "rescue" case)
2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
3) if the RAM size is larger than 2G, then reserve 128M
2) range1:size1[,range2:size2,...][@offset]
While the "crashkernel=size[@offset]" syntax is sufficient for most
configurations, sometimes it's handy to have the reserved memory dependent
on the value of System RAM -- that's mostly for distributors that pre-setup
the kernel command line to avoid a unbootable system after some memory has
been removed from the machine.
The syntax is::
Boot into System Kernel
=======================
crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
range=start-[end]
For example::
crashkernel=512M-2G:64M,2G-:128M
This would mean:
1) if the RAM is smaller than 512M, then don't reserve anything
(this is the "rescue" case)
2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
3) if the RAM size is larger than 2G, then reserve 128M
3) crashkernel=size,high and crashkernel=size,low
If memory above 4G is preferred, crashkernel=size,high can be used to
fulfill that. With it, physical memory is allowed to be allocated from top,
so could be above 4G if system has more than 4G RAM installed. Otherwise,
memory region will be allocated below 4G if available.
When crashkernel=X,high is passed, kernel could allocate physical memory
region above 4G, low memory under 4G is needed in this case. There are
three ways to get low memory:
1) Kernel will allocate at least 256M memory below 4G automatically
if crashkernel=Y,low is not specified.
2) Let user specify low memory size instead.
3) Specified value 0 will disable low memory allocation::
crashkernel=0,low
Boot into System Kernel
-----------------------
1) Update the boot loader (such as grub, yaboot, or lilo) configuration
files as necessary.
2) Boot the system kernel with the boot parameter "crashkernel=Y@X",
where Y specifies how much memory to reserve for the dump-capture kernel
and X specifies the beginning of this reserved memory. For example,
"crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
starting at physical address 0x01000000 (16MB) for the dump-capture kernel.
2) Boot the system kernel with the boot parameter "crashkernel=Y@X".
On x86 and x86_64, use "crashkernel=64M@16M".
On x86 and x86_64, use "crashkernel=Y[@X]". Most of the time, the
start address 'X' is not necessary, kernel will search a suitable
area. Unless an explicit start address is expected.
On ppc64, use "crashkernel=128M@32M".
......@@ -331,8 +376,8 @@ of dump-capture kernel. Following is the summary.
For i386 and x86_64:
- Use vmlinux if kernel is not relocatable.
- Use bzImage/vmlinuz if kernel is relocatable.
- Use vmlinux if kernel is not relocatable.
For ppc64:
......@@ -392,7 +437,7 @@ loading dump-capture kernel.
For i386, x86_64 and ia64:
"1 irqpoll maxcpus=1 reset_devices"
"1 irqpoll nr_cpus=1 reset_devices"
For ppc64:
......@@ -400,7 +445,7 @@ For ppc64:
For s390x:
"1 maxcpus=1 cgroup_disable=memory"
"1 nr_cpus=1 cgroup_disable=memory"
For arm:
......@@ -408,7 +453,7 @@ For arm:
For arm64:
"1 maxcpus=1 reset_devices"
"1 nr_cpus=1 reset_devices"
Notes on loading the dump-capture kernel:
......@@ -488,6 +533,10 @@ the following command::
cp /proc/vmcore <dump-file>
You can also use makedumpfile utility to write out the dump file
with specified options to filter out unwanted contents, e.g::
makedumpfile -l --message-level 1 -d 31 /proc/vmcore <dump-file>
Analysis
========
......@@ -535,8 +584,7 @@ This will cause a kdump to occur at the add_taint()->panic() call.
Contact
=======
- Vivek Goyal (vgoyal@redhat.com)
- Maneesh Soni (maneesh@in.ibm.com)
- kexec@lists.infradead.org
GDB macros
==========
......
......@@ -3513,6 +3513,9 @@
nr_uarts= [SERIAL] maximum number of UARTs to be registered.
numa=off [KNL, ARM64, PPC, RISCV, SPARC, X86] Disable NUMA, Only
set up a single NUMA node spanning all memory.
numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic
NUMA balancing.
Allowed values are enable and disable
......
......@@ -20,8 +20,8 @@ Nehalem and later generations of Intel processors, but the level of support for
a particular processor model in it depends on whether or not it recognizes that
processor model and may also depend on information coming from the platform
firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
works in general, so this is the time to get familiar with :doc:`cpuidle` if you
have not done that yet.]
works in general, so this is the time to get familiar with
Documentation/admin-guide/pm/cpuidle.rst if you have not done that yet.]
``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
logical CPU executing it is idle and so it may be possible to put some of the
......@@ -53,7 +53,8 @@ processor) corresponding to them depends on the processor model and it may also
depend on the configuration of the platform.
In order to create a list of available idle states required by the ``CPUIdle``
subsystem (see :ref:`idle-states-representation` in :doc:`cpuidle`),
subsystem (see :ref:`idle-states-representation` in
Documentation/admin-guide/pm/cpuidle.rst),
``intel_idle`` can use two sources of information: static tables of idle states
for different processor models included in the driver itself and the ACPI tables
of the system. The former are always used if the processor model at hand is
......@@ -98,7 +99,8 @@ states may not be enabled by default if there are no matching entries in the
preliminary list of idle states coming from the ACPI tables. In that case user
space still can enable them later (on a per-CPU basis) with the help of
the ``disable`` idle state attribute in ``sysfs`` (see
:ref:`idle-states-representation` in :doc:`cpuidle`). This basically means that
:ref:`idle-states-representation` in
Documentation/admin-guide/pm/cpuidle.rst). This basically means that
the idle states "known" to the driver may not be enabled by default if they have
not been exposed by the platform firmware (through the ACPI tables).
......@@ -186,7 +188,8 @@ be desirable. In practice, it is only really necessary to do that if the idle
states in question cannot be enabled during system startup, because in the
working state of the system the CPU power management quality of service (PM
QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
even if they have been enumerated (see :ref:`cpu-pm-qos` in :doc:`cpuidle`).
even if they have been enumerated (see :ref:`cpu-pm-qos` in
Documentation/admin-guide/pm/cpuidle.rst).
Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
......@@ -202,7 +205,8 @@ Namely, the positions of the bits that are set in the ``states_off`` value are
the indices of idle states to be disabled by default (as reflected by the names
of the corresponding idle state directories in ``sysfs``, :file:`state0`,
:file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given
idle state; see :ref:`idle-states-representation` in :doc:`cpuidle`).
idle state; see :ref:`idle-states-representation` in
Documentation/admin-guide/pm/cpuidle.rst).
For example, if ``states_off`` is equal to 3, the driver will disable idle
states 0 and 1 by default, and if it is equal to 8, idle state 3 will be
......
......@@ -18,8 +18,8 @@ General Information
(``CPUFreq``). It is a scaling driver for the Sandy Bridge and later
generations of Intel processors. Note, however, that some of those processors
may not be supported. [To understand ``intel_pstate`` it is necessary to know
how ``CPUFreq`` works in general, so this is the time to read :doc:`cpufreq` if
you have not done that yet.]
how ``CPUFreq`` works in general, so this is the time to read
Documentation/admin-guide/pm/cpufreq.rst if you have not done that yet.]
For the processors supported by ``intel_pstate``, the P-state concept is broader
than just an operating frequency or an operating performance point (see the
......@@ -445,8 +445,9 @@ Interpretation of Policy Attributes
-----------------------------------
The interpretation of some ``CPUFreq`` policy attributes described in
:doc:`cpufreq` is special with ``intel_pstate`` as the current scaling driver
and it generally depends on the driver's `operation mode <Operation Modes_>`_.
Documentation/admin-guide/pm/cpufreq.rst is special with ``intel_pstate``
as the current scaling driver and it generally depends on the driver's
`operation mode <Operation Modes_>`_.
First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and
``scaling_cur_freq`` attributes are produced by applying a processor-specific
......
......@@ -1248,7 +1248,7 @@ paragraph makes the severeness obvious.
In case you performed a successful bisection, use the title of the change that
introduced the regression as the second part of your subject. Make the report
also mention the commit id of the culprit. In case of an unsuccessful bisection,
also mention the commit id of the culprit. In case of an unsuccessful bisection,
make your report mention the latest tested version that's working fine (say 5.7)
and the oldest where the issue occurs (say 5.8-rc1).
......
......@@ -11,7 +11,7 @@ Documentation for /proc/sys/abi/
Copyright (c) 2020, Stephen Kitt
For general info, see :doc:`index`.
For general info, see Documentation/admin-guide/sysctl/index.rst.
------------------------------------------------------------------------------
......
......@@ -9,7 +9,8 @@ Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com>
For general info and legal blurb, please look in :doc:`index`.
For general info and legal blurb, please look in
Documentation/admin-guide/sysctl/index.rst.
------------------------------------------------------------------------------
......@@ -54,7 +55,7 @@ free space valid for 30 seconds.
acpi_video_flags
================
See :doc:`/power/video`. This allows the video resume mode to be set,
See Documentation/power/video.rst. This allows the video resume mode to be set,
in a similar fashion to the ``acpi_sleep`` kernel parameter, by
combining the following values:
......@@ -89,7 +90,7 @@ is 0x15 and the full version number is 0x234, this file will contain
the value 340 = 0x154.
See the ``type_of_loader`` and ``ext_loader_type`` fields in
:doc:`/x86/boot` for additional information.
Documentation/x86/boot.rst for additional information.
bootloader_version (x86 only)
......@@ -99,7 +100,7 @@ The complete bootloader version number. In the example above, this
file will contain the value 564 = 0x234.
See the ``type_of_loader`` and ``ext_loader_ver`` fields in
:doc:`/x86/boot` for additional information.
Documentation/x86/boot.rst for additional information.
bpf_stats_enabled
......@@ -269,7 +270,7 @@ see the ``hostname(1)`` man page.
firmware_config
===============
See :doc:`/driver-api/firmware/fallback-mechanisms`.
See Documentation/driver-api/firmware/fallback-mechanisms.rst.
The entries in this directory allow the firmware loader helper
fallback to be controlled:
......@@ -297,7 +298,7 @@ crashes and outputting them to a serial console.
ftrace_enabled, stack_tracer_enabled
====================================
See :doc:`/trace/ftrace`.
See Documentation/trace/ftrace.rst.
hardlockup_all_cpu_backtrace
......@@ -325,7 +326,7 @@ when a hard lockup is detected.
1 Panic on hard lockup.
= ===========================
See :doc:`/admin-guide/lockup-watchdogs` for more information.
See Documentation/admin-guide/lockup-watchdogs.rst for more information.
This can also be set using the nmi_watchdog kernel parameter.
......@@ -333,7 +334,12 @@ hotplug
=======
Path for the hotplug policy agent.
Default value is "``/sbin/hotplug``".
Default value is ``CONFIG_UEVENT_HELPER_PATH``, which in turn defaults
to the empty string.
This file only exists when ``CONFIG_UEVENT_HELPER`` is enabled. Most
modern systems rely exclusively on the netlink-based uevent source and
don't need this.
hung_task_all_cpu_backtrace
......@@ -582,7 +588,8 @@ in a KVM virtual machine. This default can be overridden by adding::
nmi_watchdog=1
to the guest kernel command line (see :doc:`/admin-guide/kernel-parameters`).
to the guest kernel command line (see
Documentation/admin-guide/kernel-parameters.rst).
numa_balancing
......@@ -1067,7 +1074,7 @@ that support this feature.
real-root-dev
=============
See :doc:`/admin-guide/initrd`.
See Documentation/admin-guide/initrd.rst.
reboot-cmd (SPARC only)
......@@ -1161,7 +1168,7 @@ will take effect.
seccomp
=======
See :doc:`/userspace-api/seccomp_filter`.
See Documentation/userspace-api/seccomp_filter.rst.
sg-big-buff
......@@ -1332,7 +1339,7 @@ the boot PROM.
sysrq
=====
See :doc:`/admin-guide/sysrq`.
See Documentation/admin-guide/sysrq.rst.
tainted
......@@ -1362,15 +1369,16 @@ ORed together. The letters are seen in "Tainted" line of Oops reports.
131072 `(T)` The kernel was built with the struct randomization plugin
====== ===== ==============================================================
See :doc:`/admin-guide/tainted-kernels` for more information.
See Documentation/admin-guide/tainted-kernels.rst for more information.
Note:
writes to this sysctl interface will fail with ``EINVAL`` if the kernel is
booted with the command line option ``panic_on_taint=<bitmask>,nousertaint``
and any of the ORed together values being written to ``tainted`` match with
the bitmask declared on panic_on_taint.
See :doc:`/admin-guide/kernel-parameters` for more details on that particular
kernel command line option and its optional ``nousertaint`` switch.
See Documentation/admin-guide/kernel-parameters.rst for more details on
that particular kernel command line option and its optional
``nousertaint`` switch.
threads-max
===========
......@@ -1394,7 +1402,7 @@ If a value outside of this range is written to ``threads-max`` an
traceoff_on_warning
===================
When set, disables tracing (see :doc:`/trace/ftrace`) when a
When set, disables tracing (see Documentation/trace/ftrace.rst) when a
``WARN()`` is hit.
......@@ -1414,8 +1422,8 @@ will send them to printk() again.
This only works if the kernel was booted with ``tp_printk`` enabled.
See :doc:`/admin-guide/kernel-parameters` and
:doc:`/trace/boottime-trace`.
See Documentation/admin-guide/kernel-parameters.rst and
Documentation/trace/boottime-trace.rst.
.. _unaligned-dump-stack:
......
......@@ -259,7 +259,7 @@ Storage family
https://web.archive.org/web/20191129073953/http://www.marvell.com/storage/armada-sp/
Core:
Sheeva ARMv7 comatible Quad-core PJ4C
Sheeva ARMv7 compatible Quad-core PJ4C
(not supported in upstream Linux kernel)
......
......@@ -196,7 +196,7 @@ a virtual address mapping (unlike the earlier scheme of virtual address
do not have a corresponding kernel virtual address space mapping) and
low-memory pages.
Note: Please refer to :doc:`/core-api/dma-api-howto` for a discussion
Note: Please refer to Documentation/core-api/dma-api-howto.rst for a discussion
on PCI high mem DMA aspects and mapping of scatter gather lists, and support
for 64 bit PCI.
......
......@@ -62,7 +62,7 @@ queue, to be sent in the future, when the hardware is able.
Software staging queues
~~~~~~~~~~~~~~~~~~~~~~~
The block IO subsystem adds requests in the software staging queues
The block IO subsystem adds requests in the software staging queues
(represented by struct blk_mq_ctx) in case that they weren't sent
directly to the driver. A request is one or more BIOs. They arrived at the
block layer through the data structure struct bio. The block layer
......@@ -132,7 +132,7 @@ In order to indicate which request has been completed, every request is
identified by an integer, ranging from 0 to the dispatch queue size. This tag
is generated by the block layer and later reused by the device driver, removing
the need to create a redundant identifier. When a request is completed in the
drive, the tag is sent back to the block layer to notify it of the finalization.
driver, the tag is sent back to the block layer to notify it of the finalization.
This removes the need to do a linear search to find out which IO has been
completed.
......
......@@ -18,7 +18,7 @@ A.
each, it would be impossible to guarantee that a set of readings
represent a single point in time.
The stat file consists of a single line of text containing 11 decimal
The stat file consists of a single line of text containing 17 decimal
values separated by whitespace. The fields are summarized in the
following table, and described in more detail below.
......
......@@ -20,10 +20,10 @@ LSM hook:
Other LSM hooks which can be instrumented can be found in
``include/linux/lsm_hooks.h``.
eBPF programs that use :doc:`/bpf/btf` do not need to include kernel headers
for accessing information from the attached eBPF program's context. They can
simply declare the structures in the eBPF program and only specify the fields
that need to be accessed.
eBPF programs that use Documentation/bpf/btf.rst do not need to include kernel
headers for accessing information from the attached eBPF program's context.
They can simply declare the structures in the eBPF program and only specify
the fields that need to be accessed.
.. code-block:: c
......@@ -88,8 +88,9 @@ example:
The ``__attribute__((preserve_access_index))`` is a clang feature that allows
the BPF verifier to update the offsets for the access at runtime using the
:doc:`/bpf/btf` information. Since the BPF verifier is aware of the types, it
also validates all the accesses made to the various types in the eBPF program.
Documentation/bpf/btf.rst information. Since the BPF verifier is aware of the
types, it also validates all the accesses made to the various types in the
eBPF program.
Loading
-------
......
......@@ -41,15 +41,7 @@ extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include',
'maintainers_include', 'sphinx.ext.autosectionlabel',
'kernel_abi', 'kernel_feat']
#
# cdomain is badly broken in Sphinx 3+. Leaving it out generates *most*
# of the docs correctly, but not all. Scream bloody murder but allow
# the process to proceed; hopefully somebody will fix this properly soon.
#
if major >= 3:
sys.stderr.write('''WARNING: The kernel documentation build process
support for Sphinx v3.0 and above is brand new. Be prepared for
possible issues in the generated output.\n''')
if (major > 3) or (minor > 0 or patch >= 2):
# Sphinx c function parser is more pedantic with regards to type
# checking. Due to that, having macros at c:function cause problems.
......@@ -353,6 +345,8 @@ latex_elements = {
# Additional stuff for the LaTeX preamble.
'preamble': '''
% Prevent column squeezing of tabulary.
\\setlength{\\tymin}{20em}
% Use some font with UTF-8 support with XeLaTeX
\\usepackage{fontspec}
\\setsansfont{DejaVu Sans}
......@@ -366,11 +360,23 @@ latex_elements = {
cjk_cmd = check_output(['fc-list', '--format="%{family[0]}\n"']).decode('utf-8', 'ignore')
if cjk_cmd.find("Noto Sans CJK SC") >= 0:
print ("enabling CJK for LaTeX builder")
latex_elements['preamble'] += '''
% This is needed for translations
\\usepackage{xeCJK}
\\setCJKmainfont{Noto Sans CJK SC}
% Define custom macros to on/off CJK
\\newcommand{\\kerneldocCJKon}{\\makexeCJKactive}
\\newcommand{\\kerneldocCJKoff}{\\makexeCJKinactive}
% To customize \sphinxtableofcontents
\\usepackage{etoolbox}
% Inactivate CJK after tableofcontents
\\apptocmd{\\sphinxtableofcontents}{\\kerneldocCJKoff}{}{}
'''
else:
latex_elements['preamble'] += '''
% Custom macros to on/off CJK (Dummy)
\\newcommand{\\kerneldocCJKon}{}
\\newcommand{\\kerneldocCJKoff}{}
'''
# Fix reference escape troubles with Sphinx 1.4.x
......
......@@ -8,7 +8,7 @@ How to access I/O mapped memory from within device drivers
The virt_to_bus() and bus_to_virt() functions have been
superseded by the functionality provided by the PCI DMA interface
(see :doc:`/core-api/dma-api-howto`). They continue
(see Documentation/core-api/dma-api-howto.rst). They continue
to be documented below for historical purposes, but new code
must not use them. --davidm 00/12/12
......
......@@ -5,7 +5,7 @@ Dynamic DMA mapping using the generic device
:Author: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
This document describes the DMA API. For a more gentle introduction
of the API (and actual examples), see :doc:`/core-api/dma-api-howto`.
of the API (and actual examples), see Documentation/core-api/dma-api-howto.rst.
This API is split into two pieces. Part I describes the basic API.
Part II describes extensions for supporting non-consistent memory
......@@ -479,7 +479,8 @@ without the _attrs suffixes, except that they pass an optional
dma_attrs.
The interpretation of DMA attributes is architecture-specific, and
each attribute should be documented in :doc:`/core-api/dma-attributes`.
each attribute should be documented in
Documentation/core-api/dma-attributes.rst.
If dma_attrs are 0, the semantics of each of these functions
is identical to those of the corresponding function
......
......@@ -17,7 +17,7 @@ To do ISA style DMA you need to include two headers::
#include <asm/dma.h>
The first is the generic DMA API used to convert virtual addresses to
bus addresses (see :doc:`/core-api/dma-api` for details).
bus addresses (see Documentation/core-api/dma-api.rst for details).
The second contains the routines specific to ISA DMA transfers. Since
this is not present on all platforms make sure you construct your
......
......@@ -48,7 +48,7 @@ Concurrency primitives
======================
How Linux keeps everything from happening at the same time. See
:doc:`/locking/index` for more related documentation.
Documentation/locking/index.rst for more related documentation.
.. toctree::
:maxdepth: 1
......@@ -77,7 +77,7 @@ Memory management
=================
How to allocate and use memory in the kernel. Note that there is a lot
more memory-management documentation in :doc:`/vm/index`.
more memory-management documentation in Documentation/vm/index.rst.
.. toctree::
:maxdepth: 1
......
......@@ -37,14 +37,13 @@ Integer types
u64 %llu or %llx
If <type> is dependent on a config option for its size (e.g., sector_t,
blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a
format specifier of its largest possible type and explicitly cast to it.
If <type> is architecture-dependent for its size (e.g., cycles_t, tcflag_t) or
is dependent on a config option for its size (e.g., blk_status_t), use a format
specifier of its largest possible type and explicitly cast to it.
Example::
printk("test: sector number/total blocks: %llu/%llu\n",
(unsigned long long)sector, (unsigned long long)blockcount);
printk("test: latency: %llu cycles\n", (unsigned long long)time);
Reminder: sizeof() returns type size_t.
......
......@@ -246,6 +246,7 @@ Allocation style
The first argument for kcalloc or kmalloc_array should be the
number of elements. sizeof() as the first argument is generally
wrong.
See: https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html
**ALLOC_SIZEOF_STRUCT**
......@@ -264,6 +265,7 @@ Allocation style
**ALLOC_WITH_MULTIPLY**
Prefer kmalloc_array/kcalloc over kmalloc/kzalloc with a
sizeof multiply.
See: https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html
......@@ -284,6 +286,7 @@ API usage
BUG() or BUG_ON() should be avoided totally.
Use WARN() and WARN_ON() instead, and handle the "impossible"
error condition as gracefully as possible.
See: https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
**CONSIDER_KSTRTO**
......@@ -292,12 +295,161 @@ API usage
may lead to unexpected results in callers. The respective kstrtol(),
kstrtoll(), kstrtoul(), and kstrtoull() functions tend to be the
correct replacements.
See: https://www.kernel.org/doc/html/latest/process/deprecated.html#simple-strtol-simple-strtoll-simple-strtoul-simple-strtoull
**CONSTANT_CONVERSION**
Use of __constant_<foo> form is discouraged for the following functions::
__constant_cpu_to_be[x]
__constant_cpu_to_le[x]
__constant_be[x]_to_cpu
__constant_le[x]_to_cpu
__constant_htons
__constant_ntohs
Using any of these outside of include/uapi/ is not preferred as using the
function without __constant_ is identical when the argument is a
constant.
In big endian systems, the macros like __constant_cpu_to_be32(x) and
cpu_to_be32(x) expand to the same expression::
#define __constant_cpu_to_be32(x) ((__force __be32)(__u32)(x))
#define __cpu_to_be32(x) ((__force __be32)(__u32)(x))
In little endian systems, the macros __constant_cpu_to_be32(x) and
cpu_to_be32(x) expand to __constant_swab32 and __swab32. __swab32
has a __builtin_constant_p check::
#define __swab32(x) \
(__builtin_constant_p((__u32)(x)) ? \
___constant_swab32(x) : \
__fswab32(x))
So ultimately they have a special case for constants.
Similar is the case with all of the macros in the list. Thus
using the __constant_... forms are unnecessarily verbose and
not preferred outside of include/uapi.
See: https://lore.kernel.org/lkml/1400106425.12666.6.camel@joe-AO725/
**DEPRECATED_API**
Usage of a deprecated RCU API is detected. It is recommended to replace
old flavourful RCU APIs by their new vanilla-RCU counterparts.
The full list of available RCU APIs can be viewed from the kernel docs.
See: https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html#full-list-of-rcu-apis
**DEPRECATED_VARIABLE**
EXTRA_{A,C,CPP,LD}FLAGS are deprecated and should be replaced by the new
flags added via commit f77bf01425b1 ("kbuild: introduce ccflags-y,
asflags-y and ldflags-y").
The following conversion scheme maybe used::
EXTRA_AFLAGS -> asflags-y
EXTRA_CFLAGS -> ccflags-y
EXTRA_CPPFLAGS -> cppflags-y
EXTRA_LDFLAGS -> ldflags-y
See:
1. https://lore.kernel.org/lkml/20070930191054.GA15876@uranus.ravnborg.org/
2. https://lore.kernel.org/lkml/1313384834-24433-12-git-send-email-lacombar@gmail.com/
3. https://www.kernel.org/doc/html/latest/kbuild/makefiles.html#compilation-flags
**DEVICE_ATTR_FUNCTIONS**
The function names used in DEVICE_ATTR is unusual.
Typically, the store and show functions are used with <attr>_store and
<attr>_show, where <attr> is a named attribute variable of the device.
Consider the following examples::
static DEVICE_ATTR(type, 0444, type_show, NULL);
static DEVICE_ATTR(power, 0644, power_show, power_store);
The function names should preferably follow the above pattern.
See: https://www.kernel.org/doc/html/latest/driver-api/driver-model/device.html#attributes
**DEVICE_ATTR_RO**
The DEVICE_ATTR_RO(name) helper macro can be used instead of
DEVICE_ATTR(name, 0444, name_show, NULL);
Note that the macro automatically appends _show to the named
attribute variable of the device for the show method.
See: https://www.kernel.org/doc/html/latest/driver-api/driver-model/device.html#attributes
**DEVICE_ATTR_RW**
The DEVICE_ATTR_RW(name) helper macro can be used instead of
DEVICE_ATTR(name, 0644, name_show, name_store);
Note that the macro automatically appends _show and _store to the
named attribute variable of the device for the show and store methods.
See: https://www.kernel.org/doc/html/latest/driver-api/driver-model/device.html#attributes
**DEVICE_ATTR_WO**
The DEVICE_AATR_WO(name) helper macro can be used instead of
DEVICE_ATTR(name, 0200, NULL, name_store);
Note that the macro automatically appends _store to the
named attribute variable of the device for the store method.
See: https://www.kernel.org/doc/html/latest/driver-api/driver-model/device.html#attributes
**DUPLICATED_SYSCTL_CONST**
Commit d91bff3011cf ("proc/sysctl: add shared variables for range
check") added some shared const variables to be used instead of a local
copy in each source file.
Consider replacing the sysctl range checking value with the shared
one in include/linux/sysctl.h. The following conversion scheme may
be used::
&zero -> SYSCTL_ZERO
&one -> SYSCTL_ONE
&int_max -> SYSCTL_INT_MAX
See:
1. https://lore.kernel.org/lkml/20190430180111.10688-1-mcroce@redhat.com/
2. https://lore.kernel.org/lkml/20190531131422.14970-1-mcroce@redhat.com/
**ENOSYS**
ENOSYS means that a nonexistent system call was called.
Earlier, it was wrongly used for things like invalid operations on
otherwise valid syscalls. This should be avoided in new code.
See: https://lore.kernel.org/lkml/5eb299021dec23c1a48fa7d9f2c8b794e967766d.1408730669.git.luto@amacapital.net/
**ENOTSUPP**
ENOTSUPP is not a standard error code and should be avoided in new patches.
EOPNOTSUPP should be used instead.
See: https://lore.kernel.org/netdev/20200510182252.GA411829@lunn.ch/
**EXPORT_SYMBOL**
EXPORT_SYMBOL should immediately follow the symbol to be exported.
**IN_ATOMIC**
in_atomic() is not for driver use so any such use is reported as an ERROR.
Also in_atomic() is often used to determine if sleeping is permitted,
but it is not reliable in this use model. Therefore its use is
strongly discouraged.
However, in_atomic() is ok for core kernel use.
See: https://lore.kernel.org/lkml/20080320201723.b87b3732.akpm@linux-foundation.org/
**LOCKDEP**
The lockdep_no_validate class was added as a temporary measure to
prevent warnings on conversion of device->sem to device->mutex.
It should not be used for any other purpose.
See: https://lore.kernel.org/lkml/1268959062.9440.467.camel@laptop/
**MALFORMED_INCLUDE**
......@@ -308,14 +460,21 @@ API usage
**USE_LOCKDEP**
lockdep_assert_held() annotations should be preferred over
assertions based on spin_is_locked()
See: https://www.kernel.org/doc/html/latest/locking/lockdep-design.html#annotations
**UAPI_INCLUDE**
No #include statements in include/uapi should use a uapi/ path.
**USLEEP_RANGE**
usleep_range() should be preferred over udelay(). The proper way of
using usleep_range() is mentioned in the kernel docs.
Comment style
-------------
See: https://www.kernel.org/doc/html/latest/timers/timers-howto.html#delays-information-on-the-various-kernel-delay-sleep-mechanisms
Comments
--------
**BLOCK_COMMENT_STYLE**
The comment style is incorrect. The preferred style for multi-
......@@ -338,8 +497,24 @@ Comment style
**C99_COMMENTS**
C99 style single line comments (//) should not be used.
Prefer the block comment style instead.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#commenting
**DATA_RACE**
Applications of data_race() should have a comment so as to document the
reasoning behind why it was deemed safe.
See: https://lore.kernel.org/lkml/20200401101714.44781-1-elver@google.com/
**FSF_MAILING_ADDRESS**
Kernel maintainers reject new instances of the GPL boilerplate paragraph
directing people to write to the FSF for a copy of the GPL, since the
FSF has moved in the past and may do so again.
So do not write paragraphs about writing to the Free Software Foundation's
mailing address.
See: https://lore.kernel.org/lkml/20131006222342.GT19510@leaf/
Commit message
--------------
......@@ -347,6 +522,7 @@ Commit message
**BAD_SIGN_OFF**
The signed-off-by line does not fall in line with the standards
specified by the community.
See: https://www.kernel.org/doc/html/latest/process/submitting-patches.html#developer-s-certificate-of-origin-1-1
**BAD_STABLE_ADDRESS_STYLE**
......@@ -368,12 +544,33 @@ Commit message
**COMMIT_MESSAGE**
The patch is missing a commit description. A brief
description of the changes made by the patch should be added.
See: https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes
**EMAIL_SUBJECT**
Naming the tool that found the issue is not very useful in the
subject line. A good subject line summarizes the change that
the patch brings.
See: https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes
**FROM_SIGN_OFF_MISMATCH**
The author's email does not match with that in the Signed-off-by:
line(s). This can be sometimes caused due to an improperly configured
email client.
This message is emitted due to any of the following reasons::
- The email names do not match.
- The email addresses do not match.
- The email subaddresses do not match.
- The email comments do not match.
**MISSING_SIGN_OFF**
The patch is missing a Signed-off-by line. A signed-off-by
line should be added according to Developer's certificate of
Origin.
See: https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
**NO_AUTHOR_SIGN_OFF**
......@@ -382,6 +579,7 @@ Commit message
end of explanation of the patch to denote that the author has
written it or otherwise has the rights to pass it on as an open
source patch.
See: https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
**DIFF_IN_COMMIT_MSG**
......@@ -389,6 +587,7 @@ Commit message
This causes problems when one tries to apply a file containing both
the changelog and the diff because patch(1) tries to apply the diff
which it found in the changelog.
See: https://lore.kernel.org/lkml/20150611134006.9df79a893e3636019ad2759e@linux-foundation.org/
**GERRIT_CHANGE_ID**
......@@ -431,6 +630,7 @@ Comparison style
**BOOL_COMPARISON**
Comparisons of A to true and false are better written
as A and !A.
See: https://lore.kernel.org/lkml/1365563834.27174.12.camel@joe-AO722/
**COMPARISON_TO_NULL**
......@@ -442,6 +642,87 @@ Comparison style
side of the test should be avoided.
Indentation and Line Breaks
---------------------------
**CODE_INDENT**
Code indent should use tabs instead of spaces.
Outside of comments, documentation and Kconfig,
spaces are never used for indentation.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#indentation
**DEEP_INDENTATION**
Indentation with 6 or more tabs usually indicate overly indented
code.
It is suggested to refactor excessive indentation of
if/else/for/do/while/switch statements.
See: https://lore.kernel.org/lkml/1328311239.21255.24.camel@joe2Laptop/
**SWITCH_CASE_INDENT_LEVEL**
switch should be at the same indent as case.
Example::
switch (suffix) {
case 'G':
case 'g':
mem <<= 30;
break;
case 'M':
case 'm':
mem <<= 20;
break;
case 'K':
case 'k':
mem <<= 10;
fallthrough;
default:
break;
}
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#indentation
**LONG_LINE**
The line has exceeded the specified maximum length.
To use a different maximum line length, the --max-line-length=n option
may be added while invoking checkpatch.
Earlier, the default line length was 80 columns. Commit bdc48fa11e46
("checkpatch/coding-style: deprecate 80-column warning") increased the
limit to 100 columns. This is not a hard limit either and it's
preferable to stay within 80 columns whenever possible.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#breaking-long-lines-and-strings
**LONG_LINE_STRING**
A string starts before but extends beyond the maximum line length.
To use a different maximum line length, the --max-line-length=n option
may be added while invoking checkpatch.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#breaking-long-lines-and-strings
**LONG_LINE_COMMENT**
A comment starts before but extends beyond the maximum line length.
To use a different maximum line length, the --max-line-length=n option
may be added while invoking checkpatch.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#breaking-long-lines-and-strings
**TRAILING_STATEMENTS**
Trailing statements (for example after any conditional) should be
on the next line.
Statements, such as::
if (x == y) break;
should be::
if (x == y)
break;
Macros, Attributes and Symbols
------------------------------
......@@ -472,7 +753,7 @@ Macros, Attributes and Symbols
**BIT_MACRO**
Defines like: 1 << <digit> could be BIT(digit).
The BIT() macro is defined in include/linux/bitops.h::
The BIT() macro is defined via include/linux/bits.h::
#define BIT(nr) (1UL << (nr))
......@@ -492,6 +773,7 @@ Macros, Attributes and Symbols
The kernel does *not* use the ``__DATE__`` and ``__TIME__`` macros,
and enables warnings if they are used as they can lead to
non-deterministic builds.
See: https://www.kernel.org/doc/html/latest/kbuild/reproducible-builds.html#timestamps
**DEFINE_ARCH_HAS**
......@@ -502,8 +784,12 @@ Macros, Attributes and Symbols
want architectures able to override them with optimized ones, we
should either use weak functions (appropriate for some cases), or
the symbol that protects them should be the same symbol we use.
See: https://lore.kernel.org/lkml/CA+55aFycQ9XJvEOsiM3txHL5bjUc8CeKWJNR_H+MiicaddB42Q@mail.gmail.com/
**DO_WHILE_MACRO_WITH_TRAILING_SEMICOLON**
do {} while(0) macros should not have a trailing semicolon.
**INIT_ATTRIBUTE**
Const init definitions should use __initconst instead of
__initdata.
......@@ -528,6 +814,20 @@ Macros, Attributes and Symbols
...
}
**MISPLACED_INIT**
It is possible to use section markers on variables in a way
which gcc doesn't understand (or at least not the way the
developer intended)::
static struct __initdata samsung_pll_clock exynos4_plls[nr_plls] = {
does not put exynos4_plls in the .initdata section. The __initdata
marker can be virtually anywhere on the line, except right after
"struct". The preferred location is before the "=" sign if there is
one, or before the trailing ";" otherwise.
See: https://lore.kernel.org/lkml/1377655732.3619.19.camel@joe-AO722/
**MULTISTATEMENT_MACRO_USE_DO_WHILE**
Macros with multiple statements should be enclosed in a
do - while block. Same should also be the case for macros
......@@ -541,6 +841,10 @@ Macros, Attributes and Symbols
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl
**PREFER_FALLTHROUGH**
Use the `fallthrough;` pseudo keyword instead of
`/* fallthrough */` like comments.
**WEAK_DECLARATION**
Using weak declarations like __attribute__((weak)) or __weak
can have unintended link defects. Avoid using them.
......@@ -551,8 +855,51 @@ Functions and Variables
**CAMELCASE**
Avoid CamelCase Identifiers.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#naming
**CONST_CONST**
Using `const <type> const *` is generally meant to be
written `const <type> * const`.
**CONST_STRUCT**
Using const is generally a good idea. Checkpatch reads
a list of frequently used structs that are always or
almost always constant.
The existing structs list can be viewed from
`scripts/const_structs.checkpatch`.
See: https://lore.kernel.org/lkml/alpine.DEB.2.10.1608281509480.3321@hadrien/
**EMBEDDED_FUNCTION_NAME**
Embedded function names are less appropriate to use as
refactoring can cause function renaming. Prefer the use of
"%s", __func__ to embedded function names.
Note that this does not work with -f (--file) checkpatch option
as it depends on patch context providing the function name.
**FUNCTION_ARGUMENTS**
This warning is emitted due to any of the following reasons:
1. Arguments for the function declaration do not follow
the identifier name. Example::
void foo
(int bar, int baz)
This should be corrected to::
void foo(int bar, int baz)
2. Some arguments for the function definition do not
have an identifier name. Example::
void foo(int)
All arguments should have identifier names.
**FUNCTION_WITHOUT_ARGS**
Function declarations without arguments like::
......@@ -583,6 +930,34 @@ Functions and Variables
return bar;
Permissions
-----------
**DEVICE_ATTR_PERMS**
The permissions used in DEVICE_ATTR are unusual.
Typically only three permissions are used - 0644 (RW), 0444 (RO)
and 0200 (WO).
See: https://www.kernel.org/doc/html/latest/filesystems/sysfs.html#attributes
**EXECUTE_PERMISSIONS**
There is no reason for source files to be executable. The executable
bit can be removed safely.
**EXPORTED_WORLD_WRITABLE**
Exporting world writable sysfs/debugfs files is usually a bad thing.
When done arbitrarily they can introduce serious security bugs.
In the past, some of the debugfs vulnerabilities would seemingly allow
any local user to write arbitrary values into device registers - a
situation from which little good can be expected to emerge.
See: https://lore.kernel.org/linux-arm-kernel/cover.1296818921.git.segoon@openwall.com/
**NON_OCTAL_PERMISSIONS**
Permission bits should use 4 digit octal permissions (like 0700 or 0444).
Avoid using any other base like decimal.
Spacing and Brackets
--------------------
......@@ -616,7 +991,7 @@ Spacing and Brackets
1. With a type on the left::
;int [] a;
int [] a;
2. At the beginning of a line for slice initialisers::
......@@ -626,12 +1001,6 @@ Spacing and Brackets
= { [0...10] = 5 }
**CODE_INDENT**
Code indent should use tabs instead of spaces.
Outside of comments, documentation and Kconfig,
spaces are never used for indentation.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#indentation
**CONCATENATED_STRING**
Concatenated elements should have a space in between.
Example::
......@@ -644,17 +1013,20 @@ Spacing and Brackets
**ELSE_AFTER_BRACE**
`else {` should follow the closing block `}` on the same line.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#placing-braces-and-spaces
**LINE_SPACING**
Vertical space is wasted given the limited number of lines an
editor window can display when multiple blank lines are used.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#spaces
**OPEN_BRACE**
The opening brace should be following the function definitions on the
next line. For any non-functional block it should be on the same line
as the last construct.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#placing-braces-and-spaces
**POINTER_LOCATION**
......@@ -671,37 +1043,47 @@ Spacing and Brackets
**SPACING**
Whitespace style used in the kernel sources is described in kernel docs.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#spaces
**SWITCH_CASE_INDENT_LEVEL**
switch should be at the same indent as case.
Example::
switch (suffix) {
case 'G':
case 'g':
mem <<= 30;
break;
case 'M':
case 'm':
mem <<= 20;
break;
case 'K':
case 'k':
mem <<= 10;
/* fall through */
default:
break;
}
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#indentation
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#spaces
**TRAILING_WHITESPACE**
Trailing whitespace should always be removed.
Some editors highlight the trailing whitespace and cause visual
distractions when editing files.
See: https://www.kernel.org/doc/html/latest/process/coding-style.html#spaces
**UNNECESSARY_PARENTHESES**
Parentheses are not required in the following cases:
1. Function pointer uses::
(foo->bar)();
could be::
foo->bar();
2. Comparisons in if::
if ((foo->bar) && (foo->baz))
if ((foo == bar))
could be::
if (foo->bar && foo->baz)
if (foo == bar)
3. addressof/dereference single Lvalues::
&(foo->bar)
*(foo->bar)
could be::
&foo->bar
*foo->bar
**WHILE_AFTER_BRACE**
while should follow the closing bracket on the same line::
......@@ -723,17 +1105,50 @@ Others
The patch seems to be corrupted or lines are wrapped.
Please regenerate the patch file before sending it to the maintainer.
**CVS_KEYWORD**
Since linux moved to git, the CVS markers are no longer used.
So, CVS style keywords ($Id$, $Revision$, $Log$) should not be
added.
**DEFAULT_NO_BREAK**
switch default case is sometimes written as "default:;". This can
cause new cases added below default to be defective.
A "break;" should be added after empty default statement to avoid
unwanted fallthrough.
**DOS_LINE_ENDINGS**
For DOS-formatted patches, there are extra ^M symbols at the end of
the line. These should be removed.
**EXECUTE_PERMISSIONS**
There is no reason for source files to be executable. The executable
bit can be removed safely.
**DT_SCHEMA_BINDING_PATCH**
DT bindings moved to a json-schema based format instead of
freeform text.
**NON_OCTAL_PERMISSIONS**
Permission bits should use 4 digit octal permissions (like 0700 or 0444).
Avoid using any other base like decimal.
See: https://www.kernel.org/doc/html/latest/devicetree/bindings/writing-schema.html
**DT_SPLIT_BINDING_PATCH**
Devicetree bindings should be their own patch. This is because
bindings are logically independent from a driver implementation,
they have a different maintainer (even though they often
are applied via the same tree), and it makes for a cleaner history in the
DT only tree created with git-filter-branch.
See: https://www.kernel.org/doc/html/latest/devicetree/bindings/submitting-patches.html#i-for-patch-submitters
**EMBEDDED_FILENAME**
Embedding the complete filename path inside the file isn't particularly
useful as often the path is moved around and becomes incorrect.
**FILE_PATH_CHANGES**
Whenever files are added, moved, or deleted, the MAINTAINERS file
patterns can be out of sync or outdated.
So MAINTAINERS might need updating in these cases.
**MEMSET**
The memset use appears to be incorrect. This may be caused due to
badly ordered parameters. Please recheck the usage.
**NOT_UNIFIED_DIFF**
The patch file does not appear to be in unified-diff format. Please
......@@ -742,14 +1157,12 @@ Others
**PRINTF_0XDECIMAL**
Prefixing 0x with decimal output is defective and should be corrected.
**TRAILING_STATEMENTS**
Trailing statements (for example after any conditional) should be
on the next line.
Like::
if (x == y) break;
**SPDX_LICENSE_TAG**
The source file is missing or has an improper SPDX identifier tag.
The Linux kernel requires the precise SPDX identifier in all source files,
and it is thoroughly documented in the kernel docs.
should be::
See: https://www.kernel.org/doc/html/latest/process/license-rules.html
if (x == y)
break;
**TYPO_SPELLING**
Some words may have been misspelled. Consider reviewing them.
......@@ -10,7 +10,7 @@ API Reference
This section documents the KUnit kernel testing API. It is divided into the
following sections:
================================= ==============================================
:doc:`test` documents all of the standard testing API
excluding mocking or mocking related features.
================================= ==============================================
Documentation/dev-tools/kunit/api/test.rst
- documents all of the standard testing API excluding mocking
or mocking related features.
......@@ -97,7 +97,7 @@ things to try.
modules will automatically execute associated tests when loaded. Test results
can be collected from ``/sys/kernel/debug/kunit/<test suite>/results``, and
can be parsed with ``kunit.py parse``. For more details, see "KUnit on
non-UML architectures" in :doc:`usage`.
non-UML architectures" in Documentation/dev-tools/kunit/usage.rst.
If none of the above tricks help, you are always welcome to email any issues to
kunit-dev@googlegroups.com.
......@@ -36,7 +36,7 @@ To make running these tests (and reading the results) easier, KUnit offers
results. This provides a quick way of running KUnit tests during development,
without requiring a virtual machine or separate hardware.
Get started now: :doc:`start`
Get started now: Documentation/dev-tools/kunit/start.rst
Why KUnit?
==========
......@@ -88,9 +88,9 @@ it takes to read their test log?
How do I use it?
================
* :doc:`start` - for new users of KUnit
* :doc:`tips` - for short examples of best practices
* :doc:`usage` - for a more detailed explanation of KUnit features
* :doc:`api/index` - for the list of KUnit APIs used for testing
* :doc:`kunit-tool` - for more information on the kunit_tool helper script
* :doc:`faq` - for answers to some common questions about KUnit
* Documentation/dev-tools/kunit/start.rst - for new users of KUnit
* Documentation/dev-tools/kunit/tips.rst - for short examples of best practices
* Documentation/dev-tools/kunit/usage.rst - for a more detailed explanation of KUnit features
* Documentation/dev-tools/kunit/api/index.rst - for the list of KUnit APIs used for testing
* Documentation/dev-tools/kunit/kunit-tool.rst - for more information on the kunit_tool helper script
* Documentation/dev-tools/kunit/faq.rst - for answers to some common questions about KUnit
......@@ -21,7 +21,7 @@ The wrapper can be run with:
./tools/testing/kunit/kunit.py run
For more information on this wrapper (also called kunit_tool) check out the
:doc:`kunit-tool` page.
Documentation/dev-tools/kunit/kunit-tool.rst page.
Creating a .kunitconfig
-----------------------
......@@ -234,7 +234,7 @@ Congrats! You just wrote your first KUnit test!
Next Steps
==========
* Check out the :doc:`tips` page for tips on
* Check out the Documentation/dev-tools/kunit/tips.rst page for tips on
writing idiomatic KUnit tests.
* Optional: see the :doc:`usage` page for a more
in-depth explanation of KUnit.
......@@ -125,7 +125,8 @@ Here's a slightly in-depth example of how one could implement "mocking":
Note: here we're able to get away with using ``test->priv``, but if you wanted
something more flexible you could use a named ``kunit_resource``, see :doc:`api/test`.
something more flexible you could use a named ``kunit_resource``, see
Documentation/dev-tools/kunit/api/test.rst.
Failing the current test
------------------------
......@@ -185,5 +186,5 @@ Alternatively, one can take full control over the error message by using ``KUNIT
Next Steps
==========
* Optional: see the :doc:`usage` page for a more
* Optional: see the Documentation/dev-tools/kunit/usage.rst page for a more
in-depth explanation of KUnit.
......@@ -10,7 +10,7 @@ understand it. This guide assumes a working knowledge of the Linux kernel and
some basic knowledge of testing.
For a high level introduction to KUnit, including setting up KUnit for your
project, see :doc:`start`.
project, see Documentation/dev-tools/kunit/start.rst.
Organization of this document
=============================
......@@ -99,7 +99,8 @@ violated; however, the test will continue running, potentially trying other
expectations until the test case ends or is otherwise terminated. This is as
opposed to *assertions* which are discussed later.
To learn about more expectations supported by KUnit, see :doc:`api/test`.
To learn about more expectations supported by KUnit, see
Documentation/dev-tools/kunit/api/test.rst.
.. note::
A single test case should be pretty short, pretty easy to understand,
......@@ -216,7 +217,8 @@ test suite in a special linker section so that it can be run by KUnit either
after late_init, or when the test module is loaded (depending on whether the
test was built in or not).
For more information on these types of things see the :doc:`api/test`.
For more information on these types of things see the
Documentation/dev-tools/kunit/api/test.rst.
Common Patterns
===============
......
......@@ -71,15 +71,15 @@ can be used to verify that a test is executing particular functions or lines
of code. This is useful for determining how much of the kernel is being tested,
and for finding corner-cases which are not covered by the appropriate test.
:doc:`gcov` is GCC's coverage testing tool, which can be used with the kernel
to get global or per-module coverage. Unlike KCOV, it does not record per-task
coverage. Coverage data can be read from debugfs, and interpreted using the
usual gcov tooling.
:doc:`kcov` is a feature which can be built in to the kernel to allow
capturing coverage on a per-task level. It's therefore useful for fuzzing and
other situations where information about code executed during, for example, a
single syscall is useful.
Documentation/dev-tools/gcov.rst is GCC's coverage testing tool, which can be
used with the kernel to get global or per-module coverage. Unlike KCOV, it
does not record per-task coverage. Coverage data can be read from debugfs,
and interpreted using the usual gcov tooling.
Documentation/dev-tools/kcov.rst is a feature which can be built in to the
kernel to allow capturing coverage on a per-task level. It's therefore useful
for fuzzing and other situations where information about code executed during,
for example, a single syscall is useful.
Dynamic Analysis Tools
......
......@@ -7,8 +7,8 @@ Submitting Devicetree (DT) binding patches
I. For patch submitters
=======================
0) Normal patch submission rules from Documentation/process/submitting-patches.rst
applies.
0) Normal patch submission rules from
Documentation/process/submitting-patches.rst applies.
1) The Documentation/ and include/dt-bindings/ portion of the patch should
be a separate patch. The preferred subject prefix for binding patches is::
......@@ -25,8 +25,8 @@ I. For patch submitters
make dt_binding_check
See Documentation/devicetree/bindings/writing-schema.rst for more details about
schema and tools setup.
See Documentation/devicetree/bindings/writing-schema.rst for more details
about schema and tools setup.
3) DT binding files should be dual licensed. The preferred license tag is
(GPL-2.0-only OR BSD-2-Clause).
......@@ -84,7 +84,8 @@ II. For kernel maintainers
III. Notes
==========
0) Please see :doc:`ABI` for details regarding devicetree ABI.
0) Please see Documentation/devicetree/bindings/ABI.rst for details
regarding devicetree ABI.
1) This document is intended as a general familiarization with the process as
decided at the 2013 Kernel Summit. When in doubt, the current word of the
......
......@@ -237,10 +237,10 @@ We have been trying to improve the situation through the creation of
a set of "books" that group documentation for specific readers. These
include:
- :doc:`../admin-guide/index`
- :doc:`../core-api/index`
- :doc:`../driver-api/index`
- :doc:`../userspace-api/index`
- Documentation/admin-guide/index.rst
- Documentation/core-api/index.rst
- Documentation/driver-api/index.rst
- Documentation/userspace-api/index.rst
As well as this book on documentation itself.
......
......@@ -276,4 +276,4 @@ before they become available from the ACPICA release process.
# git clone https://github.com/acpica/acpica
# git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
# cd acpica
# generate/linux/divergences.sh -s ../linux
# generate/linux/divergence.sh -s ../linux
......@@ -9,13 +9,13 @@ with them.
For examples of already existing generic drivers that will also be good
examples for any other kernel drivers you want to author, refer to
:doc:`drivers-on-gpio`
Documentation/driver-api/gpio/drivers-on-gpio.rst
For any kind of mass produced system you want to support, such as servers,
laptops, phones, tablets, routers, and any consumer or office or business goods
using appropriate kernel drivers is paramount. Submit your code for inclusion
in the upstream Linux kernel when you feel it is mature enough and you will get
help to refine it, see :doc:`../../process/submitting-patches`.
help to refine it, see Documentation/process/submitting-patches.rst.
In Linux GPIO lines also have a userspace ABI.
......
......@@ -25,16 +25,16 @@ ioctl commands that follow modern conventions: ``_IO``, ``_IOR``,
with the correct parameters:
_IO/_IOR/_IOW/_IOWR
The macro name specifies how the argument will be used.  It may be a
The macro name specifies how the argument will be used. It may be a
pointer to data to be passed into the kernel (_IOW), out of the kernel
(_IOR), or both (_IOWR).  _IO can indicate either commands with no
(_IOR), or both (_IOWR). _IO can indicate either commands with no
argument or those passing an integer value instead of a pointer.
It is recommended to only use _IO for commands without arguments,
and use pointers for passing data.
type
An 8-bit number, often a character literal, specific to a subsystem
or driver, and listed in :doc:`../userspace-api/ioctl/ioctl-number`
or driver, and listed in Documentation/userspace-api/ioctl/ioctl-number.rst
nr
An 8-bit number identifying the specific command, unique for a give
......@@ -200,10 +200,10 @@ cause an information leak, which can be used to defeat kernel address
space layout randomization (KASLR), helping in an attack.
For this reason (and for compat support) it is best to avoid any
implicit padding in data structures.  Where there is implicit padding
implicit padding in data structures. Where there is implicit padding
in an existing structure, kernel drivers must be careful to fully
initialize an instance of the structure before copying it to user
space.  This is usually done by calling memset() before assigning to
space. This is usually done by calling memset() before assigning to
individual members.
Subsystem abstractions
......
......@@ -217,7 +217,7 @@ system-wide transition to a sleep state even though its :c:member:`runtime_auto`
flag is clear.
For more information about the runtime power management framework, refer to
:file:`Documentation/power/runtime_pm.rst`.
Documentation/power/runtime_pm.rst.
Calling Drivers to Enter and Leave System Sleep States
......@@ -655,7 +655,7 @@ been thawed. Generally speaking, the PM notifiers are suitable for performing
actions that either require user space to be available, or at least won't
interfere with user space.
For details refer to :doc:`notifiers`.
For details refer to Documentation/driver-api/pm/notifiers.rst.
Device Low-Power (suspend) States
......@@ -726,7 +726,7 @@ it into account in any way.
Devices may be defined as IRQ-safe which indicates to the PM core that their
runtime PM callbacks may be invoked with disabled interrupts (see
:file:`Documentation/power/runtime_pm.rst` for more information). If an
Documentation/power/runtime_pm.rst for more information). If an
IRQ-safe device belongs to a PM domain, the runtime PM of the domain will be
disallowed, unless the domain itself is defined as IRQ-safe. However, it
makes sense to define a PM domain as IRQ-safe only if all the devices in it
......@@ -805,7 +805,7 @@ The ``DPM_FLAG_MAY_SKIP_RESUME`` Driver Flag
--------------------------------------------
During system-wide resume from a sleep state it's easiest to put devices into
the full-power state, as explained in :file:`Documentation/power/runtime_pm.rst`.
the full-power state, as explained in Documentation/power/runtime_pm.rst.
[Refer to that document for more information regarding this particular issue as
well as for information on the device runtime power management framework in
general.] However, it often is desirable to leave devices in suspend after
......
......@@ -5,7 +5,8 @@ Client Driver Documentation
===========================
This is the documentation for client drivers themselves. Refer to
:doc:`../client` for documentation on how to write client drivers.
Documentation/driver-api/surface_aggregator/client.rst for documentation
on how to write client drivers.
.. toctree::
:maxdepth: 1
......
......@@ -87,10 +87,11 @@ native SSAM devices, i.e. devices that are not defined in ACPI and not
implemented as platform devices, via |ssam_device| and |ssam_device_driver|
simplify management of client devices and client drivers.
Refer to :doc:`client` for documentation regarding the client device/driver
API and interface options for other kernel drivers. It is recommended to
familiarize oneself with that chapter and the :doc:`ssh` before continuing
with the architectural overview below.
Refer to Documentation/driver-api/surface_aggregator/client.rst for
documentation regarding the client device/driver API and interface options
for other kernel drivers. It is recommended to familiarize oneself with
that chapter and the Documentation/driver-api/surface_aggregator/ssh.rst
before continuing with the architectural overview below.
Packet Transport Layer
......@@ -190,9 +191,9 @@ with success on the transmitter thread.
Transmission of sequenced packets is limited by the number of concurrently
pending packets, i.e. a limit on how many packets may be waiting for an ACK
from the EC in parallel. This limit is currently set to one (see :doc:`ssh`
for the reasoning behind this). Control packets (i.e. ACK and NAK) can
always be transmitted.
from the EC in parallel. This limit is currently set to one (see
Documentation/driver-api/surface_aggregator/ssh.rst for the reasoning behind
this). Control packets (i.e. ACK and NAK) can always be transmitted.
Receiver Thread
---------------
......
......@@ -73,5 +73,7 @@ being a direct response to a previous request. We may also refer to requests
without response as commands. In general, events need to be enabled via one
of multiple dedicated requests before they are sent by the EC.
See :doc:`ssh` for a more technical protocol documentation and
:doc:`internal` for an overview of the internal driver architecture.
See Documentation/driver-api/surface_aggregator/ssh.rst for a
more technical protocol documentation and
Documentation/driver-api/surface_aggregator/internal.rst for an
overview of the internal driver architecture.
......@@ -10,7 +10,7 @@ API overview
The big picture is that USB drivers can continue to ignore most DMA issues,
though they still must provide DMA-ready buffers (see
:doc:`/core-api/dma-api-howto`). That's how they've worked through
Documentation/core-api/dma-api-howto.rst). That's how they've worked through
the 2.4 (and earlier) kernels, or they can now be DMA-aware.
DMA-aware usb drivers:
......@@ -60,7 +60,7 @@ and effects like cache-trashing can impose subtle penalties.
force a consistent memory access ordering by using memory barriers. It's
not using a streaming DMA mapping, so it's good for small transfers on
systems where the I/O would otherwise thrash an IOMMU mapping. (See
:doc:`/core-api/dma-api-howto` for definitions of "coherent" and
Documentation/core-api/dma-api-howto.rst for definitions of "coherent" and
"streaming" DMA mappings.)
Asking for 1/Nth of a page (as well as asking for N pages) is reasonably
......@@ -91,7 +91,7 @@ Working with existing buffers
Existing buffers aren't usable for DMA without first being mapped into the
DMA address space of the device. However, most buffers passed to your
driver can safely be used with such DMA mapping. (See the first section
of :doc:`/core-api/dma-api-howto`, titled "What memory is DMA-able?")
of Documentation/core-api/dma-api-howto.rst, titled "What memory is DMA-able?")
- When you're using scatterlists, you can map everything at once. On some
systems, this kicks in an IOMMU and turns the scatterlists into single
......
......@@ -78,8 +78,10 @@ configuration of fault-injection capabilities.
- /sys/kernel/debug/fail*/times:
specifies how many times failures may happen at most.
A value of -1 means "no limit".
specifies how many times failures may happen at most. A value of -1
means "no limit". Note, though, that this file only accepts unsigned
values. So, if you want to specify -1, you better use 'printf' instead
of 'echo', e.g.: $ printf %#x -1 > times
- /sys/kernel/debug/fail*/space:
......@@ -167,11 +169,13 @@ configuration of fault-injection capabilities.
- ERRNO: retval must be -1 to -MAX_ERRNO (-4096).
- ERR_NULL: retval must be 0 or -1 to -MAX_ERRNO (-4096).
- /sys/kernel/debug/fail_function/<functiuon-name>/retval:
- /sys/kernel/debug/fail_function/<function-name>/retval:
specifies the "error" return value to inject to the given
function for given function. This will be created when
user specifies new injection entry.
specifies the "error" return value to inject to the given function.
This will be created when the user specifies a new injection entry.
Note that this file only accepts unsigned values. So, if you want to
use a negative errno, you better use 'printf' instead of 'echo', e.g.:
$ printf %#x -12 > retval
Boot option
^^^^^^^^^^^
......@@ -255,7 +259,7 @@ Application Examples
echo Y > /sys/kernel/debug/$FAILTYPE/task-filter
echo 10 > /sys/kernel/debug/$FAILTYPE/probability
echo 100 > /sys/kernel/debug/$FAILTYPE/interval
echo -1 > /sys/kernel/debug/$FAILTYPE/times
printf %#x -1 > /sys/kernel/debug/$FAILTYPE/times
echo 0 > /sys/kernel/debug/$FAILTYPE/space
echo 2 > /sys/kernel/debug/$FAILTYPE/verbose
echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait
......@@ -309,7 +313,7 @@ Application Examples
echo N > /sys/kernel/debug/$FAILTYPE/task-filter
echo 10 > /sys/kernel/debug/$FAILTYPE/probability
echo 100 > /sys/kernel/debug/$FAILTYPE/interval
echo -1 > /sys/kernel/debug/$FAILTYPE/times
printf %#x -1 > /sys/kernel/debug/$FAILTYPE/times
echo 0 > /sys/kernel/debug/$FAILTYPE/space
echo 2 > /sys/kernel/debug/$FAILTYPE/verbose
echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait
......@@ -336,11 +340,11 @@ Application Examples
FAILTYPE=fail_function
FAILFUNC=open_ctree
echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject
echo -12 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval
printf %#x -12 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval
echo N > /sys/kernel/debug/$FAILTYPE/task-filter
echo 100 > /sys/kernel/debug/$FAILTYPE/probability
echo 0 > /sys/kernel/debug/$FAILTYPE/interval
echo -1 > /sys/kernel/debug/$FAILTYPE/times
printf %#x -1 > /sys/kernel/debug/$FAILTYPE/times
echo 0 > /sys/kernel/debug/$FAILTYPE/space
echo 1 > /sys/kernel/debug/$FAILTYPE/verbose
......
=======================
Direct Access for files
-----------------------
=======================
Motivation
----------
......@@ -9,7 +10,7 @@ It is also used to provide the pages which are mapped into userspace
by a call to mmap.
For block devices that are memory-like, the page cache pages would be
unnecessary copies of the original storage. The DAX code removes the
unnecessary copies of the original storage. The `DAX` code removes the
extra copy by performing reads and writes directly to the storage device.
For file mappings, the storage device is mapped directly into userspace.
......@@ -17,20 +18,20 @@ For file mappings, the storage device is mapped directly into userspace.
Usage
-----
If you have a block device which supports DAX, you can make a filesystem
on it as usual. The DAX code currently only supports files with a block
size equal to your kernel's PAGE_SIZE, so you may need to specify a block
If you have a block device which supports `DAX`, you can make a filesystem
on it as usual. The `DAX` code currently only supports files with a block
size equal to your kernel's `PAGE_SIZE`, so you may need to specify a block
size when creating the filesystem.
Currently 3 filesystems support DAX: ext2, ext4 and xfs. Enabling DAX on them
Currently 3 filesystems support `DAX`: ext2, ext4 and xfs. Enabling `DAX` on them
is different.
Enabling DAX on ext2
-----------------------------
--------------------
When mounting the filesystem, use the "-o dax" option on the command line or
add 'dax' to the options in /etc/fstab. This works to enable DAX on all files
within the filesystem. It is equivalent to the '-o dax=always' behavior below.
When mounting the filesystem, use the ``-o dax`` option on the command line or
add 'dax' to the options in ``/etc/fstab``. This works to enable `DAX` on all files
within the filesystem. It is equivalent to the ``-o dax=always`` behavior below.
Enabling DAX on xfs and ext4
......@@ -39,51 +40,56 @@ Enabling DAX on xfs and ext4
Summary
-------
1. There exists an in-kernel file access mode flag S_DAX that corresponds to
the statx flag STATX_ATTR_DAX. See the manpage for statx(2) for details
1. There exists an in-kernel file access mode flag `S_DAX` that corresponds to
the statx flag `STATX_ATTR_DAX`. See the manpage for statx(2) for details
about this access mode.
2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular
2. There exists a persistent flag `FS_XFLAG_DAX` that can be applied to regular
files and directories. This advisory flag can be set or cleared at any
time, but doing so does not immediately affect the S_DAX state.
time, but doing so does not immediately affect the `S_DAX` state.
3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will
3. If the persistent `FS_XFLAG_DAX` flag is set on a directory, this flag will
be inherited by all regular files and subdirectories that are subsequently
created in this directory. Files and subdirectories that exist at the time
this flag is set or cleared on the parent directory are not modified by
this modification of the parent directory.
4. There exist dax mount options which can override FS_XFLAG_DAX in the
setting of the S_DAX flag. Given underlying storage which supports DAX the
4. There exist dax mount options which can override `FS_XFLAG_DAX` in the
setting of the `S_DAX` flag. Given underlying storage which supports `DAX` the
following hold:
"-o dax=inode" means "follow FS_XFLAG_DAX" and is the default.
``-o dax=inode`` means "follow `FS_XFLAG_DAX`" and is the default.
"-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX."
``-o dax=never`` means "never set `S_DAX`, ignore `FS_XFLAG_DAX`."
"-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX."
``-o dax=always`` means "always set `S_DAX` ignore `FS_XFLAG_DAX`."
"-o dax" is a legacy option which is an alias for "dax=always".
This may be removed in the future so "-o dax=always" is
the preferred method for specifying this behavior.
``-o dax`` is a legacy option which is an alias for ``dax=always``.
NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain
the same even when the filesystem is mounted with a dax option. However,
in-core inode state (S_DAX) will be overridden until the filesystem is
remounted with dax=inode and the inode is evicted from kernel memory.
.. warning::
5. The S_DAX policy can be changed via:
The option ``-o dax`` may be removed in the future so ``-o dax=always`` is
the preferred method for specifying this behavior.
a) Setting the parent directory FS_XFLAG_DAX as needed before files are
.. note::
Modifications to and the inheritance behavior of `FS_XFLAG_DAX` remain
the same even when the filesystem is mounted with a dax option. However,
in-core inode state (`S_DAX`) will be overridden until the filesystem is
remounted with dax=inode and the inode is evicted from kernel memory.
5. The `S_DAX` policy can be changed via:
a) Setting the parent directory `FS_XFLAG_DAX` as needed before files are
created
b) Setting the appropriate dax="foo" mount option
c) Changing the FS_XFLAG_DAX flag on existing regular files and
c) Changing the `FS_XFLAG_DAX` flag on existing regular files and
directories. This has runtime constraints and limitations that are
described in 6) below.
6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX
6. When changing the `S_DAX` policy via toggling the persistent `FS_XFLAG_DAX`
flag, the change to existing regular files won't take effect until the
files are closed by all processes.
......@@ -91,16 +97,16 @@ Summary
Details
-------
There are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX)
There are 2 per-file dax flags. One is a persistent inode setting (`FS_XFLAG_DAX`)
and the other is a volatile flag indicating the active state of the feature
(S_DAX).
(`S_DAX`).
FS_XFLAG_DAX is preserved within the filesystem. This persistent config
setting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl
`FS_XFLAG_DAX` is preserved within the filesystem. This persistent config
setting can be set, cleared and/or queried using the `FS_IOC_FS`[`GS`]`ETXATTR` ioctl
(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'.
New files and directories automatically inherit FS_XFLAG_DAX from
their parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at
New files and directories automatically inherit `FS_XFLAG_DAX` from
their parent directory **when created**. Therefore, setting `FS_XFLAG_DAX` at
directory creation time can be used to set a default behavior for an entire
sub-tree.
......@@ -108,51 +114,64 @@ To clarify inheritance, here are 3 examples:
Example A:
mkdir -p a/b/c
xfs_io -c 'chattr +x' a
mkdir a/b/c/d
mkdir a/e
.. code-block:: shell
dax: a,e
no dax: b,c,d
mkdir -p a/b/c
xfs_io -c 'chattr +x' a
mkdir a/b/c/d
mkdir a/e
------[outcome]------
dax: a,e
no dax: b,c,d
Example B:
mkdir a
xfs_io -c 'chattr +x' a
mkdir -p a/b/c/d
.. code-block:: shell
mkdir a
xfs_io -c 'chattr +x' a
mkdir -p a/b/c/d
dax: a,b,c,d
no dax:
------[outcome]------
dax: a,b,c,d
no dax:
Example C:
mkdir -p a/b/c
xfs_io -c 'chattr +x' c
mkdir a/b/c/d
.. code-block:: shell
mkdir -p a/b/c
xfs_io -c 'chattr +x' c
mkdir a/b/c/d
dax: c,d
no dax: a,b
------[outcome]------
dax: c,d
no dax: a,b
The current enabled state (S_DAX) is set when a file inode is instantiated in
The current enabled state (`S_DAX`) is set when a file inode is instantiated in
memory by the kernel. It is set based on the underlying media support, the
value of FS_XFLAG_DAX and the filesystem's dax mount option.
value of `FS_XFLAG_DAX` and the filesystem's dax mount option.
statx can be used to query `S_DAX`.
statx can be used to query S_DAX. NOTE that only regular files will ever have
S_DAX set and therefore statx will never indicate that S_DAX is set on
directories.
.. note::
Setting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even
That only regular files will ever have `S_DAX` set and therefore statx
will never indicate that `S_DAX` is set on directories.
Setting the `FS_XFLAG_DAX` flag (specifically or through inheritance) occurs even
if the underlying media does not support dax and/or the filesystem is
overridden with a mount option.
Implementation Tips for Block Driver Writers
--------------------------------------------
To support DAX in your block driver, implement the 'direct_access'
To support `DAX` in your block driver, implement the 'direct_access'
block device operation. It is used to translate the sector number
(expressed in units of 512-byte sectors) to a page frame number (pfn)
that identifies the physical page for the memory. It also returns a
......@@ -179,19 +198,20 @@ These block devices may be used for inspiration:
Implementation Tips for Filesystem Writers
------------------------------------------
Filesystem support consists of
- adding support to mark inodes as being DAX by setting the S_DAX flag in
Filesystem support consists of:
* Adding support to mark inodes as being `DAX` by setting the `S_DAX` flag in
i_flags
- implementing ->read_iter and ->write_iter operations which use dax_iomap_rw()
when inode has S_DAX flag set
- implementing an mmap file operation for DAX files which sets the
VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
* Implementing ->read_iter and ->write_iter operations which use
:c:func:`dax_iomap_rw()` when inode has `S_DAX` flag set
* Implementing an mmap file operation for `DAX` files which sets the
`VM_MIXEDMAP` and `VM_HUGEPAGE` flags on the `VMA`, and setting the vm_ops to
include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These
handlers should probably call dax_iomap_fault() passing the appropriate
fault size and iomap operations.
- calling iomap_zero_range() passing appropriate iomap operations instead of
block_truncate_page() for DAX files
- ensuring that there is sufficient locking between reads, writes,
handlers should probably call :c:func:`dax_iomap_fault()` passing the
appropriate fault size and iomap operations.
* Calling :c:func:`iomap_zero_range()` passing appropriate iomap operations
instead of :c:func:`block_truncate_page()` for `DAX` files
* Ensuring that there is sufficient locking between reads, writes,
truncates and page faults
The iomap handlers for allocating blocks must make sure that allocated blocks
......@@ -199,9 +219,18 @@ are zeroed out and converted to written extents before being returned to avoid
exposure of uninitialized data through mmap.
These filesystems may be used for inspiration:
- ext2: see Documentation/filesystems/ext2.rst
- ext4: see Documentation/filesystems/ext4/
- xfs: see Documentation/admin-guide/xfs.rst
.. seealso::
ext2: see Documentation/filesystems/ext2.rst
.. seealso::
xfs: see Documentation/admin-guide/xfs.rst
.. seealso::
ext4: see Documentation/filesystems/ext4/
Handling Media Errors
......@@ -210,12 +239,12 @@ Handling Media Errors
The libnvdimm subsystem stores a record of known media error locations for
each pmem block device (in gendisk->badblocks). If we fault at such location,
or one with a latent error not yet discovered, the application can expect
to receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply
to receive a `SIGBUS`. Libnvdimm also allows clearing of these errors by simply
writing the affected sectors (through the pmem driver, and if the underlying
NVDIMM supports the clear_poison DSM defined by ACPI).
Since DAX IO normally doesn't go through the driver/bio path, applications or
sysadmins have an option to restore the lost data from a prior backup/inbuilt
Since `DAX` IO normally doesn't go through the ``driver/bio`` path, applications or
sysadmins have an option to restore the lost data from a prior ``backup/inbuilt``
redundancy in the following ways:
1. Delete the affected file, and restore from a backup (sysadmin route):
......@@ -227,7 +256,7 @@ redundancy in the following ways:
an entire aligned sector has to be hole-punched, but not necessarily an
entire filesystem block).
These are the two basic paths that allow DAX filesystems to continue operating
These are the two basic paths that allow `DAX` filesystems to continue operating
in the presence of media errors. More robust error recovery mechanisms can be
built on top of this in the future, for example, involving redundancy/mirroring
provided at the block layer through DM, or additionally, at the filesystem
......@@ -240,18 +269,23 @@ Shortcomings
------------
Even if the kernel or its modules are stored on a filesystem that supports
DAX on a block device that supports DAX, they will still be copied into RAM.
`DAX` on a block device that supports `DAX`, they will still be copied into RAM.
The DAX code does not work correctly on architectures which have virtually
mapped caches such as ARM, MIPS and SPARC.
Calling get_user_pages() on a range of user memory that has been mmaped
from a DAX file will fail when there are no 'struct page' to describe
Calling :c:func:`get_user_pages()` on a range of user memory that has been
mmaped from a `DAX` file will fail when there are no 'struct page' to describe
those pages. This problem has been addressed in some device drivers
by adding optional struct page support for pages under the control of
the driver (see CONFIG_NVDIMM_PFN in drivers/nvdimm for an example of
how to do this). In the non struct page cases O_DIRECT reads/writes to
those memory ranges from a non-DAX file will fail (note that O_DIRECT
reads/writes _of a DAX file_ do work, it is the memory that is being
accessed that is key here). Other things that will not work in the
non struct page case include RDMA, sendfile() and splice().
the driver (see `CONFIG_NVDIMM_PFN` in ``drivers/nvdimm`` for an example of
how to do this). In the non struct page cases `O_DIRECT` reads/writes to
those memory ranges from a non-`DAX` file will fail
.. note::
`O_DIRECT` reads/writes _of a `DAX` file do work, it is the memory that
is being accessed that is key here). Other things that will not work in
the non struct page case include RDMA, :c:func:`sendfile()` and
:c:func:`splice()`.
......@@ -25,7 +25,7 @@ check=none, nocheck (*) Don't do extra checking of bitmaps on mount
(check=normal and check=strict options removed)
dax Use direct access (no page cache). See
Documentation/filesystems/dax.txt.
Documentation/filesystems/dax.rst.
debug Extra debugging information is sent to the
kernel syslog. Useful for developers.
......
......@@ -84,7 +84,7 @@ Without the option META\_BG, for safety concerns, all block group
descriptors copies are kept in the first block group. Given the default
128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
can have at most 2^27/64 = 2^21 block groups. This limits the entire
filesystem size to 2^21 2^27 = 2^48bytes or 256TiB.
filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.
The solution to this problem is to use the metablock group feature
(META\_BG), which is already in ext3 for all 2.6 releases. With the
......
......@@ -77,6 +77,7 @@ Documentation for filesystem implementations.
coda
configfs
cramfs
dax
debugfs
dlmfs
ecryptfs
......
......@@ -448,15 +448,17 @@ described. If it finds a ``LAST_NORM`` component it first calls
filesystem to revalidate the result if it is that sort of filesystem.
If that doesn't get a good result, it calls "``lookup_slow()``" which
takes ``i_rwsem``, rechecks the cache, and then asks the filesystem
to find a definitive answer. Each of these will call
``follow_managed()`` (as described below) to handle any mount points.
In the absence of symbolic links, ``walk_component()`` creates a new
``struct path`` containing a counted reference to the new dentry and a
reference to the new ``vfsmount`` which is only counted if it is
different from the previous ``vfsmount``. It then calls
``path_to_nameidata()`` to install the new ``struct path`` in the
``struct nameidata`` and drop the unneeded references.
to find a definitive answer.
As the last step of walk_component(), step_into() will be called either
directly from walk_component() or from handle_dots(). It calls
handle_mounts(), to check and handle mount points, in which a new
``struct path`` is created containing a counted reference to the new dentry and
a reference to the new ``vfsmount`` which is only counted if it is
different from the previous ``vfsmount``. Then if there is
a symbolic link, step_into() calls pick_link() to deal with it,
otherwise it installs the new ``struct path`` in the ``struct nameidata``, and
drops the unneeded references.
This "hand-over-hand" sequencing of getting a reference to the new
dentry before dropping the reference to the previous dentry may
......@@ -470,8 +472,8 @@ Handling the final component
``nd->last_type`` to refer to the final component of the path. It does
not call ``walk_component()`` that last time. Handling that final
component remains for the caller to sort out. Those callers are
``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and
``path_openat()`` each of which handles the differing requirements of
path_lookupat(), path_parentat() and
path_openat() each of which handles the differing requirements of
different system calls.
``path_parentat()`` is clearly the simplest - it just wraps a little bit
......@@ -486,20 +488,18 @@ perform their operation.
object is wanted such as by ``stat()`` or ``chmod()``. It essentially just
calls ``walk_component()`` on the final component through a call to
``lookup_last()``. ``path_lookupat()`` returns just the final dentry.
``path_mountpoint()`` handles the special case of unmounting which must
not try to revalidate the mounted filesystem. It effectively
contains, through a call to ``mountpoint_last()``, an alternate
implementation of ``lookup_slow()`` which skips that step. This is
important when unmounting a filesystem that is inaccessible, such as
It is worth noting that when flag ``LOOKUP_MOUNTPOINT`` is set,
path_lookupat() will unset LOOKUP_JUMPED in nameidata so that in the
subsequent path traversal d_weak_revalidate() won't be called.
This is important when unmounting a filesystem that is inaccessible, such as
one provided by a dead NFS server.
Finally ``path_openat()`` is used for the ``open()`` system call; it
contains, in support functions starting with "``do_last()``", all the
contains, in support functions starting with "open_last_lookups()", all the
complexity needed to handle the different subtleties of O_CREAT (with
or without O_EXCL), final "``/``" characters, and trailing symbolic
links. We will revisit this in the final part of this series, which
focuses on those symbolic links. "``do_last()``" will sometimes, but
focuses on those symbolic links. "open_last_lookups()" will sometimes, but
not always, take ``i_rwsem``, depending on what it finds.
Each of these, or the functions which call them, need to be alert to
......@@ -535,8 +535,7 @@ covered in greater detail in autofs.txt in the Linux documentation
tree, but a few notes specifically related to path lookup are in order
here.
The Linux VFS has a concept of "managed" dentries which is reflected
in function names such as "``follow_managed()``". There are three
The Linux VFS has a concept of "managed" dentries. There are three
potentially interesting things about these dentries corresponding
to three different flags that might be set in ``dentry->d_flags``:
......@@ -652,10 +651,10 @@ RCU-walk finds it cannot stop gracefully, it simply gives up and
restarts from the top with REF-walk.
This pattern of "try RCU-walk, if that fails try REF-walk" can be
clearly seen in functions like ``filename_lookup()``,
``filename_parentat()``, ``filename_mountpoint()``,
``do_filp_open()``, and ``do_file_open_root()``. These five
correspond roughly to the four ``path_*()`` functions we met earlier,
clearly seen in functions like filename_lookup(),
filename_parentat(),
do_filp_open(), and do_file_open_root(). These four
correspond roughly to the three ``path_*()`` functions we met earlier,
each of which calls ``link_path_walk()``. The ``path_*()`` functions are
called using different mode flags until a mode is found which works.
They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
......@@ -993,8 +992,8 @@ is 4096. There are a number of reasons for this limit; not letting the
kernel spend too much time on just one path is one of them. With
symbolic links you can effectively generate much longer paths so some
sort of limit is needed for the same reason. Linux imposes a limit of
at most 40 symlinks in any one path lookup. It previously imposed a
further limit of eight on the maximum depth of recursion, but that was
at most 40 (MAXSYMLINKS) symlinks in any one path lookup. It previously imposed
a further limit of eight on the maximum depth of recursion, but that was
raised to 40 when a separate stack was implemented, so there is now
just the one limit.
......@@ -1061,42 +1060,26 @@ filesystem cannot successfully get a reference in RCU-walk mode, it
must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
REF-walk mode in which the filesystem is allowed to sleep.
The place for all this to happen is the ``i_op->follow_link()`` inode
method. In the present mainline code this is never actually called in
RCU-walk mode as the rewrite is not quite complete. It is likely that
in a future release this method will be passed an ``inode`` pointer when
called in RCU-walk mode so it both (1) knows to be careful, and (2) has the
validated pointer. Much like the ``i_op->permission()`` method we
looked at previously, ``->follow_link()`` would need to be careful that
The place for all this to happen is the ``i_op->get_link()`` inode
method. This is called both in RCU-walk and REF-walk. In RCU-walk the
``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop out of
RCU-walk. Much like the ``i_op->permission()`` method we
looked at previously, ``->get_link()`` would need to be careful that
all the data structures it references are safe to be accessed while
holding no counted reference, only the RCU lock. Though getting a
reference with ``->follow_link()`` is not yet done in RCU-walk mode, the
code is ready to release the reference when that does happen.
This need to drop the reference to a symlink adds significant
complexity. It requires a reference to the inode so that the
``i_op->put_link()`` inode operation can be called. In REF-walk, that
reference is kept implicitly through a reference to the dentry, so
keeping the ``struct path`` of the symlink is easiest. For RCU-walk,
the pointer to the inode is kept separately. To allow switching from
RCU-walk back to REF-walk in the middle of processing nested symlinks
we also need the seq number for the dentry so we can confirm that
switching back was safe.
Finally, when providing a reference to a symlink, the filesystem also
provides an opaque "cookie" that must be passed to ``->put_link()`` so that it
knows what to free. This might be the allocated memory area, or a
pointer to the ``struct page`` in the page cache, or something else
completely. Only the filesystem knows what it is.
holding no counted reference, only the RCU lock. A callback
``struct delayed_called`` will be passed to ``->get_link()``:
file systems can set their own put_link function and argument through
set_delayed_call(). Later on, when VFS wants to put link, it will call
do_delayed_call() to invoke that callback function with the argument.
In order for the reference to each symlink to be dropped when the walk completes,
whether in RCU-walk or REF-walk, the symlink stack needs to contain,
along with the path remnants:
- the ``struct path`` to provide a reference to the inode in REF-walk
- the ``struct inode *`` to provide a reference to the inode in RCU-walk
- the ``struct path`` to provide a reference to the previous path
- the ``const char *`` to provide a reference to the to previous name
- the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk
- the ``cookie`` that tells ``->put_path()`` what to put.
- the ``struct delayed_call`` for later invocation.
This means that each entry in the symlink stack needs to hold five
pointers and an integer instead of just one pointer (the path
......@@ -1120,12 +1103,10 @@ doesn't need to notice. Getting this ``name`` variable on and off the
stack is very straightforward; pushing and popping the references is
a little more complex.
When a symlink is found, ``walk_component()`` returns the value ``1``
(``0`` is returned for any other sort of success, and a negative number
is, as usual, an error indicator). This causes ``get_link()`` to be
called; it then gets the link from the filesystem. Providing that
operation is successful, the old path ``name`` is placed on the stack,
and the new value is used as the ``name`` for a while. When the end of
When a symlink is found, walk_component() calls pick_link() via step_into()
which returns the link from the filesystem.
Providing that operation is successful, the old path ``name`` is placed on the
stack, and the new value is used as the ``name`` for a while. When the end of
the path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored
off the stack and path walking continues.
......@@ -1142,23 +1123,23 @@ stack in ``walk_component()`` immediately when the symlink is found;
old symlink as it walks that last component. So it is quite
convenient for ``walk_component()`` to release the old symlink and pop
the references just before pushing the reference information for the
new symlink. It is guided in this by two flags; ``WALK_GET``, which
gives it permission to follow a symlink if it finds one, and
``WALK_PUT``, which tells it to release the current symlink after it has been
followed. ``WALK_PUT`` is tested first, leading to a call to
``put_link()``. ``WALK_GET`` is tested subsequently (by
``should_follow_link()``) leading to a call to ``pick_link()`` which sets
up the stack frame.
new symlink. It is guided in this by three flags: ``WALK_NOFOLLOW`` which
forbids it from following a symlink if it finds one, ``WALK_MORE``
which indicates that it is yet too early to release the
current symlink, and ``WALK_TRAILING`` which indicates that it is on the final
component of the lookup, so we will check userspace flag ``LOOKUP_FOLLOW`` to
decide whether follow it when it is a symlink and call ``may_follow_link()`` to
check if we have privilege to follow it.
Symlinks with no final component
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A pair of special-case symlinks deserve a little further explanation.
Both result in a new ``struct path`` (with mount and dentry) being set
up in the ``nameidata``, and result in ``get_link()`` returning ``NULL``.
up in the ``nameidata``, and result in pick_link() returning ``NULL``.
The more obvious case is a symlink to "``/``". All symlinks starting
with "``/``" are detected in ``get_link()`` which resets the ``nameidata``
with "``/``" are detected in pick_link() which resets the ``nameidata``
to point to the effective filesystem root. If the symlink only
contains "``/``" then there is nothing more to do, no components at all,
so ``NULL`` is returned to indicate that the symlink can be released and
......@@ -1175,12 +1156,11 @@ something that looks like a symlink. It is really a reference to the
target file, not just the name of it. When you ``readlink`` these
objects you get a name that might refer to the same file - unless it
has been unlinked or mounted over. When ``walk_component()`` follows
one of these, the ``->follow_link()`` method in "procfs" doesn't return
a string name, but instead calls ``nd_jump_link()`` which updates the
``nameidata`` in place to point to that target. ``->follow_link()`` then
returns ``NULL``. Again there is no final component and ``get_link()``
reports this by leaving the ``last_type`` field of ``nameidata`` as
``LAST_BIND``.
one of these, the ``->get_link()`` method in "procfs" doesn't return
a string name, but instead calls nd_jump_link() which updates the
``nameidata`` in place to point to that target. ``->get_link()`` then
returns ``NULL``. Again there is no final component and pick_link()
returns ``NULL``.
Following the symlink in the final component
--------------------------------------------
......@@ -1197,42 +1177,38 @@ potentially need to call ``link_path_walk()`` again and again on
successive symlinks until one is found that doesn't point to another
symlink.
This case is handled by the relevant caller of ``link_path_walk()``, such as
``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then
handles the final component. If the final component is a symlink
that needs to be followed, then ``trailing_symlink()`` is called to set
things up properly and the loop repeats, calling ``link_path_walk()``
again. This could loop as many as 40 times if the last component of
each symlink is another symlink.
The various functions that examine the final component and possibly
report that it is a symlink are ``lookup_last()``, ``mountpoint_last()``
and ``do_last()``, each of which use the same convention as
``walk_component()`` of returning ``1`` if a symlink was found that needs
to be followed.
Of these, ``do_last()`` is the most interesting as it is used for
opening a file. Part of ``do_last()`` runs with ``i_rwsem`` held and this
part is in a separate function: ``lookup_open()``.
Explaining ``do_last()`` completely is beyond the scope of this article,
but a few highlights should help those interested in exploring the
code.
1. Rather than just finding the target file, ``do_last()`` needs to open
This case is handled by relevant callers of link_path_walk(), such as
path_lookupat(), path_openat() using a loop that calls link_path_walk(),
and then handles the final component by calling open_last_lookups() or
lookup_last(). If it is a symlink that needs to be followed,
open_last_lookups() or lookup_last() will set things up properly and
return the path so that the loop repeats, calling
link_path_walk() again. This could loop as many as 40 times if the last
component of each symlink is another symlink.
Of the various functions that examine the final component,
open_last_lookups() is the most interesting as it works in tandem
with do_open() for opening a file. Part of open_last_lookups() runs
with ``i_rwsem`` held and this part is in a separate function: lookup_open().
Explaining open_last_lookups() and do_open() completely is beyond the scope
of this article, but a few highlights should help those interested in exploring
the code.
1. Rather than just finding the target file, do_open() is used after
open_last_lookup() to open
it. If the file was found in the dcache, then ``vfs_open()`` is used for
this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if
the filesystem provides it) to combine the final lookup with the open, or
will perform the separate ``lookup_real()`` and ``vfs_create()`` steps
will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps
directly. In the later case the actual "open" of this newly found or
created file will be performed by ``vfs_open()``, just as if the name
created file will be performed by vfs_open(), just as if the name
were found in the dcache.
2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information
wasn't quite current enough. Rather than restarting the lookup from
the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead,
giving the filesystem a chance to resolve small inconsistencies.
If that doesn't work, only then is the lookup restarted from the top.
2. vfs_open() can fail with ``-EOPENSTALE`` if the cached information
wasn't quite current enough. If it's in RCU-walk ``-ECHILD`` will be returned
otherwise ``-ESTALE`` is returned. When ``-ESTALE`` is returned, the caller may
retry with ``LOOKUP_REVAL`` flag set.
3. An open with O_CREAT **does** follow a symlink in the final component,
unlike other creation system calls (like ``mkdir``). So the sequence::
......@@ -1242,8 +1218,8 @@ code.
will create a file called ``/tmp/bar``. This is not permitted if
``O_EXCL`` is set but otherwise is handled for an O_CREAT open much
like for a non-creating open: ``should_follow_link()`` returns ``1``, and
so does ``do_last()`` so that ``trailing_symlink()`` gets called and the
like for a non-creating open: lookup_last() or open_last_lookup()
returns a non ``NULL`` value, and link_path_walk() gets called and the
open process continues on the symlink that was found.
Updating the access time
......
......@@ -79,7 +79,8 @@ the ANOD object which is also the final target node of the reference.
})
}
Please also see a graph example in :doc:`graph`.
Please also see a graph example in
Documentation/firmware-guide/acpi/dsd/graph.rst.
References
==========
......
......@@ -174,4 +174,4 @@ References
referenced 2016-10-04.
[7] _DSD Device Properties Usage Rules.
:doc:`../DSD-properties-rules`
Documentation/firmware-guide/acpi/DSD-properties-rules.rst
......@@ -339,8 +339,8 @@ a code like this::
There are also devm_* versions of these functions which release the
descriptors once the device is released.
See Documentation/firmware-guide/acpi/gpio-properties.rst for more information about the
_DSD binding related to GPIOs.
See Documentation/firmware-guide/acpi/gpio-properties.rst for more information
about the _DSD binding related to GPIOs.
MFD devices
===========
......@@ -460,7 +460,8 @@ the _DSD of the device object itself or the _DSD of its ancestor in the
Otherwise, the _DSD itself is regarded as invalid and therefore the "compatible"
property returned by it is meaningless.
Refer to :doc:`DSD-properties-rules` for more information.
Refer to Documentation/firmware-guide/acpi/DSD-properties-rules.rst for more
information.
PCI hierarchy representation
============================
......
......@@ -59,7 +59,7 @@ Declare the I2C devices via ACPI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ACPI can also describe I2C devices. There is special documentation for this
which is currently located at :doc:`../firmware-guide/acpi/enumeration`.
which is currently located at Documentation/firmware-guide/acpi/enumeration.rst.
Declare the I2C devices in board files
......
......@@ -17,7 +17,8 @@ address), ``force`` (to forcibly attach the driver to a given device) and
With the conversion of the I2C subsystem to the standard device driver
binding model, it became clear that these per-module parameters were no
longer needed, and that a centralized implementation was possible. The new,
sysfs-based interface is described in :doc:`instantiating-devices`, section
sysfs-based interface is described in
Documentation/i2c/instantiating-devices.rst, section
"Method 4: Instantiate from user-space".
Below is a mapping from the old module parameters to the new interface.
......
......@@ -27,8 +27,8 @@ a different protocol operation entirely.
Each transaction type corresponds to a functionality flag. Before calling a
transaction function, a device driver should always check (just once) for
the corresponding functionality flag to ensure that the underlying I2C
adapter supports the transaction in question. See :doc:`functionality` for
the details.
adapter supports the transaction in question. See
Documentation/i2c/functionality.rst for the details.
Key to symbols
......
......@@ -263,7 +263,7 @@ possible overrun should the name be too long::
char name[128];
if (ioctl(fd, JSIOCGNAME(sizeof(name)), name) < 0)
strncpy(name, "Unknown", sizeof(name));
strscpy(name, "Unknown", sizeof(name));
printf("Name: %s\n", name);
......
......@@ -601,7 +601,7 @@ Defined in ``include/linux/export.h``
This is the variant of `EXPORT_SYMBOL()` that allows specifying a symbol
namespace. Symbol Namespaces are documented in
:doc:`../core-api/symbol-namespaces`
Documentation/core-api/symbol-namespaces.rst
:c:func:`EXPORT_SYMBOL_NS_GPL()`
--------------------------------
......@@ -610,7 +610,7 @@ Defined in ``include/linux/export.h``
This is the variant of `EXPORT_SYMBOL_GPL()` that allows specifying a symbol
namespace. Symbol Namespaces are documented in
:doc:`../core-api/symbol-namespaces`
Documentation/core-api/symbol-namespaces.rst
Routines and Conventions
========================
......
......@@ -466,7 +466,7 @@ network. PTP support varies among Intel devices that support this driver. Use
"ethtool -T <netdev name>" to get a definitive list of PTP capabilities
supported by the device.
IEEE 802.1ad (QinQ) Support
IEEE 802.1ad (QinQ) Support
---------------------------
The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
......@@ -523,8 +523,8 @@ of a port's bandwidth (should it be available). The sum of all the values for
Maximum Bandwidth is not restricted, because no more than 100% of a port's
bandwidth can ever be used.
NOTE: X710/XXV710 devices fail to enable Max VFs (64) when Multiple Functions
per Port (MFP) and SR-IOV are enabled. An error from i40e is logged that says
NOTE: X710/XXV710 devices fail to enable Max VFs (64) when Multiple Functions
per Port (MFP) and SR-IOV are enabled. An error from i40e is logged that says
"add vsi failed for VF N, aq_err 16". To workaround the issue, enable less than
64 virtual functions (VFs).
......
......@@ -113,7 +113,7 @@ which the AVF is associated. The following are base mode features:
- AVF device ID
- HW mailbox is used for VF to PF communications (including on Windows)
IEEE 802.1ad (QinQ) Support
IEEE 802.1ad (QinQ) Support
---------------------------
The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
......
......@@ -22,7 +22,7 @@ The major benefit to creating a region is to provide access to internal
address regions that are otherwise inaccessible to the user.
Regions may also be used to provide an additional way to debug complex error
states, but see also :doc:`devlink-health`
states, but see also Documentation/networking/devlink/devlink-health.rst
Regions may optionally support capturing a snapshot on demand via the
``DEVLINK_CMD_REGION_NEW`` netlink message. A driver wishing to allow
......
......@@ -495,8 +495,8 @@ help debug packet drops caused by these exceptions. The following list includes
links to the description of driver-specific traps registered by various device
drivers:
* :doc:`netdevsim`
* :doc:`mlxsw`
* Documentation/networking/devlink/netdevsim.rst
* Documentation/networking/devlink/mlxsw.rst
.. _Generic-Packet-Trap-Groups:
......
......@@ -153,7 +153,7 @@ As capture, each frame contains two parts::
struct ifreq s_ifr;
...
strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
strscpy_pad (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
/* get interface index of eth0 */
ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
......
......@@ -107,7 +107,7 @@ Note that the character pointer becomes overwritten with the real device name
*/
ifr.ifr_flags = IFF_TUN;
if( *dev )
strncpy(ifr.ifr_name, dev, IFNAMSIZ);
strscpy_pad(ifr.ifr_name, dev, IFNAMSIZ);
if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){
close(fd);
......
......@@ -10,10 +10,11 @@ can greatly increase the chances of your change being accepted.
This document contains a large number of suggestions in a relatively terse
format. For detailed information on how the kernel development process
works, see :doc:`development-process`. Also, read :doc:`submit-checklist`
works, see Documentation/process/development-process.rst. Also, read
Documentation/process/submit-checklist.rst
for a list of items to check before submitting code. If you are submitting
a driver, also read :doc:`submitting-drivers`; for device tree binding patches,
read :doc:`submitting-patches`.
a driver, also read Documentation/process/submitting-drivers.rst; for device
tree binding patches, read Documentation/process/submitting-patches.rst.
This documentation assumes that you're using ``git`` to prepare your patches.
If you're unfamiliar with ``git``, you would be well-advised to learn how to
......@@ -178,8 +179,7 @@ Style-check your changes
------------------------
Check your patch for basic style violations, details of which can be
found in
:ref:`Documentation/process/coding-style.rst <codingstyle>`.
found in Documentation/process/coding-style.rst.
Failure to do so simply wastes
the reviewers time and will get your patch rejected, probably
without even being read.
......@@ -238,7 +238,7 @@ If you have a patch that fixes an exploitable security bug, send that patch
to security@kernel.org. For severe bugs, a short embargo may be considered
to allow distributors to get the patch out to users; in such cases,
obviously, the patch should not be sent to any public lists. See also
:doc:`/admin-guide/security-bugs`.
Documentation/admin-guide/security-bugs.rst.
Patches that fix a severe bug in a released kernel should be directed
toward the stable maintainers by putting a line like this::
......@@ -246,9 +246,8 @@ toward the stable maintainers by putting a line like this::
Cc: stable@vger.kernel.org
into the sign-off area of your patch (note, NOT an email recipient). You
should also read
:ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`
in addition to this file.
should also read Documentation/process/stable-kernel-rules.rst
in addition to this document.
If changes affect userland-kernel interfaces, please send the MAN-PAGES
maintainer (as listed in the MAINTAINERS file) a man-pages patch, or at
......@@ -305,8 +304,8 @@ decreasing the likelihood of your MIME-attached change being accepted.
Exception: If your mailer is mangling patches then someone may ask
you to re-send them using MIME.
See :doc:`/process/email-clients` for hints about configuring your e-mail
client so that it sends your patches untouched.
See Documentation/process/email-clients.rst for hints about configuring
your e-mail client so that it sends your patches untouched.
Respond to review comments
--------------------------
......@@ -324,7 +323,7 @@ for their time. Code review is a tiring and time-consuming process, and
reviewers sometimes get grumpy. Even in that case, though, respond
politely and address the problems they have pointed out.
See :doc:`email-clients` for recommendations on email
See Documentation/process/email-clients.rst for recommendations on email
clients and mailing list etiquette.
......@@ -562,10 +561,10 @@ method for indicating a bug fixed by the patch. See :ref:`describe_changes`
for more details.
Note: Attaching a Fixes: tag does not subvert the stable kernel rules
process nor the requirement to Cc: stable@vger.kernel.org on all stable
process nor the requirement to Cc: stable@vger.kernel.org on all stable
patch candidates. For more information, please read
:ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`
Documentation/process/stable-kernel-rules.rst.
.. _the_canonical_patch_format:
The canonical patch format
......@@ -824,8 +823,7 @@ Greg Kroah-Hartman, "How to piss off a kernel subsystem maintainer".
NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people!
<https://lore.kernel.org/r/20050711.125305.08322243.davem@davemloft.net>
Kernel Documentation/process/coding-style.rst:
:ref:`Documentation/process/coding-style.rst <codingstyle>`
Kernel Documentation/process/coding-style.rst
Linus Torvalds's mail on the canonical patch format:
<https://lore.kernel.org/r/Pine.LNX.4.58.0504071023190.28951@ppc970.osdl.org>
......
......@@ -29,7 +29,7 @@ Quota and period are managed within the cpu subsystem via cgroupfs.
.. note::
The cgroupfs files described in this section are only applicable
to cgroup v1. For cgroup v2, see
:ref:`Documentation/admin-guide/cgroupv2.rst <cgroup-v2-cpu>`.
:ref:`Documentation/admin-guide/cgroup-v2.rst <cgroup-v2-cpu>`.
- cpu.cfs_quota_us: the total available run-time within a period (in
microseconds)
......
......@@ -60,7 +60,7 @@ within the constraints of HZ and jiffies and their nasty design level
coupling to timeslices and granularity it was not really viable.
The second (less frequent but still periodically occurring) complaint
about Linux's nice level support was its assymetry around the origo
about Linux's nice level support was its asymmetry around the origin
(which you can see demonstrated in the picture above), or more
accurately: the fact that nice level behavior depended on the _absolute_
nice level as well, while the nice API itself is fundamentally
......
......@@ -25,7 +25,8 @@ Any user can enforce Landlock rulesets on their processes. They are merged and
evaluated according to the inherited ones in a way that ensures that only more
constraints can be added.
User space documentation can be found here: :doc:`/userspace-api/landlock`.
User space documentation can be found here:
Documentation/userspace-api/landlock.rst.
Guiding principles for safe access controls
===========================================
......
......@@ -427,7 +427,7 @@ the ‘TRC’ prefix.
:Syntax:
``echo idx > vmid_idx``
Where idx <  numvmidc
Where idx < numvmidc
----
......
......@@ -315,7 +315,8 @@ intermediate links as required.
Note: ``cti_sys0`` appears in two of the connections lists above.
CTIs can connect to multiple devices and are arranged in a star topology
via the CTM. See (:doc:`coresight-ect`) [#fourth]_ for further details.
via the CTM. See (Documentation/trace/coresight/coresight-ect.rst)
[#fourth]_ for further details.
Looking at this device we see 4 connections::
linaro-developer:~# ls -l /sys/bus/coresight/devices/cti_sys0/connections
......@@ -606,7 +607,8 @@ interface provided for that purpose by the generic STM API::
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/stm0
root@genericarmv8:~#
Details on how to use the generic STM API can be found here:- :doc:`../stm` [#second]_.
Details on how to use the generic STM API can be found here:
- Documentation/trace/stm.rst [#second]_.
The CTI & CTM Modules
---------------------
......@@ -616,7 +618,7 @@ individual CTIs and components, and can propagate these between all CTIs via
channels on the CTM (Cross Trigger Matrix).
A separate documentation file is provided to explain the use of these devices.
(:doc:`coresight-ect`) [#fourth]_.
(Documentation/trace/coresight/coresight-ect.rst) [#fourth]_.
.. [#first] Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
......
......@@ -40,7 +40,7 @@ See events.rst for more information.
Implementation Details
----------------------
See :doc:`ftrace-design` for details for arch porters and such.
See Documentation/trace/ftrace-design.rst for details for arch porters and such.
The File System
......@@ -354,8 +354,8 @@ of ftrace. Here is a list of some of the key files:
is being directly called by the function. If the count is greater
than 1 it most likely will be ftrace_ops_list_func().
If the callback of the function jumps to a trampoline that is
specific to a the callback and not the standard trampoline,
If the callback of a function jumps to a trampoline that is
specific to the callback and which is not the standard trampoline,
its address will be printed as well as the function that the
trampoline calls.
......
......@@ -18,6 +18,10 @@ Translations
Disclaimer
----------
.. raw:: latex
\kerneldocCJKoff
Translation's purpose is to ease reading and understanding in languages other
than English. Its aim is to help people who do not understand English or have
doubts about its interpretation. Additionally, some people prefer to read
......
......@@ -4,6 +4,10 @@
Traduzione italiana
===================
.. raw:: latex
\kerneldocCJKoff
:manutentore: Federico Vaga <federico.vaga@vaga.pv.it>
.. _it_disclaimer:
......
......@@ -62,7 +62,7 @@ i ``case``. Un esempio.:
case 'K':
case 'k':
mem <<= 10;
/* fall through */
fallthrough;
default:
break;
}
......
.. raw:: latex
\renewcommand\thesection*
\renewcommand\thesubsection*
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
Japanese translations
=====================
......
.. raw:: latex
\renewcommand\thesection*
\renewcommand\thesubsection*
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
한국어 번역
===========
......
......@@ -65,6 +65,7 @@ Todolist:
clearing-warn-once
cpu-load
lockup-watchdogs
unicode
Todolist:
......@@ -100,7 +101,6 @@ Todolist:
laptops/index
lcd-panel-cgram
ldm
lockup-watchdogs
LSM/index
md
media/index
......
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/admin-guide/lockup-watchdogs.rst
:Translator: Hailong Liu <liu.hailong6@zte.com.cn>
.. _cn_lockup-watchdogs:
=================================================
Softlockup与hardlockup检测机制(又名:nmi_watchdog)
=================================================
Linux中内核实现了一种用以检测系统发生softlockup和hardlockup的看门狗机制。
Softlockup是一种会引发系统在内核态中一直循环超过20秒(详见下面“实现”小节)导致
其他任务没有机会得到运行的BUG。一旦检测到'softlockup'发生,默认情况下系统会打
印当前堆栈跟踪信息并进入锁定状态。也可配置使其在检测到'softlockup'后进入panic
状态;通过sysctl命令设置“kernel.softlockup_panic”、使用内核启动参数
“softlockup_panic”(详见Documentation/admin-guide/kernel-parameters.rst)以及使
能内核编译选项“BOOTPARAM_SOFTLOCKUP_PANIC”都可实现这种配置。
而'hardlockup'是一种会引发系统在内核态一直循环超过10秒钟(详见"实现"小节)导致其
他中断没有机会运行的缺陷。与'softlockup'情况类似,除了使用sysctl命令设置
'hardlockup_panic'、使能内核选项“BOOTPARAM_HARDLOCKUP_PANIC”以及使用内核参数
"nmi_watchdog"(详见:”Documentation/admin-guide/kernel-parameters.rst“)外,一旦检
测到'hardlockup'默认情况下系统打印当前堆栈跟踪信息,然后进入锁定状态。
这个panic选项也可以与panic_timeout结合使用(这个panic_timeout是通过稍具迷惑性的
sysctl命令"kernel.panic"来设置),使系统在panic指定时间后自动重启。
实现
====
Softlockup和hardlockup分别建立在hrtimer(高精度定时器)和perf两个子系统上而实现。
这也就意味着理论上任何架构只要实现了这两个子系统就支持这两种检测机制。
Hrtimer用于周期性产生中断并唤醒watchdog线程;NMI perf事件则以”watchdog_thresh“
(编译时默认初始化为10秒,也可通过”watchdog_thresh“这个sysctl接口来进行配置修改)
为间隔周期产生以检测 hardlockups。如果一个CPU在这个时间段内没有检测到hrtimer中
断发生,'hardlockup 检测器'(即NMI perf事件处理函数)将会视系统配置而选择产生内核
警告或者直接panic。
而watchdog线程本质上是一个高优先级内核线程,每调度一次就对时间戳进行一次更新。
如果时间戳在2*watchdog_thresh(这个是softlockup的触发门限)这段时间都未更新,那么
"softlocup 检测器"(内部hrtimer定时器回调函数)会将相关的调试信息打印到系统日志中,
然后如果系统配置了进入panic流程则进入panic,否则内核继续执行。
Hrtimer定时器的周期是2*watchdog_thresh/5,也就是说在hardlockup被触发前hrtimer有
2~3次机会产生时钟中断。
如上所述,内核相当于为系统管理员提供了一个可调节hrtimer定时器和perf事件周期长度
的调节旋钮。如何通过这个旋钮为特定使用场景配置一个合理的周期值要对lockups检测的
响应速度和lockups检测开销这二者之间进行权衡。
默认情况下所有在线cpu上都会运行一个watchdog线程。不过在内核配置了”NO_HZ_FULL“的
情况下watchdog线程默认只会运行在管家(housekeeping)cpu上,而”nohz_full“启动参数指
定的cpu上则不会有watchdog线程运行。试想,如果我们允许watchdog线程在”nohz_full“指
定的cpu上运行,这些cpu上必须得运行时钟定时器来激发watchdog线程调度;这样一来就会
使”nohz_full“保护用户程序免受内核干扰的功能失效。当然,副作用就是”nohz_full“指定
的cpu即使在内核产生了lockup问题我们也无法检测到。不过,至少我们可以允许watchdog
线程在管家(non-tickless)核上继续运行以便我们能继续正常的监测这些cpus上的lockups
事件。
不论哪种情况都可以通过sysctl命令kernel.watchdog_cpumask来对没有运行watchdog线程
的cpu集合进行调节。对于nohz_full而言,如果nohz_full cpu上有异常挂住的情况,通过
这种方式打开这些cpu上的watchdog进行调试可能会有所作用。
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/cachetlb.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
吴想成 Wu XiangCheng <bobwxc@email.cn>
.. _cn_core-api_cachetlb:
======================
Linux下的缓存和TLB刷新
======================
:作者: David S. Miller <davem@redhat.com>
*译注:TLB,Translation Lookaside Buffer,页表缓存/变换旁查缓冲器*
本文描述了由Linux虚拟内存子系统调用的缓存/TLB刷新接口。它列举了每个接
口,描述了它的预期目的,以及接口被调用后的预期副作用。
下面描述的副作用是针对单处理器的实现,以及在单个处理器上发生的情况。若
为SMP,则只需将定义简单地扩展一下,使发生在某个特定接口的副作用扩展到系
统的所有处理器上。不要被这句话吓到,以为SMP的缓存/tlb刷新一定是很低
效的,事实上,这是一个可以进行很多优化的领域。例如,如果可以证明一个用
户地址空间从未在某个cpu上执行过(见mm_cpumask()),那么就不需要在该
cpu上对这个地址空间进行刷新。
首先是TLB刷新接口,因为它们是最简单的。在Linux下,TLB被抽象为cpu
用来缓存从软件页表获得的虚拟->物理地址转换的东西。这意味着,如果软件页
表发生变化,这个“TLB”缓存中就有可能出现过时(脏)的翻译。因此,当软件页表
发生变化时,内核会在页表发生 *变化后* 调用以下一种刷新方法:
1) ``void flush_tlb_all(void)``
最严格的刷新。在这个接口运行后,任何以前的页表修改都会对cpu可见。
这通常是在内核页表被改变时调用的,因为这种转换在本质上是“全局”的。
2) ``void flush_tlb_mm(struct mm_struct *mm)``
这个接口从TLB中刷新整个用户地址空间。在运行后,这个接口必须确保
以前对地址空间‘mm’的任何页表修改对cpu来说是可见的。也就是说,在
运行后,TLB中不会有‘mm’的页表项。
这个接口被用来处理整个地址空间的页表操作,比如在fork和exec过程
中发生的事情。
3) ``void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)``
这里我们要从TLB中刷新一个特定范围的(用户)虚拟地址转换。在运行后,
这个接口必须确保以前对‘start’到‘end-1’范围内的地址空间‘vma->vm_mm’
的任何页表修改对cpu来说是可见的。也就是说,在运行后,TLB中不会有
‘mm’的页表项用于‘start’到‘end-1’范围内的虚拟地址。
“vma”是用于该区域的备份存储。主要是用于munmap()类型的操作。
提供这个接口是希望端口能够找到一个合适的有效方法来从TLB中删除多
个页面大小的转换,而不是让内核为每个可能被修改的页表项调用
flush_tlb_page(见下文)。
4) ``void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)``
这一次我们需要从TLB中删除PAGE_SIZE大小的转换。‘vma’是Linux用来跟
踪进程的mmap区域的支持结构体,地址空间可以通过vma->vm_mm获得。另
外,可以通过测试(vma->vm_flags & VM_EXEC)来查看这个区域是否是
可执行的(因此在split-tlb类型的设置中可能在“指令TLB”中)。
在运行后,这个接口必须确保之前对用户虚拟地址“addr”的地址空间
“vma->vm_mm”的页表修改对cpu来说是可见的。也就是说,在运行后,TLB
中不会有虚拟地址‘addr’的‘vma->vm_mm’的页表项。
这主要是在故障处理时使用。
5) ``void update_mmu_cache(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)``
在每个页面故障结束时,这个程序被调用,以告诉体系结构特定的代码,在
软件页表中,在地址空间“vma->vm_mm”的虚拟地址“地址”处,现在存在
一个翻译。
可以用它所选择的任何方式使用这个信息来进行移植。例如,它可以使用这
个事件来为软件管理的TLB配置预装TLB转换。目前sparc64移植就是这么干
的。
接下来,我们有缓存刷新接口。一般来说,当Linux将现有的虚拟->物理映射
改变为新的值时,其顺序将是以下形式之一::
1) flush_cache_mm(mm);
change_all_page_tables_of(mm);
flush_tlb_mm(mm);
2) flush_cache_range(vma, start, end);
change_range_of_page_tables(mm, start, end);
flush_tlb_range(vma, start, end);
3) flush_cache_page(vma, addr, pfn);
set_pte(pte_pointer, new_pte_val);
flush_tlb_page(vma, addr);
缓存级别的刷新将永远是第一位的,因为这允许我们正确处理那些缓存严格,
且在虚拟地址被从缓存中刷新时要求一个虚拟地址的虚拟->物理转换存在的系统。
HyperSparc cpu就是这样一个具有这种属性的cpu。
下面的缓存刷新程序只需要在特定的cpu需要的范围内处理缓存刷新。大多数
情况下,这些程序必须为cpu实现,这些cpu有虚拟索引的缓存,当虚拟->物
理转换被改变或移除时,必须被刷新。因此,例如,IA32处理器的物理索引
的物理标记的缓存没有必要实现这些接口,因为这些缓存是完全同步的,并
且不依赖于翻译信息。
下面逐个列出这些程序:
1) ``void flush_cache_mm(struct mm_struct *mm)``
这个接口将整个用户地址空间从高速缓存中刷掉。也就是说,在运行后,
将没有与‘mm’相关的缓存行。
这个接口被用来处理整个地址空间的页表操作,比如在退出和执行过程
中发生的事情。
2) ``void flush_cache_dup_mm(struct mm_struct *mm)``
这个接口将整个用户地址空间从高速缓存中刷新掉。也就是说,在运行
后,将没有与‘mm’相关的缓存行。
这个接口被用来处理整个地址空间的页表操作,比如在fork过程中发生
的事情。
这个选项与flush_cache_mm分开,以允许对VIPT缓存进行一些优化。
3) ``void flush_cache_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)``
在这里,我们要从缓存中刷新一个特定范围的(用户)虚拟地址。运行
后,在“start”到“end-1”范围内的虚拟地址的“vma->vm_mm”的缓存中
将没有页表项。
“vma”是被用于该区域的备份存储。主要是用于munmap()类型的操作。
提供这个接口是希望端口能够找到一个合适的有效方法来从缓存中删
除多个页面大小的区域, 而不是让内核为每个可能被修改的页表项调
用 flush_cache_page (见下文)。
4) ``void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)``
这一次我们需要从缓存中删除一个PAGE_SIZE大小的区域。“vma”是
Linux用来跟踪进程的mmap区域的支持结构体,地址空间可以通过
vma->vm_mm获得。另外,我们可以通过测试(vma->vm_flags &
VM_EXEC)来查看这个区域是否是可执行的(因此在“Harvard”类
型的缓存布局中可能是在“指令缓存”中)。
“pfn”表示“addr”所对应的物理页框(通过PAGE_SHIFT左移这个
值来获得物理地址)。正是这个映射应该从缓存中删除。
在运行之后,对于虚拟地址‘addr’的‘vma->vm_mm’,在缓存中不会
有任何页表项,它被翻译成‘pfn’。
这主要是在故障处理过程中使用。
5) ``void flush_cache_kmaps(void)``
只有在平台使用高位内存的情况下才需要实现这个程序。它将在所有的
kmaps失效之前被调用。
运行后,内核虚拟地址范围PKMAP_ADDR(0)到PKMAP_ADDR(LAST_PKMAP)
的缓存中将没有页表项。
这个程序应该在asm/highmem.h中实现。
6) ``void flush_cache_vmap(unsigned long start, unsigned long end)``
``void flush_cache_vunmap(unsigned long start, unsigned long end)``
在这里,在这两个接口中,我们从缓存中刷新一个特定范围的(内核)
虚拟地址。运行后,在“start”到“end-1”范围内的虚拟地址的内核地
址空间的缓存中不会有页表项。
这两个程序中的第一个是在vmap_range()安装了页表项之后调用的。
第二个是在vunmap_range()删除页表项之前调用的。
还有一类cpu缓存问题,目前需要一套完全不同的接口来正确处理。最大
的问题是处理器的数据缓存中的虚拟别名。
.. note::
这段内容有些晦涩,为了减轻中文阅读压力,特作此译注。
别名(alias)属于缓存一致性问题,当不同的虚拟地址映射相同的
物理地址,而这些虚拟地址的index不同,此时就发生了别名现象(多
个虚拟地址被称为别名)。通俗点来说就是指同一个物理地址的数据被
加载到不同的cacheline中就会出现别名现象。
常见的解决方法有两种:第一种是硬件维护一致性,设计特定的cpu电
路来解决问题(例如设计为PIPT的cache);第二种是软件维护一致性,
就是下面介绍的sparc的解决方案——页面染色,涉及的技术细节太多,
译者不便展开,请读者自行查阅相关资料。
您的移植是否容易在其D-cache中出现虚拟别名?嗯,如果您的D-cache
是虚拟索引的,且cache大于PAGE_SIZE(页大小),并且不能防止同一
物理地址的多个cache行同时存在,您就会遇到这个问题。
如果你的D-cache有这个问题,首先正确定义asm/shmparam.h SHMLBA,
它基本上应该是你的虚拟寻址D-cache的大小(或者如果大小是可变的,
则是最大的可能大小)。这个设置将迫使SYSv IPC层只允许用户进程在
这个值的倍数的地址上对共享内存进行映射。
.. note::
这并不能解决共享mmaps的问题,请查看sparc64移植解决
这个问题的一个方法(特别是 SPARC_FLAG_MMAPSHARED)。
接下来,你必须解决所有其他情况下的D-cache别名问题。请记住这个事
实,对于一个给定的页面映射到某个用户地址空间,总是至少还有一个映
射,那就是内核在其线性映射中从PAGE_OFFSET开始。因此,一旦第一个
用户将一个给定的物理页映射到它的地址空间,就意味着D-cache的别名
问题有可能存在,因为内核已经将这个页映射到它的虚拟地址。
``void copy_user_page(void *to, void *from, unsigned long addr, struct page *page)``
``void clear_user_page(void *to, unsigned long addr, struct page *page)``
这两个程序在用户匿名或COW页中存储数据。它允许一个端口有效地
避免用户空间和内核之间的D-cache别名问题。
例如,一个端口可以在复制过程中把“from”和“to”暂时映射到内核
的虚拟地址上。这两个页面的虚拟地址的选择方式是,内核的加载/存
储指令发生在虚拟地址上,而这些虚拟地址与用户的页面映射是相同
的“颜色”。例如,Sparc64就使用这种技术。
“addr”参数告诉了用户最终要映射这个页面的虚拟地址,“page”参
数给出了一个指向目标页结构体的指针。
如果D-cache别名不是问题,这两个程序可以简单地直接调用
memcpy/memset而不做其他事情。
``void flush_dcache_page(struct page *page)``
任何时候,当内核写到一个页面缓存页,或者内核要从一个页面缓存
页中读出,并且这个页面的用户空间共享/可写映射可能存在时,
这个程序就会被调用。
.. note::
这个程序只需要为有可能被映射到用户进程的地址空间的
页面缓存调用。因此,例如,处理页面缓存中vfs符号链
接的VFS层代码根本不需要调用这个接口。
“内核写入页面缓存的页面”这句话的意思是,具体来说,内核执行存
储指令,在该页面的页面->虚拟映射处弄脏该页面的数据。在这里,通
过刷新的手段处理D-cache的别名是很重要的,以确保这些内核存储对
该页的用户空间映射是可见的。
推论的情况也同样重要,如果有用户对这个文件有共享+可写的映射,
我们必须确保内核对这些页面的读取会看到用户所做的最新的存储。
如果D-cache别名不是一个问题,这个程序可以简单地定义为该架构上
的nop。
在page->flags (PG_arch_1)中有一个位是“架构私有”。内核保证,
对于分页缓存的页面,当这样的页面第一次进入分页缓存时,它将清除
这个位。
这使得这些接口可以更有效地被实现。如果目前没有用户进程映射这个
页面,它允许我们“推迟”(也许是无限期)实际的刷新过程。请看
sparc64的flush_dcache_page和update_mmu_cache实现,以了解如
何做到这一点。
这个想法是,首先在flush_dcache_page()时,如果page->mapping->i_mmap
是一个空树,只需标记架构私有页标志位。之后,在update_mmu_cache()
中,会对这个标志位进行检查,如果设置了,就进行刷新,并清除标志位。
.. important::
通常很重要的是,如果你推迟刷新,实际的刷新发生在同一个
CPU上,因为它将cpu存储到页面上,使其变脏。同样,请看
sparc64关于如何处理这个问题的例子。
``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr, void *dst, void *src, int len)``
``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr, void *dst, void *src, int len)``
当内核需要复制任意的数据进出任意的用户页时(比如ptrace()),它将使
用这两个程序。
任何必要的缓存刷新或其他需要发生的一致性操作都应该在这里发生。如果
处理器的指令缓存没有对cpu存储进行窥探,那么你很可能需要为
copy_to_user_page()刷新指令缓存。
``void flush_anon_page(struct vm_area_struct *vma, struct page *page,
unsigned long vmaddr)``
当内核需要访问一个匿名页的内容时,它会调用这个函数(目前只有
get_user_pages())。注意:flush_dcache_page()故意对匿名页不起作
用。默认的实现是nop(对于所有相干的架构应该保持这样)。对于不一致性
的架构,它应该刷新vmaddr处的页面缓存。
``void flush_kernel_dcache_page(struct page *page)``
当内核需要修改一个用kmap获得的用户页时,它会在所有修改完成后(但在
kunmapping之前)调用这个函数,以使底层页面达到最新状态。这里假定用
户没有不一致性的缓存副本(即原始页面是从类似get_user_pages()的机制
中获得的)。默认的实现是一个nop,在所有相干的架构上都应该如此。在不
一致性的架构上,这应该刷新内核缓存中的页面(使用page_address(page))。
``void flush_icache_range(unsigned long start, unsigned long end)``
当内核存储到它将执行的地址中时(例如在加载模块时),这个函数被调用。
如果icache不对存储进行窥探,那么这个程序将需要对其进行刷新。
``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
flush_icache_page的所有功能都可以在flush_dcache_page和update_mmu_cache
中实现。在未来,我们希望能够完全删除这个接口。
最后一类API是用于I/O到内核内特意设置的别名地址范围。这种别名是通过使用
vmap/vmalloc API设置的。由于内核I/O是通过物理页进行的,I/O子系统假定用户
映射和内核偏移映射是唯一的别名。这对vmap别名来说是不正确的,所以内核中任何
试图对vmap区域进行I/O的东西都必须手动管理一致性。它必须在做I/O之前刷新vmap
范围,并在I/O返回后使其失效。
``void flush_kernel_vmap_range(void *vaddr, int size)``
刷新vmap区域中指定的虚拟地址范围的内核缓存。这是为了确保内核在vmap范围
内修改的任何数据对物理页是可见的。这个设计是为了使这个区域可以安全地执
行I/O。注意,这个API并 *没有* 刷新该区域的偏移映射别名。
``void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates``
在vmap区域的一个给定的虚拟地址范围的缓存,这可以防止处理器在物理页的I/O
发生时通过投机性地读取数据而使缓存变脏。这只对读入vmap区域的数据是必要的。
......@@ -19,12 +19,13 @@
来的大量 kerneldoc 信息;有朝一日,若有人有动力的话,应当把它们拆分
出来。
Todolist:
.. toctree::
:maxdepth: 1
kernel-api
workqueue
printk-basics
printk-formats
workqueue
symbol-namespaces
数据结构和低级实用程序
......@@ -32,9 +33,13 @@ Todolist:
在整个内核中使用的函数库。
Todolist:
.. toctree::
:maxdepth: 1
kobject
Todolist:
kref
assoc_array
xarray
......@@ -58,12 +63,12 @@ Linux如何让一切同时发生。 详情请参阅
:maxdepth: 1
irq/index
Todolist:
refcount-vs-atomic
local_ops
padata
Todolist:
../RCU/index
低级硬件管理
......@@ -71,9 +76,14 @@ Todolist:
缓存管理,CPU热插拔管理等。
Todolist:
.. toctree::
:maxdepth: 1
cachetlb
Todolist:
cpu_hotplug
memory-hotplug
genericirq
......
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/kernel-api.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_kernel-api.rst:
============
Linux内核API
============
列表管理函数
============
该API在以下内核代码中:
include/linux/list.h
基本的C库函数
=============
在编写驱动程序时,一般不能使用C库中的例程。部分函数通常很有用,它们在
下面被列出。这些函数的行为可能会与ANSI定义的略有不同,这些偏差会在文中
注明。
字符串转换
----------
该API在以下内核代码中:
lib/vsprintf.c
include/linux/kernel.h
include/linux/kernel.h
lib/kstrtox.c
lib/string_helpers.c
字符串处理
----------
该API在以下内核代码中:
lib/string.c
include/linux/string.h
mm/util.c
基本的内核库函数
================
Linux内核提供了很多实用的基本函数。
位运算
------
该API在以下内核代码中:
include/asm-generic/bitops/instrumented-atomic.h
include/asm-generic/bitops/instrumented-non-atomic.h
include/asm-generic/bitops/instrumented-lock.h
位图运算
--------
该API在以下内核代码中:
lib/bitmap.c
include/linux/bitmap.h
include/linux/bitmap.h
include/linux/bitmap.h
lib/bitmap.c
lib/bitmap.c
include/linux/bitmap.h
命令行解析
----------
该API在以下内核代码中:
lib/cmdline.c
排序
----
该API在以下内核代码中:
lib/sort.c
lib/list_sort.c
文本检索
--------
该API在以下内核代码中:
lib/textsearch.c
lib/textsearch.c
include/linux/textsearch.h
Linux中的CRC和数学函数
======================
CRC函数
-------
*译注:CRC,Cyclic Redundancy Check,循环冗余校验*
该API在以下内核代码中:
lib/crc4.c
lib/crc7.c
lib/crc8.c
lib/crc16.c
lib/crc32.c
lib/crc-ccitt.c
lib/crc-itu-t.c
基数为2的对数和幂函数
---------------------
该API在以下内核代码中:
include/linux/log2.h
整数幂函数
----------
该API在以下内核代码中:
lib/math/int_pow.c
lib/math/int_sqrt.c
除法函数
--------
该API在以下内核代码中:
include/asm-generic/div64.h
include/linux/math64.h
lib/math/div64.c
lib/math/gcd.c
UUID/GUID
---------
该API在以下内核代码中:
lib/uuid.c
内核IPC设备
===========
IPC实用程序
-----------
该API在以下内核代码中:
ipc/util.c
FIFO 缓冲区
===========
kfifo接口
---------
该API在以下内核代码中:
include/linux/kfifo.h
转发接口支持
============
转发接口支持旨在为工具和设备提供一种有效的机制,将大量数据从内核空间
转发到用户空间。
转发接口
--------
该API在以下内核代码中:
kernel/relay.c
kernel/relay.c
模块支持
========
模块加载
--------
该API在以下内核代码中:
kernel/kmod.c
模块接口支持
------------
更多信息请参考文件kernel/module.c。
硬件接口
========
该API在以下内核代码中:
kernel/dma.c
资源管理
--------
该API在以下内核代码中:
kernel/resource.c
kernel/resource.c
MTRR处理
--------
该API在以下内核代码中:
arch/x86/kernel/cpu/mtrr/mtrr.c
安全框架
========
该API在以下内核代码中:
security/security.c
security/inode.c
审计接口
========
该API在以下内核代码中:
kernel/audit.c
kernel/auditsc.c
kernel/auditfilter.c
核算框架
========
该API在以下内核代码中:
kernel/acct.c
块设备
======
该API在以下内核代码中:
block/blk-core.c
block/blk-core.c
block/blk-map.c
block/blk-sysfs.c
block/blk-settings.c
block/blk-exec.c
block/blk-flush.c
block/blk-lib.c
block/blk-integrity.c
kernel/trace/blktrace.c
block/genhd.c
block/genhd.c
字符设备
========
该API在以下内核代码中:
fs/char_dev.c
时钟框架
========
时钟框架定义了编程接口,以支持系统时钟树的软件管理。该框架广泛用于系统级芯片(SOC)平
台,以支持电源管理和各种可能需要自定义时钟速率的设备。请注意,这些 “时钟”与计时或实
时时钟(RTC)无关,它们都有单独的框架。这些:c:type: `struct clk <clk>` 实例可用于管理
各种时钟信号,例如一个96理例如96MHz的时钟信号,该信号可被用于总线或外设的数据交换,或以
其他方式触发系统硬件中的同步状态机转换。
通过明确的软件时钟门控来支持电源管理:未使用的时钟被禁用,因此系统不会因为改变不在使用
中的晶体管的状态而浪费电源。在某些系统中,这可能是由硬件时钟门控支持的,其中时钟被门控
而不在软件中被禁用。芯片的部分,在供电但没有时钟的情况下,可能会保留其最后的状态。这种
低功耗状态通常被称为*保留模式*。这种模式仍然会产生漏电流,特别是在电路几何结构较细的情
况下,但对于CMOS电路来说,电能主要是随着时钟翻转而被消耗的。
电源感知驱动程序只有在其管理的设备处于活动使用状态时才会启用时钟。此外,系统睡眠状态通
常根据哪些时钟域处于活动状态而有所不同:“待机”状态可能允许从多个活动域中唤醒,而
"mem"(暂停到RAM)状态可能需要更全面地关闭来自高速PLL和振荡器的时钟,从而限制了可能
的唤醒事件源的数量。驱动器的暂停方法可能需要注意目标睡眠状态的系统特定时钟约束。
一些平台支持可编程时钟发生器。这些可以被各种外部芯片使用,如其他CPU、多媒体编解码器以
及对接口时钟有严格要求的设备。
该API在以下内核代码中:
include/linux/clk.h
同步原语
========
读-复制-更新(RCU)
-------------------
该API在以下内核代码中:
include/linux/rcupdate.h
kernel/rcu/tree.c
kernel/rcu/tree_exp.h
kernel/rcu/update.c
include/linux/srcu.h
kernel/rcu/srcutree.c
include/linux/rculist_bl.h
include/linux/rculist.h
include/linux/rculist_nulls.h
include/linux/rcu_sync.h
kernel/rcu/sync.c
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/kobject.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_core_api_kobject.rst:
=======================================================
关于kobjects、ksets和ktypes的一切你没想过需要了解的东西
=======================================================
:作者: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
:最后一次更新: December 19, 2007
根据Jon Corbet于2003年10月1日为lwn.net撰写的原创文章改编,网址是:
https://lwn.net/Articles/51437/
理解驱动模型和建立在其上的kobject抽象的部分的困难在于,没有明显的切入点。
处理kobjects需要理解一些不同的类型,所有这些类型都会相互引用。为了使事情
变得更简单,我们将多路并进,从模糊的术语开始,并逐渐增加细节。那么,先来
了解一些我们将要使用的术语的简明定义吧。
- 一个kobject是一个kobject结构体类型的对象。Kobjects有一个名字和一个
引用计数。一个kobject也有一个父指针(允许对象被排列成层次结构),一个
特定的类型,并且,通常在sysfs虚拟文件系统中表示。
Kobjects本身通常并不引人关注;相反它们常常被嵌入到其他包含真正引人注目
的代码的结构体中。
任何结构体都 **不应该** 有一个以上的kobject嵌入其中。如果有的话,对象的引用计
数肯定会被打乱,而且不正确,你的代码就会出现错误。所以不要这样做。
- ktype是嵌入一个kobject的对象的类型。每个嵌入kobject的结构体都需要一个
相应的ktype。ktype控制着kobject在被创建和销毁时的行为。
- 一个kset是一组kobjects。这些kobjects可以是相同的ktype或者属于不同的
ktype。kset是kobjects集合的基本容器类型。Ksets包含它们自己的kobjects,
但你可以安全地忽略这个实现细节,因为kset的核心代码会自动处理这个kobject。
当你看到一个下面全是其他目录的sysfs目录时,通常这些目录中的每一个都对应
于同一个kset中的一个kobject。
我们将研究如何创建和操作所有这些类型。将采取一种自下而上的方法,所以我们
将回到kobjects。
嵌入kobjects
=============
内核代码很少创建孤立的kobject,只有一个主要的例外,下面会解释。相反,
kobjects被用来控制对一个更大的、特定领域的对象的访问。为此,kobjects会被
嵌入到其他结构中。如果你习惯于用面向对象的术语来思考问题,那么kobjects可
以被看作是一个顶级的抽象类,其他的类都是从它派生出来的。一个kobject实现了
一系列的功能,这些功能本身并不特别有用,但在其他对象中却很好用。C语言不允
许直接表达继承,所以必须使用其他技术——比如结构体嵌入。
(对于那些熟悉内核链表实现的人来说,这类似于“list_head”结构本身很少有用,
但总是被嵌入到感兴趣的更大的对象中)。
例如, ``drivers/uio/uio.c`` 中的IO代码有一个结构体,定义了与uio设备相
关的内存区域::
struct uio_map {
struct kobject kobj;
struct uio_mem *mem;
};
如果你有一个uio_map结构体,找到其嵌入的kobject只是一个使用kobj成员的问题。
然而,与kobjects一起工作的代码往往会遇到相反的问题:给定一个结构体kobject
的指针,指向包含结构体的指针是什么?你必须避免使用一些技巧(比如假设
kobject在结构的开头),相反,你得使用container_of()宏,其可以在 ``<linux/kernel.h>``
中找到::
container_of(ptr, type, member)
其中:
* ``ptr`` 是一个指向嵌入kobject的指针,
* ``type`` 是包含结构体的类型,
* ``member`` 是 ``指针`` 所指向的结构体域的名称。
container_of()的返回值是一个指向相应容器类型的指针。因此,例如,一个嵌入到
uio_map结构 **中** 的kobject结构体的指针kp可以被转换为一个指向 **包含** uio_map
结构体的指针,方法是::
struct uio_map *u_map = container_of(kp, struct uio_map, kobj);
为了方便起见,程序员经常定义一个简单的宏,用于将kobject指针 **反推** 到包含
类型。在早期的 ``drivers/uio/uio.c`` 中正是如此,你可以在这里看到::
struct uio_map {
struct kobject kobj;
struct uio_mem *mem;
};
#define to_map(map) container_of(map, struct uio_map, kobj)
其中宏的参数“map”是一个指向有关的kobject结构体的指针。该宏随后被调用::
struct uio_map *map = to_map(kobj);
kobjects的初始化
================
当然,创建kobject的代码必须初始化该对象。一些内部字段是通过(强制)调用kobject_init()
来设置的::
void kobject_init(struct kobject *kobj, struct kobj_type *ktype);
ktype是正确创建kobject的必要条件,因为每个kobject都必须有一个相关的kobj_type。
在调用kobject_init()后,为了向sysfs注册kobject,必须调用函数kobject_add()::
int kobject_add(struct kobject *kobj, struct kobject *parent,
const char *fmt, ...);
这将正确设置kobject的父级和kobject的名称。如果该kobject要与一个特定的kset相关
联,在调用kobject_add()之前必须分配kobj->kset。如果kset与kobject相关联,则
kobject的父级可以在调用kobject_add()时被设置为NULL,则kobject的父级将是kset
本身。
由于kobject的名字是在它被添加到内核时设置的,所以kobject的名字不应该被直接操作。
如果你必须改变kobject的名字,请调用kobject_rename()::
int kobject_rename(struct kobject *kobj, const char *new_name);
kobject_rename()函数不会执行任何锁定操作,也不会对name进行可靠性检查,所以调用
者自己检查和串行化操作是明智的选择
有一个叫kobject_set_name()的函数,但那是历史遗产,正在被删除。如果你的代码需
要调用这个函数,那么它是不正确的,需要被修复。
要正确访问kobject的名称,请使用函数kobject_name()::
const char *kobject_name(const struct kobject * kobj);
有一个辅助函数可以同时初始化和添加kobject到内核中,令人惊讶的是,该函数被称为
kobject_init_and_add()::
int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype,
struct kobject *parent, const char *fmt, ...);
参数与上面描述的单个kobject_init()和kobject_add()函数相同。
Uevents
=======
当一个kobject被注册到kobject核心后,你需要向全世界宣布它已经被创建了。这可以通
过调用kobject_uevent()来实现::
int kobject_uevent(struct kobject *kobj, enum kobject_action action);
当kobject第一次被添加到内核时,使用 *KOBJ_ADD* 动作。这应该在该kobject的任
何属性或子对象被正确初始化后进行,因为当这个调用发生时,用户空间会立即开始寻
找它们。
当kobject从内核中移除时(关于如何做的细节在下面), **KOBJ_REMOVE** 的uevent
将由kobject核心自动创建,所以调用者不必担心手动操作。
引用计数
========
kobject的关键功能之一是作为它所嵌入的对象的一个引用计数器。只要对该对象的引用
存在,该对象(以及支持它的代码)就必须继续存在。用于操作kobject的引用计数的低
级函数是::
struct kobject *kobject_get(struct kobject *kobj);
void kobject_put(struct kobject *kobj);
对kobject_get()的成功调用将增加kobject的引用计数器值并返回kobject的指针。
当引用被释放时,对kobject_put()的调用将递减引用计数值,并可能释放该对象。请注
意,kobject_init()将引用计数设置为1,所以设置kobject的代码最终需要kobject_put()
来释放该引用。
因为kobjects是动态的,所以它们不能以静态方式或在堆栈中声明,而总是以动态方式分
配。未来版本的内核将包含对静态创建的kobjects的运行时检查,并将警告开发者这种不
当的使用。
如果你使用struct kobject只是为了给你的结构体提供一个引用计数器,请使用struct kref
来代替;kobject是多余的。关于如何使用kref结构体的更多信息,请参见Linux内核源代
码树中的文件Documentation/core-api/kref.rst
创建“简单的”kobjects
====================
有时,开发者想要的只是在sysfs层次结构中创建一个简单的目录,而不必去搞那些复杂
的ksets、显示和存储函数,以及其他细节。这是一个应该创建单个kobject的例外。要
创建这样一个条目(即简单的目录),请使用函数::
struct kobject *kobject_create_and_add(const char *name, struct kobject *parent);
这个函数将创建一个kobject,并将其放在sysfs中指定的父kobject下面的位置。要创
建与此kobject相关的简单属性,请使用::
int sysfs_create_file(struct kobject *kobj, const struct attribute *attr);
或者::
int sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp);
这里使用的两种类型的属性,与已经用kobject_create_and_add()创建的kobject,
都可以是kobj_attribute类型,所以不需要创建特殊的自定义属性。
参见示例模块, ``samples/kobject/kobject-example.c`` ,以了解一个简单的
kobject和属性的实现。
ktypes和释放方法
================
以上讨论中还缺少一件重要的事情,那就是当一个kobject的引用次数达到零的时候
会发生什么。创建kobject的代码通常不知道何时会发生这种情况;首先,如果它知
道,那么使用kobject就没有什么意义。当sysfs被引入时,即使是可预测的对象生命
周期也会变得更加复杂,因为内核的其他部分可以获得在系统中注册的任何kobject
的引用。
最终的结果是,一个由kobject保护的结构体在其引用计数归零之前不能被释放。引
用计数不受创建kobject的代码的直接控制。因此,每当它的一个kobjects的最后一
个引用消失时,必须异步通知该代码。
一旦你通过kobject_add()注册了你的kobject,你绝对不能使用kfree()来直接释
放它。唯一安全的方法是使用kobject_put()。在kobject_init()之后总是使用
kobject_put()以避免错误的发生是一个很好的做法。
这个通知是通过kobject的release()方法完成的。通常这样的方法有如下形式::
void my_object_release(struct kobject *kobj)
{
struct my_object *mine = container_of(kobj, struct my_object, kobj);
/* Perform any additional cleanup on this object, then... */
kfree(mine);
}
有一点很重要:每个kobject都必须有一个release()方法,而且这个kobject必
须持续存在(处于一致的状态),直到这个方法被调用。如果这些约束条件没有
得到满足,那么代码就是有缺陷的。注意,如果你忘记提供release()方法,内
核会警告你。不要试图通过提供一个“空”的释放函数来摆脱这个警告。
如果你的清理函数只需要调用kfree(),那么你必须创建一个包装函数,该函数
使用container_of()来向上造型到正确的类型(如上面的例子所示),然后在整个
结构体上调用kfree()。
注意,kobject的名字在release函数中是可用的,但它不能在这个回调中被改
变。否则,在kobject核心中会出现内存泄漏,这让人很不爽。
有趣的是,release()方法并不存储在kobject本身;相反,它与ktype相关。
因此,让我们引入结构体kobj_type::
struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;
struct attribute **default_attrs;
const struct attribute_group **default_groups;
const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
const void *(*namespace)(struct kobject *kobj);
void (*get_ownership)(struct kobject *kobj, kuid_t *uid, kgid_t *gid);
};
这个结构提用来描述一个特定类型的kobject(或者更正确地说,包含对象的
类型)。每个kobject都需要有一个相关的kobj_type结构;当你调用
kobject_init()或kobject_init_and_add()时必须指定一个指向该结构的
指针。
当然,kobj_type结构中的release字段是指向这种类型的kobject的release()
方法的一个指针。另外两个字段(sysfs_ops 和 default_attrs)控制这种
类型的对象如何在 sysfs 中被表示;它们超出了本文的范围。
default_attrs 指针是一个默认属性的列表,它将为任何用这个 ktype 注册
的 kobject 自动创建。
ksets
=====
一个kset仅仅是一个希望相互关联的kobjects的集合。没有限制它们必须是相
同的ktype,但是如果它们不是相同的,就要非常小心。
一个kset有以下功能:
- 它像是一个包含一组对象的袋子。一个kset可以被内核用来追踪“所有块
设备”或“所有PCI设备驱动”。
- kset也是sysfs中的一个子目录,与kset相关的kobjects可以在这里显示
出来。每个kset都包含一个kobject,它可以被设置为其他kobject的父对象;
sysfs层次结构的顶级目录就是以这种方式构建的。
- Ksets可以支持kobjects的 "热插拔",并影响uevent事件如何被报告给
用户空间。
在面向对象的术语中,“kset”是顶级的容器类;ksets包含它们自己的kobject,
但是这个kobject是由kset代码管理的,不应该被任何其他用户所操纵。
kset在一个标准的内核链表中保存它的子对象。Kobjects通过其kset字段指向其
包含的kset。在几乎所有的情况下,属于一个kset的kobjects在它们的父
对象中都有那个kset(或者,严格地说,它的嵌入kobject)。
由于kset中包含一个kobject,它应该总是被动态地创建,而不是静态地
或在堆栈中声明。要创建一个新的kset,请使用::
struct kset *kset_create_and_add(const char *name,
const struct kset_uevent_ops *uevent_ops,
struct kobject *parent_kobj);
当你完成对kset的处理后,调用::
void kset_unregister(struct kset *k);
来销毁它。这将从sysfs中删除该kset并递减其引用计数值。当引用计数
为零时,该kset将被释放。因为对该kset的其他引用可能仍然存在,
释放可能发生在kset_unregister()返回之后。
一个使用kset的例子可以在内核树中的 ``samples/kobject/kset-example.c``
文件中看到。
如果一个kset希望控制与它相关的kobjects的uevent操作,它可以使用
结构体kset_uevent_ops来处理它::
struct kset_uevent_ops {
int (* const filter)(struct kset *kset, struct kobject *kobj);
const char *(* const name)(struct kset *kset, struct kobject *kobj);
int (* const uevent)(struct kset *kset, struct kobject *kobj,
struct kobj_uevent_env *env);
};
过滤器函数允许kset阻止一个特定kobject的uevent被发送到用户空间。
如果该函数返回0,该uevent将不会被发射出去。
name函数将被调用以覆盖uevent发送到用户空间的kset的默认名称。默
认情况下,该名称将与kset本身相同,但这个函数,如果存在,可以覆盖
该名称。
当uevent即将被发送至用户空间时,uevent函数将被调用,以允许更多
的环境变量被添加到uevent中。
有人可能会问,鉴于没有提出执行该功能的函数,究竟如何将一个kobject
添加到一个kset中。答案是这个任务是由kobject_add()处理的。当一个
kobject被传递给kobject_add()时,它的kset成员应该指向这个kobject
所属的kset。 kobject_add()将处理剩下的部分。
如果属于一个kset的kobject没有父kobject集,它将被添加到kset的目
录中。并非所有的kset成员都必须住在kset目录中。如果在添加kobject
之前分配了一个明确的父kobject,那么该kobject将被注册到kset中,
但是被添加到父kobject下面。
移除Kobject
===========
当一个kobject在kobject核心注册成功后,在代码使用完它时,必须将其
清理掉。要做到这一点,请调用kobject_put()。通过这样做,kobject核
心会自动清理这个kobject分配的所有内存。如果为这个对象发送了 ``KOBJ_ADD``
uevent,那么相应的 ``KOBJ_REMOVE`` uevent也将被发送,任何其他的
sysfs内务将被正确处理。
如果你需要分两次对kobject进行删除(比如说在你要销毁对象时无权睡眠),
那么调用kobject_del()将从sysfs中取消kobject的注册。这使得kobject
“不可见”,但它并没有被清理掉,而且该对象的引用计数仍然是一样的。在稍
后的时间调用kobject_put()来完成与该kobject相关的内存的清理。
kobject_del()可以用来放弃对父对象的引用,如果循环引用被构建的话。
在某些情况下,一个父对象引用一个子对象是有效的。循环引用必须通过明
确调用kobject_del()来打断,这样一个释放函数就会被调用,前一个循环
中的对象会相互释放。
示例代码出处
============
关于正确使用ksets和kobjects的更完整的例子,请参见示例程序
``samples/kobject/{kobject-example.c,kset-example.c}`` ,如果
您选择 ``CONFIG_SAMPLE_KOBJECT`` ,它们将被构建为可加载模块。
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/local_ops.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_local_ops:
========================
本地原子操作的语义和行为
========================
:作者: Mathieu Desnoyers
本文解释了本地原子操作的目的,如何为任何给定的架构实现这些操作,并说明了
如何正确使用这些操作。它还强调了在内存写入顺序很重要的情况下,跨CPU读取
这些本地变量时必须采取的预防措施。
.. note::
注意,基于 ``local_t`` 的操作不建议用于一般内核操作。请使用 ``this_cpu``
操作来代替使用,除非真的有特殊目的。大多数内核中使用的 ``local_t`` 已
经被 ``this_cpu`` 操作所取代。 ``this_cpu`` 操作在一条指令中结合了重
定位和类似 ``local_t`` 的语义,产生了更紧凑和更快的执行代码。
本地原子操作的目的
==================
本地原子操作的目的是提供快速和高度可重入的每CPU计数器。它们通过移除LOCK前
缀和通常需要在CPU间同步的内存屏障,将标准原子操作的性能成本降到最低。
在许多情况下,拥有快速的每CPU原子计数器是很有吸引力的:它不需要禁用中断来保护中
断处理程序,它允许在NMI(Non Maskable Interrupt)处理程序中使用连贯的计数器。
它对追踪目的和各种性能监测计数器特别有用。
本地原子操作只保证在拥有数据的CPU上的变量修改的原子性。因此,必须注意确保只
有一个CPU写到 ``local_t`` 的数据。这是通过使用每CPU的数据来实现的,并确
保我们在一个抢占式安全上下文中修改它。然而,从任何一个CPU读取 ``local_t``
数据都是允许的:这样它就会显得与所有者CPU的其他内存写入顺序不一致。
针对特定架构的实现
==================
这可以通过稍微修改标准的原子操作来实现:只有它们的UP变体必须被保留。这通常
意味着删除LOCK前缀(在i386和x86_64上)和任何SMP同步屏障。如果架构在SMP和
UP之间没有不同的行为,在你的架构的 ``local.h`` 中包括 ``asm-generic/local.h``
就足够了。
通过在一个结构体中嵌入一个 ``atomic_long_t`` , ``local_t`` 类型被定义为
一个不透明的 ``signed long`` 。这样做的目的是为了使从这个类型到
``long`` 的转换失败。该定义看起来像::
typedef struct { atomic_long_t a; } local_t;
使用本地原子操作时应遵循的规则
==============================
* 被本地操作触及的变量必须是每cpu的变量。
* *只有* 这些变量的CPU所有者才可以写入这些变量。
* 这个CPU可以从任何上下文(进程、中断、软中断、nmi...)中使用本地操作来更新
它的local_t变量。
* 当在进程上下文中使用本地操作时,必须禁用抢占(或中断),以确保进程在获得每
CPU变量和进行实际的本地操作之间不会被迁移到不同的CPU。
* 当在中断上下文中使用本地操作时,在主线内核上不需要特别注意,因为它们将在局
部CPU上运行,并且已经禁用了抢占。然而,我建议无论如何都要明确地禁用抢占,
以确保它在-rt内核上仍能正确工作。
* 读取本地cpu变量将提供该变量的当前拷贝。
* 对这些变量的读取可以从任何CPU进行,因为对 “ ``long`` ”,对齐的变量的更新
总是原子的。由于写入程序的CPU没有进行内存同步,所以在读取 *其他* cpu的变
量时,可以读取该变量的过期副本。
如何使用本地原子操作
====================
::
#include <linux/percpu.h>
#include <asm/local.h>
static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0);
计数器
======
计数是在一个signed long的所有位上进行的。
在可抢占的上下文中,围绕本地原子操作使用 ``get_cpu_var()`` 和
``put_cpu_var()`` :它确保在对每个cpu变量进行写访问时,抢占被禁用。比如
说::
local_inc(&get_cpu_var(counters));
put_cpu_var(counters);
如果你已经在一个抢占安全上下文中,你可以使用 ``this_cpu_ptr()`` 代替::
local_inc(this_cpu_ptr(&counters));
读取计数器
==========
那些本地计数器可以从外部的CPU中读取,以求得计数的总和。请注意,local_read
所看到的跨CPU的数据必须被认为是相对于拥有该数据的CPU上发生的其他内存写入来
说不符合顺序的::
long sum = 0;
for_each_online_cpu(cpu)
sum += local_read(&per_cpu(counters, cpu));
如果你想使用远程local_read来同步CPU之间对资源的访问,必须在写入者和读取者
的CPU上分别使用显式的 ``smp_wmb()`` 和 ``smp_rmb()`` 内存屏障。如果你使
用 ``local_t`` 变量作为写在缓冲区中的字节的计数器,就会出现这种情况:在缓
冲区写和计数器增量之间应该有一个 ``smp_wmb()`` ,在计数器读和缓冲区读之间
也应有一个 ``smp_rmb()`` 。
下面是一个使用 ``local.h`` 实现每个cpu基本计数器的示例模块::
/* test-local.c
*
* Sample module for local.h usage.
*/
#include <asm/local.h>
#include <linux/module.h>
#include <linux/timer.h>
static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0);
static struct timer_list test_timer;
/* IPI called on each CPU. */
static void test_each(void *info)
{
/* Increment the counter from a non preemptible context */
printk("Increment on cpu %d\n", smp_processor_id());
local_inc(this_cpu_ptr(&counters));
/* This is what incrementing the variable would look like within a
* preemptible context (it disables preemption) :
*
* local_inc(&get_cpu_var(counters));
* put_cpu_var(counters);
*/
}
static void do_test_timer(unsigned long data)
{
int cpu;
/* Increment the counters */
on_each_cpu(test_each, NULL, 1);
/* Read all the counters */
printk("Counters read from CPU %d\n", smp_processor_id());
for_each_online_cpu(cpu) {
printk("Read : CPU %d, count %ld\n", cpu,
local_read(&per_cpu(counters, cpu)));
}
mod_timer(&test_timer, jiffies + 1000);
}
static int __init test_init(void)
{
/* initialize the timer that will increment the counter */
timer_setup(&test_timer, do_test_timer, 0);
mod_timer(&test_timer, jiffies + 1);
return 0;
}
static void __exit test_exit(void)
{
del_timer_sync(&test_timer);
}
module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers");
MODULE_DESCRIPTION("Local Atomic Ops");
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/padata.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_core_api_padata.rst:
==================
padata并行执行机制
==================
:日期: 2020年5月
Padata是一种机制,内核可以通过此机制将工作分散到多个CPU上并行完成,同时
可以选择保持它们的顺序。
它最初是为IPsec开发的,它需要在不对这些数据包重新排序的其前提下,为大量的数
据包进行加密和解密。这是目前padata的序列化作业支持的唯一用途。
Padata还支持多线程作业,将作业平均分割,同时在线程之间进行负载均衡和协调。
执行序列化作业
==============
初始化
------
使用padata执行序列化作业的第一步是建立一个padata_instance结构体,以全面
控制作业的运行方式::
#include <linux/padata.h>
struct padata_instance *padata_alloc(const char *name);
'name'即标识了这个实例。
然后,通过分配一个padata_shell来完成padata的初始化::
struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
一个padata_shell用于向padata提交一个作业,并允许一系列这样的作业被独立地
序列化。一个padata_instance可以有一个或多个padata_shell与之相关联,每个
都允许一系列独立的作业。
修改cpumasks
------------
用于运行作业的CPU可以通过两种方式改变,通过padata_set_cpumask()编程或通
过sysfs。前者的定义是::
int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
cpumask_var_t cpumask);
这里cpumask_type是PADATA_CPU_PARALLEL(并行)或PADATA_CPU_SERIAL(串行)之一,其中并
行cpumask描述了哪些处理器将被用来并行执行提交给这个实例的作业,串行cpumask
定义了哪些处理器被允许用作串行化回调处理器。 cpumask指定了要使用的新cpumask。
一个实例的cpumasks可能有sysfs文件。例如,pcrypt的文件在
/sys/kernel/pcrypt/<instance-name>。在一个实例的目录中,有两个文件,parallel_cpumask
和serial_cpumask,任何一个cpumask都可以通过在文件中回显(echo)一个bitmask
来改变,比如说::
echo f > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
读取其中一个文件会显示用户提供的cpumask,它可能与“可用”的cpumask不同。
Padata内部维护着两对cpumask,用户提供的cpumask和“可用的”cpumask(每一对由一个
并行和一个串行cpumask组成)。用户提供的cpumasks在实例分配时默认为所有可能的CPU,
并且可以如上所述进行更改。可用的cpumasks总是用户提供的cpumasks的一个子集,只包
含用户提供的掩码中的在线CPU;这些是padata实际使用的cpumasks。因此,向padata提
供一个包含离线CPU的cpumask是合法的。一旦用户提供的cpumask中的一个离线CPU上线,
padata就会使用它。
改变CPU掩码的操作代价很高,所以不应频繁更改。
运行一个作业
-------------
实际上向padata实例提交工作需要创建一个padata_priv结构体,它代表一个作业::
struct padata_priv {
/* Other stuff here... */
void (*parallel)(struct padata_priv *padata);
void (*serial)(struct padata_priv *padata);
};
这个结构体几乎肯定会被嵌入到一些针对要做的工作的大结构体中。它的大部分字段对
padata来说是私有的,但是这个结构在初始化时应该被清零,并且应该提供parallel()和
serial()函数。在完成工作的过程中,这些函数将被调用,我们马上就会遇到。
工作的提交是通过::
int padata_do_parallel(struct padata_shell *ps,
struct padata_priv *padata, int *cb_cpu);
ps和padata结构体必须如上所述进行设置;cb_cpu指向作业完成后用于最终回调的首选CPU;
它必须在当前实例的CPU掩码中(如果不是,cb_cpu指针将被更新为指向实际选择的CPU)。
padata_do_parallel()的返回值在成功时为0,表示工作正在进行中。-EBUSY意味着有人
在其他地方正在搞乱实例的CPU掩码,而当cb_cpu不在串行cpumask中、并行或串行cpumasks
中无在线CPU,或实例停止时,则会出现-EINVAL反馈。
每个提交给padata_do_parallel()的作业将依次传递给一个CPU上的上述parallel()函数
的一个调用,所以真正的并行是通过提交多个作业来实现的。parallel()在运行时禁用软
件中断,因此不能睡眠。parallel()函数把获得的padata_priv结构体指针作为其唯一的参
数;关于实际要做的工作的信息可能是通过使用container_of()找到封装结构体来获得的。
请注意,parallel()没有返回值;padata子系统假定parallel()将从此时开始负责这项工
作。作业不需要在这次调用中完成,但是,如果parallel()留下了未完成的工作,它应该准
备在前一个作业完成之前,被以新的作业再次调用
序列化作业
----------
当一个作业完成时,parallel()(或任何实际完成该工作的函数)应该通过调用通知padata此
事::
void padata_do_serial(struct padata_priv *padata);
在未来的某个时刻,padata_do_serial()将触发对padata_priv结构体中serial()函数的调
用。这个调用将发生在最初要求调用padata_do_parallel()的CPU上;它也是在本地软件中断
被禁用的情况下运行的。
请注意,这个调用可能会被推迟一段时间,因为padata代码会努力确保作业按照提交的顺序完
成。
销毁
----
清理一个padata实例时,可以预见的是调用两个free函数,这两个函数对应于分配的逆过程::
void padata_free_shell(struct padata_shell *ps);
void padata_free(struct padata_instance *pinst);
用户有责任确保在调用上述任何一项之前,所有未完成的工作都已完成。
运行多线程作业
==============
一个多线程作业有一个主线程和零个或多个辅助线程,主线程参与作业,然后等待所有辅助线
程完成。padata将作业分割成称为chunk的单元,其中chunk是一个线程在一次调用线程函数
中完成的作业片段。
用户必须做三件事来运行一个多线程作业。首先,通过定义一个padata_mt_job结构体来描述
作业,这在接口部分有解释。这包括一个指向线程函数的指针,padata每次将作业块分配给线
程时都会调用这个函数。然后,定义线程函数,它接受三个参数: ``start`` 、 ``end`` 和 ``arg`` ,
其中前两个参数限定了线程操作的范围,最后一个是指向作业共享状态的指针,如果有的话。
准备好共享状态,它通常被分配在主线程的堆栈中。最后,调用padata_do_multithreaded(),
它将在作业完成后返回。
接口
====
该API在以下内核代码中:
include/linux/padata.h
kernel/padata.c
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/printk-basics.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_printk-basics.rst:
==================
使用printk记录消息
==================
printk()是Linux内核中最广为人知的函数之一。它是我们打印消息的标准工具,通常也是追踪和调试
的最基本方法。如果你熟悉printf(3),你就能够知道printk()是基于它的,尽管它在功能上有一些不
同之处:
- printk() 消息可以指定日志级别。
- 格式字符串虽然与C99基本兼容,但并不遵循完全相同的规范。它有一些扩展和一些限制(没
有 ``%n`` 或浮点转换指定符)。参见:ref: `如何正确地获得printk格式指定符<printk-specifiers>` 。
所有的printk()消息都会被打印到内核日志缓冲区,这是一个通过/dev/kmsg输出到用户空间的环
形缓冲区。读取它的通常方法是使用 ``dmesg`` 。
printk()的用法通常是这样的::
printk(KERN_INFO "Message: %s\n", arg);
其中 ``KERN_INFO`` 是日志级别(注意,它与格式字符串连在一起,日志级别不是一个单独的参数)。
可用的日志级别是:
+----------------+--------+-----------------------------------------------+
| 名称 | 字符串 | 别名函数 |
+================+========+===============================================+
| KERN_EMERG | "0" | pr_emerg() |
+----------------+--------+-----------------------------------------------+
| KERN_ALERT | "1" | pr_alert() |
+----------------+--------+-----------------------------------------------+
| KERN_CRIT | "2" | pr_crit() |
+----------------+--------+-----------------------------------------------+
| KERN_ERR | "3" | pr_err() |
+----------------+--------+-----------------------------------------------+
| KERN_WARNING | "4" | pr_warn() |
+----------------+--------+-----------------------------------------------+
| KERN_NOTICE | "5" | pr_notice() |
+----------------+--------+-----------------------------------------------+
| KERN_INFO | "6" | pr_info() |
+----------------+--------+-----------------------------------------------+
| KERN_DEBUG | "7" | pr_debug() and pr_devel() 若定义了DEBUG |
+----------------+--------+-----------------------------------------------+
| KERN_DEFAULT | "" | |
+----------------+--------+-----------------------------------------------+
| KERN_CONT | "c" | pr_cont() |
+----------------+--------+-----------------------------------------------+
日志级别指定了一条消息的重要性。内核根据日志级别和当前 *console_loglevel* (一个内核变量)决
定是否立即显示消息(将其打印到当前控制台)。如果消息的优先级比 *console_loglevel* 高(日志级
别值较低),消息将被打印到控制台。
如果省略了日志级别,则以 ``KERN_DEFAULT`` 级别打印消息。
你可以用以下方法检查当前的 *console_loglevel* ::
$ cat /proc/sys/kernel/printk
4 4 1 7
结果显示了 *current*, *default*, *minimum* 和 *boot-time-default* 日志级别
要改变当前的 console_loglevel,只需在 ``/proc/sys/kernel/printk`` 中写入所需的
级别。例如,要打印所有的消息到控制台上::
# echo 8 > /proc/sys/kernel/printk
另一种方式,使用 ``dmesg``::
# dmesg -n 5
设置 console_loglevel 打印 KERN_WARNING (4) 或更严重的消息到控制台。更多消息参
见 ``dmesg(1)`` 。
作为printk()的替代方案,你可以使用 ``pr_*()`` 别名来记录日志。这个系列的宏在宏名中
嵌入了日志级别。例如::
pr_info("Info message no. %d\n", msg_num);
打印 ``KERN_INFO`` 消息。
除了比等效的printk()调用更简洁之外,它们还可以通过pr_fmt()宏为格式字符串使用一个通用
的定义。例如,在源文件的顶部(在任何 ``#include`` 指令之前)定义这样的内容。::
#define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__
会在该文件中的每一条 pr_*() 消息前加上发起该消息的模块和函数名称。
为了调试,还有两个有条件编译的宏:
pr_debug()和pr_devel(),除非定义了 ``DEBUG`` (或者在pr_debug()的情况下定义了
``CONFIG_DYNAMIC_DEBUG`` ),否则它们会被编译。
函数接口
========
该API在以下内核代码中:
kernel/printk/printk.c
include/linux/printk.h
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/printk-formats.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_printk-formats.rst:
==============================
如何获得正确的printk格式占位符
==============================
:作者: Randy Dunlap <rdunlap@infradead.org>
:作者: Andrew Murray <amurray@mpc-data.co.uk>
整数类型
========
::
若变量类型是Type,则使用printk格式占位符。
-------------------------------------------
char %d 或 %x
unsigned char %u 或 %x
short int %d 或 %x
unsigned short int %u 或 %x
int %d 或 %x
unsigned int %u 或 %x
long %ld 或 %lx
unsigned long %lu 或 %lx
long long %lld 或 %llx
unsigned long long %llu 或 %llx
size_t %zu 或 %zx
ssize_t %zd 或 %zx
s8 %d 或 %x
u8 %u 或 %x
s16 %d 或 %x
u16 %u 或 %x
s32 %d 或 %x
u32 %u 或 %x
s64 %lld 或 %llx
u64 %llu 或 %llx
如果 <type> 的大小依赖于配置选项 (例如 sector_t, blkcnt_t) 或其大小依赖于架构
(例如 tcflag_t),则使用其可能的最大类型的格式占位符并显式强制转换为它。
例如
::
printk("test: sector number/total blocks: %llu/%llu\n",
(unsigned long long)sector, (unsigned long long)blockcount);
提醒:sizeof()返回类型为size_t。
内核的printf不支持%n。显而易见,浮点格式(%e, %f, %g, %a)也不被识别。使用任何不
支持的占位符或长度限定符都会导致一个WARN并且终止vsnprintf()执行。
指针类型
========
一个原始指针值可以用%p打印,它将在打印前对地址进行哈希处理。内核也支持扩展占位符来打印
不同类型的指针。
一些扩展占位符会打印给定地址上的数据,而不是打印地址本身。在这种情况下,以下错误消息可能
会被打印出来,而不是无法访问的消息::
(null) data on plain NULL address
(efault) data on invalid address
(einval) invalid data on a valid address
普通指针
----------
::
%p abcdef12 or 00000000abcdef12
没有指定扩展名的指针(即没有修饰符的%p)被哈希(hash),以防止内核内存布局消息的泄露。这
样还有一个额外的好处,就是提供一个唯一的标识符。在64位机器上,前32位被清零。当没有足够的
熵进行散列处理时,内核将打印(ptrval)代替
如果可能的话,使用专门的修饰符,如%pS或%pB(如下所述),以避免打印一个必须事后解释的非哈
希地址。如果不可能,而且打印地址的目的是为调试提供更多的消息,使用%p,并在调试过程中
用 ``no_hash_pointers`` 参数启动内核,这将打印所有未修改的%p地址。如果你 *真的* 想知
道未修改的地址,请看下面的%px。
如果(也只有在)你将地址作为虚拟文件的内容打印出来,例如在procfs或sysfs中(使用
seq_printf(),而不是printk())由用户空间进程读取,使用下面描述的%pK修饰符,不
要用%p或%px。
错误指针
--------
::
%pe -ENOSPC
用于打印错误指针(即IS_ERR()为真的指针)的符号错误名。不知道符号名的错误值会以十进制打印,
而作为%pe参数传递的非ERR_PTR会被视为普通的%p。
符号/函数指针
-------------
::
%pS versatile_init+0x0/0x110
%ps versatile_init
%pSR versatile_init+0x9/0x110
(with __builtin_extract_return_addr() translation)
%pB prev_fn_of_versatile_init+0x88/0x88
``S`` 和 ``s`` 占位符用于打印符号格式的指针。它们的结果是符号名称带有(S)或不带有(s)偏移
量。如果禁用KALLSYMS,则打印符号地址。
``B`` 占位符的结果是带有偏移量的符号名,在打印堆栈回溯时应该使用。占位符将考虑编译器优化
的影响,当使用尾部调用并使用noreturn GCC属性标记时,可能会发生这种优化。
如果指针在一个模块内,模块名称和可选的构建ID将被打印在符号名称之后,并在说明符的末尾添加
一个额外的 ``b`` 。
::
%pS versatile_init+0x0/0x110 [module_name]
%pSb versatile_init+0x0/0x110 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e]
%pSRb versatile_init+0x9/0x110 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e]
(with __builtin_extract_return_addr() translation)
%pBb prev_fn_of_versatile_init+0x88/0x88 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e]
来自BPF / tracing追踪的探查指针
----------------------------------
::
%pks kernel string
%pus user string
``k`` 和 ``u`` 指定符用于打印来自内核内存(k)或用户内存(u)的先前探测的内存。后面的 ``s`` 指
定符的结果是打印一个字符串。对于直接在常规的vsnprintf()中使用时,(k)和(u)注释被忽略,但是,当
在BPF的bpf_trace_printk()之外使用时,它会读取它所指向的内存,不会出现错误。
内核指针
--------
::
%pK 01234567 or 0123456789abcdef
用于打印应该对非特权用户隐藏的内核指针。%pK的行为取决于kptr_restrict sysctl——详见
Documentation/admin-guide/sysctl/kernel.rst。
未经修改的地址
--------------
::
%px 01234567 or 0123456789abcdef
对于打印指针,当你 *真的* 想打印地址时。在用%px打印指针之前,请考虑你是否泄露了内核内
存布局的敏感消息。%px在功能上等同于%lx(或%lu)。%px是首选,因为它在grep查找时更唯一。
如果将来我们需要修改内核处理打印指针的方式,我们将能更好地找到调用点。
在使用%px之前,请考虑使用%p并在调试过程中启用' ' no_hash_pointer ' '内核参数是否足
够(参见上面的%p描述)。%px的一个有效场景可能是在panic发生之前立即打印消息,这样无论如何
都可以防止任何敏感消息被利用,使用%px就不需要用no_hash_pointer来重现panic。
指针差异
--------
::
%td 2560
%tx a00
为了打印指针的差异,使用ptrdiff_t的%t修饰符。
例如::
printk("test: difference between pointers: %td\n", ptr2 - ptr1);
结构体资源(Resources)
-----------------------
::
%pr [mem 0x60000000-0x6fffffff flags 0x2200] or
[mem 0x0000000060000000-0x000000006fffffff flags 0x2200]
%pR [mem 0x60000000-0x6fffffff pref] or
[mem 0x0000000060000000-0x000000006fffffff pref]
用于打印结构体资源。 ``R`` 和 ``r`` 占位符的结果是打印出的资源带有(R)或不带有(r)解码标志
成员。
通过引用传递。
物理地址类型 phys_addr_t
------------------------
::
%pa[p] 0x01234567 or 0x0123456789abcdef
用于打印phys_addr_t类型(以及它的衍生物,如resource_size_t),该类型可以根据构建选项而
变化,无论CPU数据真实物理地址宽度如何。
通过引用传递。
DMA地址类型dma_addr_t
---------------------
::
%pad 0x01234567 or 0x0123456789abcdef
用于打印dma_addr_t类型,该类型可以根据构建选项而变化,而不考虑CPU数据路径的宽度。
通过引用传递。
原始缓冲区为转义字符串
----------------------
::
%*pE[achnops]
用于将原始缓冲区打印成转义字符串。对于以下缓冲区::
1b 62 20 5c 43 07 22 90 0d 5d
几个例子展示了如何进行转换(不包括两端的引号)。::
%*pE "\eb \C\a"\220\r]"
%*pEhp "\x1bb \C\x07"\x90\x0d]"
%*pEa "\e\142\040\\\103\a\042\220\r\135"
转换规则是根据可选的标志组合来应用的(详见:c:func:`string_escape_mem` 内核文档):
- a - ESCAPE_ANY
- c - ESCAPE_SPECIAL
- h - ESCAPE_HEX
- n - ESCAPE_NULL
- o - ESCAPE_OCTAL
- p - ESCAPE_NP
- s - ESCAPE_SPACE
默认情况下,使用 ESCAPE_ANY_NP。
ESCAPE_ANY_NP是许多情况下的明智选择,特别是对于打印SSID。
如果字段宽度被省略,那么将只转义1个字节。
原始缓冲区为十六进制字符串
--------------------------
::
%*ph 00 01 02 ... 3f
%*phC 00:01:02: ... :3f
%*phD 00-01-02- ... -3f
%*phN 000102 ... 3f
对于打印小的缓冲区(最长64个字节),可以用一定的分隔符作为一个
十六进制字符串。对于较大的缓冲区,可以考虑使用
:c:func:`print_hex_dump` 。
MAC/FDDI地址
------------
::
%pM 00:01:02:03:04:05
%pMR 05:04:03:02:01:00
%pMF 00-01-02-03-04-05
%pm 000102030405
%pmR 050403020100
用于打印以十六进制表示的6字节MAC/FDDI地址。 ``M`` 和 ``m`` 占位符导致打印的
地址有(M)或没有(m)字节分隔符。默认的字节分隔符是冒号(:)。
对于FDDI地址,可以在 ``M`` 占位符之后使用 ``F`` 说明,以使用破折号(——)分隔符
代替默认的分隔符。
对于蓝牙地址, ``R`` 占位符应使用在 ``M`` 占位符之后,以使用反转的字节顺序,适
合于以小尾端顺序的蓝牙地址的肉眼可见的解析。
通过引用传递。
IPv4地址
--------
::
%pI4 1.2.3.4
%pi4 001.002.003.004
%p[Ii]4[hnbl]
用于打印IPv4点分隔的十进制地址。 ``I4`` 和 ``i4`` 占位符的结果是打印的地址
有(i4)或没有(I4)前导零。
附加的 ``h`` 、 ``n`` 、 ``b`` 和 ``l`` 占位符分别用于指定主机、网络、大
尾端或小尾端地址。如果没有提供占位符,则使用默认的网络/大尾端顺序。
通过引用传递。
IPv6 地址
---------
::
%pI6 0001:0002:0003:0004:0005:0006:0007:0008
%pi6 00010002000300040005000600070008
%pI6c 1:2:3:4:5:6:7:8
用于打印IPv6网络顺序的16位十六进制地址。 ``I6`` 和 ``i6`` 占位符的结果是
打印的地址有(I6)或没有(i6)分号。始终使用前导零。
额外的 ``c`` 占位符可与 ``I`` 占位符一起使用,以打印压缩的IPv6地址,如
https://tools.ietf.org/html/rfc5952 所述
通过引用传递。
IPv4/IPv6地址(generic, with port, flowinfo, scope)
--------------------------------------------------
::
%pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008
%piS 001.002.003.004 or 00010002000300040005000600070008
%pISc 1.2.3.4 or 1:2:3:4:5:6:7:8
%pISpc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345
%p[Ii]S[pfschnbl]
用于打印一个IP地址,不需要区分它的类型是AF_INET还是AF_INET6。一个指向有效结构
体sockaddr的指针,通过 ``IS`` 或 ``IS`` 指定,可以传递给这个格式占位符。
附加的 ``p`` 、 ``f`` 和 ``s`` 占位符用于指定port(IPv4, IPv6)、
flowinfo (IPv6)和sope(IPv6)。port有一个 ``:`` 前缀,flowinfo是 ``/`` 和
范围是 ``%`` ,每个后面都跟着实际的值。
对于IPv6地址,如果指定了额外的指定符 ``c`` ,则使用
https://tools.ietf.org/html/rfc5952 描述的压缩IPv6地址。
如https://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-07
所建议的,IPv6地址由'[',']'包围,以防止出现额外的占位符 ``p`` , ``f`` 或 ``s`` 。
对于IPv4地址,也可以使用额外的 ``h`` , ``n`` , ``b`` 和 ``l`` 说
明符,但对于IPv6地址则忽略。
通过引用传递。
更多例子::
%pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789
%pISsc 1.2.3.4 or [1:2:3:4:5:6:7:8]%1234567890
%pISpfc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345/123456789
UUID/GUID地址
-------------
::
%pUb 00010203-0405-0607-0809-0a0b0c0d0e0f
%pUB 00010203-0405-0607-0809-0A0B0C0D0E0F
%pUl 03020100-0504-0706-0809-0a0b0c0e0e0f
%pUL 03020100-0504-0706-0809-0A0B0C0E0E0F
用于打印16字节的UUID/GUIDs地址。附加的 ``l`` , ``L`` , ``b`` 和 ``B`` 占位符用
于指定小写(l)或大写(L)十六进制表示法中的小尾端顺序,以及小写(b)或大写(B)十六进制表
示法中的大尾端顺序。
如果没有使用额外的占位符,则将打印带有小写十六进制表示法的默认大端顺序。
通过引用传递。
目录项(dentry)的名称
----------------------
::
%pd{,2,3,4}
%pD{,2,3,4}
用于打印dentry名称;如果我们用 :c:func:`d_move` 和它比较,名称可能是新旧混合的,但
不会oops。 %pd dentry比较安全,其相当于我们以前用的%s dentry->d_name.name,%pd<n>打
印 ``n`` 最后的组件。 %pD对结构文件做同样的事情。
通过引用传递。
块设备(block_device)名称
--------------------------
::
%pg sda, sda1 or loop0p1
用于打印block_device指针的名称。
va_format结构体
---------------
::
%pV
用于打印结构体va_format。这些结构包含一个格式字符串
和va_list如下
::
struct va_format {
const char *fmt;
va_list *va;
};
实现 "递归vsnprintf"。
如果没有一些机制来验证格式字符串和va_list参数的正确性,请不要使用这个功能。
通过引用传递。
设备树节点
----------
::
%pOF[fnpPcCF]
用于打印设备树节点结构。默认行为相当于%pOFf。
- f - 设备节点全称
- n - 设备节点名
- p - 设备节点句柄
- P - 设备节点路径规范(名称+@单位)
- F - 设备节点标志
- c - 主要兼容字符串
- C - 全兼容字符串
当使用多个参数时,分隔符是':'。
例如
::
%pOF /foo/bar@0 - Node full name
%pOFf /foo/bar@0 - Same as above
%pOFfp /foo/bar@0:10 - Node full name + phandle
%pOFfcF /foo/bar@0:foo,device:--P- - Node full name +
major compatible string +
node flags
D - dynamic
d - detached
P - Populated
B - Populated bus
通过引用传递。
Fwnode handles
--------------
::
%pfw[fP]
用于打印fwnode_handles的消息。默认情况下是打印完整的节点名称,包括路径。
这些修饰符在功能上等同于上面的%pOF。
- f - 节点的全名,包括路径。
- P - 节点名称,包括地址(如果有的话)。
例如 (ACPI)
::
%pfwf \_SB.PCI0.CIO2.port@1.endpoint@0 - Full node name
%pfwP endpoint@0 - Node name
例如 (OF)
::
%pfwf /ocp@68000000/i2c@48072000/camera@10/port/endpoint - Full name
%pfwP endpoint - Node name
时间和日期
----------
::
%pt[RT] YYYY-mm-ddTHH:MM:SS
%pt[RT]s YYYY-mm-dd HH:MM:SS
%pt[RT]d YYYY-mm-dd
%pt[RT]t HH:MM:SS
%pt[RT][dt][r][s]
用于打印日期和时间::
R struct rtc_time structure
T time64_t type
以我们(人类)可读的格式。
默认情况下,年将以1900为单位递增,月将以1为单位递增。 使用%pt[RT]r (raw)
来抑制这种行为。
%pt[RT]s(空格)将覆盖ISO 8601的分隔符,在日期和时间之间使用''(空格)而
不是'T'(大写T)。当日期或时间被省略时,它不会有任何影响。
通过引用传递。
clk结构体
---------
::
%pC pll1
%pCn pll1
用于打印clk结构。%pC 和 %pCn 打印时钟的名称(通用时钟框架)或唯一的32位
ID(传统时钟框架)。
通过引用传递。
位图及其衍生物,如cpumask和nodemask
-----------------------------------
::
%*pb 0779
%*pbl 0,3-6,8-10
对于打印位图(bitmap)及其派生的cpumask和nodemask,%*pb输出以字段宽度为位数的位图,
%*pbl输出以字段宽度为位数的范围列表。
字段宽度用值传递,位图用引用传递。可以使用辅助宏cpumask_pr_args()和
nodemask_pr_args()来方便打印cpumask和nodemask。
标志位字段,如页标志、gfp_flags
-------------------------------
::
%pGp referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff
%pGg GFP_USER|GFP_DMA32|GFP_NOWARN
%pGv read|exec|mayread|maywrite|mayexec|denywrite
将flags位字段打印为构造值的符号常量集合。标志的类型由第三个字符给出。目前支持的
是[p]age flags, [v]ma_flags(都期望 ``unsigned long *`` )和
[g]fp_flags(期望 ``gfp_t *`` )。标志名称和打印顺序取决于特定的类型。
注意,这种格式不应该直接用于跟踪点的:c:func:`TP_printk()` 部分。相反,应使
用 <trace/events/mmflags.h>中的show_*_flags()函数。
通过引用传递。
网络设备特性
------------
::
%pNF 0x000000000000c000
用于打印netdev_features_t。
通过引用传递。
V4L2和DRM FourCC代码(像素格式)
------------------------------
::
%p4cc
打印V4L2或DRM使用的FourCC代码,包括格式端序及其十六进制的数值。
通过引用传递。
例如::
%p4cc BG12 little-endian (0x32314742)
%p4cc Y10 little-endian (0x20303159)
%p4cc NV12 big-endian (0xb231564e)
谢谢
====
如果您添加了其他%p扩展,请在可行的情况下,用一个或多个测试用例扩展<lib/test_printf.c>。
谢谢你的合作和关注。
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/refcount-vs-atomic.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_refcount-vs-atomic:
=======================================
与atomic_t相比,refcount_t的API是这样的
=======================================
.. contents:: :local:
简介
====
refcount_t API的目标是为实现对象的引用计数器提供一个最小的API。虽然来自
lib/refcount.c的独立于架构的通用实现在下面使用了原子操作,但一些 ``refcount_*()``
和 ``atomic_*()`` 函数在内存顺序保证方面有很多不同。本文档概述了这些差异,并
提供了相应的例子,以帮助开发者根据这些内存顺序保证的变化来验证他们的代码。
本文档中使用的术语尽量遵循tools/memory-model/Documentation/explanation.txt
中定义的正式LKMM。
memory-barriers.txt和atomic_t.txt提供了更多关于内存顺序的背景,包括通用的
和针对原子操作的。
内存顺序的相关类型
==================
.. note:: 下面的部分只涵盖了本文使用的与原子操作和引用计数器有关的一些内存顺
序类型。如果想了解更广泛的情况,请查阅memory-barriers.txt文件。
在没有任何内存顺序保证的情况下(即完全无序),atomics和refcounters只提供原
子性和程序顺序(program order, po)关系(在同一个CPU上)。它保证每个
``atomic_* ()`` 和 ``refcount_*()`` 操作都是原子性的,指令在单个CPU上按程序
顺序执行。这是用READ_ONCE()/WRITE_ONCE()和比较并交换原语实现的。
强(完全)内存顺序保证在同一CPU上的所有较早加载和存储的指令(所有程序顺序较早
[po-earlier]指令)在执行任何程序顺序较后指令(po-later)之前完成。它还保证
同一CPU上储存的程序优先较早的指令和来自其他CPU传播的指令必须在该CPU执行任何
程序顺序较后指令之前传播到其他CPU(A-累积属性)。这是用smp_mb()实现的。
RELEASE内存顺序保证了在同一CPU上所有较早加载和存储的指令(所有程序顺序较早
指令)在此操作前完成。它还保证同一CPU上储存的程序优先较早的指令和来自其他CPU
传播的指令必须在释放(release)操作之前传播到所有其他CPU(A-累积属性)。这是用
smp_store_release()实现的。
ACQUIRE内存顺序保证了同一CPU上的所有后加载和存储的指令(所有程序顺序较后
指令)在获取(acquire)操作之后完成。它还保证在获取操作执行后,同一CPU上
储存的所有程序顺序较后指令必须传播到所有其他CPU。这是用
smp_acquire__after_ctrl_dep()实现的。
对Refcounters的控制依赖(取决于成功)保证了如果一个对象的引用被成功获得(引用计数
器的增量或增加行为发生了,函数返回true),那么进一步的存储是针对这个操作的命令。对存
储的控制依赖没有使用任何明确的屏障来实现,而是依赖于CPU不对存储进行猜测。这只是
一个单一的CPU关系,对其他CPU不提供任何保证。
函数的比较
==========
情况1) - 非 “读/修改/写”(RMW)操作
------------------------------------
函数变化:
* atomic_set() --> refcount_set()
* atomic_read() --> refcount_read()
内存顺序保证变化:
* none (两者都是完全无序的)
情况2) - 基于增量的操作,不返回任何值
--------------------------------------
函数变化:
* atomic_inc() --> refcount_inc()
* atomic_add() --> refcount_add()
内存顺序保证变化:
* none (两者都是完全无序的)
情况3) - 基于递减的RMW操作,没有返回值
---------------------------------------
函数变化:
* atomic_dec() --> refcount_dec()
内存顺序保证变化:
* 完全无序的 --> RELEASE顺序
情况4) - 基于增量的RMW操作,返回一个值
---------------------------------------
函数变化:
* atomic_inc_not_zero() --> refcount_inc_not_zero()
* 无原子性对应函数 --> refcount_add_not_zero()
内存顺序保证变化:
* 完全有序的 --> 控制依赖于存储的成功
.. note:: 此处 **假设** 了,必要的顺序是作为获得对象指针的结果而提供的。
情况 5) - 基于Dec/Sub递减的通用RMW操作,返回一个值
---------------------------------------------------
函数变化:
* atomic_dec_and_test() --> refcount_dec_and_test()
* atomic_sub_and_test() --> refcount_sub_and_test()
内存顺序保证变化:
* 完全有序的 --> RELEASE顺序 + 成功后ACQUIRE顺序
情况6)其他基于递减的RMW操作,返回一个值
----------------------------------------
函数变化:
* 无原子性对应函数 --> refcount_dec_if_one()
* ``atomic_add_unless(&var, -1, 1)`` --> ``refcount_dec_not_one(&var)``
内存顺序保证变化:
* 完全有序的 --> RELEASE顺序 + 控制依赖
.. note:: atomic_add_unless()只在执行成功时提供完整的顺序。
情况7)--基于锁的RMW
--------------------
函数变化:
* atomic_dec_and_lock() --> refcount_dec_and_lock()
* atomic_dec_and_mutex_lock() --> refcount_dec_and_mutex_lock()
内存顺序保证变化:
* 完全有序 --> RELEASE顺序 + 控制依赖 + 持有
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/symbol-namespaces.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_symbol-namespaces.rst:
=================================
符号命名空间(Symbol Namespaces)
=================================
本文档描述了如何使用符号命名空间来构造通过EXPORT_SYMBOL()系列宏导出的内核内符号的导出面。
.. 目录
=== 1 简介
=== 2 如何定义符号命名空间
--- 2.1 使用EXPORT_SYMBOL宏
--- 2.2 使用DEFAULT_SYMBOL_NAMESPACE定义
=== 3 如何使用命名空间中导出的符号
=== 4 加载使用命名空间符号的模块
=== 5 自动创建MODULE_IMPORT_NS声明
1. 简介
=======
符号命名空间已经被引入,作为构造内核内API的导出面的一种手段。它允许子系统维护者将
他们导出的符号划分进独立的命名空间。这对于文档的编写非常有用(想想SUBSYSTEM_DEBUG
命名空间),也可以限制一组符号在内核其他部分的使用。今后,使用导出到命名空间的符号
的模块必须导入命名空间。否则,内核将根据其配置,拒绝加载该模块或警告说缺少
导入。
2. 如何定义符号命名空间
=======================
符号可以用不同的方法导出到命名空间。所有这些都在改变 EXPORT_SYMBOL 和与之类似的那些宏
被检测到的方式,以创建 ksymtab 条目。
2.1 使用EXPORT_SYMBOL宏
=======================
除了允许将内核符号导出到内核符号表的宏EXPORT_SYMBOL()和EXPORT_SYMBOL_GPL()之外,
这些宏的变体还可以将符号导出到某个命名空间:EXPORT_SYMBOL_NS() 和 EXPORT_SYMBOL_NS_GPL()。
它们需要一个额外的参数:命名空间(the namespace)。请注意,由于宏扩展,该参数需
要是一个预处理器符号。例如,要把符号 ``usb_stor_suspend`` 导出到命名空间 ``USB_STORAGE``,
请使用::
EXPORT_SYMBOL_NS(usb_stor_suspend, USB_STORAGE);
相应的 ksymtab 条目结构体 ``kernel_symbol`` 将有相应的成员 ``命名空间`` 集。
导出时未指明命名空间的符号将指向 ``NULL`` 。如果没有定义命名空间,则默认没有。
``modpost`` 和kernel/module.c分别在构建时或模块加载时使用名称空间。
2.2 使用DEFAULT_SYMBOL_NAMESPACE定义
====================================
为一个子系统的所有符号定义命名空间可能会非常冗长,并可能变得难以维护。因此,我
们提供了一个默认定义(DEFAULT_SYMBOL_NAMESPACE),如果设置了这个定义, 它将成
为所有没有指定命名空间的 EXPORT_SYMBOL() 和 EXPORT_SYMBOL_GPL() 宏扩展的默认
定义。
有多种方法来指定这个定义,使用哪种方法取决于子系统和维护者的喜好。第一种方法是在
子系统的 ``Makefile`` 中定义默认命名空间。例如,如果要将usb-common中定义的所有符号导
出到USB_COMMON命名空间,可以在drivers/usb/common/Makefile中添加这样一行::
ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE=USB_COMMON
这将影响所有 EXPORT_SYMBOL() 和 EXPORT_SYMBOL_GPL() 语句。当这个定义存在时,
用EXPORT_SYMBOL_NS()导出的符号仍然会被导出到作为命名空间参数传递的命名空间中,
因为这个参数优先于默认的符号命名空间。
定义默认命名空间的第二个选项是直接在编译单元中作为预处理声明。上面的例子就会变
成::
#undef DEFAULT_SYMBOL_NAMESPACE
#define DEFAULT_SYMBOL_NAMESPACE USB_COMMON
应置于相关编译单元中任何 EXPORT_SYMBOL 宏之前
3. 如何使用命名空间中导出的符号
===============================
为了使用被导出到命名空间的符号,内核模块需要明确地导入这些命名空间。
否则内核可能会拒绝加载该模块。模块代码需要使用宏MODULE_IMPORT_NS来
表示它所使用的命名空间的符号。例如,一个使用usb_stor_suspend符号的
模块,需要使用如下语句导入命名空间USB_STORAGE::
MODULE_IMPORT_NS(USB_STORAGE);
这将在模块中为每个导入的命名空间创建一个 ``modinfo`` 标签。这也顺带
使得可以用modinfo检查模块已导入的命名空间::
$ modinfo drivers/usb/storage/ums-karma.ko
[...]
import_ns: USB_STORAGE
[...]
建议将 MODULE_IMPORT_NS() 语句添加到靠近其他模块元数据定义的地方,
如 MODULE_AUTHOR() 或 MODULE_LICENSE() 。关于自动创建缺失的导入
语句的方法,请参考第5节。
4. 加载使用命名空间符号的模块
=============================
在模块加载时(比如 ``insmod`` ),内核将检查每个从模块中引用的符号是否可
用,以及它可能被导出到的名字空间是否被模块导入。内核的默认行为是拒绝
加载那些没有指明足以导入的模块。此错误会被记录下来,并且加载将以
EINVAL方式失败。要允许加载不满足这个前提条件的模块,可以使用此配置选项:
设置 MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS=y 将使加载不受影响,但会
发出警告。
5. 自动创建MODULE_IMPORT_NS声明
===============================
缺少命名空间的导入可以在构建时很容易被检测到。事实上,如果一个模块
使用了一个命名空间的符号而没有导入它,modpost会发出警告。
MODULE_IMPORT_NS()语句通常会被添加到一个明确的位置(和其他模块元
数据一起)。为了使模块作者(和子系统维护者)的生活更加轻松,我们提
供了一个脚本和make目标来修复丢失的导入。修复丢失的导入可以用::
$ make nsdeps
对模块作者来说,以下情况可能很典型::
- 编写依赖未导入命名空间的符号的代码
- ``make``
- 注意 ``modpost`` 的警告,提醒你有一个丢失的导入。
- 运行 ``make nsdeps``将导入添加到正确的代码位置。
对于引入命名空间的子系统维护者来说,其步骤非常相似。同样,make nsdeps最终将
为树内模块添加缺失的命名空间导入::
- 向命名空间转移或添加符号(例如,使用EXPORT_SYMBOL_NS())。
- `make e`(最好是用allmodconfig来覆盖所有的内核模块)。
- 注意 ``modpost`` 的警告,提醒你有一个丢失的导入。
- 运行 ``maknsdeps``将导入添加到正确的代码位置。
你也可以为外部模块的构建运行nsdeps。典型的用法是::
$ make -C <path_to_kernel_src> M=$PWD nsdeps
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/workqueue.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_workqueue.rst:
=========================
并发管理的工作队列 (cmwq)
=========================
:日期: September, 2010
:作者: Tejun Heo <tj@kernel.org>
:作者: Florian Mickler <florian@mickler.org>
简介
====
在很多情况下,需要一个异步进程的执行环境,工作队列(wq)API是这种情况下
最常用的机制。
当需要这样一个异步执行上下文时,一个描述将要执行的函数的工作项(work,
即一个待执行的任务)被放在队列中。一个独立的线程作为异步执行环境。该队
列被称为workqueue,线程被称为工作者(worker,即执行这一队列的线程)。
当工作队列上有工作项时,工作者会一个接一个地执行与工作项相关的函数。当
工作队列中没有任何工作项时,工作者就会变得空闲。当一个新的工作项被排入
队列时,工作者又开始执行。
为什么要cmwq?
=============
在最初的wq实现中,多线程(MT)wq在每个CPU上有一个工作者线程,而单线程
(ST)wq在全系统有一个工作者线程。一个MT wq需要保持与CPU数量相同的工
作者数量。这些年来,内核增加了很多MT wq的用户,随着CPU核心数量的不断
增加,一些系统刚启动就达到了默认的32k PID的饱和空间。
尽管MT wq浪费了大量的资源,但所提供的并发性水平却不能令人满意。这个限
制在ST和MT wq中都有,只是在MT中没有那么严重。每个wq都保持着自己独立的
工作者池。一个MT wq只能为每个CPU提供一个执行环境,而一个ST wq则为整个
系统提供一个。工作项必须竞争这些非常有限的执行上下文,从而导致各种问题,
包括在单一执行上下文周围容易发生死锁。
(MT wq)所提供的并发性水平和资源使用之间的矛盾也迫使其用户做出不必要的权衡,比
如libata选择使用ST wq来轮询PIO,并接受一个不必要的限制,即没有两个轮
询PIO可以同时进行。由于MT wq并没有提供更好的并发性,需要更高层次的并
发性的用户,如async或fscache,不得不实现他们自己的线程池。
并发管理工作队列(cmwq)是对wq的重新实现,重点是以下目标。
* 保持与原始工作队列API的兼容性。
* 使用由所有wq共享的每CPU统一的工作者池,在不浪费大量资源的情况下按
* 需提供灵活的并发水平。
* 自动调节工作者池和并发水平,使API用户不需要担心这些细节。
设计
====
为了简化函数的异步执行,引入了一个新的抽象概念,即工作项。
一个工作项是一个简单的结构,它持有一个指向将被异步执行的函数的指针。
每当一个驱动程序或子系统希望一个函数被异步执行时,它必须建立一个指
向该函数的工作项,并在工作队列中排队等待该工作项。(就是挂到workqueue
队列里面去)
特定目的线程,称为工作线程(工作者),一个接一个地执行队列中的功能。
如果没有工作项排队,工作者线程就会闲置。这些工作者线程被管理在所谓
的工作者池中。
cmwq设计区分了面向用户的工作队列,子系统和驱动程序在上面排队工作,
以及管理工作者池和处理排队工作项的后端机制。
每个可能的CPU都有两个工作者池,一个用于正常的工作项,另一个用于高
优先级的工作项,还有一些额外的工作者池,用于服务未绑定工作队列的工
作项目——这些后备池的数量是动态的。
当他们认为合适的时候,子系统和驱动程序可以通过特殊的
``workqueue API`` 函数创建和排队工作项。他们可以通过在工作队列上
设置标志来影响工作项执行方式的某些方面,他们把工作项放在那里。这些
标志包括诸如CPU定位、并发限制、优先级等等。要获得详细的概述,请参
考下面的 ``alloc_workqueue()`` 的 API 描述。
当一个工作项被排入一个工作队列时,目标工作池将根据队列参数和工作队
列属性确定,并被附加到工作池的共享工作列表上。例如,除非特别重写,
否则一个绑定的工作队列的工作项将被排在与发起线程运行的CPU相关的普
通或高级工作工作者池的工作项列表中。
对于任何工作者池的实施,管理并发水平(有多少执行上下文处于活动状
态)是一个重要问题。最低水平是为了节省资源,而饱和水平是指系统被
充分使用。
每个与实际CPU绑定的worker-pool通过钩住调度器来实现并发管理。每当
一个活动的工作者被唤醒或睡眠时,工作者池就会得到通知,并跟踪当前可
运行的工作者的数量。一般来说,工作项不会占用CPU并消耗很多周期。这
意味着保持足够的并发性以防止工作处理停滞应该是最优的。只要CPU上有
一个或多个可运行的工作者,工作者池就不会开始执行新的工作,但是,当
最后一个运行的工作者进入睡眠状态时,它会立即安排一个新的工作者,这
样CPU就不会在有待处理的工作项目时闲置。这允许在不损失执行带宽的情
况下使用最少的工作者。
除了kthreads的内存空间外,保留空闲的工作者并没有其他成本,所以cmwq
在杀死它们之前会保留一段时间的空闲。
对于非绑定的工作队列,后备池的数量是动态的。可以使用
``apply_workqueue_attrs()`` 为非绑定工作队列分配自定义属性,
workqueue将自动创建与属性相匹配的后备工作者池。调节并发水平的责任在
用户身上。也有一个标志可以将绑定的wq标记为忽略并发管理。
详情请参考API部分。
前进进度的保证依赖于当需要更多的执行上下文时可以创建工作者,这也是
通过使用救援工作者来保证的。所有可能在处理内存回收的代码路径上使用
的工作项都需要在wq上排队,wq上保留了一个救援工作者,以便在内存有压
力的情况下下执行。否则,工作者池就有可能出现死锁,等待执行上下文释
放出来。
应用程序编程接口 (API)
======================
``alloc_workqueue()`` 分配了一个wq。原来的 ``create_*workqueue()``
函数已被废弃,并计划删除。 ``alloc_workqueue()`` 需要三个
参数 - ``@name`` , ``@flags`` 和 ``@max_active`` 。
``@name`` 是wq的名称,如果有的话,也用作救援线程的名称。
一个wq不再管理执行资源,而是作为前进进度保证、刷新(flush)和
工作项属性的域。 ``@flags`` 和 ``@max_active`` 控制着工作
项如何被分配执行资源、安排和执行。
``flags``
---------
``WQ_UNBOUND``
排队到非绑定wq的工作项由特殊的工作者池提供服务,这些工作者不
绑定在任何特定的CPU上。这使得wq表现得像一个简单的执行环境提
供者,没有并发管理。非绑定工作者池试图尽快开始执行工作项。非
绑定的wq牺牲了局部性,但在以下情况下是有用的。
* 预计并发水平要求会有很大的波动,使用绑定的wq最终可能会在不
同的CPU上产生大量大部分未使用的工作者,因为发起线程在不同
的CPU上跳转。
* 长期运行的CPU密集型工作负载,可以由系统调度器更好地管理。
``WQ_FREEZABLE``
一个可冻结的wq参与了系统暂停操作的冻结阶段。wq上的工作项被
排空,在解冻之前没有新的工作项开始执行。
``WQ_MEM_RECLAIM``
所有可能在内存回收路径中使用的wq都必须设置这个标志。无论内
存压力如何,wq都能保证至少有一个执行上下文。
``WQ_HIGHPRI``
高优先级wq的工作项目被排到目标cpu的高优先级工作者池中。高
优先级的工作者池由具有较高级别的工作者线程提供服务。
请注意,普通工作者池和高优先级工作者池之间并不相互影响。他
们各自维护其独立的工作者池,并在其工作者之间实现并发管理。
``WQ_CPU_INTENSIVE``
CPU密集型wq的工作项对并发水平没有贡献。换句话说,可运行的
CPU密集型工作项不会阻止同一工作者池中的其他工作项开始执行。
这对于那些预计会占用CPU周期的绑定工作项很有用,这样它们的
执行就会受到系统调度器的监管。
尽管CPU密集型工作项不会对并发水平做出贡献,但它们的执行开
始仍然受到并发管理的管制,可运行的非CPU密集型工作项会延迟
CPU密集型工作项的执行。
这个标志对于未绑定的wq来说是没有意义的。
请注意,标志 ``WQ_NON_REENTRANT`` 不再存在,因为现在所有的工作
队列都是不可逆的——任何工作项都保证在任何时间内最多被整个系统的一
个工作者执行。
``max_active``
--------------
``@max_active`` 决定了每个CPU可以分配给wq的工作项的最大执行上
下文数量。例如,如果 ``@max_active为16`` ,每个CPU最多可以同
时执行16个wq的工作项。
目前,对于一个绑定的wq, ``@max_active`` 的最大限制是512,当指
定为0时使用的默认值是256。对于非绑定的wq,其限制是512和
4 * ``num_possible_cpus()`` 中的较高值。这些值被选得足够高,所
以它们不是限制性因素,同时会在失控情况下提供保护。
一个wq的活动工作项的数量通常由wq的用户来调节,更具体地说,是由用
户在同一时间可以排列多少个工作项来调节。除非有特定的需求来控制活动
工作项的数量,否则建议指定 为"0"。
一些用户依赖于ST wq的严格执行顺序。 ``@max_active`` 为1和 ``WQ_UNBOUND``
的组合用来实现这种行为。这种wq上的工作项目总是被排到未绑定的工作池
中,并且在任何时候都只有一个工作项目处于活动状态,从而实现与ST wq相
同的排序属性。
在目前的实现中,上述配置只保证了特定NUMA节点内的ST行为。相反,
``alloc_ordered_queue()`` 应该被用来实现全系统的ST行为。
执行场景示例
============
下面的示例执行场景试图说明cmwq在不同配置下的行为。
工作项w0、w1、w2被排到同一个CPU上的一个绑定的wq q0上。w0
消耗CPU 5ms,然后睡眠10ms,然后在完成之前再次消耗CPU 5ms。
忽略所有其他的任务、工作和处理开销,并假设简单的FIFO调度,
下面是一个高度简化的原始wq的可能事件序列的版本。::
TIME IN MSECS EVENT
0 w0 starts and burns CPU
5 w0 sleeps
15 w0 wakes up and burns CPU
20 w0 finishes
20 w1 starts and burns CPU
25 w1 sleeps
35 w1 wakes up and finishes
35 w2 starts and burns CPU
40 w2 sleeps
50 w2 wakes up and finishes
And with cmwq with ``@max_active`` >= 3, ::
TIME IN MSECS EVENT
0 w0 starts and burns CPU
5 w0 sleeps
5 w1 starts and burns CPU
10 w1 sleeps
10 w2 starts and burns CPU
15 w2 sleeps
15 w0 wakes up and burns CPU
20 w0 finishes
20 w1 wakes up and finishes
25 w2 wakes up and finishes
如果 ``@max_active`` == 2, ::
TIME IN MSECS EVENT
0 w0 starts and burns CPU
5 w0 sleeps
5 w1 starts and burns CPU
10 w1 sleeps
15 w0 wakes up and burns CPU
20 w0 finishes
20 w1 wakes up and finishes
20 w2 starts and burns CPU
25 w2 sleeps
35 w2 wakes up and finishes
现在,我们假设w1和w2被排到了不同的wq q1上,这个wq q1
有 ``WQ_CPU_INTENSIVE`` 设置::
TIME IN MSECS EVENT
0 w0 starts and burns CPU
5 w0 sleeps
5 w1 and w2 start and burn CPU
10 w1 sleeps
15 w2 sleeps
15 w0 wakes up and burns CPU
20 w0 finishes
20 w1 wakes up and finishes
25 w2 wakes up and finishes
指南
====
* 如果一个wq可能处理在内存回收期间使用的工作项目,请不
要忘记使用 ``WQ_MEM_RECLAIM`` 。每个设置了
``WQ_MEM_RECLAIM`` 的wq都有一个为其保留的执行环境。
如果在内存回收过程中使用的多个工作项之间存在依赖关系,
它们应该被排在不同的wq中,每个wq都有 ``WQ_MEM_RECLAIM`` 。
* 除非需要严格排序,否则没有必要使用ST wq。
* 除非有特殊需要,建议使用0作为@max_active。在大多数使用情
况下,并发水平通常保持在默认限制之下。
* 一个wq作为前进进度保证(WQ_MEM_RECLAIM,冲洗(flush)和工
作项属性的域。不涉及内存回收的工作项,不需要作为工作项组的一
部分被刷新,也不需要任何特殊属性,可以使用系统中的一个wq。使
用专用wq和系统wq在执行特性上没有区别。
* 除非工作项预计会消耗大量的CPU周期,否则使用绑定的wq通常是有
益的,因为wq操作和工作项执行中的定位水平提高了。
调试
====
因为工作函数是由通用的工作者线程执行的,所以需要一些手段来揭示一些行为不端的工作队列用户。
工作者线程在进程列表中显示为: ::
root 5671 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/0:1]
root 5672 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/1:2]
root 5673 0.0 0.0 0 0 ? S 12:12 0:00 [kworker/0:0]
root 5674 0.0 0.0 0 0 ? S 12:13 0:00 [kworker/1:0]
如果kworkers失控了(使用了太多的cpu),有两类可能的问题:
1. 正在迅速调度的事情
2. 一个消耗大量cpu周期的工作项。
第一个可以用追踪的方式进行跟踪: ::
$ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace_pipe > out.txt
(wait a few secs)
如果有什么东西在工作队列上忙着做循环,它就会主导输出,可以用工作项函数确定违规者。
对于第二类问题,应该可以只检查违规工作者线程的堆栈跟踪。 ::
$ cat /proc/THE_OFFENDING_KWORKER/stack
工作项函数在堆栈追踪中应该是微不足道的。
内核内联文档参考
================
该API在以下内核代码中:
include/linux/workqueue.h
kernel/workqueue.c
......@@ -19,13 +19,13 @@
:maxdepth: 2
gcov
kasan
Todolist:
- coccinelle
- sparse
- kcov
- kasan
- ubsan
- kmemleak
- kcsan
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/dev-tools/kasan.rst
:Translator: 万家兵 Wan Jiabing <wanjiabing@vivo.com>
内核地址消毒剂(KASAN)
=====================
概述
----
KernelAddressSANitizer(KASAN)是一种动态内存安全错误检测工具,主要功能是
检查内存越界访问和使用已释放内存的问题。KASAN有三种模式:
1. 通用KASAN(与用户空间的ASan类似)
2. 基于软件标签的KASAN(与用户空间的HWASan类似)
3. 基于硬件标签的KASAN(基于硬件内存标签)
由于通用KASAN的内存开销较大,通用KASAN主要用于调试。基于软件标签的KASAN
可用于dogfood测试,因为它具有较低的内存开销,并允许将其用于实际工作量。
基于硬件标签的KASAN具有较低的内存和性能开销,因此可用于生产。同时可用于
检测现场内存问题或作为安全缓解措施。
软件KASAN模式(#1和#2)使用编译时工具在每次内存访问之前插入有效性检查,
因此需要一个支持它的编译器版本。
通用KASAN在GCC和Clang受支持。GCC需要8.3.0或更高版本。任何受支持的Clang
版本都是兼容的,但从Clang 11才开始支持检测全局变量的越界访问。
基于软件标签的KASAN模式仅在Clang中受支持。
硬件KASAN模式(#3)依赖硬件来执行检查,但仍需要支持内存标签指令的编译器
版本。GCC 10+和Clang 11+支持此模式。
两种软件KASAN模式都适用于SLUB和SLAB内存分配器,而基于硬件标签的KASAN目前
仅支持SLUB。
目前x86_64、arm、arm64、xtensa、s390、riscv架构支持通用KASAN模式,仅
arm64架构支持基于标签的KASAN模式。
用法
----
要启用KASAN,请使用以下命令配置内核::
CONFIG_KASAN=y
同时在 ``CONFIG_KASAN_GENERIC`` (启用通用KASAN模式), ``CONFIG_KASAN_SW_TAGS``
(启用基于硬件标签的KASAN模式),和 ``CONFIG_KASAN_HW_TAGS`` (启用基于硬件标签
的KASAN模式)之间进行选择。
对于软件模式,还可以在 ``CONFIG_KASAN_OUTLINE`` 和 ``CONFIG_KASAN_INLINE``
之间进行选择。outline和inline是编译器插桩类型。前者产生较小的二进制文件,
而后者快1.1-2倍。
要将受影响的slab对象的alloc和free堆栈跟踪包含到报告中,请启用
``CONFIG_STACKTRACE`` 。要包括受影响物理页面的分配和释放堆栈跟踪的话,
请启用 ``CONFIG_PAGE_OWNER`` 并使用 ``page_owner=on`` 进行引导。
错误报告
~~~~~~~~
典型的KASAN报告如下所示::
==================================================================
BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [test_kasan]
Write of size 1 at addr ffff8801f44ec37b by task insmod/2760
CPU: 1 PID: 2760 Comm: insmod Not tainted 4.19.0-rc3+ #698
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x94/0xd8
print_address_description+0x73/0x280
kasan_report+0x144/0x187
__asan_report_store1_noabort+0x17/0x20
kmalloc_oob_right+0xa8/0xbc [test_kasan]
kmalloc_tests_init+0x16/0x700 [test_kasan]
do_one_initcall+0xa5/0x3ae
do_init_module+0x1b6/0x547
load_module+0x75df/0x8070
__do_sys_init_module+0x1c6/0x200
__x64_sys_init_module+0x6e/0xb0
do_syscall_64+0x9f/0x2c0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f96443109da
RSP: 002b:00007ffcf0b51b08 EFLAGS: 00000202 ORIG_RAX: 00000000000000af
RAX: ffffffffffffffda RBX: 000055dc3ee521a0 RCX: 00007f96443109da
RDX: 00007f96445cff88 RSI: 0000000000057a50 RDI: 00007f9644992000
RBP: 000055dc3ee510b0 R08: 0000000000000003 R09: 0000000000000000
R10: 00007f964430cd0a R11: 0000000000000202 R12: 00007f96445cff88
R13: 000055dc3ee51090 R14: 0000000000000000 R15: 0000000000000000
Allocated by task 2760:
save_stack+0x43/0xd0
kasan_kmalloc+0xa7/0xd0
kmem_cache_alloc_trace+0xe1/0x1b0
kmalloc_oob_right+0x56/0xbc [test_kasan]
kmalloc_tests_init+0x16/0x700 [test_kasan]
do_one_initcall+0xa5/0x3ae
do_init_module+0x1b6/0x547
load_module+0x75df/0x8070
__do_sys_init_module+0x1c6/0x200
__x64_sys_init_module+0x6e/0xb0
do_syscall_64+0x9f/0x2c0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 815:
save_stack+0x43/0xd0
__kasan_slab_free+0x135/0x190
kasan_slab_free+0xe/0x10
kfree+0x93/0x1a0
umh_complete+0x6a/0xa0
call_usermodehelper_exec_async+0x4c3/0x640
ret_from_fork+0x35/0x40
The buggy address belongs to the object at ffff8801f44ec300
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 123 bytes inside of
128-byte region [ffff8801f44ec300, ffff8801f44ec380)
The buggy address belongs to the page:
page:ffffea0007d13b00 count:1 mapcount:0 mapping:ffff8801f7001640 index:0x0
flags: 0x200000000000100(slab)
raw: 0200000000000100 ffffea0007d11dc0 0000001a0000001a ffff8801f7001640
raw: 0000000000000000 0000000080150015 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8801f44ec200: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
ffff8801f44ec280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff8801f44ec300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03
^
ffff8801f44ec380: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
ffff8801f44ec400: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================
报告标题总结了发生的错误类型以及导致该错误的访问类型。紧随其后的是错误访问的
堆栈跟踪、所访问内存分配位置的堆栈跟踪(对于访问了slab对象的情况)以及对象
被释放的位置的堆栈跟踪(对于访问已释放内存的问题报告)。接下来是对访问的
slab对象的描述以及关于访问的内存页的信息。
最后,报告展示了访问地址周围的内存状态。在内部,KASAN单独跟踪每个内存颗粒的
内存状态,根据KASAN模式分为8或16个对齐字节。报告的内存状态部分中的每个数字
都显示了围绕访问地址的其中一个内存颗粒的状态。
对于通用KASAN,每个内存颗粒的大小为8个字节。每个颗粒的状态被编码在一个影子字节
中。这8个字节可以是可访问的,部分访问的,已释放的或成为Redzone的一部分。KASAN
对每个影子字节使用以下编码:00表示对应内存区域的所有8个字节都可以访问;数字N
(1 <= N <= 7)表示前N个字节可访问,其他(8 - N)个字节不可访问;任何负值都表示
无法访问整个8字节。KASAN使用不同的负值来区分不同类型的不可访问内存,如redzones
或已释放的内存(参见 mm/kasan/kasan.h)。
在上面的报告中,箭头指向影子字节 ``03`` ,表示访问的地址是部分可访问的。
对于基于标签的KASAN模式,报告最后的部分显示了访问地址周围的内存标签
(参考 `实施细则`_ 章节)。
请注意,KASAN错误标题(如 ``slab-out-of-bounds`` 或 ``use-after-free`` )
是尽量接近的:KASAN根据其拥有的有限信息打印出最可能的错误类型。错误的实际类型
可能会有所不同。
通用KASAN还报告两个辅助调用堆栈跟踪。这些堆栈跟踪指向代码中与对象交互但不直接
出现在错误访问堆栈跟踪中的位置。目前,这包括 call_rcu() 和排队的工作队列。
启动参数
~~~~~~~~
KASAN受通用 ``panic_on_warn`` 命令行参数的影响。启用该功能后,KASAN在打印错误
报告后会引起内核恐慌。
默认情况下,KASAN只为第一次无效内存访问打印错误报告。使用 ``kasan_multi_shot`` ,
KASAN会针对每个无效访问打印报告。这有效地禁用了KASAN报告的 ``panic_on_warn`` 。
基于硬件标签的KASAN模式(请参阅下面有关各种模式的部分)旨在在生产中用作安全缓解
措施。因此,它支持允许禁用KASAN或控制其功能的引导参数。
- ``kasan=off`` 或 ``=on`` 控制KASAN是否启用 (默认: ``on`` )。
- ``kasan.mode=sync`` 或 ``=async`` 控制KASAN是否配置为同步或异步执行模式(默认:
``sync`` )。同步模式:当标签检查错误发生时,立即检测到错误访问。异步模式:
延迟错误访问检测。当标签检查错误发生时,信息存储在硬件中(在arm64的
TFSR_EL1寄存器中)。内核会定期检查硬件,并且仅在这些检查期间报告标签错误。
- ``kasan.stacktrace=off`` 或 ``=on`` 禁用或启用alloc和free堆栈跟踪收集
(默认: ``on`` )。
- ``kasan.fault=report`` 或 ``=panic`` 控制是只打印KASAN报告还是同时使内核恐慌
(默认: ``report`` )。即使启用了 ``kasan_multi_shot`` ,也会发生内核恐慌。
实施细则
--------
通用KASAN
~~~~~~~~~
软件KASAN模式使用影子内存来记录每个内存字节是否可以安全访问,并使用编译时工具
在每次内存访问之前插入影子内存检查。
通用KASAN将1/8的内核内存专用于其影子内存(16TB以覆盖x86_64上的128TB),并使用
具有比例和偏移量的直接映射将内存地址转换为其相应的影子地址。
这是将地址转换为其相应影子地址的函数::
static inline void *kasan_mem_to_shadow(const void *addr)
{
return (void *)((unsigned long)addr >> KASAN_SHADOW_SCALE_SHIFT)
+ KASAN_SHADOW_OFFSET;
}
在这里 ``KASAN_SHADOW_SCALE_SHIFT = 3`` 。
编译时工具用于插入内存访问检查。编译器在每次访问大小为1、2、4、8或16的内存之前
插入函数调用( ``__asan_load*(addr)`` , ``__asan_store*(addr)``)。这些函数通过
检查相应的影子内存来检查内存访问是否有效。
使用inline插桩,编译器不进行函数调用,而是直接插入代码来检查影子内存。此选项
显著地增大了内核体积,但与outline插桩内核相比,它提供了x1.1-x2的性能提升。
通用KASAN是唯一一种通过隔离延迟重新使用已释放对象的模式
(参见 mm/kasan/quarantine.c 以了解实现)。
基于软件标签的KASAN模式
~~~~~~~~~~~~~~~~~~~~~~~
基于软件标签的KASAN使用软件内存标签方法来检查访问有效性。目前仅针对arm64架构实现。
基于软件标签的KASAN使用arm64 CPU的顶部字节忽略(TBI)特性在内核指针的顶部字节中
存储一个指针标签。它使用影子内存来存储与每个16字节内存单元相关的内存标签(因此,
它将内核内存的1/16专用于影子内存)。
在每次内存分配时,基于软件标签的KASAN都会生成一个随机标签,用这个标签标记分配
的内存,并将相同的标签嵌入到返回的指针中。
基于软件标签的KASAN使用编译时工具在每次内存访问之前插入检查。这些检查确保正在
访问的内存的标签等于用于访问该内存的指针的标签。如果标签不匹配,基于软件标签
的KASAN会打印错误报告。
基于软件标签的KASAN也有两种插桩模式(outline,发出回调来检查内存访问;inline,
执行内联的影子内存检查)。使用outline插桩模式,会从执行访问检查的函数打印错误
报告。使用inline插桩,编译器会发出 ``brk`` 指令,并使用专用的 ``brk`` 处理程序
来打印错误报告。
基于软件标签的KASAN使用0xFF作为匹配所有指针标签(不检查通过带有0xFF指针标签
的指针进行的访问)。值0xFE当前保留用于标记已释放的内存区域。
基于软件标签的KASAN目前仅支持对Slab和page_alloc内存进行标记。
基于硬件标签的KASAN模式
~~~~~~~~~~~~~~~~~~~~~~~
基于硬件标签的KASAN在概念上类似于软件模式,但它是使用硬件内存标签作为支持而
不是编译器插桩和影子内存。
基于硬件标签的KASAN目前仅针对arm64架构实现,并且基于ARMv8.5指令集架构中引入
的arm64内存标记扩展(MTE)和最高字节忽略(TBI)。
特殊的arm64指令用于为每次内存分配指定内存标签。相同的标签被指定给指向这些分配
的指针。在每次内存访问时,硬件确保正在访问的内存的标签等于用于访问该内存的指针
的标签。如果标签不匹配,则会生成故障并打印报告。
基于硬件标签的KASAN使用0xFF作为匹配所有指针标签(不检查通过带有0xFF指针标签的
指针进行的访问)。值0xFE当前保留用于标记已释放的内存区域。
基于硬件标签的KASAN目前仅支持对Slab和page_alloc内存进行标记。
如果硬件不支持MTE(ARMv8.5之前),则不会启用基于硬件标签的KASAN。在这种情况下,
所有KASAN引导参数都将被忽略。
请注意,启用CONFIG_KASAN_HW_TAGS始终会导致启用内核中的TBI。即使提供了
``kasan.mode=off`` 或硬件不支持MTE(但支持TBI)。
基于硬件标签的KASAN只报告第一个发现的错误。之后,MTE标签检查将被禁用。
影子内存
--------
内核将内存映射到地址空间的几个不同部分。内核虚拟地址的范围很大:没有足够的真实
内存来支持内核可以访问的每个地址的真实影子区域。因此,KASAN只为地址空间的某些
部分映射真实的影子。
默认行为
~~~~~~~~
默认情况下,体系结构仅将实际内存映射到用于线性映射的阴影区域(以及可能的其他
小区域)。对于所有其他区域 —— 例如vmalloc和vmemmap空间 —— 一个只读页面被映射
到阴影区域上。这个只读的影子页面声明所有内存访问都是允许的。
这给模块带来了一个问题:它们不存在于线性映射中,而是存在于专用的模块空间中。
通过连接模块分配器,KASAN临时映射真实的影子内存以覆盖它们。例如,这允许检测
对模块全局变量的无效访问。
这也造成了与 ``VMAP_STACK`` 的不兼容:如果堆栈位于vmalloc空间中,它将被分配
只读页面的影子内存,并且内核在尝试为堆栈变量设置影子数据时会出错。
CONFIG_KASAN_VMALLOC
~~~~~~~~~~~~~~~~~~~~
使用 ``CONFIG_KASAN_VMALLOC`` ,KASAN可以以更大的内存使用为代价覆盖vmalloc
空间。目前,这在x86、riscv、s390和powerpc上受支持。
这通过连接到vmalloc和vmap并动态分配真实的影子内存来支持映射。
vmalloc空间中的大多数映射都很小,需要不到一整页的阴影空间。因此,为每个映射
分配一个完整的影子页面将是一种浪费。此外,为了确保不同的映射使用不同的影子
页面,映射必须与 ``KASAN_GRANULE_SIZE * PAGE_SIZE`` 对齐。
相反,KASAN跨多个映射共享后备空间。当vmalloc空间中的映射使用影子区域的特定
页面时,它会分配一个后备页面。此页面稍后可以由其他vmalloc映射共享。
KASAN连接到vmap基础架构以懒清理未使用的影子内存。
为了避免交换映射的困难,KASAN预测覆盖vmalloc空间的阴影区域部分将不会被早期
的阴影页面覆盖,但是将不会被映射。这将需要更改特定于arch的代码。
这允许在x86上支持 ``VMAP_STACK`` ,并且可以简化对没有固定模块区域的架构的支持。
对于开发者
----------
忽略访问
~~~~~~~~
软件KASAN模式使用编译器插桩来插入有效性检查。此类检测可能与内核的某些部分
不兼容,因此需要禁用。
内核的其他部分可能会访问已分配对象的元数据。通常,KASAN会检测并报告此类访问,
但在某些情况下(例如,在内存分配器中),这些访问是有效的。
对于软件KASAN模式,要禁用特定文件或目录的检测,请将 ``KASAN_SANITIZE`` 添加
到相应的内核Makefile中:
- 对于单个文件(例如,main.o)::
KASAN_SANITIZE_main.o := n
- 对于一个目录下的所有文件::
KASAN_SANITIZE := n
对于软件KASAN模式,要在每个函数的基础上禁用检测,请使用KASAN特定的
``__no_sanitize_address`` 函数属性或通用的 ``noinstr`` 。
请注意,禁用编译器插桩(基于每个文件或每个函数)会使KASAN忽略在软件KASAN模式
的代码中直接发生的访问。当访问是间接发生的(通过调用检测函数)或使用没有编译器
插桩的基于硬件标签的模式时,它没有帮助。
对于软件KASAN模式,要在当前任务的一部分内核代码中禁用KASAN报告,请使用
``kasan_disable_current()``/``kasan_enable_current()`` 部分注释这部分代码。
这也会禁用通过函数调用发生的间接访问的报告。
对于基于标签的KASAN模式(包括硬件模式),要禁用访问检查,请使用
``kasan_reset_tag()`` 或 ``page_kasan_tag_reset()`` 。请注意,通过
``page_kasan_tag_reset()`` 临时禁用访问检查需要通过 ``page_kasan_tag``
/ ``page_kasan_tag_set`` 保存和恢复每页KASAN标签。
测试
~~~~
有一些KASAN测试可以验证KASAN是否正常工作并可以检测某些类型的内存损坏。
测试由两部分组成:
1. 与KUnit测试框架集成的测试。使用 ``CONFIG_KASAN_KUNIT_TEST`` 启用。
这些测试可以通过几种不同的方式自动运行和部分验证;请参阅下面的说明。
2. 与KUnit不兼容的测试。使用 ``CONFIG_KASAN_MODULE_TEST`` 启用并且只能作为模块
运行。这些测试只能通过加载内核模块并检查内核日志以获取KASAN报告来手动验证。
如果检测到错误,每个KUnit兼容的KASAN测试都会打印多个KASAN报告之一,然后测试打印
其编号和状态。
当测试通过::
ok 28 - kmalloc_double_kzfree
当由于 ``kmalloc`` 失败而导致测试失败时::
# kmalloc_large_oob_right: ASSERTION FAILED at lib/test_kasan.c:163
Expected ptr is not null, but is
not ok 4 - kmalloc_large_oob_right
当由于缺少KASAN报告而导致测试失败时::
# kmalloc_double_kzfree: EXPECTATION FAILED at lib/test_kasan.c:629
Expected kasan_data->report_expected == kasan_data->report_found, but
kasan_data->report_expected == 1
kasan_data->report_found == 0
not ok 28 - kmalloc_double_kzfree
最后打印所有KASAN测试的累积状态。成功::
ok 1 - kasan
或者,如果其中一项测试失败::
not ok 1 - kasan
有几种方法可以运行与KUnit兼容的KASAN测试。
1. 可加载模块
启用 ``CONFIG_KUNIT`` 后,KASAN-KUnit测试可以构建为可加载模块,并通过使用
``insmod`` 或 ``modprobe`` 加载 ``test_kasan.ko`` 来运行。
2. 内置
通过内置 ``CONFIG_KUNIT`` ,也可以内置KASAN-KUnit测试。在这种情况下,
测试将在启动时作为后期初始化调用运行。
3. 使用kunit_tool
通过内置 ``CONFIG_KUNIT`` 和 ``CONFIG_KASAN_KUNIT_TEST`` ,还可以使用
``kunit_tool`` 以更易读的方式查看KUnit测试结果。这不会打印通过测试
的KASAN报告。有关 ``kunit_tool`` 更多最新信息,请参阅
`KUnit文档 <https://www.kernel.org/doc/html/latest/dev-tools/kunit/index.html>`_ 。
.. _KUnit: https://www.kernel.org/doc/html/latest/dev-tools/kunit/index.html
......@@ -4,6 +4,7 @@
\renewcommand\thesection*
\renewcommand\thesubsection*
\kerneldocCJKon
.. _linux_doc_zh:
......@@ -72,11 +73,11 @@ TODOlist:
dev-tools/index
doc-guide/index
kernel-hacking/index
maintainer/index
TODOList:
* trace/index
* maintainer/index
* fault-injection/index
* livepatch/index
* rust/index
......@@ -153,6 +154,7 @@ TODOList:
arm64/index
riscv/index
openrisc/index
parisc/index
TODOList:
......@@ -160,7 +162,6 @@ TODOList:
* ia64/index
* m68k/index
* nios2/index
* parisc/index
* powerpc/index
* s390/index
* sh/index
......
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/maintainer/configure-git.rst
:译者:
吴想成 Wu XiangCheng <bobwxc@email.cn>
.. _configuregit_zh:
Git配置
=======
本章讲述了维护者级别的git配置。
Documentation/maintainer/pull-requests.rst 中使用的标记分支应使用开发人员的
GPG公钥进行签名。可以通过将 ``-u`` 标志传递给 ``git tag`` 来创建签名标记。
但是,由于 *通常* 对同一项目使用同一个密钥,因此可以设置::
git config user.signingkey "keyname"
或者手动编辑你的 ``.git/config`` 或 ``~/.gitconfig`` 文件::
[user]
name = Jane Developer
email = jd@domain.org
signingkey = jd@domain.org
你可能需要告诉 ``git`` 去使用 ``gpg2``::
[gpg]
program = /path/to/gpg2
你可能也需要告诉 ``gpg`` 去使用哪个 ``tty`` (添加到你的shell rc文件中)::
export GPG_TTY=$(tty)
创建链接到lore.kernel.org的提交
-------------------------------
http://lore.kernel.org 网站是所有涉及或影响内核开发的邮件列表的总存档。在这里
存储补丁存档是推荐的做法,当维护人员将补丁应用到子系统树时,最好提供一个指向
lore存档链接的标签,以便浏览提交历史的人可以找到某个更改背后的相关讨论和基本
原理。链接标签如下所示:
Link: https://lore.kernel.org/r/<message-id>
通过在git中添加以下钩子,可以将此配置为在发布 ``git am`` 时自动执行:
.. code-block:: none
$ git config am.messageid true
$ cat >.git/hooks/applypatch-msg <<'EOF'
#!/bin/sh
. git-sh-setup
perl -pi -e 's|^Message-Id:\s*<?([^>]+)>?$|Link: https://lore.kernel.org/r/$1|g;' "$1"
test -x "$GIT_DIR/hooks/commit-msg" &&
exec "$GIT_DIR/hooks/commit-msg" ${1+"$@"}
:
EOF
$ chmod a+x .git/hooks/applypatch-msg
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/maintainer/index.rst
==============
内核维护者手册
==============
本文档本是内核维护者手册的首页。
本手册还需要大量完善!请自由提出(和编写)本手册的补充内容。
*译注:指英文原版*
.. toctree::
:maxdepth: 2
configure-git
rebasing-and-merging
pull-requests
maintainer-entry-profile
modifying-patches
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/maintainer/maintainer-entry-profile.rst
:译者:
吴想成 Wu XiangCheng <bobwxc@email.cn>
.. _maintainerentryprofile_zh:
维护者条目概要
==============
维护人员条目概要补充了顶层过程文档(提交补丁,提交驱动程序……),增加了子系
统/设备驱动程序本地习惯以及有关补丁提交生命周期的相关内容。贡献者使用此文档
来调整他们的期望和避免常见错误;维护人员可以使用这些信息超越子系统层面查看
是否有机会汇聚到通用实践中。
总览
----
提供了子系统如何操作的介绍。MAINTAINERS文件告诉了贡献者应发送某文件的补丁到哪,
但它没有传达其他子系统的本地基础设施和机制以协助开发。
请考虑以下问题:
- 当补丁被本地树接纳或合并到上游时是否有通知?
- 子系统是否使用patchwork实例?Patchwork状态变更是否有通知?
- 是否有任何机器人或CI基础设施监视列表,或子系统是否使用自动测试反馈以便把
控接纳补丁?
- 被拉入-next的Git分支是哪个?
- 贡献者应针对哪个分支提交?
- 是否链接到其他维护者条目概要?例如一个设备驱动可能指向其父子系统的条目。
这使得贡献者意识到某维护者可能对提交链中其他维护者负有的义务。
提交检查单补遗
--------------
列出强制性和咨询性标准,超出通用标准“提交检查表,以便维护者检查一个补丁是否
足够健康。例如:“通过checkpatch.pl,没有错误、没有警告。通过单元测试详见某处”。
提交检查单补遗还可以包括有关硬件规格状态的详细信息。例如,子系统接受补丁之前
是否需要考虑在某个修订版上发布的规范。
开发周期的关键日期
------------------
提交者常常会误以为补丁可以在合并窗口关闭之前的任何时间发送,且下一个-rc1时仍
可以。事实上,大多数补丁都需要在下一个合并窗口打开之前提前进入linux-next中。
向提交者澄清关键日期(以-rc发布周为标志)以明确什么时候补丁会被考虑合并以及
何时需要等待下一个-rc。
至少需要讲明:
- 最后一个可以提交新功能的-rc:
针对下一个合并窗口的新功能提交应该在此点之前首次发布以供考虑。在此时间点
之后提交的补丁应该明确他们的目标为下下个合并窗口,或者给出应加快进度被接受
的充足理由。通常新特性贡献者的提交应出现在-rc5之前。
- 最后合并-rc:合并决策的最后期限。
向贡献者指出尚未接受的补丁集需要等待下下个合并窗口。当然,维护者没有义务
接受所有给定的补丁集,但是如果审阅在此时间点尚未结束,那么希望贡献者应该
等待并在下一个合并窗口重新提交。
可选项:
- 开发基线分支的首个-rc,列在概述部分,视为已为新提交做好准备。
审阅节奏
--------
贡献者最担心的问题之一是:补丁集已发布却未收到反馈,应在多久后发送提醒。除了
指定在重新提交之前要等待多长时间,还可以指示更新的首选样式;例如,重新发送
整个系列,或私下发送提醒邮件。本节也可以列出本区域的代码审阅方式,以及获取
不能直接从维护者那里得到的反馈的方法。
现有概要
--------
这里列出了现有的维护人员条目概要;我们可能会想要在不久的将来做一些不同的事情。
.. toctree::
:maxdepth: 1
../doc-guide/maintainer-profile
../../../nvdimm/maintainer-entry-profile
../../../riscv/patch-acceptance
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/maintainer/modifying-patches.rst
:译者:
吴想成 Wu XiangCheng <bobwxc@email.cn>
.. _modifyingpatches_zh:
修改补丁
========
如果你是子系统或者分支的维护者,由于代码在你的和提交者的树中并不完全相同,
有时你需要稍微修改一下收到的补丁以合并它们。
如果你严格遵守开发者来源证书的规则(c),你应该要求提交者重做,但这完全是会
适得其反的时间、精力浪费。规则(b)允许你调整代码,但这样修改提交者的代码并
让他背书你的错误是非常不礼貌的。为解决此问题,建议在你之前最后一个
Signed-off-by标签和你的之间添加一行,以指示更改的性质。这没有强制性要求,最
好在描述前面加上你的邮件和/或姓名,用方括号括住整行,以明显指出你对最后一刻
的更改负责。例如::
Signed-off-by: Random J Developer <random@developer.example.org>
[lucky@maintainer.example.org: struct foo moved from foo.c to foo.h]
Signed-off-by: Lucky K Maintainer <lucky@maintainer.example.org>
如果您维护着一个稳定的分支,并希望同时明确贡献、跟踪更改、合并修复,并保护
提交者免受责难,这种做法尤其有用。请注意,在任何情况下都不得更改作者的身份
(From头),因为它会在变更日志中显示。
向后移植(back-port)人员特别要注意:为了便于跟踪,请在提交消息的顶部(即主题行
之后)插入补丁的来源,这是一种常见而有用的做法。例如,我们可以在3.x稳定版本
中看到以下内容::
Date: Tue Oct 7 07:26:38 2014 -0400
libata: Un-break ATA blacklist
commit 1c40279960bcd7d52dbdf1d466b20d24b99176c8 upstream.
下面是一个旧的内核在某补丁被向后移植后会出现的::
Date: Tue May 13 22:12:27 2008 +0200
wireless, airo: waitbusy() won't delay
[backport of 2.6 commit b7acbdfbd1f277c1eb23f344f899cfa4cd0bf36a]
不管什么格式,这些信息都为人们跟踪你的树,以及试图解决你树中的错误的人提供了
有价值的帮助。
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/maintainer/pull-requests.rst
:译者:
吴想成 Wu XiangCheng <bobwxc@email.cn>
.. _pullrequests_zh:
如何创建拉取请求
================
本章描述维护人员如何创建并向其他维护人员提交拉取请求。这对将更改从一个维护者
树转移到另一个维护者树非常有用。
本文档由Tobin C. Harding(当时他尚不是一名经验丰富的维护人员)编写,内容主要
来自Greg Kroah Hartman和Linus Torvalds在LKML上的评论。Jonathan Corbet和Mauro
Carvalho Chehab提出了一些建议和修改。错误不可避免,如有问题,请找Tobin C.
Harding <me@tobin.cc>。
原始邮件线程::
http://lkml.kernel.org/r/20171114110500.GA21175@kroah.com
创建分支
--------
首先,您需要将希望包含拉取请求里的所有更改都放在单独分支中。通常您将基于某开发
人员树的一个分支,一般是打算向其发送拉取请求的开发人员。
为了创建拉取请求,您必须首先标记刚刚创建的分支。建议您选择一个有意义的标记名,
以即使过了一段时间您和他人仍能理解的方式。在名称中包含源子系统和目标内核版本
的指示也是一个好的做法。
Greg提供了以下内容。对于一个含有drivers/char中混杂事项、将应用于4.15-rc1内核的
拉取请求,可以命名为 ``char-misc-4.15-rc1`` 。如果要在 ``char-misc-next`` 分支
上打上此标记,您可以使用以下命令::
git tag -s char-misc-4.15-rc1 char-misc-next
这将在 ``char-misc-next`` 分支的最后一个提交上创建一个名为 ``char-misc-4.15-rc1``
的标记,并用您的gpg密钥签名(参见 Documentation/maintainer/configure-git.rst )。
Linus只接受基于签名过的标记的拉取请求。其他维护者可能会有所不同。
当您运行上述命令时 ``git`` 会打开编辑器要求你描述一下这个标记。在本例中您需要
描述拉取请求,所以请概述一下包含的内容,为什么要合并,是否完成任何测试。所有
这些信息都将留在标记中,然后在维护者合并拉取请求时保留在合并提交中。所以把它
写好,它将永远留在内核中。
正如Linus所说::
不管怎么样,至少对我来说,重要的是 *信息* 。我需要知道我在拉取什么、
为什么我要拉取。我也希望将此消息用于合并消息,因此它不仅应该对我有
意义,也应该可以成为一个有意义的历史记录。
注意,如果拉取请求有一些不寻常的地方,请详细说明。如果你修改了并非
由你维护的文件,请解释 **为什么** 。我总会在差异中看到的,如果你不
提的话,我只会觉得分外可疑。当你在合并窗口后给我发新东西的时候,
(甚至是比较重大的错误修复),不仅需要解释做了什么、为什么这么做,
还请解释一下 **时间问题** 。为什么错过了合并窗口……
我会看你写在拉取请求邮件和签名标记里面的内容,所以根据你的工作流,
你可以在签名标记里面描述工作内容(也会自动放进拉取请求邮件),也
可以只在标记里面放个占位符,稍后在你实际发给我拉取请求时描述工作内容。
是的,我会编辑这些消息。部分因为我需要做一些琐碎的格式调整(整体缩进、
括号等),也因为此消息可能对我有意义(描述了冲突或一些个人问题)而对
合并提交信息上下文没啥意义,因此我需要尽力让它有意义起来。我也会
修复一些拼写和语法错误,特别是非母语者(母语者也是;^)。但我也会删掉
或增加一些内容。
Linus
Greg给出了一个拉取请求的例子::
Char/Misc patches for 4.15-rc1
Here is the big char/misc patch set for the 4.15-rc1 merge window.
Contained in here is the normal set of new functions added to all
of these crazy drivers, as well as the following brand new
subsystems:
- time_travel_controller: Finally a set of drivers for the
latest time travel bus architecture that provides i/o to
the CPU before it asked for it, allowing uninterrupted
processing
- relativity_shifters: due to the affect that the
time_travel_controllers have on the overall system, there
was a need for a new set of relativity shifter drivers to
accommodate the newly formed black holes that would
threaten to suck CPUs into them. This subsystem handles
this in a way to successfully neutralize the problems.
There is a Kconfig option to force these to be enabled
when needed, so problems should not occur.
All of these patches have been successfully tested in the latest
linux-next releases, and the original problems that it found have
all been resolved (apologies to anyone living near Canberra for the
lack of the Kconfig options in the earlier versions of the
linux-next tree creations.)
Signed-off-by: Your-name-here <your_email@domain>
此标记消息格式就像一个git提交。顶部有一行“总结标题”, 一定要在下面sign-off。
现在您已经有了一个本地签名标记,您需要将它推送到可以被拉取的位置::
git push origin char-misc-4.15-rc1
创建拉取请求
------------
最后要做的是创建拉取请求消息。可以使用 ``git request-pull`` 命令让 ``git``
为你做这件事,但它需要确定你想拉取什么,以及拉取针对的基础(显示正确的拉取
更改和变更状态)。以下命令将生成一个拉取请求::
git request-pull master git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git/ char-misc-4.15-rc1
引用Greg的话::
此命令要求git比较从“char-misc-4.15-rc1”标记位置到“master”分支头(上述
例子中指向了我从Linus的树分叉的地方,通常是-rc发布)的差异,并去使用
git:// 协议拉取。如果你希望使用 https:// 协议,也可以用在这里(但是请
注意,部分人由于防火墙问题没法用https协议拉取)。
如果char-misc-4.15-rc1标记没有出现在我要求拉取的仓库中,git会提醒
它不在那里,所以记得推送到公开地方。
“git request-pull”会包含git树的地址和需要拉取的特定标记,以及标记
描述全文(详尽描述标记)。同时它也会创建此拉取请求的差异状态和单个
提交的缩短日志。
Linus回复说他倾向于 ``git://`` 协议。其他维护者可能有不同的偏好。另外,请注意
如果你创建的拉取请求没有签名标记, ``https://`` 可能是更好的选择。完整的讨论
请看原邮件。
提交拉取请求
------------
拉取请求的提交方式与普通补丁相同。向维护人员发送内联电子邮件并抄送LKML以及
任何必要特定子系统的列表。对Linus的拉取请求通常有如下主题行::
[GIT PULL] <subsystem> changes for v4.15-rc1
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/maintainer/rebasing-and-merging.rst
:译者:
吴想成 Wu XiangCheng <bobwxc@email.cn>
==========
变基与合并
==========
一般来说,维护子系统需要熟悉Git源代码管理系统。Git是一个功能强大的工具,有
很多功能;就像这类工具常出现的情况一样,使用这些功能的方法有对有错。本文档
特别介绍了变基与合并的用法。维护者经常在错误使用这些工具时遇到麻烦,但避免
问题实际上并不那么困难。
总的来说,需要注意的一点是:与许多其他项目不同,内核社区并不害怕在其开发历史
中看到合并提交。事实上,考虑到该项目的规模,避免合并几乎是不可能的。维护者会
在希望避免合并时遇到一些问题,而过于频繁的合并也会带来另一些问题。
变基
====
“变基(Rebase)”是更改存储库中一系列提交的历史记录的过程。有两种不同型的操作
都被称为变基,因为这两种操作都使用 ``git rebase`` 命令,但它们之间存在显著
差异:
- 更改一系列补丁的父提交(起始提交)。例如,变基操作可以将基于上一内核版本
的一个补丁集重建到当前版本上。在下面的讨论中,我们将此操作称为“变根”。
- 通过修复(或删除)损坏的提交、添加补丁、添加标记以更改一系列补丁的历史,
来提交变更日志或更改已应用提交的顺序。在下文中,这种类型的操作称为“历史
修改”
术语“变基”将用于指代上述两种操作。如果使用得当,变基可以产生更清晰、更整洁的
开发历史;如果使用不当,它可能会模糊历史并引入错误。
以下一些经验法则可以帮助开发者避免最糟糕的变基风险:
- 已经发布到你私人系统之外世界的历史通常不应更改。其他人可能会拉取你的树
的副本,然后基于它进行工作;修改你的树会给他们带来麻烦。如果工作需要变基,
这通常是表明它还没有准备好提交到公共存储库的信号。
但是,总有例外。有些树(linux-next是一个典型的例子)由于它们的需要经常
变基,开发人员知道不要基于它们来工作。开发人员有时会公开一个不稳定的分支,
供其他人或自动测试服务进行测试。如果您确实以这种方式公开了一个可能不稳定
的分支,请确保潜在使用者知道不要基于它来工作。
- 不要在包含由他人创建的历史的分支上变基。如果你从别的开发者的仓库拉取了变更,
那你现在就成了他们历史记录的保管人。你不应该改变它,除了少数例外情况。例如
树中有问题的提交必须显式恢复(即通过另一个补丁修复),而不是通过修改历史而
消失。
- 没有合理理由,不要对树变根。仅为了切换到更新的基或避免与上游储存库的合并
通常不是合理理由。
- 如果你必须对储存库进行变根,请不要随机选取一个提交作为新基。在发布节点之间
内核通常处于一个相对不稳定的状态;基于其中某点进行开发会增加遇到意外错误的
几率。当一系列补丁必须移动到新基时,请选择移动到一个稳定节点(例如-rc版本
节点)。
- 请知悉对补丁系列进行变根(或做明显的历史修改)会改变它们的开发环境,且很
可能使做过的大部分测试失效。一般来说,变基后的补丁系列应当像新代码一样对
待,并重新测试。
合并窗口麻烦的一个常见原因是,Linus收到了一个明显在拉取请求发送之前不久才变根
(通常是变根到随机的提交上)的补丁系列。这样一个系列被充分测试的可能性相对较
低,拉取请求被接受的几率也同样较低。
相反,如果变基仅限于私有树、提交基于一个通用的起点、且经过充分测试,则引起
麻烦的可能性就很低。
合并
====
内核开发过程中,合并是一个很常见的操作;5.1版本开发周期中有超过1126个合并
——差不多占了整体的9%。内核开发工作积累在100多个不同的子系统树中,每个
子系统树都可能包含多个主题分支;每个分支通常独立于其他分支进行开发。因此
在任何给定分支进入上游储存库之前,至少需要一次合并。
许多项目要求拉取请求中的分支基于当前主干,这样历史记录中就不会出现合并提交。
内核并不是这样;任何为了避免合并而重新对分支变基都很可能导致麻烦。
子系统维护人员发现他们必须进行两种类型的合并:从较低层级的子系统树和从其他
子系统树(同级树或主线)进行合并。这两种情况下要遵循的最佳实践是不同的。
合并较低层级树
--------------
较大的子系统往往有多个级别的维护人员,较低级别的维护人员向较高级别发送拉取
请求。合并这样的请求执行几乎肯定会生成一个合并提交;这也是应该的。实际上,
子系统维护人员可能希望在极少数快进合并情况下使用 ``-–no-ff`` 标志来强制添加
合并提交,以便记录合并的原因。 **任何** 类型的合并的变更日志必须说明
*为什么* 合并。对于较低级别的树,“为什么”通常是对该取所带来的变化的总结。
各级维护人员都应在他们的拉取请求上使用经签名的标签,上游维护人员应在拉取分支
时验证标签。不这样做会威胁整个开发过程的安全。
根据上面列出的规则,一旦您将其他人的历史记录合并到树中,您就不得对该分支进行
变基,即使您能够这样做。
合并同级树或上游树
------------------
虽然来自下游的合并是常见且不起眼的,但当需要将一个分支推向上游时,其中来自
其他树的合并往往是一个危险信号。这种合并需要仔细考虑并加以充分证明,否则后续
的拉取请求很可能会被拒绝。
想要将主分支合并到存储库中是很自然的;这种类型的合并通常被称为“反向合并”
。反向合并有助于确保与并行的开发没有冲突,并且通常会给人一种温暖、舒服的
感觉,即处于最新。但这种诱惑几乎总是应该避免的。
为什么呢?反向合并将搅乱你自己分支的开发历史。它们会大大增加你遇到来自社区
其他地方的错误的机会,且使你很难确保你所管理的工作稳定并准备好合入上游。
频繁的合并还可以掩盖树中开发过程中的问题;它们会隐藏与其他树的交互,而这些
交互不应该(经常)发生在管理良好的分支中。
也就是说,偶尔需要进行反向合并;当这种情况发生时,一定要在提交信息中记录
*为什么* 。同样,在一个众所周知的稳定点进行合并,而不是随机提交。即使这样,
你也不应该反向合并一棵比你的直接上游树更高层级的树;如果确实需要更高级别的
反向合并,应首先在上游树进行。
导致合并相关问题最常见的原因之一是:在发送拉取请求之前维护者合并上游以解决
合并冲突。同样,这种诱惑很容易理解,但绝对应该避免。对于最终拉取请求来说
尤其如此:Linus坚信他更愿意看到合并冲突,而不是不必要的反向合并。看到冲突
可以让他了解潜在的问题所在。他做过很多合并(在5.1版本开发周期中是382次),
而且在解决冲突方面也很在行——通常比参与的开发人员要强。
那么,当他们的子系统分支和主线之间发生冲突时,维护人员应该怎么做呢?最重要
的一步是在拉取请求中提示Linus会发生冲突;如果啥都没说则表明您的分支可以正常
合入。对于特别困难的冲突,创建并推送一个 *独立* 分支来展示你将如何解决问题。
在拉取请求中提到该分支,但是请求本身应该针对未合并的分支。
即使不存在已知冲突,在发送拉取请求之前进行合并测试也是个好主意。它可能会提醒
您一些在linux-next树中没有发现的问题,并帮助您准确地理解您正在要求上游做什么。
合并上游树或另一个子系统树的另一个原因是解决依赖关系。这些依赖性问题有时确实
会发生,而且有时与另一棵树交叉合并是解决这些问题的最佳方法;同样,在这种情况
下,合并提交应该解释为什么要进行合并。花点时间把它做好;会有人阅读这些变更
日志。
然而依赖性问题通常表明需要改变方法。合并另一个子系统树以解决依赖性风险会带来
其他缺陷,几乎永远不应这样做。如果该子系统树无法被合到上游,那么它的任何问题
也都会阻碍你的树合并。更可取的选择包括与维护人员达成一致意见,在其中一个树中
同时进行两组更改;或者创建一个主题分支专门处理可以合并到两个树中的先决条件提交。
如果依赖关系与主要的基础结构更改相关,正确的解决方案可能是将依赖提交保留一个
开发周期,以便这些更改有时间在主线上稳定。
最后
====
在开发周期的开头合并主线是比较常见的,可以获取树中其他地方的更改和修复。同样,
这样的合并应该选择一个众所周知的发布点,而不是一些随机点。如果在合并窗口期间
上游分支已完全清空到主线中,则可以使用以下命令向前拉取它::
git merge v5.2-rc1^0
“^0”使Git执行快进合并(在这种情况下这应该可以),从而避免多余的虚假合并提交。
上面列出的就是指导方针了。总是会有一些情况需要不同的解决方案,这些指导原则
不应阻止开发人员在需要时做正确的事情。但是,我们应该时刻考虑是否真的出现了
这样的需求,并准备好解释为什么需要做一些不寻常的事情。
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/parisc/debugging.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_parisc_debugging:
=================
调试PA-RISC
=================
好吧,这里有一些关于调试linux/parisc的较底层部分的信息。
1. 绝对地址
=====================
很多汇编代码目前运行在实模式下,这意味着会使用绝对地址,而不是像内核其他
部分那样使用虚拟地址。要将绝对地址转换为虚拟地址,你可以在System.map中查
找,添加__PAGE_OFFSET(目前是0x10000000)。
2. HPMCs
========
当实模式的代码试图访问不存在的内存时,会出现HPMC(high priority machine
check)而不是内核oops。若要调试HPMC,请尝试找到系统响应程序/请求程序地址。
系统请求程序地址应该与(某)处理器的HPA(I/O范围内的高地址)相匹配;系统响应程
序地址是实模式代码试图访问的地址。
系统响应程序地址的典型值是大于__PAGE_OFFSET (0x10000000)的地址,这意味着
在实模式试图访问它之前,虚拟地址没有被翻译成物理地址。
3. 有趣的Q位
============
某些非常关键的代码必须清除PSW中的Q位。当Q位被清除时,CPU不会更新中断处理
程序所读取的寄存器,以找出机器被中断的位置——所以如果你在清除Q位的指令和再
次设置Q位的RFI之间遇到中断,你不知道它到底发生在哪里。如果你幸运的话,IAOQ
会指向清除Q位的指令,如果你不幸运的话,它会指向任何地方。通常Q位的问题会
表现为无法解释的系统挂起或物理内存越界。
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/parisc/index.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_parisc_index:
====================
PA-RISC体系架构
====================
.. toctree::
:maxdepth: 2
debugging
registers
Todolist:
features
.. only:: subproject and html
Indices
=======
* :ref:`genindex`
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/parisc/registers.rst
:Translator: Yanteng Si <siyanteng@loongson.cn>
.. _cn_parisc_registers:
=========================
Linux/PA-RISC的寄存器用法
=========================
[ 用星号表示目前尚未实现的计划用途。 ]
ABI约定的通用寄存器
===================
控制寄存器
----------
============================ =================================
CR 0 (恢复计数器) 用于ptrace
CR 1-CR 7(无定义) 未使用
CR 8 (Protection ID) 每进程值*
CR 9, 12, 13 (PIDS) 未使用
CR10 (CCR) FPU延迟保存*
CR11 按照ABI的规定(SAR)
CR14 (中断向量) 初始化为 fault_vector
CR15 (EIEM) 所有位初始化为1*
CR16 (间隔计时器) 读取周期数/写入开始时间间隔计时器
CR17-CR22 中断参数
CR19 中断指令寄存器
CR20 中断空间寄存器
CR21 中断偏移量寄存器
CR22 中断 PSW
CR23 (EIRR) 读取未决中断/写入清除位
CR24 (TR 0) 内核空间页目录指针
CR25 (TR 1) 用户空间页目录指针
CR26 (TR 2) 不使用
CR27 (TR 3) 线程描述符指针
CR28 (TR 4) 不使用
CR29 (TR 5) 不使用
CR30 (TR 6) 当前 / 0
CR31 (TR 7) 临时寄存器,在不同地方使用
============================ =================================
空间寄存器(内核模式)
----------------------
======== ==============================
SR0 临时空间寄存器
SR4-SR7 设置为0
SR1 临时空间寄存器
SR2 内核不应该破坏它
SR3 用于用户空间访问(当前进程)
======== ==============================
空间寄存器(用户模式)
----------------------
======== ============================
SR0 临时空间寄存器
SR1 临时空间寄存器
SR2 保存Linux gateway page的空间
SR3 在内核中保存用户地址空间的值
SR4-SR7 定义了用户/内核的短地址空间
======== ============================
处理器状态字
------------
====================== ================================================
W (64位地址) 0
E (小尾端) 0
S (安全间隔计时器) 0
T (产生分支陷阱) 0
H (高特权级陷阱) 0
L (低特权级陷阱) 0
N (撤销下一条指令) 被C代码使用
X (数据存储中断禁用) 0
B (产生分支) 被C代码使用
C (代码地址转译) 1, 在执行实模式代码时为0
V (除法步长校正) 被C代码使用
M (HPMC 掩码) 0, 在执行HPMC操作*时为1
C/B (进/借 位) 被C代码使用
O (有序引用) 1*
F (性能监视器) 0
R (回收计数器陷阱) 0
Q (收集中断状态) 1 (在rfi之前的代码中为0)
P (保护标识符) 1*
D (数据地址转译) 1, 在执行实模式代码时为0
I (外部中断掩码) 由cli()/sti()宏使用。
====================== ================================================
“隐形”寄存器(影子寄存器)
---------------------------
============= ===================
PSW W 默认值 0
PSW E 默认值 0
影子寄存器 被中断处理代码使用
TOC启用位 1
============= ===================
----------------------------------------------------------
PA-RISC架构定义了7个寄存器作为“影子寄存器”。这些寄存器在
RETURN FROM INTERRUPTION AND RESTORE指令中使用,通过消
除中断处理程序中对一般寄存器(GR)的保存和恢复的需要来减
少状态保存和恢复时间。影子寄存器是GRs 1, 8, 9, 16, 17,
24和25。
-------------------------------------------------------------------------
寄存器使用说明,最初由John Marvin提供,并由Randolph Chung提供一些补充说明。
对于通用寄存器:
r1,r2,r19-r26,r28,r29 & r31可以在不保存它们的情况下被使用。当然,如果你
关心它们,在调用另一个程序之前,你也需要保存它们。上面的一些寄存器确实
有特殊的含义,你应该注意一下:
r1:
addil指令是硬性规定将其结果放在r1中,所以如果你使用这条指令要
注意这点。
r2:
这就是返回指针。一般来说,你不想使用它,因为你需要这个指针来返
回给你的调用者。然而,它与这组寄存器组合在一起,因为调用者不能
依赖你返回时的值是相同的,也就是说,你可以将r2复制到另一个寄存
器,并在作废r2后通过该寄存器返回,这应该不会给调用程序带来问题。
r19-r22:
这些通常被认为是临时寄存器。
请注意,在64位中它们是arg7-arg4。
r23-r26:
这些是arg3-arg0,也就是说,如果你不再关心传入的值,你可以使用
它们。
r28,r29:
这俩是ret0和ret1。它们是你传入返回值的地方。r28是主返回值。当返回
小结构体时,r29也可以用来将数据传回给调用程序。
r30:
栈指针
r31:
ble指令将返回指针放在这里。
r3-r18,r27,r30需要被保存和恢复。r3-r18只是一般用途的寄存器。
r27是数据指针,用来使对全局变量的引用更容易。r30是栈指针。
......@@ -19,7 +19,7 @@
:ref:`Documentation/translations/zh_CN/process/howto.rst <cn_process_howto>`
文件是一个重要的起点;
:ref:`Documentation/translations/zh_CN/process/submitting-patches.rst <cn_submittingpatches>`
和 :ref:`Documentation/transaltions/zh_CN/process/submitting-drivers.rst <cn_submittingdrivers>`
和 :ref:`Documentation/translations/zh_CN/process/submitting-drivers.rst <cn_submittingdrivers>`
也是所有内核开发人员都应该阅读的内容。许多内部内核API都是使用kerneldoc机制
记录的;“make htmldocs”或“make pdfdocs”可用于以HTML或PDF格式生成这些文档
(尽管某些发行版提供的tex版本会遇到内部限制,无法正确处理文档)。
......
......@@ -61,7 +61,7 @@ Linux 内核代码风格
case 'K':
case 'k':
mem <<= 10;
/* fall through */
fallthrough;
default:
break;
}
......
===========
===========
EHCI driver
===========
......
===============================
===============================
Linux USB Printer Gadget Driver
===============================
......
......@@ -145,7 +145,8 @@ Bind mounts and OverlayFS
Landlock enables to restrict access to file hierarchies, which means that these
access rights can be propagated with bind mounts (cf.
:doc:`/filesystems/sharedsubtree`) but not with :doc:`/filesystems/overlayfs`.
Documentation/filesystems/sharedsubtree.rst) but not with
Documentation/filesystems/overlayfs.rst.
A bind mount mirrors a source file hierarchy to a destination. The destination
hierarchy is then composed of the exact same files, on which Landlock rules can
......@@ -170,8 +171,8 @@ Inheritance
Every new thread resulting from a :manpage:`clone(2)` inherits Landlock domain
restrictions from its parent. This is similar to the seccomp inheritance (cf.
:doc:`/userspace-api/seccomp_filter`) or any other LSM dealing with task's
:manpage:`credentials(7)`. For instance, one process's thread may apply
Documentation/userspace-api/seccomp_filter.rst) or any other LSM dealing with
task's :manpage:`credentials(7)`. For instance, one process's thread may apply
Landlock rules to itself, but they will not be automatically applied to other
sibling threads (unlike POSIX thread credential changes, cf.
:manpage:`nptl(7)`).
......@@ -278,7 +279,7 @@ Memory usage
------------
Kernel memory allocated to create rulesets is accounted and can be restricted
by the :doc:`/admin-guide/cgroup-v1/memory`.
by the Documentation/admin-guide/cgroup-v1/memory.rst.
Questions and answers
=====================
......@@ -303,7 +304,7 @@ issues, especially when untrusted processes can manipulate them (cf.
Additional documentation
========================
* :doc:`/security/landlock`
* Documentation/security/landlock.rst
* https://landlock.io
.. Links
......
......@@ -6620,7 +6620,7 @@ system fingerprint. To prevent userspace from circumventing such restrictions
by running an enclave in a VM, KVM prevents access to privileged attributes by
default.
See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
See Documentation/x86/sgx.rst for more details.
7.26 KVM_CAP_PPC_RPT_INVALIDATE
-------------------------------
......
......@@ -10,7 +10,7 @@ The memory of Protected Virtual Machines (PVMs) is not accessible to
I/O or the hypervisor. In those cases where the hypervisor needs to
access the memory of a PVM, that memory must be made accessible.
Memory made accessible to the hypervisor will be encrypted. See
:doc:`s390-pv` for details."
Documentation/virt/kvm/s390-pv.rst for details."
On IPL (boot) a small plaintext bootloader is started, which provides
information about the encrypted components and necessary metadata to
......
......@@ -304,6 +304,6 @@ VCPU returns from the call.
References
==========
.. [atomic-ops] Documentation/core-api/atomic_ops.rst
.. [atomic-ops] Documentation/atomic_bitops.txt and Documentation/atomic_t.txt
.. [memory-barriers] Documentation/memory-barriers.txt
.. [lwn-mb] https://lwn.net/Articles/573436/
......@@ -10,7 +10,7 @@ Overview
Zswap is a lightweight compressed cache for swap pages. It takes pages that are
in the process of being swapped out and attempts to compress them into a
dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles
for potentially reduced swap I/O.  This trade-off can also result in a
for potentially reduced swap I/O. This trade-off can also result in a
significant performance improvement if reads from the compressed cache are
faster than reads from a swap device.
......@@ -26,7 +26,7 @@ faster than reads from a swap device.
performance impact of swapping.
* Overcommitted guests that share a common I/O resource can
dramatically reduce their swap I/O pressure, avoiding heavy handed I/O
throttling by the hypervisor. This allows more work to get done with less
throttling by the hypervisor. This allows more work to get done with less
impact to the guest workload and guests sharing the I/O subsystem
* Users with SSDs as swap devices can extend the life of the device by
drastically reducing life-shortening writes.
......
......@@ -1343,7 +1343,7 @@ follow::
In addition to read/modify/write the setup header of the struct
boot_params as that of 16-bit boot protocol, the boot loader should
also fill the additional fields of the struct boot_params as
described in chapter :doc:`zero-page`.
described in chapter Documentation/x86/zero-page.rst.
After setting up the struct boot_params, the boot loader can load the
32/64-bit kernel in the same way as that of 16-bit boot protocol.
......@@ -1379,7 +1379,7 @@ can be calculated as follows::
In addition to read/modify/write the setup header of the struct
boot_params as that of 16-bit boot protocol, the boot loader should
also fill the additional fields of the struct boot_params as described
in chapter :doc:`zero-page`.
in chapter Documentation/x86/zero-page.rst.
After setting up the struct boot_params, the boot loader can load
64-bit kernel in the same way as that of 16-bit boot protocol, but
......
......@@ -28,7 +28,7 @@ are aligned with platform MTRR setup. If MTRRs are only set up by the platform
firmware code though and the OS does not make any specific MTRR mapping
requests mtrr_type_lookup() should always return MTRR_TYPE_INVALID.
For details refer to :doc:`pat`.
For details refer to Documentation/x86/pat.rst.
.. tip::
On Intel P6 family processors (Pentium Pro, Pentium II and later)
......
......@@ -399,7 +399,7 @@ struct dev_links_info {
* along with subsystem-level and driver-level callbacks.
* @em_pd: device's energy model performance domain
* @pins: For device pin management.
* See Documentation/driver-api/pinctl.rst for details.
* See Documentation/driver-api/pin-control.rst for details.
* @msi_list: Hosts MSI descriptors
* @msi_domain: The generic MSI domain this device is using.
* @numa_node: NUMA node this device is close to.
......
......@@ -31,7 +31,7 @@ struct pinctrl_map;
* @irq_flags: Mode for primary IRQ (defaults to active low)
* @gpio_base: Base GPIO number
* @gpio_configs: Array of GPIO configurations (See
* Documentation/driver-api/pinctl.rst)
* Documentation/driver-api/pin-control.rst)
* @n_gpio_configs: Number of entries in gpio_configs
* @gpsw: General purpose switch mode setting. Depends on the external
* hardware connected to the switch. (See the SW1_MODE field
......
......@@ -89,7 +89,7 @@ struct pinctrl_map;
* it.
* @PIN_CONFIG_OUTPUT: this will configure the pin as an output and drive a
* value on the line. Use argument 1 to indicate high level, argument 0 to
* indicate low level. (Please see Documentation/driver-api/pinctl.rst,
* indicate low level. (Please see Documentation/driver-api/pin-control.rst,
* section "GPIO mode pitfalls" for a discussion around this parameter.)
* @PIN_CONFIG_PERSIST_STATE: retain pin state across sleep or controller reset
* @PIN_CONFIG_POWER_SOURCE: if the pin can select between different power
......
......@@ -2,7 +2,7 @@
/*
* Platform profile sysfs interface
*
* See Documentation/ABI/testing/sysfs-platform_profile.rst for more
* See Documentation/userspace-api/sysfs-platform_profile.rst for more
* information.
*/
......
......@@ -10,6 +10,8 @@
* whenever kernel_clone() is invoked to create a new process.
*/
#define pr_fmt(fmt) "%s: " fmt, __func__
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
......@@ -27,32 +29,31 @@ static struct kprobe kp = {
static int __kprobes handler_pre(struct kprobe *p, struct pt_regs *regs)
{
#ifdef CONFIG_X86
pr_info("<%s> pre_handler: p->addr = 0x%p, ip = %lx, flags = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, ip = %lx, flags = 0x%lx\n",
p->symbol_name, p->addr, regs->ip, regs->flags);
#endif
#ifdef CONFIG_PPC
pr_info("<%s> pre_handler: p->addr = 0x%p, nip = 0x%lx, msr = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, nip = 0x%lx, msr = 0x%lx\n",
p->symbol_name, p->addr, regs->nip, regs->msr);
#endif
#ifdef CONFIG_MIPS
pr_info("<%s> pre_handler: p->addr = 0x%p, epc = 0x%lx, status = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, epc = 0x%lx, status = 0x%lx\n",
p->symbol_name, p->addr, regs->cp0_epc, regs->cp0_status);
#endif
#ifdef CONFIG_ARM64
pr_info("<%s> pre_handler: p->addr = 0x%p, pc = 0x%lx,"
" pstate = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, pc = 0x%lx, pstate = 0x%lx\n",
p->symbol_name, p->addr, (long)regs->pc, (long)regs->pstate);
#endif
#ifdef CONFIG_ARM
pr_info("<%s> pre_handler: p->addr = 0x%p, pc = 0x%lx, cpsr = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, pc = 0x%lx, cpsr = 0x%lx\n",
p->symbol_name, p->addr, (long)regs->ARM_pc, (long)regs->ARM_cpsr);
#endif
#ifdef CONFIG_RISCV
pr_info("<%s> pre_handler: p->addr = 0x%p, pc = 0x%lx, status = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, pc = 0x%lx, status = 0x%lx\n",
p->symbol_name, p->addr, regs->epc, regs->status);
#endif
#ifdef CONFIG_S390
pr_info("<%s> pre_handler: p->addr, 0x%p, ip = 0x%lx, flags = 0x%lx\n",
pr_info("<%s> p->addr, 0x%p, ip = 0x%lx, flags = 0x%lx\n",
p->symbol_name, p->addr, regs->psw.addr, regs->flags);
#endif
......@@ -65,31 +66,31 @@ static void __kprobes handler_post(struct kprobe *p, struct pt_regs *regs,
unsigned long flags)
{
#ifdef CONFIG_X86
pr_info("<%s> post_handler: p->addr = 0x%p, flags = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, flags = 0x%lx\n",
p->symbol_name, p->addr, regs->flags);
#endif
#ifdef CONFIG_PPC
pr_info("<%s> post_handler: p->addr = 0x%p, msr = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, msr = 0x%lx\n",
p->symbol_name, p->addr, regs->msr);
#endif
#ifdef CONFIG_MIPS
pr_info("<%s> post_handler: p->addr = 0x%p, status = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, status = 0x%lx\n",
p->symbol_name, p->addr, regs->cp0_status);
#endif
#ifdef CONFIG_ARM64
pr_info("<%s> post_handler: p->addr = 0x%p, pstate = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, pstate = 0x%lx\n",
p->symbol_name, p->addr, (long)regs->pstate);
#endif
#ifdef CONFIG_ARM
pr_info("<%s> post_handler: p->addr = 0x%p, cpsr = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, cpsr = 0x%lx\n",
p->symbol_name, p->addr, (long)regs->ARM_cpsr);
#endif
#ifdef CONFIG_RISCV
pr_info("<%s> post_handler: p->addr = 0x%p, status = 0x%lx\n",
pr_info("<%s> p->addr = 0x%p, status = 0x%lx\n",
p->symbol_name, p->addr, regs->status);
#endif
#ifdef CONFIG_S390
pr_info("<%s> pre_handler: p->addr, 0x%p, flags = 0x%lx\n",
pr_info("<%s> p->addr, 0x%p, flags = 0x%lx\n",
p->symbol_name, p->addr, regs->flags);
#endif
}
......
......@@ -24,7 +24,7 @@ my $help = 0;
my $fix = 0;
my $warn = 0;
if (! -d ".git") {
if (! -e ".git") {
printf "Warning: can't check if file exists, as this is not a git tree\n";
exit 0;
}
......
......@@ -406,6 +406,8 @@ my $doc_inline_sect = '\s*\*\s*(@\s*[\w][\w\.]*\s*):(.*)';
my $doc_inline_end = '^\s*\*/\s*$';
my $doc_inline_oneline = '^\s*/\*\*\s*(@[\w\s]+):\s*(.*)\s*\*/\s*$';
my $export_symbol = '^\s*EXPORT_SYMBOL(_GPL)?\s*\(\s*(\w+)\s*\)\s*;';
my $function_pointer = qr{([^\(]*\(\*)\s*\)\s*\(([^\)]*)\)};
my $attribute = qr{__attribute__\s*\(\([a-z0-9,_\*\s\(\)]*\)\)}i;
my %parameterdescs;
my %parameterdesc_start_lines;
......@@ -694,7 +696,7 @@ sub output_function_man(%) {
$post = ");";
}
$type = $args{'parametertypes'}{$parameter};
if ($type =~ m/([^\(]*\(\*)\s*\)\s*\(([^\)]*)\)/) {
if ($type =~ m/$function_pointer/) {
# pointer-to-function
print ".BI \"" . $parenth . $1 . "\" " . " \") (" . $2 . ")" . $post . "\"\n";
} else {
......@@ -974,7 +976,7 @@ sub output_function_rst(%) {
$count++;
$type = $args{'parametertypes'}{$parameter};
if ($type =~ m/([^\(]*\(\*)\s*\)\s*\(([^\)]*)\)/) {
if ($type =~ m/$function_pointer/) {
# pointer-to-function
print $1 . $parameter . ") (" . $2 . ")";
} else {
......@@ -1211,7 +1213,9 @@ sub dump_struct($$) {
my $members;
my $type = qr{struct|union};
# For capturing struct/union definition body, i.e. "{members*}qualifiers*"
my $definition_body = qr{\{(.*)\}(?:\s*(?:__packed|__aligned|____cacheline_aligned_in_smp|____cacheline_aligned|__attribute__\s*\(\([a-z0-9,_\s\(\)]*\)\)))*};
my $qualifiers = qr{$attribute|__packed|__aligned|____cacheline_aligned_in_smp|____cacheline_aligned};
my $definition_body = qr{\{(.*)\}\s*$qualifiers*};
my $struct_members = qr{($type)([^\{\};]+)\{([^\{\}]*)\}([^\{\}\;]*)\;};
if ($x =~ /($type)\s+(\w+)\s*$definition_body/) {
$decl_type = $1;
......@@ -1235,27 +1239,27 @@ sub dump_struct($$) {
# strip comments:
$members =~ s/\/\*.*?\*\///gos;
# strip attributes
$members =~ s/\s*__attribute__\s*\(\([a-z0-9,_\*\s\(\)]*\)\)/ /gi;
$members =~ s/\s*$attribute/ /gi;
$members =~ s/\s*__aligned\s*\([^;]*\)/ /gos;
$members =~ s/\s*__packed\s*/ /gos;
$members =~ s/\s*CRYPTO_MINALIGN_ATTR/ /gos;
$members =~ s/\s*____cacheline_aligned_in_smp/ /gos;
$members =~ s/\s*____cacheline_aligned/ /gos;
my $args = qr{([^,)]+)};
# replace DECLARE_BITMAP
$members =~ s/__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)/DECLARE_BITMAP($1, __ETHTOOL_LINK_MODE_MASK_NBITS)/gos;
$members =~ s/DECLARE_BITMAP\s*\(([^,)]+),\s*([^,)]+)\)/unsigned long $1\[BITS_TO_LONGS($2)\]/gos;
$members =~ s/DECLARE_BITMAP\s*\($args,\s*$args\)/unsigned long $1\[BITS_TO_LONGS($2)\]/gos;
# replace DECLARE_HASHTABLE
$members =~ s/DECLARE_HASHTABLE\s*\(([^,)]+),\s*([^,)]+)\)/unsigned long $1\[1 << (($2) - 1)\]/gos;
$members =~ s/DECLARE_HASHTABLE\s*\($args,\s*$args\)/unsigned long $1\[1 << (($2) - 1)\]/gos;
# replace DECLARE_KFIFO
$members =~ s/DECLARE_KFIFO\s*\(([^,)]+),\s*([^,)]+),\s*([^,)]+)\)/$2 \*$1/gos;
$members =~ s/DECLARE_KFIFO\s*\($args,\s*$args,\s*$args\)/$2 \*$1/gos;
# replace DECLARE_KFIFO_PTR
$members =~ s/DECLARE_KFIFO_PTR\s*\(([^,)]+),\s*([^,)]+)\)/$2 \*$1/gos;
$members =~ s/DECLARE_KFIFO_PTR\s*\($args,\s*$args\)/$2 \*$1/gos;
my $declaration = $members;
# Split nested struct/union elements as newer ones
while ($members =~ m/(struct|union)([^\{\};]+)\{([^\{\}]*)\}([^\{\}\;]*)\;/) {
while ($members =~ m/$struct_members/) {
my $newmember;
my $maintype = $1;
my $ids = $4;
......@@ -1315,7 +1319,7 @@ sub dump_struct($$) {
}
}
}
$members =~ s/(struct|union)([^\{\};]+)\{([^\{\}]*)\}([^\{\}\;]*)\;/$newmember/;
$members =~ s/$struct_members/$newmember/;
}
# Ignore other nested elements, like enums
......@@ -1555,8 +1559,9 @@ sub create_parameterlist($$$$) {
my $param;
# temporarily replace commas inside function pointer definition
while ($args =~ /(\([^\),]+),/) {
$args =~ s/(\([^\),]+),/$1#/g;
my $arg_expr = qr{\([^\),]+};
while ($args =~ /$arg_expr,/) {
$args =~ s/($arg_expr),/$1#/g;
}
foreach my $arg (split($splitter, $args)) {
......@@ -1707,7 +1712,7 @@ sub check_sections($$$$$) {
foreach $px (0 .. $#prms) {
$prm_clean = $prms[$px];
$prm_clean =~ s/\[.*\]//;
$prm_clean =~ s/__attribute__\s*\(\([a-z,_\*\s\(\)]*\)\)//i;
$prm_clean =~ s/$attribute//i;
# ignore array size in a parameter string;
# however, the original param string may contain
# spaces, e.g.: addr[6 + 2]
......@@ -1809,8 +1814,14 @@ sub dump_function($$) {
# - parport_register_device (function pointer parameters)
# - atomic_set (macro)
# - pci_match_device, __copy_to_user (long return type)
if ($define && $prototype =~ m/^()([a-zA-Z0-9_~:]+)\s+/) {
my $name = qr{[a-zA-Z0-9_~:]+};
my $prototype_end1 = qr{[^\(]*};
my $prototype_end2 = qr{[^\{]*};
my $prototype_end = qr{\(($prototype_end1|$prototype_end2)\)};
my $type1 = qr{[\w\s]+};
my $type2 = qr{$type1\*+};
if ($define && $prototype =~ m/^()($name)\s+/) {
# This is an object-like macro, it has no return type and no parameter
# list.
# Function-like macros are not allowed to have spaces between
......@@ -1818,23 +1829,9 @@ sub dump_function($$) {
$return_type = $1;
$declaration_name = $2;
$noret = 1;
} elsif ($prototype =~ m/^()([a-zA-Z0-9_~:]+)\s*\(([^\(]*)\)/ ||
$prototype =~ m/^(\w+)\s+([a-zA-Z0-9_~:]+)\s*\(([^\(]*)\)/ ||
$prototype =~ m/^(\w+\s*\*+)\s*([a-zA-Z0-9_~:]+)\s*\(([^\(]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+)\s+([a-zA-Z0-9_~:]+)\s*\(([^\(]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+\s*\*+)\s*([a-zA-Z0-9_~:]+)\s*\(([^\(]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+\s+\w+)\s+([a-zA-Z0-9_~:]+)\s*\(([^\(]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+\s+\w+\s*\*+)\s*([a-zA-Z0-9_~:]+)\s*\(([^\(]*)\)/ ||
$prototype =~ m/^()([a-zA-Z0-9_~:]+)\s*\(([^\{]*)\)/ ||
$prototype =~ m/^(\w+)\s+([a-zA-Z0-9_~:]+)\s*\(([^\{]*)\)/ ||
$prototype =~ m/^(\w+\s*\*+)\s*([a-zA-Z0-9_~:]+)\s*\(([^\{]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+)\s+([a-zA-Z0-9_~:]+)\s*\(([^\{]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+\s*\*+)\s*([a-zA-Z0-9_~:]+)\s*\(([^\{]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+\s+\w+)\s+([a-zA-Z0-9_~:]+)\s*\(([^\{]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+\s+\w+\s*\*+)\s*([a-zA-Z0-9_~:]+)\s*\(([^\{]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+\s+\w+\s+\w+)\s+([a-zA-Z0-9_~:]+)\s*\(([^\{]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+\s+\w+\s+\w+\s*\*+)\s*([a-zA-Z0-9_~:]+)\s*\(([^\{]*)\)/ ||
$prototype =~ m/^(\w+\s+\w+\s*\*+\s*\w+\s*\*+\s*)\s*([a-zA-Z0-9_~:]+)\s*\(([^\{]*)\)/) {
} elsif ($prototype =~ m/^()($name)\s*$prototype_end/ ||
$prototype =~ m/^($type1)\s+($name)\s*$prototype_end/ ||
$prototype =~ m/^($type2+)\s*($name)\s*$prototype_end/) {
$return_type = $1;
$declaration_name = $2;
my $args = $3;
......@@ -2111,12 +2108,12 @@ sub process_name($$) {
} elsif (/$doc_decl/o) {
$identifier = $1;
my $is_kernel_comment = 0;
my $decl_start = qr{\s*\*};
my $decl_start = qr{$doc_com};
# test for pointer declaration type, foo * bar() - desc
my $fn_type = qr{\w+\s*\*\s*};
my $parenthesis = qr{\(\w*\)};
my $decl_end = qr{[-:].*};
if (/^$decl_start\s*([\w\s]+?)$parenthesis?\s*$decl_end?$/) {
if (/^$decl_start([\w\s]+?)$parenthesis?\s*$decl_end?$/) {
$identifier = $1;
}
if ($identifier =~ m/^(struct|union|enum|typedef)\b\s*(\S*)/) {
......@@ -2126,8 +2123,8 @@ sub process_name($$) {
}
# Look for foo() or static void foo() - description; or misspelt
# identifier
elsif (/^$decl_start\s*$fn_type?(\w+)\s*$parenthesis?\s*$decl_end?$/ ||
/^$decl_start\s*$fn_type?(\w+.*)$parenthesis?\s*$decl_end$/) {
elsif (/^$decl_start$fn_type?(\w+)\s*$parenthesis?\s*$decl_end?$/ ||
/^$decl_start$fn_type?(\w+.*)$parenthesis?\s*$decl_end$/) {
$identifier = $1;
$decl_type = 'function';
$identifier =~ s/^define\s+//;
......
......@@ -22,16 +22,18 @@ my $need = 0;
my $optional = 0;
my $need_symlink = 0;
my $need_sphinx = 0;
my $need_venv = 0;
my $need_pip = 0;
my $need_virtualenv = 0;
my $rec_sphinx_upgrade = 0;
my $install = "";
my $virtenv_dir = "";
my $python_cmd = "";
my $activate_cmd;
my $min_version;
my $cur_version;
my $rec_version = "1.7.9"; # PDF won't build here
my $min_pdf_version = "2.4.4"; # Min version where pdf builds
my $latest_avail_ver;
#
# Command line arguments
......@@ -319,10 +321,7 @@ sub check_sphinx()
return;
}
if ($cur_version lt $rec_version) {
$rec_sphinx_upgrade = 1;
return;
}
return if ($cur_version lt $rec_version);
# On version check mode, just assume Sphinx has all mandatory deps
exit (0) if ($version_check);
......@@ -701,6 +700,162 @@ sub deactivate_help()
printf "\tdeactivate\n";
}
sub get_virtenv()
{
my $ver;
my $min_activate = "$ENV{'PWD'}/${virtenv_prefix}${min_version}/bin/activate";
my @activates = glob "$ENV{'PWD'}/${virtenv_prefix}*/bin/activate";
@activates = sort {$b cmp $a} @activates;
foreach my $f (@activates) {
next if ($f lt $min_activate);
my $sphinx_cmd = $f;
$sphinx_cmd =~ s/activate/sphinx-build/;
next if (! -f $sphinx_cmd);
my $ver = get_sphinx_version($sphinx_cmd);
if ($need_sphinx && ($ver ge $min_version)) {
return ($f, $ver);
} elsif ($ver gt $cur_version) {
return ($f, $ver);
}
}
return ("", "");
}
sub recommend_sphinx_upgrade()
{
my $venv_ver;
# Avoid running sphinx-builds from venv if $cur_version is good
if ($cur_version && ($cur_version ge $rec_version)) {
$latest_avail_ver = $cur_version;
return;
}
# Get the highest version from sphinx_*/bin/sphinx-build and the
# corresponding command to activate the venv/virtenv
$activate_cmd = get_virtenv();
# Store the highest version from Sphinx existing virtualenvs
if (($activate_cmd ne "") && ($venv_ver gt $cur_version)) {
$latest_avail_ver = $venv_ver;
} else {
$latest_avail_ver = $cur_version if ($cur_version);
}
# As we don't know package version of Sphinx, and there's no
# virtual environments, don't check if upgrades are needed
if (!$virtualenv) {
return if (!$latest_avail_ver);
}
# Either there are already a virtual env or a new one should be created
$need_pip = 1;
# Return if the reason is due to an upgrade or not
if ($latest_avail_ver lt $rec_version) {
$rec_sphinx_upgrade = 1;
}
}
#
# The logic here is complex, as it have to deal with different versions:
# - minimal supported version;
# - minimal PDF version;
# - recommended version.
# It also needs to work fine with both distro's package and venv/virtualenv
sub recommend_sphinx_version($)
{
my $virtualenv_cmd = shift;
if ($latest_avail_ver lt $min_pdf_version) {
print "note: If you want pdf, you need at least Sphinx $min_pdf_version.\n";
}
# Version is OK. Nothing to do.
return if ($cur_version && ($cur_version ge $rec_version));
if (!$need_sphinx) {
# sphinx-build is present and its version is >= $min_version
#only recommend enabling a newer virtenv version if makes sense.
if ($latest_avail_ver gt $cur_version) {
printf "\nYou may also use the newer Sphinx version $latest_avail_ver with:\n";
printf "\tdeactivate\n" if ($ENV{'PWD'} =~ /${virtenv_prefix}/);
printf "\t. $activate_cmd\n";
deactivate_help();
return;
}
return if ($latest_avail_ver ge $rec_version);
}
if (!$virtualenv) {
# No sphinx either via package or via virtenv. As we can't
# Compare the versions here, just return, recommending the
# user to install it from the package distro.
return if (!$latest_avail_ver);
# User doesn't want a virtenv recommendation, but he already
# installed one via virtenv with a newer version.
# So, print commands to enable it
if ($latest_avail_ver gt $cur_version) {
printf "\nYou may also use the Sphinx virtualenv version $latest_avail_ver with:\n";
printf "\tdeactivate\n" if ($ENV{'PWD'} =~ /${virtenv_prefix}/);
printf "\t. $activate_cmd\n";
deactivate_help();
return;
}
print "\n";
} else {
$need++ if ($need_sphinx);
}
# Suggest newer versions if current ones are too old
if ($latest_avail_ver && $cur_version ge $min_version) {
# If there's a good enough version, ask the user to enable it
if ($latest_avail_ver ge $rec_version) {
printf "\nNeed to activate Sphinx (version $latest_avail_ver) on virtualenv with:\n";
printf "\t. $activate_cmd\n";
deactivate_help();
return;
}
# Version is above the minimal required one, but may be
# below the recommended one. So, print warnings/notes
if ($latest_avail_ver lt $rec_version) {
print "Warning: It is recommended at least Sphinx version $rec_version.\n";
}
}
# At this point, either it needs Sphinx or upgrade is recommended,
# both via pip
if ($rec_sphinx_upgrade) {
if (!$virtualenv) {
print "Instead of install/upgrade Python Sphinx pkg, you could use pip/pypi with:\n\n";
} else {
print "To upgrade Sphinx, use:\n\n";
}
} else {
print "Sphinx needs to be installed either as a package or via pip/pypi with:\n";
}
$python_cmd = find_python_no_venv();
printf "\t$virtualenv_cmd $virtenv_dir\n";
printf "\t. $virtenv_dir/bin/activate\n";
printf "\tpip install -r $requirement_file\n";
deactivate_help();
}
sub check_needs()
{
# Check if Sphinx is already accessible from current environment
......@@ -722,15 +877,14 @@ sub check_needs()
if ($virtualenv) {
my $tmp = qx($python_cmd --version 2>&1);
if ($tmp =~ m/(\d+\.)(\d+\.)/) {
if ($1 >= 3 && $2 >= 3) {
$need_venv = 1; # python 3.3 or upper
} else {
$need_virtualenv = 1;
}
if ($1 < 3) {
# Fail if it finds python2 (or worse)
die "Python 3 is required to build the kernel docs\n";
}
if ($1 == 3 && $2 < 3) {
# Need Python 3.3 or upper for venv
$need_virtualenv = 1;
}
} else {
die "Warning: couldn't identify $python_cmd version!";
}
......@@ -739,14 +893,22 @@ sub check_needs()
}
}
# Set virtualenv command line, if python < 3.3
recommend_sphinx_upgrade();
my $virtualenv_cmd;
if ($need_virtualenv) {
$virtualenv_cmd = findprog("virtualenv-3");
$virtualenv_cmd = findprog("virtualenv-3.5") if (!$virtualenv_cmd);
if (!$virtualenv_cmd) {
check_program("virtualenv", 0);
$virtualenv_cmd = "virtualenv";
if ($need_pip) {
# Set virtualenv command line, if python < 3.3
if ($need_virtualenv) {
$virtualenv_cmd = findprog("virtualenv-3");
$virtualenv_cmd = findprog("virtualenv-3.5") if (!$virtualenv_cmd);
if (!$virtualenv_cmd) {
check_program("virtualenv", 0);
$virtualenv_cmd = "virtualenv";
}
} else {
$virtualenv_cmd = "$python_cmd -m venv";
check_python_module("ensurepip", 0);
}
}
......@@ -763,10 +925,6 @@ sub check_needs()
check_program("rsvg-convert", 2) if ($pdf);
check_program("latexmk", 2) if ($pdf);
if ($need_sphinx || $rec_sphinx_upgrade) {
check_python_module("ensurepip", 0) if ($need_venv);
}
# Do distro-specific checks and output distro-install commands
check_distros();
......@@ -784,67 +942,7 @@ sub check_needs()
which("sphinx-build-3");
}
# NOTE: if the system has a too old Sphinx version installed,
# it will recommend installing a newer version using virtualenv
if ($need_sphinx || $rec_sphinx_upgrade) {
my $min_activate = "$ENV{'PWD'}/${virtenv_prefix}${min_version}/bin/activate";
my @activates = glob "$ENV{'PWD'}/${virtenv_prefix}*/bin/activate";
if ($cur_version lt $rec_version) {
print "Warning: It is recommended at least Sphinx version $rec_version.\n";
print " If you want pdf, you need at least $min_pdf_version.\n";
}
if ($cur_version lt $min_pdf_version) {
print "Note: It is recommended at least Sphinx version $min_pdf_version if you need PDF support.\n";
}
@activates = sort {$b cmp $a} @activates;
my ($activate, $ver);
foreach my $f (@activates) {
next if ($f lt $min_activate);
my $sphinx_cmd = $f;
$sphinx_cmd =~ s/activate/sphinx-build/;
next if (! -f $sphinx_cmd);
$ver = get_sphinx_version($sphinx_cmd);
if ($need_sphinx && ($ver ge $min_version)) {
$activate = $f;
last;
} elsif ($ver gt $cur_version) {
$activate = $f;
last;
}
}
if ($activate ne "") {
if ($need_sphinx) {
printf "\nNeed to activate Sphinx (version $ver) on virtualenv with:\n";
printf "\t. $activate\n";
deactivate_help();
exit (1);
} else {
printf "\nYou may also use a newer Sphinx (version $ver) with:\n";
printf "\tdeactivate && . $activate\n";
}
} else {
my $rec_activate = "$virtenv_dir/bin/activate";
print "To upgrade Sphinx, use:\n\n" if ($rec_sphinx_upgrade);
$python_cmd = find_python_no_venv();
if ($need_venv) {
printf "\t$python_cmd -m venv $virtenv_dir\n";
} else {
printf "\t$virtualenv_cmd $virtenv_dir\n";
}
printf "\t. $rec_activate\n";
printf "\tpip install -r $requirement_file\n";
deactivate_help();
$need++ if (!$rec_sphinx_upgrade);
}
}
recommend_sphinx_version($virtualenv_cmd);
printf "\n";
print "All optional dependencies are met.\n" if (!$optional);
......
......@@ -196,7 +196,7 @@ else
fi
echo "For a more detailed explanation of the various taint flags see"
echo " Documentation/admin-guide/tainted-kernels.rst in the the Linux kernel sources"
echo " Documentation/admin-guide/tainted-kernels.rst in the Linux kernel sources"
echo " or https://kernel.org/doc/html/latest/admin-guide/tainted-kernels.html"
echo "Raw taint value as int/string: $taint/'$out'"
#EOF#
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment