Commit b6c17ea4 authored by Rusty Russell's avatar Rusty Russell Committed by Linus Torvalds

[PATCH] Update Documentation/DocBook/kernel-hacking.tmpl

Update the hacking guide, before CONFIG_PREEMPT_RT goes in and it needs
rewriting again.

Changes include modernization of quotes, removal of most references to
bottom halves (some mention required because we still use bh in places to
mean softirq).

It would be nice to have a discussion of sparse and various annotations.
Please send patches straight to akpm.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (authored)
Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
parent f92273c1
......@@ -8,8 +8,7 @@
<authorgroup>
<author>
<firstname>Paul</firstname>
<othername>Rusty</othername>
<firstname>Rusty</firstname>
<surname>Russell</surname>
<affiliation>
<address>
......@@ -20,7 +19,7 @@
</authorgroup>
<copyright>
<year>2001</year>
<year>2005</year>
<holder>Rusty Russell</holder>
</copyright>
......@@ -64,7 +63,7 @@
<chapter id="introduction">
<title>Introduction</title>
<para>
Welcome, gentle reader, to Rusty's Unreliable Guide to Linux
Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux
Kernel Hacking. This document describes the common routines and
general requirements for kernel code: its goal is to serve as a
primer for Linux kernel development for experienced C
......@@ -96,13 +95,13 @@
<listitem>
<para>
not associated with any process, serving a softirq, tasklet or bh;
not associated with any process, serving a softirq or tasklet;
</para>
</listitem>
<listitem>
<para>
running in kernel space, associated with a process;
running in kernel space, associated with a process (user context);
</para>
</listitem>
......@@ -114,11 +113,12 @@
</itemizedlist>
<para>
There is a strict ordering between these: other than the last
category (userspace) each can only be pre-empted by those above.
For example, while a softirq is running on a CPU, no other
softirq will pre-empt it, but a hardware interrupt can. However,
any other CPUs in the system execute independently.
There is an ordering between these. The bottom two can preempt
each other, but above that is a strict hierarchy: each can only be
preempted by the ones above it. For example, while a softirq is
running on a CPU, no other softirq will preempt it, but a hardware
interrupt can. However, any other CPUs in the system execute
independently.
</para>
<para>
......@@ -130,10 +130,10 @@
<title>User Context</title>
<para>
User context is when you are coming in from a system call or
other trap: you can sleep, and you own the CPU (except for
interrupts) until you call <function>schedule()</function>.
In other words, user context (unlike userspace) is not pre-emptable.
User context is when you are coming in from a system call or other
trap: like userspace, you can be preempted by more important tasks
and by interrupts. You can sleep, by calling
<function>schedule()</function>.
</para>
<note>
......@@ -153,7 +153,7 @@
<caution>
<para>
Beware that if you have interrupts or bottom halves disabled
Beware that if you have preemption or softirqs disabled
(see below), <function>in_interrupt()</function> will return a
false positive.
</para>
......@@ -168,10 +168,10 @@
<hardware>keyboard</hardware> are examples of real
hardware which produce interrupts at any time. The kernel runs
interrupt handlers, which services the hardware. The kernel
guarantees that this handler is never re-entered: if another
guarantees that this handler is never re-entered: if the same
interrupt arrives, it is queued (or dropped). Because it
disables interrupts, this handler has to be fast: frequently it
simply acknowledges the interrupt, marks a `software interrupt'
simply acknowledges the interrupt, marks a 'software interrupt'
for execution and exits.
</para>
......@@ -188,60 +188,52 @@
</sect1>
<sect1 id="basics-softirqs">
<title>Software Interrupt Context: Bottom Halves, Tasklets, softirqs</title>
<title>Software Interrupt Context: Softirqs and Tasklets</title>
<para>
Whenever a system call is about to return to userspace, or a
hardware interrupt handler exits, any `software interrupts'
hardware interrupt handler exits, any 'software interrupts'
which are marked pending (usually by hardware interrupts) are
run (<filename>kernel/softirq.c</filename>).
</para>
<para>
Much of the real interrupt handling work is done here. Early in
the transition to <acronym>SMP</acronym>, there were only `bottom
the transition to <acronym>SMP</acronym>, there were only 'bottom
halves' (BHs), which didn't take advantage of multiple CPUs. Shortly
after we switched from wind-up computers made of match-sticks and snot,
we abandoned this limitation.
we abandoned this limitation and switched to 'softirqs'.
</para>
<para>
<filename class="headerfile">include/linux/interrupt.h</filename> lists the
different BH's. No matter how many CPUs you have, no two BHs will run at
the same time. This made the transition to SMP simpler, but sucks hard for
scalable performance. A very important bottom half is the timer
BH (<filename class="headerfile">include/linux/timer.h</filename>): you
can register to have it call functions for you in a given length of time.
different softirqs. A very important softirq is the
timer softirq (<filename
class="headerfile">include/linux/timer.h</filename>): you can
register to have it call functions for you in a given length of
time.
</para>
<para>
2.3.43 introduced softirqs, and re-implemented the (now
deprecated) BHs underneath them. Softirqs are fully-SMP
versions of BHs: they can run on as many CPUs at once as
required. This means they need to deal with any races in shared
data using their own locks. A bitmask is used to keep track of
which are enabled, so the 32 available softirqs should not be
used up lightly. (<emphasis>Yes</emphasis>, people will
notice).
</para>
<para>
tasklets (<filename class="headerfile">include/linux/interrupt.h</filename>)
are like softirqs, except they are dynamically-registrable (meaning you
can have as many as you want), and they also guarantee that any tasklet
will only run on one CPU at any time, although different tasklets can
run simultaneously (unlike different BHs).
Softirqs are often a pain to deal with, since the same softirq
will run simultaneously on more than one CPU. For this reason,
tasklets (<filename
class="headerfile">include/linux/interrupt.h</filename>) are more
often used: they are dynamically-registrable (meaning you can have
as many as you want), and they also guarantee that any tasklet
will only run on one CPU at any time, although different tasklets
can run simultaneously.
</para>
<caution>
<para>
The name `tasklet' is misleading: they have nothing to do with `tasks',
The name 'tasklet' is misleading: they have nothing to do with 'tasks',
and probably more to do with some bad vodka Alexey Kuznetsov had at the
time.
</para>
</caution>
<para>
You can tell you are in a softirq (or bottom half, or tasklet)
You can tell you are in a softirq (or tasklet)
using the <function>in_softirq()</function> macro
(<filename class="headerfile">include/linux/interrupt.h</filename>).
</para>
......@@ -288,11 +280,10 @@
<term>A rigid stack limit</term>
<listitem>
<para>
The kernel stack is about 6K in 2.2 (for most
architectures: it's about 14K on the Alpha), and shared
with interrupts so you can't use it all. Avoid deep
recursion and huge local arrays on the stack (allocate
them dynamically instead).
Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's
about 14K on most 64-bit archs, and often shared with interrupts
so you can't use it all. Avoid deep recursion and huge local
arrays on the stack (allocate them dynamically instead).
</para>
</listitem>
</varlistentry>
......@@ -339,7 +330,7 @@ asmlinkage long sys_mycall(int arg)
<para>
If all your routine does is read or write some parameter, consider
implementing a <function>sysctl</function> interface instead.
implementing a <function>sysfs</function> interface instead.
</para>
<para>
......@@ -417,7 +408,10 @@ cond_resched(); /* Will sleep */
</para>
<para>
You will eventually lock up your box if you break these rules.
You should always compile your kernel
<symbol>CONFIG_DEBUG_SPINLOCK_SLEEP</symbol> on, and it will warn
you if you break these rules. If you <emphasis>do</emphasis> break
the rules, you will eventually lock up your box.
</para>
<para>
......@@ -515,8 +509,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
success).
</para>
</caution>
[Yes, this moronic interface makes me cringe. Please submit a
patch and become my hero --RR.]
[Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.]
</para>
<para>
The functions may sleep implicitly. This should never be called
......@@ -587,10 +580,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
</variablelist>
<para>
If you see a <errorname>kmem_grow: Called nonatomically from int
</errorname> warning message you called a memory allocation function
from interrupt context without <constant>GFP_ATOMIC</constant>.
You should really fix that. Run, don't walk.
If you see a <errorname>sleeping function called from invalid
context</errorname> warning message, then maybe you called a
sleeping allocation function from interrupt context without
<constant>GFP_ATOMIC</constant>. You should really fix that.
Run, don't walk.
</para>
<para>
......@@ -639,16 +633,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
</sect1>
<sect1 id="routines-udelay">
<title><function>udelay()</function>/<function>mdelay()</function>
<title><function>mdelay()</function>/<function>udelay()</function>
<filename class="headerfile">include/asm/delay.h</filename>
<filename class="headerfile">include/linux/delay.h</filename>
</title>
<para>
The <function>udelay()</function> function can be used for small pauses.
Do not use large values with <function>udelay()</function> as you risk
The <function>udelay()</function> and <function>ndelay()</function> functions can be used for small pauses.
Do not use large values with them as you risk
overflow - the helper function <function>mdelay()</function> is useful
here, or even consider <function>schedule_timeout()</function>.
here, or consider <function>msleep()</function>.
</para>
</sect1>
......@@ -698,8 +692,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
These routines disable soft interrupts on the local CPU, and
restore them. They are reentrant; if soft interrupts were
disabled before, they will still be disabled after this pair
of functions has been called. They prevent softirqs, tasklets
and bottom halves from running on the current CPU.
of functions has been called. They prevent softirqs and tasklets
from running on the current CPU.
</para>
</sect1>
......@@ -708,10 +702,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<filename class="headerfile">include/asm/smp.h</filename></title>
<para>
<function>smp_processor_id()</function> returns the current
processor number, between 0 and <symbol>NR_CPUS</symbol> (the
maximum number of CPUs supported by Linux, currently 32). These
values are not necessarily continuous.
<function>get_cpu()</function> disables preemption (so you won't
suddenly get moved to another CPU) and returns the current
processor number, between 0 and <symbol>NR_CPUS</symbol>. Note
that the CPU numbers are not necessarily continuous. You return
it again with <function>put_cpu()</function> when you are done.
</para>
<para>
If you know you cannot be preempted by another task (ie. you are
in interrupt context, or have preemption disabled) you can use
smp_processor_id().
</para>
</sect1>
......@@ -722,19 +722,14 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
After boot, the kernel frees up a special section; functions
marked with <type>__init</type> and data structures marked with
<type>__initdata</type> are dropped after boot is complete (within
modules this directive is currently ignored). <type>__exit</type>
<type>__initdata</type> are dropped after boot is complete: similarly
modules discard this memory after initialization. <type>__exit</type>
is used to declare a function which is only required on exit: the
function will be dropped if this file is not compiled as a module.
See the header file for use. Note that it makes no sense for a function
marked with <type>__init</type> to be exported to modules with
<function>EXPORT_SYMBOL()</function> - this will break.
</para>
<para>
Static data structures marked as <type>__initdata</type> must be initialised
(as opposed to ordinary static data which is zeroed BSS) and cannot be
<type>const</type>.
</para>
</sect1>
......@@ -762,9 +757,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
The function can return a negative error number to cause
module loading to fail (unfortunately, this has no effect if
the module is compiled into the kernel). For modules, this is
called in user context, with interrupts enabled, and the
kernel lock held, so it can sleep.
the module is compiled into the kernel). This function is
called in user context with interrupts enabled, so it can sleep.
</para>
</sect1>
......@@ -779,6 +773,34 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
reached zero. This function can also sleep, but cannot fail:
everything must be cleaned up by the time it returns.
</para>
<para>
Note that this macro is optional: if it is not present, your
module will not be removable (except for 'rmmod -f').
</para>
</sect1>
<sect1 id="routines-module-use-counters">
<title> <function>try_module_get()</function>/<function>module_put()</function>
<filename class="headerfile">include/linux/module.h</filename></title>
<para>
These manipulate the module usage count, to protect against
removal (a module also can't be removed if another module uses one
of its exported symbols: see below). Before calling into module
code, you should call <function>try_module_get()</function> on
that module: if it fails, then the module is being removed and you
should act as if it wasn't there. Otherwise, you can safely enter
the module, and call <function>module_put()</function> when you're
finished.
</para>
<para>
Most registerable structures have an
<structfield>owner</structfield> field, such as in the
<structname>file_operations</structname> structure. Set this field
to the macro <symbol>THIS_MODULE</symbol>.
</para>
</sect1>
<!-- add info on new-style module refcounting here -->
......@@ -821,7 +843,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
There is a macro to do this:
<function>wait_event_interruptible()</function>
<filename class="headerfile">include/linux/sched.h</filename> The
<filename class="headerfile">include/linux/wait.h</filename> The
first argument is the wait queue head, and the second is an
expression which is evaluated; the macro returns
<returnvalue>0</returnvalue> when this expression is true, or
......@@ -847,10 +869,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
Call <function>wake_up()</function>
<filename class="headerfile">include/linux/sched.h</filename>;,
<filename class="headerfile">include/linux/wait.h</filename>;,
which will wake up every process in the queue. The exception is
if one has <constant>TASK_EXCLUSIVE</constant> set, in which case
the remainder of the queue will not be woken.
the remainder of the queue will not be woken. There are other variants
of this basic function available in the same header.
</para>
</sect1>
</chapter>
......@@ -863,7 +886,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
first class of operations work on <type>atomic_t</type>
<filename class="headerfile">include/asm/atomic.h</filename>; this
contains a signed integer (at least 24 bits long), and you must use
contains a signed integer (at least 32 bits long), and you must use
these functions to manipulate or read atomic_t variables.
<function>atomic_read()</function> and
<function>atomic_set()</function> get and set the counter,
......@@ -882,13 +905,12 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<para>
Note that these functions are slower than normal arithmetic, and
so should not be used unnecessarily. On some platforms they
are much slower, like 32-bit Sparc where they use a spinlock.
so should not be used unnecessarily.
</para>
<para>
The second class of atomic operations is atomic bit operations on a
<type>long</type>, defined in
The second class of atomic operations is atomic bit operations on an
<type>unsigned long</type>, defined in
<filename class="headerfile">include/linux/bitops.h</filename>. These
operations generally take a pointer to the bit pattern, and a bit
......@@ -899,7 +921,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<function>test_and_clear_bit()</function> and
<function>test_and_change_bit()</function> do the same thing,
except return true if the bit was previously set; these are
particularly useful for very simple locking.
particularly useful for atomically setting flags.
</para>
<para>
......@@ -907,12 +929,6 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
than BITS_PER_LONG. The resulting behavior is strange on big-endian
platforms though so it is a good idea not to do this.
</para>
<para>
Note that the order of bits depends on the architecture, and in
particular, the bitfield passed to these operations must be at
least as large as a <type>long</type>.
</para>
</chapter>
<chapter id="symbols">
......@@ -932,11 +948,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<filename class="headerfile">include/linux/module.h</filename></title>
<para>
This is the classic method of exporting a symbol, and it works
for both modules and non-modules. In the kernel all these
declarations are often bundled into a single file to help
genksyms (which searches source files for these declarations).
See the comment on genksyms and Makefiles below.
This is the classic method of exporting a symbol: dynamically
loaded modules will be able to use the symbol as normal.
</para>
</sect1>
......@@ -949,7 +962,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can
only be seen by modules with a
<function>MODULE_LICENSE()</function> that specifies a GPL
compatible license.
compatible license. It implies that the function is considered
an internal implementation issue, and not really an interface.
</para>
</sect1>
</chapter>
......@@ -962,12 +976,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
<filename class="headerfile">include/linux/list.h</filename></title>
<para>
There are three sets of linked-list routines in the kernel
headers, but this one seems to be winning out (and Linus has
used it). If you don't have some particular pressing need for
a single list, it's a good choice. In fact, I don't care
whether it's a good choice or not, just use it so we can get
rid of the others.
There used to be three sets of linked-list routines in the kernel
headers, but this one is the winner. If you don't have some
particular pressing need for a single list, it's a good choice.
</para>
<para>
In particular, <function>list_for_each_entry</function> is useful.
</para>
</sect1>
......@@ -979,14 +994,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress));
convention, and return <returnvalue>0</returnvalue> for success,
and a negative error number
(eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be
unintuitive at first, but it's fairly widespread in the networking
code, for example.
unintuitive at first, but it's fairly widespread in the kernel.
</para>
<para>
The filesystem code uses <function>ERR_PTR()</function>
Using <function>ERR_PTR()</function>
<filename class="headerfile">include/linux/fs.h</filename>; to
<filename class="headerfile">include/linux/err.h</filename>; to
encode a negative error number into a pointer, and
<function>IS_ERR()</function> and <function>PTR_ERR()</function>
to get it back out again: avoids a separate pointer parameter for
......@@ -1040,7 +1054,7 @@ static struct block_device_operations opt_fops = {
supported, due to lack of general use, but the following are
considered standard (see the GCC info page section "C
Extensions" for more details - Yes, really the info page, the
man page is only a short summary of the stuff in info):
man page is only a short summary of the stuff in info).
</para>
<itemizedlist>
<listitem>
......@@ -1091,7 +1105,7 @@ static struct block_device_operations opt_fops = {
</listitem>
<listitem>
<para>
Function names as strings (__FUNCTION__)
Function names as strings (__func__).
</para>
</listitem>
<listitem>
......@@ -1164,63 +1178,35 @@ static struct block_device_operations opt_fops = {
<listitem>
<para>
Usually you want a configuration option for your kernel hack.
Edit <filename>Config.in</filename> in the appropriate directory
(but under <filename>arch/</filename> it's called
<filename>config.in</filename>). The Config Language used is not
bash, even though it looks like bash; the safe way is to use only
the constructs that you already see in
<filename>Config.in</filename> files (see
<filename>Documentation/kbuild/kconfig-language.txt</filename>).
It's good to run "make xconfig" at least once to test (because
it's the only one with a static parser).
</para>
<para>
Variables which can be Y or N use <type>bool</type> followed by a
tagline and the config define name (which must start with
CONFIG_). The <type>tristate</type> function is the same, but
allows the answer M (which defines
<symbol>CONFIG_foo_MODULE</symbol> in your source, instead of
<symbol>CONFIG_FOO</symbol>) if <symbol>CONFIG_MODULES</symbol>
is enabled.
Edit <filename>Kconfig</filename> in the appropriate directory.
The Config language is simple to use by cut and paste, and there's
complete documentation in
<filename>Documentation/kbuild/kconfig-language.txt</filename>.
</para>
<para>
You may well want to make your CONFIG option only visible if
<symbol>CONFIG_EXPERIMENTAL</symbol> is enabled: this serves as a
warning to users. There many other fancy things you can do: see
the various <filename>Config.in</filename> files for ideas.
the various <filename>Kconfig</filename> files for ideas.
</para>
</listitem>
<listitem>
<para>
Edit the <filename>Makefile</filename>: the CONFIG variables are
exported here so you can conditionalize compilation with `ifeq'.
If your file exports symbols then add the names to
<varname>export-objs</varname> so that genksyms will find them.
<caution>
<para>
There is a restriction on the kernel build system that objects
which export symbols must have globally unique names.
If your object does not have a globally unique name then the
standard fix is to move the
<function>EXPORT_SYMBOL()</function> statements to their own
object with a unique name.
This is why several systems have separate exporting objects,
usually suffixed with ksyms.
</para>
</caution>
In your description of the option, make sure you address both the
expert user and the user who knows nothing about your feature. Mention
incompatibilities and issues here. <emphasis> Definitely
</emphasis> end your description with <quote> if in doubt, say N
</quote> (or, occasionally, `Y'); this is for people who have no
idea what you are talking about.
</para>
</listitem>
<listitem>
<para>
Document your option in Documentation/Configure.help. Mention
incompatibilities and issues here. <emphasis> Definitely
</emphasis> end your description with <quote> if in doubt, say N
</quote> (or, occasionally, `Y'); this is for people who have no
idea what you are talking about.
Edit the <filename>Makefile</filename>: the CONFIG variables are
exported here so you can usually just add a "obj-$(CONFIG_xxx) +=
xxx.o" line. The syntax is documented in
<filename>Documentation/kbuild/makefiles.txt</filename>.
</para>
</listitem>
......@@ -1253,20 +1239,12 @@ static struct block_device_operations opt_fops = {
</para>
<para>
<filename>include/linux/brlock.h:</filename>
<filename>include/asm-i386/delay.h:</filename>
</para>
<programlisting>
extern inline void br_read_lock (enum brlock_indices idx)
{
/*
* This causes a link-time bug message if an
* invalid index is used:
*/
if (idx >= __BR_END)
__br_lock_usage_bug();
read_lock(&amp;__brlock_array[smp_processor_id()][idx]);
}
#define ndelay(n) (__builtin_constant_p(n) ? \
((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \
__ndelay(n))
</programlisting>
<para>
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment