Commit 0150aedd authored by Jonathan Corbet's avatar Jonathan Corbet

Merge branch 'mauro' into docs-next

Mauro sez:

  There are lots of plain text documents under Documentation/filesystems.
  Manually convert several of those to ReST and add them to the index file.
Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
parents 3eb30c51 9a610812
v9fs: Plan 9 Resource Sharing for Linux .. SPDX-License-Identifier: GPL-2.0
=======================================
ABOUT =======================================
v9fs: Plan 9 Resource Sharing for Linux
=======================================
About
===== =====
v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
...@@ -14,9 +17,11 @@ and Maya Gokhale. Additional development by Greg Watson ...@@ -14,9 +17,11 @@ and Maya Gokhale. Additional development by Greg Watson
The best detailed explanation of the Linux implementation and applications of The best detailed explanation of the Linux implementation and applications of
the 9p client is available in the form of a USENIX paper: the 9p client is available in the form of a USENIX paper:
http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html
Other applications are described in the following papers: Other applications are described in the following papers:
* XCPU & Clustering * XCPU & Clustering
http://xcpu.org/papers/xcpu-talk.pdf http://xcpu.org/papers/xcpu-talk.pdf
* KVMFS: control file system for KVM * KVMFS: control file system for KVM
...@@ -28,18 +33,18 @@ Other applications are described in the following papers: ...@@ -28,18 +33,18 @@ Other applications are described in the following papers:
* VirtFS: A Virtualization Aware File System pass-through * VirtFS: A Virtualization Aware File System pass-through
http://goo.gl/3WPDg http://goo.gl/3WPDg
USAGE Usage
===== =====
For remote file server: For remote file server::
mount -t 9p 10.10.1.2 /mnt/9 mount -t 9p 10.10.1.2 /mnt/9
For Plan 9 From User Space applications (http://swtch.com/plan9) For Plan 9 From User Space applications (http://swtch.com/plan9)::
mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
For server running on QEMU host with virtio transport: For server running on QEMU host with virtio transport::
mount -t 9p -o trans=virtio <mount_tag> /mnt/9 mount -t 9p -o trans=virtio <mount_tag> /mnt/9
...@@ -48,18 +53,22 @@ mount points. Each 9P export is seen by the client as a virtio device with an ...@@ -48,18 +53,22 @@ mount points. Each 9P export is seen by the client as a virtio device with an
associated "mount_tag" property. Available mount tags can be associated "mount_tag" property. Available mount tags can be
seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files. seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
OPTIONS Options
======= =======
============= ===============================================================
trans=name select an alternative transport. Valid options are trans=name select an alternative transport. Valid options are
currently: currently:
unix - specifying a named pipe mount point
tcp - specifying a normal TCP/IP connection ======== ============================================
fd - used passed file descriptors for connection unix specifying a named pipe mount point
tcp specifying a normal TCP/IP connection
fd used passed file descriptors for connection
(see rfdno and wfdno) (see rfdno and wfdno)
virtio - connect to the next virtio channel available virtio connect to the next virtio channel available
(from QEMU with trans_virtio module) (from QEMU with trans_virtio module)
rdma - connect to a specified RDMA channel rdma connect to a specified RDMA channel
======== ============================================
uname=name user name to attempt mount as on the remote server. The uname=name user name to attempt mount as on the remote server. The
server may override or ignore this value. Certain user server may override or ignore this value. Certain user
...@@ -69,28 +78,36 @@ OPTIONS ...@@ -69,28 +78,36 @@ OPTIONS
offering several exported file systems. offering several exported file systems.
cache=mode specifies a caching policy. By default, no caches are used. cache=mode specifies a caching policy. By default, no caches are used.
none = default no cache policy, metadata and data
none
default no cache policy, metadata and data
alike are synchronous. alike are synchronous.
loose = no attempts are made at consistency, loose
no attempts are made at consistency,
intended for exclusive, read-only mounts intended for exclusive, read-only mounts
fscache = use FS-Cache for a persistent, read-only fscache
use FS-Cache for a persistent, read-only
cache backend. cache backend.
mmap = minimal cache that is only used for read-write mmap
minimal cache that is only used for read-write
mmap. Northing else is cached, like cache=none mmap. Northing else is cached, like cache=none
debug=n specifies debug level. The debug level is a bitmask. debug=n specifies debug level. The debug level is a bitmask.
0x01 = display verbose error messages
0x02 = developer debug (DEBUG_CURRENT) ===== ================================
0x04 = display 9p trace 0x01 display verbose error messages
0x08 = display VFS trace 0x02 developer debug (DEBUG_CURRENT)
0x10 = display Marshalling debug 0x04 display 9p trace
0x20 = display RPC debug 0x08 display VFS trace
0x40 = display transport debug 0x10 display Marshalling debug
0x80 = display allocation debug 0x20 display RPC debug
0x100 = display protocol message debug 0x40 display transport debug
0x200 = display Fid debug 0x80 display allocation debug
0x400 = display packet debug 0x100 display protocol message debug
0x800 = display fscache tracing debug 0x200 display Fid debug
0x400 display packet debug
0x800 display fscache tracing debug
===== ================================
rfdno=n the file descriptor for reading with trans=fd rfdno=n the file descriptor for reading with trans=fd
...@@ -103,9 +120,12 @@ OPTIONS ...@@ -103,9 +120,12 @@ OPTIONS
noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) noextend force legacy mode (no 9p2000.u or 9p2000.L semantics)
version=name Select 9P protocol version. Valid options are: version=name Select 9P protocol version. Valid options are:
9p2000 - Legacy mode (same as noextend)
9p2000.u - Use 9P2000.u protocol ======== ==============================
9p2000.L - Use 9P2000.L protocol 9p2000 Legacy mode (same as noextend)
9p2000.u Use 9P2000.u protocol
9p2000.L Use 9P2000.L protocol
======== ==============================
dfltuid attempt to mount as a particular uid dfltuid attempt to mount as a particular uid
...@@ -118,22 +138,27 @@ OPTIONS ...@@ -118,22 +138,27 @@ OPTIONS
hosts. This functionality will be expanded in later versions. hosts. This functionality will be expanded in later versions.
access there are four access modes. access there are four access modes.
user = if a user tries to access a file on v9fs user
if a user tries to access a file on v9fs
filesystem for the first time, v9fs sends an filesystem for the first time, v9fs sends an
attach command (Tattach) for that user. attach command (Tattach) for that user.
This is the default mode. This is the default mode.
<uid> = allows only user with uid=<uid> to access <uid>
allows only user with uid=<uid> to access
the files on the mounted filesystem the files on the mounted filesystem
any = v9fs does single attach and performs all any
v9fs does single attach and performs all
operations as one user operations as one user
client = ACL based access check on the 9p client clien
ACL based access check on the 9p client
side for access validation side for access validation
cachetag cache tag to use the specified persistent cache. cachetag cache tag to use the specified persistent cache.
cache tags for existing cache sessions can be listed at cache tags for existing cache sessions can be listed at
/sys/fs/9p/caches. (applies only to cache=fscache) /sys/fs/9p/caches. (applies only to cache=fscache)
============= ===============================================================
RESOURCES Resources
========= =========
Protocol specifications are maintained on github: Protocol specifications are maintained on github:
...@@ -158,4 +183,3 @@ http://plan9.bell-labs.com/plan9 ...@@ -158,4 +183,3 @@ http://plan9.bell-labs.com/plan9
For information on Plan 9 from User Space (Plan 9 applications and libraries For information on Plan 9 from User Space (Plan 9 applications and libraries
ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
.. SPDX-License-Identifier: GPL-2.0
===============================
Acorn Disc Filing System - ADFS
===============================
Filesystems supported by ADFS Filesystems supported by ADFS
----------------------------- -----------------------------
...@@ -25,6 +31,7 @@ directory updates, specifically updating the access mode and timestamp. ...@@ -25,6 +31,7 @@ directory updates, specifically updating the access mode and timestamp.
Mount options for ADFS Mount options for ADFS
---------------------- ----------------------
============ ======================================================
uid=nnn All files in the partition will be owned by uid=nnn All files in the partition will be owned by
user id nnn. Default 0 (root). user id nnn. Default 0 (root).
gid=nnn All files in the partition will be in group gid=nnn All files in the partition will be in group
...@@ -36,22 +43,23 @@ Mount options for ADFS ...@@ -36,22 +43,23 @@ Mount options for ADFS
ftsuffix=n When ftsuffix=0, no file type suffix will be applied. ftsuffix=n When ftsuffix=0, no file type suffix will be applied.
When ftsuffix=1, a hexadecimal suffix corresponding to When ftsuffix=1, a hexadecimal suffix corresponding to
the RISC OS file type will be added. Default 0. the RISC OS file type will be added. Default 0.
============ ======================================================
Mapping of ADFS permissions to Linux permissions Mapping of ADFS permissions to Linux permissions
------------------------------------------------ ------------------------------------------------
ADFS permissions consist of the following: ADFS permissions consist of the following:
Owner read - Owner read
Owner write - Owner write
Other read - Other read
Other write - Other write
(In older versions, an 'execute' permission did exist, but this (In older versions, an 'execute' permission did exist, but this
does not hold the same meaning as the Linux 'execute' permission does not hold the same meaning as the Linux 'execute' permission
and is now obsolete). and is now obsolete).
The mapping is performed as follows: The mapping is performed as follows::
Owner read -> -r--r--r-- Owner read -> -r--r--r--
Owner write -> --w--w---w Owner write -> --w--w---w
...@@ -66,17 +74,18 @@ Mapping of ADFS permissions to Linux permissions ...@@ -66,17 +74,18 @@ Mapping of ADFS permissions to Linux permissions
Possible other mode permissions -> ----rwxrwx Possible other mode permissions -> ----rwxrwx
Hence, with the default masks, if a file is owner read/write, and Hence, with the default masks, if a file is owner read/write, and
not a UnixExec filetype, then the permissions will be: not a UnixExec filetype, then the permissions will be::
-rw------- -rw-------
However, if the masks were ownmask=0770,othmask=0007, then this would However, if the masks were ownmask=0770,othmask=0007, then this would
be modified to: be modified to::
-rw-rw---- -rw-rw----
There is no restriction on what you can do with these masks. You may There is no restriction on what you can do with these masks. You may
wish that either read bits give read access to the file for all, but wish that either read bits give read access to the file for all, but
keep the default write protection (ownmask=0755,othmask=0577): keep the default write protection (ownmask=0755,othmask=0577)::
-rw-r--r-- -rw-r--r--
......
.. SPDX-License-Identifier: GPL-2.0
=============================
Overview of Amiga Filesystems Overview of Amiga Filesystems
============================= =============================
Not all varieties of the Amiga filesystems are supported for reading and Not all varieties of the Amiga filesystems are supported for reading and
writing. The Amiga currently knows six different filesystems: writing. The Amiga currently knows six different filesystems:
============== ===============================================================
DOS\0 The old or original filesystem, not really suited for DOS\0 The old or original filesystem, not really suited for
hard disks and normally not used on them, either. hard disks and normally not used on them, either.
Supported read/write. Supported read/write.
...@@ -23,6 +27,7 @@ DOS\4 The original filesystem with directory cache. The directory ...@@ -23,6 +27,7 @@ DOS\4 The original filesystem with directory cache. The directory
sense on hard disks. Supported read only. sense on hard disks. Supported read only.
DOS\5 The Fast File System with directory cache. Supported read only. DOS\5 The Fast File System with directory cache. Supported read only.
============== ===============================================================
All of the above filesystems allow block sizes from 512 to 32K bytes. All of the above filesystems allow block sizes from 512 to 32K bytes.
Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
...@@ -36,14 +41,18 @@ are supported, too. ...@@ -36,14 +41,18 @@ are supported, too.
Mount options for the AFFS Mount options for the AFFS
========================== ==========================
protect If this option is set, the protection bits cannot be altered. protect
If this option is set, the protection bits cannot be altered.
setuid[=uid] This sets the owner of all files and directories in the file setuid[=uid]
This sets the owner of all files and directories in the file
system to uid or the uid of the current user, respectively. system to uid or the uid of the current user, respectively.
setgid[=gid] Same as above, but for gid. setgid[=gid]
Same as above, but for gid.
mode=mode Sets the mode flags to the given (octal) value, regardless mode=mode
Sets the mode flags to the given (octal) value, regardless
of the original permissions. Directories will get an x of the original permissions. Directories will get an x
permission if the corresponding r bit is set. permission if the corresponding r bit is set.
This is useful since most of the plain AmigaOS files This is useful since most of the plain AmigaOS files
...@@ -53,33 +62,41 @@ nofilenametruncate ...@@ -53,33 +62,41 @@ nofilenametruncate
The file system will return an error when filename exceeds The file system will return an error when filename exceeds
standard maximum filename length (30 characters). standard maximum filename length (30 characters).
reserved=num Sets the number of reserved blocks at the start of the reserved=num
Sets the number of reserved blocks at the start of the
partition to num. You should never need this option. partition to num. You should never need this option.
Default is 2. Default is 2.
root=block Sets the block number of the root block. This should never root=block
Sets the block number of the root block. This should never
be necessary. be necessary.
bs=blksize Sets the blocksize to blksize. Valid block sizes are 512, bs=blksize
Sets the blocksize to blksize. Valid block sizes are 512,
1024, 2048 and 4096. Like the root option, this should 1024, 2048 and 4096. Like the root option, this should
never be necessary, as the affs can figure it out itself. never be necessary, as the affs can figure it out itself.
quiet The file system will not return an error for disallowed quiet
The file system will not return an error for disallowed
mode changes. mode changes.
verbose The volume name, file system type and block size will verbose
The volume name, file system type and block size will
be written to the syslog when the filesystem is mounted. be written to the syslog when the filesystem is mounted.
mufs The filesystem is really a muFS, also it doesn't mufs
The filesystem is really a muFS, also it doesn't
identify itself as one. This option is necessary if identify itself as one. This option is necessary if
the filesystem wasn't formatted as muFS, but is used the filesystem wasn't formatted as muFS, but is used
as one. as one.
prefix=path Path will be prefixed to every absolute path name of prefix=path
Path will be prefixed to every absolute path name of
symbolic links on an AFFS partition. Default = "/". symbolic links on an AFFS partition. Default = "/".
(See below.) (See below.)
volume=name When symbolic links with an absolute path are created volume=name
When symbolic links with an absolute path are created
on an AFFS partition, name will be prepended as the on an AFFS partition, name will be prepended as the
volume name. Default = "" (empty string). volume name. Default = "" (empty string).
(See below.) (See below.)
...@@ -148,11 +165,13 @@ might be "User", "WB" and "Graphics", the mount points /amiga/User, ...@@ -148,11 +165,13 @@ might be "User", "WB" and "Graphics", the mount points /amiga/User,
Examples Examples
======== ========
Command line: Command line::
mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
mount /dev/sda3 /Amiga -t affs mount /dev/sda3 /Amiga -t affs
/etc/fstab entry: /etc/fstab entry::
/dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0 /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
IMPORTANT NOTE IMPORTANT NOTE
...@@ -170,7 +189,8 @@ before booting Windows! ...@@ -170,7 +189,8 @@ before booting Windows!
If the damage is already done, the following should fix the RDB If the damage is already done, the following should fix the RDB
(where <disk> is the device name). (where <disk> is the device name).
DO AT YOUR OWN RISK:
DO AT YOUR OWN RISK::
dd if=/dev/<disk> of=rdb.tmp count=1 dd if=/dev/<disk> of=rdb.tmp count=1
cp rdb.tmp rdb.fixed cp rdb.tmp rdb.fixed
...@@ -189,10 +209,14 @@ By default, filenames are truncated to 30 characters without warning. ...@@ -189,10 +209,14 @@ By default, filenames are truncated to 30 characters without warning.
'nofilenametruncate' mount option can change that behavior. 'nofilenametruncate' mount option can change that behavior.
Case is ignored by the affs in filename matching, but Linux shells Case is ignored by the affs in filename matching, but Linux shells
do care about the case. Example (with /wb being an affs mounted fs): do care about the case. Example (with /wb being an affs mounted fs)::
rm /wb/WRONGCASE rm /wb/WRONGCASE
will remove /mnt/wrongcase, but
will remove /mnt/wrongcase, but::
rm /wb/WR* rm /wb/WR*
will not since the names are matched by the shell. will not since the names are matched by the shell.
The block allocation is designed for hard disk partitions. If more The block allocation is designed for hard disk partitions. If more
...@@ -219,4 +243,4 @@ due to an incompatibility with the Amiga floppy controller. ...@@ -219,4 +243,4 @@ due to an incompatibility with the Amiga floppy controller.
If you are interested in an Amiga Emulator for Linux, look at If you are interested in an Amiga Emulator for Linux, look at
http://web.archive.org/web/*/http://www.freiburg.linux.de/~uae/ http://web.archive.org/web/%2E/http://www.freiburg.linux.de/~uae/
==================== .. SPDX-License-Identifier: GPL-2.0
kAFS: AFS FILESYSTEM
====================
Contents: ====================
kAFS: AFS FILESYSTEM
====================
.. Contents:
- Overview. - Overview.
- Usage. - Usage.
...@@ -14,8 +16,7 @@ Contents: ...@@ -14,8 +16,7 @@ Contents:
- The @sys substitution. - The @sys substitution.
======== Overview
OVERVIEW
======== ========
This filesystem provides a fairly simple secure AFS filesystem driver. It is This filesystem provides a fairly simple secure AFS filesystem driver. It is
...@@ -35,35 +36,33 @@ It does not yet support the following AFS features: ...@@ -35,35 +36,33 @@ It does not yet support the following AFS features:
(*) pioctl() system call. (*) pioctl() system call.
=========== Compilation
COMPILATION
=========== ===========
The filesystem should be enabled by turning on the kernel configuration The filesystem should be enabled by turning on the kernel configuration
options: options::
CONFIG_AF_RXRPC - The RxRPC protocol transport CONFIG_AF_RXRPC - The RxRPC protocol transport
CONFIG_RXKAD - The RxRPC Kerberos security handler CONFIG_RXKAD - The RxRPC Kerberos security handler
CONFIG_AFS - The AFS filesystem CONFIG_AFS - The AFS filesystem
Additionally, the following can be turned on to aid debugging: Additionally, the following can be turned on to aid debugging::
CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled
CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled
They permit the debugging messages to be turned on dynamically by manipulating They permit the debugging messages to be turned on dynamically by manipulating
the masks in the following files: the masks in the following files::
/sys/module/af_rxrpc/parameters/debug /sys/module/af_rxrpc/parameters/debug
/sys/module/kafs/parameters/debug /sys/module/kafs/parameters/debug
===== Usage
USAGE
===== =====
When inserting the driver modules the root cell must be specified along with a When inserting the driver modules the root cell must be specified along with a
list of volume location server IP addresses: list of volume location server IP addresses::
modprobe rxrpc modprobe rxrpc
modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
...@@ -77,14 +76,14 @@ The second module is the kerberos RxRPC security driver, and the third module ...@@ -77,14 +76,14 @@ The second module is the kerberos RxRPC security driver, and the third module
is the actual filesystem driver for the AFS filesystem. is the actual filesystem driver for the AFS filesystem.
Once the module has been loaded, more modules can be added by the following Once the module has been loaded, more modules can be added by the following
procedure: procedure::
echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
Where the parameters to the "add" command are the name of a cell and a list of Where the parameters to the "add" command are the name of a cell and a list of
volume location servers within that cell, with the latter separated by colons. volume location servers within that cell, with the latter separated by colons.
Filesystems can be mounted anywhere by commands similar to the following: Filesystems can be mounted anywhere by commands similar to the following::
mount -t afs "%cambridge.redhat.com:root.afs." /afs mount -t afs "%cambridge.redhat.com:root.afs." /afs
mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
...@@ -104,8 +103,7 @@ named volume will be looked up in the cell specified during modprobe. ...@@ -104,8 +103,7 @@ named volume will be looked up in the cell specified during modprobe.
Additional cells can be added through /proc (see later section). Additional cells can be added through /proc (see later section).
=========== Mountpoints
MOUNTPOINTS
=========== ===========
AFS has a concept of mountpoints. In AFS terms, these are specially formatted AFS has a concept of mountpoints. In AFS terms, these are specially formatted
...@@ -123,42 +121,40 @@ culled first. If all are culled, then the requested volume will also be ...@@ -123,42 +121,40 @@ culled first. If all are culled, then the requested volume will also be
unmounted, otherwise error EBUSY will be returned. unmounted, otherwise error EBUSY will be returned.
This can be used by the administrator to attempt to unmount the whole AFS tree This can be used by the administrator to attempt to unmount the whole AFS tree
mounted on /afs in one go by doing: mounted on /afs in one go by doing::
umount /afs umount /afs
============ Dynamic Root
DYNAMIC ROOT
============ ============
A mount option is available to create a serverless mount that is only usable A mount option is available to create a serverless mount that is only usable
for dynamic lookup. Creating such a mount can be done by, for example: for dynamic lookup. Creating such a mount can be done by, for example::
mount -t afs none /afs -o dyn mount -t afs none /afs -o dyn
This creates a mount that just has an empty directory at the root. Attempting This creates a mount that just has an empty directory at the root. Attempting
to look up a name in this directory will cause a mountpoint to be created that to look up a name in this directory will cause a mountpoint to be created that
looks up a cell of the same name, for example: looks up a cell of the same name, for example::
ls /afs/grand.central.org/ ls /afs/grand.central.org/
=============== Proc Filesystem
PROC FILESYSTEM
=============== ===============
The AFS modules creates a "/proc/fs/afs/" directory and populates it: The AFS modules creates a "/proc/fs/afs/" directory and populates it:
(*) A "cells" file that lists cells currently known to the afs module and (*) A "cells" file that lists cells currently known to the afs module and
their usage counts: their usage counts::
[root@andromeda ~]# cat /proc/fs/afs/cells [root@andromeda ~]# cat /proc/fs/afs/cells
USE NAME USE NAME
3 cambridge.redhat.com 3 cambridge.redhat.com
(*) A directory per cell that contains files that list volume location (*) A directory per cell that contains files that list volume location
servers, volumes, and active servers known within that cell. servers, volumes, and active servers known within that cell::
[root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers
USE ADDR STATE USE ADDR STATE
...@@ -171,8 +167,7 @@ The AFS modules creates a "/proc/fs/afs/" directory and populates it: ...@@ -171,8 +167,7 @@ The AFS modules creates a "/proc/fs/afs/" directory and populates it:
1 Val 20000000 20000001 20000002 root.afs 1 Val 20000000 20000001 20000002 root.afs
================= The Cell Database
THE CELL DATABASE
================= =================
The filesystem maintains an internal database of all the cells it knows and the The filesystem maintains an internal database of all the cells it knows and the
...@@ -181,7 +176,7 @@ the system belongs is added to the database when modprobe is performed by the ...@@ -181,7 +176,7 @@ the system belongs is added to the database when modprobe is performed by the
"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on "rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on
the kernel command line. the kernel command line.
Further cells can be added by commands similar to the following: Further cells can be added by commands similar to the following::
echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
...@@ -189,8 +184,7 @@ Further cells can be added by commands similar to the following: ...@@ -189,8 +184,7 @@ Further cells can be added by commands similar to the following:
No other cell database operations are available at this time. No other cell database operations are available at this time.
======== Security
SECURITY
======== ========
Secure operations are initiated by acquiring a key using the klog program. A Secure operations are initiated by acquiring a key using the klog program. A
...@@ -198,17 +192,17 @@ very primitive klog program is available at: ...@@ -198,17 +192,17 @@ very primitive klog program is available at:
http://people.redhat.com/~dhowells/rxrpc/klog.c http://people.redhat.com/~dhowells/rxrpc/klog.c
This should be compiled by: This should be compiled by::
make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils" make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils"
And then run as: And then run as::
./klog ./klog
Assuming it's successful, this adds a key of type RxRPC, named for the service Assuming it's successful, this adds a key of type RxRPC, named for the service
and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or
by cat'ing /proc/keys: by cat'ing /proc/keys::
[root@andromeda ~]# keyctl show [root@andromeda ~]# keyctl show
Session Keyring Session Keyring
...@@ -232,20 +226,19 @@ socket), then the operations on the file will be made with key that was used to ...@@ -232,20 +226,19 @@ socket), then the operations on the file will be made with key that was used to
open the file. open the file.
===================== The @sys Substitution
THE @SYS SUBSTITUTION
===================== =====================
The list of up to 16 @sys substitutions for the current network namespace can The list of up to 16 @sys substitutions for the current network namespace can
be configured by writing a list to /proc/fs/afs/sysname: be configured by writing a list to /proc/fs/afs/sysname::
[root@andromeda ~]# echo foo amd64_linux_26 >/proc/fs/afs/sysname [root@andromeda ~]# echo foo amd64_linux_26 >/proc/fs/afs/sysname
or cleared entirely by writing an empty list: or cleared entirely by writing an empty list::
[root@andromeda ~]# echo >/proc/fs/afs/sysname [root@andromeda ~]# echo >/proc/fs/afs/sysname
The current list for current network namespace can be retrieved by: The current list for current network namespace can be retrieved by::
[root@andromeda ~]# cat /proc/fs/afs/sysname [root@andromeda ~]# cat /proc/fs/afs/sysname
foo foo
......
.. SPDX-License-Identifier: GPL-2.0
====================================================================
Miscellaneous Device control operations for the autofs kernel module Miscellaneous Device control operations for the autofs kernel module
==================================================================== ====================================================================
...@@ -36,24 +38,24 @@ For example, there are two types of automount maps, direct (in the kernel ...@@ -36,24 +38,24 @@ For example, there are two types of automount maps, direct (in the kernel
module source you will see a third type called an offset, which is just module source you will see a third type called an offset, which is just
a direct mount in disguise) and indirect. a direct mount in disguise) and indirect.
Here is a master map with direct and indirect map entries: Here is a master map with direct and indirect map entries::
/- /etc/auto.direct /- /etc/auto.direct
/test /etc/auto.indirect /test /etc/auto.indirect
and the corresponding map files: and the corresponding map files::
/etc/auto.direct: /etc/auto.direct:
/automount/dparse/g6 budgie:/autofs/export1 /automount/dparse/g6 budgie:/autofs/export1
/automount/dparse/g1 shark:/autofs/export1 /automount/dparse/g1 shark:/autofs/export1
and so on. and so on.
/etc/auto.indirect: /etc/auto.indirect::
g1 shark:/autofs/export1 g1 shark:/autofs/export1
g6 budgie:/autofs/export1 g6 budgie:/autofs/export1
and so on. and so on.
For the above indirect map an autofs file system is mounted on /test and For the above indirect map an autofs file system is mounted on /test and
mounts are triggered for each sub-directory key by the inode lookup mounts are triggered for each sub-directory key by the inode lookup
...@@ -69,18 +71,18 @@ use the follow_link inode operation to trigger the mount. ...@@ -69,18 +71,18 @@ use the follow_link inode operation to trigger the mount.
But, each entry in direct and indirect maps can have offsets (making But, each entry in direct and indirect maps can have offsets (making
them multi-mount map entries). them multi-mount map entries).
For example, an indirect mount map entry could also be: For example, an indirect mount map entry could also be::
g1 \ g1 \
/ shark:/autofs/export5/testing/test \ / shark:/autofs/export5/testing/test \
/s1 shark:/autofs/export/testing/test/s1 \ /s1 shark:/autofs/export/testing/test/s1 \
/s2 shark:/autofs/export5/testing/test/s2 \ /s2 shark:/autofs/export5/testing/test/s2 \
/s1/ss1 shark:/autofs/export1 \ /s1/ss1 shark:/autofs/export1 \
/s2/ss2 shark:/autofs/export2 /s2/ss2 shark:/autofs/export2
and a similarly a direct mount map entry could also be: and a similarly a direct mount map entry could also be::
/automount/dparse/g1 \ /automount/dparse/g1 \
/ shark:/autofs/export5/testing/test \ / shark:/autofs/export5/testing/test \
/s1 shark:/autofs/export/testing/test/s1 \ /s1 shark:/autofs/export/testing/test/s1 \
/s2 shark:/autofs/export5/testing/test/s2 \ /s2 shark:/autofs/export5/testing/test/s2 \
...@@ -170,9 +172,9 @@ autofs Miscellaneous Device mount control interface ...@@ -170,9 +172,9 @@ autofs Miscellaneous Device mount control interface
The control interface is opening a device node, typically /dev/autofs. The control interface is opening a device node, typically /dev/autofs.
All the ioctls use a common structure to pass the needed parameter All the ioctls use a common structure to pass the needed parameter
information and return operation results: information and return operation results::
struct autofs_dev_ioctl { struct autofs_dev_ioctl {
__u32 ver_major; __u32 ver_major;
__u32 ver_minor; __u32 ver_minor;
__u32 size; /* total size of data passed in __u32 size; /* total size of data passed in
...@@ -195,7 +197,7 @@ struct autofs_dev_ioctl { ...@@ -195,7 +197,7 @@ struct autofs_dev_ioctl {
}; };
char path[0]; char path[0];
}; };
The ioctlfd field is a mount point file descriptor of an autofs mount The ioctlfd field is a mount point file descriptor of an autofs mount
point. It is returned by the open call and is used by all calls except point. It is returned by the open call and is used by all calls except
...@@ -212,7 +214,7 @@ is used account for the increased structure length when translating the ...@@ -212,7 +214,7 @@ is used account for the increased structure length when translating the
structure sent from user space. structure sent from user space.
This structure can be initialized before setting specific fields by using This structure can be initialized before setting specific fields by using
the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *). the void function call init_autofs_dev_ioctl(``struct autofs_dev_ioctl *``).
All of the ioctls perform a copy of this structure from user space to All of the ioctls perform a copy of this structure from user space to
kernel space and return -EINVAL if the size parameter is smaller than kernel space and return -EINVAL if the size parameter is smaller than
......
.. SPDX-License-Identifier: GPL-2.0
=========================
BeOS filesystem for Linux BeOS filesystem for Linux
=========================
Document last updated: Dec 6, 2001 Document last updated: Dec 6, 2001
WARNING Warning
======= =======
Make sure you understand that this is alpha software. This means that the Make sure you understand that this is alpha software. This means that the
implementation is neither complete nor well-tested. implementation is neither complete nor well-tested.
I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
LICENSE License
===== =======
This software is covered by the GNU General Public License. This software is covered by the GNU General Public License.
See the file COPYING for the complete text of the license. See the file COPYING for the complete text of the license.
Or the GNU website: <http://www.gnu.org/licenses/licenses.html> Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
AUTHOR Author
===== ======
The largest part of the code written by Will Dyson <will_dyson@pobox.com> The largest part of the code written by Will Dyson <will_dyson@pobox.com>
He has been working on the code since Aug 13, 2001. See the changelog for He has been working on the code since Aug 13, 2001. See the changelog for
details. details.
Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp> Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
His original code can still be found at: His original code can still be found at:
<http://hp.vector.co.jp/authors/VA008030/bfs/> <http://hp.vector.co.jp/authors/VA008030/bfs/>
Does anyone know of a more current email address for Makoto? He doesn't Does anyone know of a more current email address for Makoto? He doesn't
respond to the address given above... respond to the address given above...
This filesystem doesn't have a maintainer. This filesystem doesn't have a maintainer.
WHAT IS THIS DRIVER? What is this Driver?
================== ====================
This module implements the native filesystem of BeOS http://www.beincorporated.com/ This module implements the native filesystem of BeOS http://www.beincorporated.com/
for the linux 2.4.1 and later kernels. Currently it is a read-only for the linux 2.4.1 and later kernels. Currently it is a read-only
implementation. implementation.
Which is it, BFS or BEFS? Which is it, BFS or BEFS?
================ =========================
Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
But Unixware Boot Filesystem is called bfs, too. And they are already in But Unixware Boot Filesystem is called bfs, too. And they are already in
the kernel. Because of this naming conflict, on Linux the BeOS the kernel. Because of this naming conflict, on Linux the BeOS
filesystem is called befs. filesystem is called befs.
HOW TO INSTALL How to Install
============== ==============
step 1. Install the BeFS patch into the source code tree of linux. step 1. Install the BeFS patch into the source code tree of linux.
...@@ -63,7 +69,7 @@ The linux kernel has many compile-time options. Most of them are beyond the ...@@ -63,7 +69,7 @@ The linux kernel has many compile-time options. Most of them are beyond the
scope of this document. I suggest the Kernel-HOWTO document as a good general scope of this document. I suggest the Kernel-HOWTO document as a good general
reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html
However, to use the BeFS module, you must enable it at configure time. However, to use the BeFS module, you must enable it at configure time::
cd /foo/bar/linux cd /foo/bar/linux
make menuconfig (or xconfig) make menuconfig (or xconfig)
...@@ -82,35 +88,40 @@ step 3. Install ...@@ -82,35 +88,40 @@ step 3. Install
See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
instructions on this critical step. instructions on this critical step.
USING BFS Using BFS
========= =========
To use the BeOS filesystem, use filesystem type 'befs'. To use the BeOS filesystem, use filesystem type 'befs'.
ex) ex::
mount -t befs /dev/fd0 /beos mount -t befs /dev/fd0 /beos
MOUNT OPTIONS Mount Options
============= =============
============= ===========================================================
uid=nnn All files in the partition will be owned by user id nnn. uid=nnn All files in the partition will be owned by user id nnn.
gid=nnn All files in the partition will be in group nnn. gid=nnn All files in the partition will be in group nnn.
iocharset=xxx Use xxx as the name of the NLS translation table. iocharset=xxx Use xxx as the name of the NLS translation table.
debug The driver will output debugging information to the syslog. debug The driver will output debugging information to the syslog.
============= ===========================================================
HOW TO GET LASTEST VERSION How to Get Lastest Version
========================== ==========================
The latest version is currently available at: The latest version is currently available at:
<http://befs-driver.sourceforge.net/> <http://befs-driver.sourceforge.net/>
ANY KNOWN BUGS? Any Known Bugs?
=========== ===============
As of Jan 20, 2002: As of Jan 20, 2002:
None None
SPECIAL THANKS Special Thanks
============== ==============
Dominic Giampalo ... Writing "Practical file system design with Be filesystem" Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
Hiroyuki Yamada ... Testing LinuxPPC. Hiroyuki Yamada ... Testing LinuxPPC.
......
BFS FILESYSTEM FOR LINUX .. SPDX-License-Identifier: GPL-2.0
========================
BFS Filesystem for Linux
======================== ========================
The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
...@@ -9,20 +12,20 @@ In order to access /stand partition under Linux you obviously need to ...@@ -9,20 +12,20 @@ In order to access /stand partition under Linux you obviously need to
know the partition number and the kernel must support UnixWare disk slices know the partition number and the kernel must support UnixWare disk slices
(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not (CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
depend on having UnixWare disklabel support because one can also mount depend on having UnixWare disklabel support because one can also mount
BFS filesystem via loopback: BFS filesystem via loopback::
# losetup /dev/loop0 stand.img # losetup /dev/loop0 stand.img
# mount -t bfs /dev/loop0 /mnt/stand # mount -t bfs /dev/loop0 /mnt/stand
where stand.img is a file containing the image of BFS filesystem. where stand.img is a file containing the image of BFS filesystem.
When you have finished using it and umounted you need to also deallocate When you have finished using it and umounted you need to also deallocate
/dev/loop0 device by: /dev/loop0 device by::
# losetup -d /dev/loop0 # losetup -d /dev/loop0
You can simplify mounting by just typing: You can simplify mounting by just typing::
# mount -t bfs -o loop stand.img /mnt/stand # mount -t bfs -o loop stand.img /mnt/stand
this will allocate the first available loopback device (and load loop.o this will allocate the first available loopback device (and load loop.o
kernel module if necessary) automatically. If the loopback driver is not kernel module if necessary) automatically. If the loopback driver is not
...@@ -33,21 +36,21 @@ that modprobe is functioning. Beware that umount will not deallocate ...@@ -33,21 +36,21 @@ that modprobe is functioning. Beware that umount will not deallocate
losetup(8). Read losetup(8) manpage for more info. losetup(8). Read losetup(8) manpage for more info.
To create the BFS image under UnixWare you need to find out first which To create the BFS image under UnixWare you need to find out first which
slice contains it. The command prtvtoc(1M) is your friend: slice contains it. The command prtvtoc(1M) is your friend::
# prtvtoc /dev/rdsk/c0b0t0d0s0 # prtvtoc /dev/rdsk/c0b0t0d0s0
(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you (assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
look for the slice with tag "STAND", which is usually slice 10. With this look for the slice with tag "STAND", which is usually slice 10. With this
information you can use dd(1) to create the BFS image: information you can use dd(1) to create the BFS image::
# umount /stand # umount /stand
# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512 # dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
Just in case, you can verify that you have done the right thing by checking Just in case, you can verify that you have done the right thing by checking
the magic number: the magic number::
# od -Ad -tx4 stand.img | more # od -Ad -tx4 stand.img | more
The first 4 bytes should be 0x1badface. The first 4 bytes should be 0x1badface.
......
.. SPDX-License-Identifier: GPL-2.0
=====
BTRFS BTRFS
===== =====
......
.. SPDX-License-Identifier: GPL-2.0
============================
Ceph Distributed File System Ceph Distributed File System
============================ ============================
...@@ -15,6 +18,7 @@ Basic features include: ...@@ -15,6 +18,7 @@ Basic features include:
* Easy deployment: most FS components are userspace daemons * Easy deployment: most FS components are userspace daemons
Also, Also,
* Flexible snapshots (on any directory) * Flexible snapshots (on any directory)
* Recursive accounting (nested files, directories, bytes) * Recursive accounting (nested files, directories, bytes)
...@@ -63,7 +67,7 @@ no 'du' or similar recursive scan of the file system is required. ...@@ -63,7 +67,7 @@ no 'du' or similar recursive scan of the file system is required.
Finally, Ceph also allows quotas to be set on any directory in the system. Finally, Ceph also allows quotas to be set on any directory in the system.
The quota can restrict the number of bytes or the number of files stored The quota can restrict the number of bytes or the number of files stored
beneath that point in the directory hierarchy. Quotas can be set using beneath that point in the directory hierarchy. Quotas can be set using
extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg: extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg::
setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
getfattr -n ceph.quota.max_bytes /some/dir getfattr -n ceph.quota.max_bytes /some/dir
...@@ -76,7 +80,7 @@ from writing as much data as it needs. ...@@ -76,7 +80,7 @@ from writing as much data as it needs.
Mount Syntax Mount Syntax
============ ============
The basic mount syntax is: The basic mount syntax is::
# mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
...@@ -84,7 +88,7 @@ You only need to specify a single monitor, as the client will get the ...@@ -84,7 +88,7 @@ You only need to specify a single monitor, as the client will get the
full list when it connects. (However, if the monitor you specify full list when it connects. (However, if the monitor you specify
happens to be down, the mount won't succeed.) The port can be left happens to be down, the mount won't succeed.) The port can be left
off if the monitor is using the default. So if the monitor is at off if the monitor is using the default. So if the monitor is at
1.2.3.4, 1.2.3.4::
# mount -t ceph 1.2.3.4:/ /mnt/ceph # mount -t ceph 1.2.3.4:/ /mnt/ceph
...@@ -179,8 +183,8 @@ For more information on Ceph, see the home page at ...@@ -179,8 +183,8 @@ For more information on Ceph, see the home page at
https://ceph.com/ https://ceph.com/
The Linux kernel client source tree is available at The Linux kernel client source tree is available at
https://github.com/ceph/ceph-client.git - https://github.com/ceph/ceph-client.git
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
and the source for the full system is at and the source for the full system is at
https://github.com/ceph/ceph.git https://github.com/ceph/ceph.git
.. SPDX-License-Identifier: GPL-2.0
Cramfs - cram a filesystem onto a small ROM ===========================================
Cramfs - cram a filesystem onto a small ROM
===========================================
cramfs is designed to be simple and small, and to compress things well. cramfs is designed to be simple and small, and to compress things well.
...@@ -28,9 +31,9 @@ issue. ...@@ -28,9 +31,9 @@ issue.
Hard links are supported, but hard linked files Hard links are supported, but hard linked files
will still have a link count of 1 in the cramfs image. will still have a link count of 1 in the cramfs image.
Cramfs directories have no `.' or `..' entries. Directories (like Cramfs directories have no ``.`` or ``..`` entries. Directories (like
every other file on cramfs) always have a link count of 1. (There's every other file on cramfs) always have a link count of 1. (There's
no need to use -noleaf in `find', btw.) no need to use -noleaf in ``find``, btw.)
No timestamps are stored in a cramfs, so these default to the epoch No timestamps are stored in a cramfs, so these default to the epoch
(1970 GMT). Recently-accessed files may have updated timestamps, but (1970 GMT). Recently-accessed files may have updated timestamps, but
...@@ -70,9 +73,9 @@ MTD drivers are cfi_cmdset_0001 (Intel/Sharp CFI flash) or physmap ...@@ -70,9 +73,9 @@ MTD drivers are cfi_cmdset_0001 (Intel/Sharp CFI flash) or physmap
(Flash device in physical memory map). MTD partitions based on such devices (Flash device in physical memory map). MTD partitions based on such devices
are fine too. Then that device should be specified with the "mtd:" prefix are fine too. Then that device should be specified with the "mtd:" prefix
as the mount device argument. For example, to mount the MTD device named as the mount device argument. For example, to mount the MTD device named
"fs_partition" on the /mnt directory: "fs_partition" on the /mnt directory::
$ mount -t cramfs mtd:fs_partition /mnt $ mount -t cramfs mtd:fs_partition /mnt
To boot a kernel with this as root filesystem, suffice to specify To boot a kernel with this as root filesystem, suffice to specify
something like "root=mtd:fs_partition" on the kernel command line. something like "root=mtd:fs_partition" on the kernel command line.
...@@ -90,6 +93,7 @@ https://github.com/npitre/cramfs-tools ...@@ -90,6 +93,7 @@ https://github.com/npitre/cramfs-tools
For /usr/share/magic For /usr/share/magic
-------------------- --------------------
===== ======================= =======================
0 ulelong 0x28cd3d45 Linux cramfs offset 0 0 ulelong 0x28cd3d45 Linux cramfs offset 0
>4 ulelong x size %d >4 ulelong x size %d
>8 ulelong x flags 0x%x >8 ulelong x flags 0x%x
...@@ -110,6 +114,7 @@ For /usr/share/magic ...@@ -110,6 +114,7 @@ For /usr/share/magic
>552 ulelong x fsid.blocks %d >552 ulelong x fsid.blocks %d
>556 ulelong x fsid.files %d >556 ulelong x fsid.files %d
>560 string >\0 name "%.16s" >560 string >\0 name "%.16s"
===== ======================= =======================
Hacker Notes Hacker Notes
......
Copyright 2009 Jonathan Corbet <corbet@lwn.net> .. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
=======
DebugFS
=======
Copyright |copy| 2009 Jonathan Corbet <corbet@lwn.net>
Debugfs exists as a simple way for kernel developers to make information Debugfs exists as a simple way for kernel developers to make information
available to user space. Unlike /proc, which is only meant for information available to user space. Unlike /proc, which is only meant for information
...@@ -6,11 +13,11 @@ about a process, or sysfs, which has strict one-value-per-file rules, ...@@ -6,11 +13,11 @@ about a process, or sysfs, which has strict one-value-per-file rules,
debugfs has no rules at all. Developers can put any information they want debugfs has no rules at all. Developers can put any information they want
there. The debugfs filesystem is also intended to not serve as a stable there. The debugfs filesystem is also intended to not serve as a stable
ABI to user space; in theory, there are no stability constraints placed on ABI to user space; in theory, there are no stability constraints placed on
files exported there. The real world is not always so simple, though [1]; files exported there. The real world is not always so simple, though [1]_;
even debugfs interfaces are best designed with the idea that they will need even debugfs interfaces are best designed with the idea that they will need
to be maintained forever. to be maintained forever.
Debugfs is typically mounted with a command like: Debugfs is typically mounted with a command like::
mount -t debugfs none /sys/kernel/debug mount -t debugfs none /sys/kernel/debug
...@@ -23,7 +30,7 @@ Note that the debugfs API is exported GPL-only to modules. ...@@ -23,7 +30,7 @@ Note that the debugfs API is exported GPL-only to modules.
Code using debugfs should include <linux/debugfs.h>. Then, the first order Code using debugfs should include <linux/debugfs.h>. Then, the first order
of business will be to create at least one directory to hold a set of of business will be to create at least one directory to hold a set of
debugfs files: debugfs files::
struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
...@@ -36,7 +43,7 @@ something went wrong. If ERR_PTR(-ENODEV) is returned, that is an ...@@ -36,7 +43,7 @@ something went wrong. If ERR_PTR(-ENODEV) is returned, that is an
indication that the kernel has been built without debugfs support and none indication that the kernel has been built without debugfs support and none
of the functions described below will work. of the functions described below will work.
The most general way to create a file within a debugfs directory is with: The most general way to create a file within a debugfs directory is with::
struct dentry *debugfs_create_file(const char *name, umode_t mode, struct dentry *debugfs_create_file(const char *name, umode_t mode,
struct dentry *parent, void *data, struct dentry *parent, void *data,
...@@ -53,7 +60,7 @@ ERR_PTR(-ERROR) on error, or ERR_PTR(-ENODEV) if debugfs support is ...@@ -53,7 +60,7 @@ ERR_PTR(-ERROR) on error, or ERR_PTR(-ENODEV) if debugfs support is
missing. missing.
Create a file with an initial size, the following function can be used Create a file with an initial size, the following function can be used
instead: instead::
struct dentry *debugfs_create_file_size(const char *name, umode_t mode, struct dentry *debugfs_create_file_size(const char *name, umode_t mode,
struct dentry *parent, void *data, struct dentry *parent, void *data,
...@@ -66,7 +73,7 @@ as the function debugfs_create_file. ...@@ -66,7 +73,7 @@ as the function debugfs_create_file.
In a number of cases, the creation of a set of file operations is not In a number of cases, the creation of a set of file operations is not
actually necessary; the debugfs code provides a number of helper functions actually necessary; the debugfs code provides a number of helper functions
for simple situations. Files containing a single integer value can be for simple situations. Files containing a single integer value can be
created with any of: created with any of::
void debugfs_create_u8(const char *name, umode_t mode, void debugfs_create_u8(const char *name, umode_t mode,
struct dentry *parent, u8 *value); struct dentry *parent, u8 *value);
...@@ -80,7 +87,7 @@ created with any of: ...@@ -80,7 +87,7 @@ created with any of:
These files support both reading and writing the given value; if a specific These files support both reading and writing the given value; if a specific
file should not be written to, simply set the mode bits accordingly. The file should not be written to, simply set the mode bits accordingly. The
values in these files are in decimal; if hexadecimal is more appropriate, values in these files are in decimal; if hexadecimal is more appropriate,
the following functions can be used instead: the following functions can be used instead::
void debugfs_create_x8(const char *name, umode_t mode, void debugfs_create_x8(const char *name, umode_t mode,
struct dentry *parent, u8 *value); struct dentry *parent, u8 *value);
...@@ -94,7 +101,7 @@ the following functions can be used instead: ...@@ -94,7 +101,7 @@ the following functions can be used instead:
These functions are useful as long as the developer knows the size of the These functions are useful as long as the developer knows the size of the
value to be exported. Some types can have different widths on different value to be exported. Some types can have different widths on different
architectures, though, complicating the situation somewhat. There are architectures, though, complicating the situation somewhat. There are
functions meant to help out in such special cases: functions meant to help out in such special cases::
void debugfs_create_size_t(const char *name, umode_t mode, void debugfs_create_size_t(const char *name, umode_t mode,
struct dentry *parent, size_t *value); struct dentry *parent, size_t *value);
...@@ -103,7 +110,7 @@ As might be expected, this function will create a debugfs file to represent ...@@ -103,7 +110,7 @@ As might be expected, this function will create a debugfs file to represent
a variable of type size_t. a variable of type size_t.
Similarly, there are helpers for variables of type unsigned long, in decimal Similarly, there are helpers for variables of type unsigned long, in decimal
and hexadecimal: and hexadecimal::
struct dentry *debugfs_create_ulong(const char *name, umode_t mode, struct dentry *debugfs_create_ulong(const char *name, umode_t mode,
struct dentry *parent, struct dentry *parent,
...@@ -111,7 +118,7 @@ and hexadecimal: ...@@ -111,7 +118,7 @@ and hexadecimal:
void debugfs_create_xul(const char *name, umode_t mode, void debugfs_create_xul(const char *name, umode_t mode,
struct dentry *parent, unsigned long *value); struct dentry *parent, unsigned long *value);
Boolean values can be placed in debugfs with: Boolean values can be placed in debugfs with::
struct dentry *debugfs_create_bool(const char *name, umode_t mode, struct dentry *debugfs_create_bool(const char *name, umode_t mode,
struct dentry *parent, bool *value); struct dentry *parent, bool *value);
...@@ -120,7 +127,7 @@ A read on the resulting file will yield either Y (for non-zero values) or ...@@ -120,7 +127,7 @@ A read on the resulting file will yield either Y (for non-zero values) or
N, followed by a newline. If written to, it will accept either upper- or N, followed by a newline. If written to, it will accept either upper- or
lower-case values, or 1 or 0. Any other input will be silently ignored. lower-case values, or 1 or 0. Any other input will be silently ignored.
Also, atomic_t values can be placed in debugfs with: Also, atomic_t values can be placed in debugfs with::
void debugfs_create_atomic_t(const char *name, umode_t mode, void debugfs_create_atomic_t(const char *name, umode_t mode,
struct dentry *parent, atomic_t *value) struct dentry *parent, atomic_t *value)
...@@ -129,7 +136,7 @@ A read of this file will get atomic_t values, and a write of this file ...@@ -129,7 +136,7 @@ A read of this file will get atomic_t values, and a write of this file
will set atomic_t values. will set atomic_t values.
Another option is exporting a block of arbitrary binary data, with Another option is exporting a block of arbitrary binary data, with
this structure and function: this structure and function::
struct debugfs_blob_wrapper { struct debugfs_blob_wrapper {
void *data; void *data;
...@@ -151,7 +158,7 @@ If you want to dump a block of registers (something that happens quite ...@@ -151,7 +158,7 @@ If you want to dump a block of registers (something that happens quite
often during development, even if little such code reaches mainline. often during development, even if little such code reaches mainline.
Debugfs offers two functions: one to make a registers-only file, and Debugfs offers two functions: one to make a registers-only file, and
another to insert a register block in the middle of another sequential another to insert a register block in the middle of another sequential
file. file::
struct debugfs_reg32 { struct debugfs_reg32 {
char *name; char *name;
...@@ -175,7 +182,7 @@ The "base" argument may be 0, but you may want to build the reg32 array ...@@ -175,7 +182,7 @@ The "base" argument may be 0, but you may want to build the reg32 array
using __stringify, and a number of register names (macros) are actually using __stringify, and a number of register names (macros) are actually
byte offsets over a base for the register block. byte offsets over a base for the register block.
If you want to dump an u32 array in debugfs, you can create file with: If you want to dump an u32 array in debugfs, you can create file with::
void debugfs_create_u32_array(const char *name, umode_t mode, void debugfs_create_u32_array(const char *name, umode_t mode,
struct dentry *parent, struct dentry *parent,
...@@ -185,7 +192,7 @@ The "array" argument provides data, and the "elements" argument is ...@@ -185,7 +192,7 @@ The "array" argument provides data, and the "elements" argument is
the number of elements in the array. Note: Once array is created its the number of elements in the array. Note: Once array is created its
size can not be changed. size can not be changed.
There is a helper function to create device related seq_file: There is a helper function to create device related seq_file::
struct dentry *debugfs_create_devm_seqfile(struct device *dev, struct dentry *debugfs_create_devm_seqfile(struct device *dev,
const char *name, const char *name,
...@@ -197,7 +204,7 @@ The "dev" argument is the device related to this debugfs file, and ...@@ -197,7 +204,7 @@ The "dev" argument is the device related to this debugfs file, and
the "read_fn" is a function pointer which to be called to print the the "read_fn" is a function pointer which to be called to print the
seq_file content. seq_file content.
There are a couple of other directory-oriented helper functions: There are a couple of other directory-oriented helper functions::
struct dentry *debugfs_rename(struct dentry *old_dir, struct dentry *debugfs_rename(struct dentry *old_dir,
struct dentry *old_dentry, struct dentry *old_dentry,
...@@ -219,7 +226,7 @@ module is unloaded without explicitly removing debugfs entries, the result ...@@ -219,7 +226,7 @@ module is unloaded without explicitly removing debugfs entries, the result
will be a lot of stale pointers and no end of highly antisocial behavior. will be a lot of stale pointers and no end of highly antisocial behavior.
So all debugfs users - at least those which can be built as modules - must So all debugfs users - at least those which can be built as modules - must
be prepared to remove all files and directories they create there. A file be prepared to remove all files and directories they create there. A file
can be removed with: can be removed with::
void debugfs_remove(struct dentry *dentry); void debugfs_remove(struct dentry *dentry);
...@@ -229,7 +236,7 @@ be removed. ...@@ -229,7 +236,7 @@ be removed.
Once upon a time, debugfs users were required to remember the dentry Once upon a time, debugfs users were required to remember the dentry
pointer for every debugfs file they created so that all files could be pointer for every debugfs file they created so that all files could be
cleaned up. We live in more civilized times now, though, and debugfs users cleaned up. We live in more civilized times now, though, and debugfs users
can call: can call::
void debugfs_remove_recursive(struct dentry *dentry); void debugfs_remove_recursive(struct dentry *dentry);
...@@ -237,5 +244,4 @@ If this function is passed a pointer for the dentry corresponding to the ...@@ -237,5 +244,4 @@ If this function is passed a pointer for the dentry corresponding to the
top-level directory, the entire hierarchy below that directory will be top-level directory, the entire hierarchy below that directory will be
removed. removed.
Notes: .. [1] http://lwn.net/Articles/309298/
[1] http://lwn.net/Articles/309298/
dlmfs .. SPDX-License-Identifier: GPL-2.0
================== .. include:: <isonum.txt>
=====
DLMFS
=====
A minimal DLM userspace interface implemented via a virtual file A minimal DLM userspace interface implemented via a virtual file
system. system.
dlmfs is built with OCFS2 as it requires most of its infrastructure. dlmfs is built with OCFS2 as it requires most of its infrastructure.
Project web page: http://ocfs2.wiki.kernel.org :Project web page: http://ocfs2.wiki.kernel.org
Tools web page: https://github.com/markfasheh/ocfs2-tools :Tools web page: https://github.com/markfasheh/ocfs2-tools
OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ :OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
All code copyright 2005 Oracle except when otherwise noted. All code copyright 2005 Oracle except when otherwise noted.
CREDITS Credits
======= =======
Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds Some code taken from ramfs which is Copyright |copy| 2000 Linus Torvalds
and Transmeta Corp. and Transmeta Corp.
Mark Fasheh <mark.fasheh@oracle.com> Mark Fasheh <mark.fasheh@oracle.com>
...@@ -96,14 +101,19 @@ operation. If the lock succeeds, you'll get an fd. ...@@ -96,14 +101,19 @@ operation. If the lock succeeds, you'll get an fd.
open(2) with O_CREAT to ensure the resource inode is created - dlmfs does open(2) with O_CREAT to ensure the resource inode is created - dlmfs does
not automatically create inodes for existing lock resources. not automatically create inodes for existing lock resources.
============ ===========================
Open Flag Lock Request Type Open Flag Lock Request Type
--------- ----------------- ============ ===========================
O_RDONLY Shared Read O_RDONLY Shared Read
O_RDWR Exclusive O_RDWR Exclusive
============ ===========================
============ ===========================
Open Flag Resulting Locking Behavior Open Flag Resulting Locking Behavior
--------- -------------------------- ============ ===========================
O_NONBLOCK Trylock operation O_NONBLOCK Trylock operation
============ ===========================
You must provide exactly one of O_RDONLY or O_RDWR. You must provide exactly one of O_RDONLY or O_RDWR.
......
.. SPDX-License-Identifier: GPL-2.0
======================================================
eCryptfs: A stacked cryptographic filesystem for Linux eCryptfs: A stacked cryptographic filesystem for Linux
======================================================
eCryptfs is free software. Please see the file COPYING for details. eCryptfs is free software. Please see the file COPYING for details.
For documentation, please see the files in the doc/ subdirectory. For For documentation, please see the files in the doc/ subdirectory. For
building and installation instructions please see the INSTALL file. building and installation instructions please see the INSTALL file.
Maintainer: Phillip Hellewell :Maintainer: Phillip Hellewell
Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com> :Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com>
Developers: Michael C. Thompson :Developers: Michael C. Thompson
Kent Yoder Kent Yoder
Web Site: http://ecryptfs.sf.net :Web Site: http://ecryptfs.sf.net
This software is currently undergoing development. Make sure to This software is currently undergoing development. Make sure to
maintain a backup copy of any data you write into eCryptfs. maintain a backup copy of any data you write into eCryptfs.
...@@ -19,13 +23,15 @@ SourceForge site: ...@@ -19,13 +23,15 @@ SourceForge site:
http://sourceforge.net/projects/ecryptfs/ http://sourceforge.net/projects/ecryptfs/
Userspace requirements include: Userspace requirements include:
- David Howells' userspace keyring headers and libraries (version
- David Howells' userspace keyring headers and libraries (version
1.0 or higher), obtainable from 1.0 or higher), obtainable from
http://people.redhat.com/~dhowells/keyutils/ http://people.redhat.com/~dhowells/keyutils/
- Libgcrypt - Libgcrypt
NOTES Notes
=====
In the beta/experimental releases of eCryptfs, when you upgrade In the beta/experimental releases of eCryptfs, when you upgrade
eCryptfs, you should copy the files to an unencrypted location and eCryptfs, you should copy the files to an unencrypted location and
...@@ -33,20 +39,21 @@ then copy the files back into the new eCryptfs mount to migrate the ...@@ -33,20 +39,21 @@ then copy the files back into the new eCryptfs mount to migrate the
files. files.
MOUNT-WIDE PASSPHRASE Mount-wide Passphrase
=====================
Create a new directory into which eCryptfs will write its encrypted Create a new directory into which eCryptfs will write its encrypted
files (i.e., /root/crypt). Then, create the mount point directory files (i.e., /root/crypt). Then, create the mount point directory
(i.e., /mnt/crypt). Now it's time to mount eCryptfs: (i.e., /mnt/crypt). Now it's time to mount eCryptfs::
mount -t ecryptfs /root/crypt /mnt/crypt mount -t ecryptfs /root/crypt /mnt/crypt
You should be prompted for a passphrase and a salt (the salt may be You should be prompted for a passphrase and a salt (the salt may be
blank). blank).
Try writing a new file: Try writing a new file::
echo "Hello, World" > /mnt/crypt/hello.txt echo "Hello, World" > /mnt/crypt/hello.txt
The operation will complete. Notice that there is a new file in The operation will complete. Notice that there is a new file in
/root/crypt that is at least 12288 bytes in size (depending on your /root/crypt that is at least 12288 bytes in size (depending on your
...@@ -59,10 +66,13 @@ keyctl clear @u ...@@ -59,10 +66,13 @@ keyctl clear @u
Then umount /mnt/crypt and mount again per the instructions given Then umount /mnt/crypt and mount again per the instructions given
above. above.
cat /mnt/crypt/hello.txt ::
cat /mnt/crypt/hello.txt
NOTES Notes
=====
eCryptfs version 0.1 should only be mounted on (1) empty directories eCryptfs version 0.1 should only be mounted on (1) empty directories
or (2) directories containing files only created by eCryptfs. If you or (2) directories containing files only created by eCryptfs. If you
......
.. SPDX-License-Identifier: GPL-2.0
=======================================
efivarfs - a (U)EFI variable filesystem efivarfs - a (U)EFI variable filesystem
=======================================
The efivarfs filesystem was created to address the shortcomings of The efivarfs filesystem was created to address the shortcomings of
using entries in sysfs to maintain EFI variables. The old sysfs EFI using entries in sysfs to maintain EFI variables. The old sysfs EFI
...@@ -11,7 +14,7 @@ than a single page, sysfs isn't the best interface for this. ...@@ -11,7 +14,7 @@ than a single page, sysfs isn't the best interface for this.
Variables can be created, deleted and modified with the efivarfs Variables can be created, deleted and modified with the efivarfs
filesystem. filesystem.
efivarfs is typically mounted like this, efivarfs is typically mounted like this::
mount -t efivarfs none /sys/firmware/efi/efivars mount -t efivarfs none /sys/firmware/efi/efivars
......
.. SPDX-License-Identifier: GPL-2.0
======================================
Enhanced Read-Only File System - EROFS
======================================
Overview Overview
======== ========
...@@ -6,6 +12,7 @@ from other read-only file systems, it aims to be designed for flexibility, ...@@ -6,6 +12,7 @@ from other read-only file systems, it aims to be designed for flexibility,
scalability, but be kept simple and high performance. scalability, but be kept simple and high performance.
It is designed as a better filesystem solution for the following scenarios: It is designed as a better filesystem solution for the following scenarios:
- read-only storage media or - read-only storage media or
- part of a fully trusted read-only solution, which means it needs to be - part of a fully trusted read-only solution, which means it needs to be
...@@ -17,6 +24,7 @@ It is designed as a better filesystem solution for the following scenarios: ...@@ -17,6 +24,7 @@ It is designed as a better filesystem solution for the following scenarios:
for those embedded devices with limited memory (ex, smartphone); for those embedded devices with limited memory (ex, smartphone);
Here is the main features of EROFS: Here is the main features of EROFS:
- Little endian on-disk design; - Little endian on-disk design;
- Currently 4KB block size (nobh) and therefore maximum 16TB address space; - Currently 4KB block size (nobh) and therefore maximum 16TB address space;
...@@ -24,13 +32,17 @@ Here is the main features of EROFS: ...@@ -24,13 +32,17 @@ Here is the main features of EROFS:
- Metadata & data could be mixed by design; - Metadata & data could be mixed by design;
- 2 inode versions for different requirements: - 2 inode versions for different requirements:
===================== ============ =====================================
compact (v1) extended (v2) compact (v1) extended (v2)
Inode metadata size: 32 bytes 64 bytes ===================== ============ =====================================
Max file size: 4 GB 16 EB (also limited by max. vol size) Inode metadata size 32 bytes 64 bytes
Max uids/gids: 65536 4294967296 Max file size 4 GB 16 EB (also limited by max. vol size)
File change time: no yes (64 + 32-bit timestamp) Max uids/gids 65536 4294967296
Max hardlinks: 65536 4294967296 File change time no yes (64 + 32-bit timestamp)
Metadata reserved: 4 bytes 14 bytes Max hardlinks 65536 4294967296
Metadata reserved 4 bytes 14 bytes
===================== ============ =====================================
- Support extended attributes (xattrs) as an option; - Support extended attributes (xattrs) as an option;
...@@ -43,29 +55,36 @@ Here is the main features of EROFS: ...@@ -43,29 +55,36 @@ Here is the main features of EROFS:
The following git tree provides the file system user-space tools under The following git tree provides the file system user-space tools under
development (ex, formatting tool mkfs.erofs): development (ex, formatting tool mkfs.erofs):
>> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
Bugs and patches are welcome, please kindly help us and send to the following Bugs and patches are welcome, please kindly help us and send to the following
linux-erofs mailing list: linux-erofs mailing list:
>> linux-erofs mailing list <linux-erofs@lists.ozlabs.org>
- linux-erofs mailing list <linux-erofs@lists.ozlabs.org>
Mount options Mount options
============= =============
=================== =========================================================
(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled (no)user_xattr Setup Extended User Attributes. Note: xattr is enabled
by default if CONFIG_EROFS_FS_XATTR is selected. by default if CONFIG_EROFS_FS_XATTR is selected.
(no)acl Setup POSIX Access Control List. Note: acl is enabled (no)acl Setup POSIX Access Control List. Note: acl is enabled
by default if CONFIG_EROFS_FS_POSIX_ACL is selected. by default if CONFIG_EROFS_FS_POSIX_ACL is selected.
cache_strategy=%s Select a strategy for cached decompression from now on: cache_strategy=%s Select a strategy for cached decompression from now on:
disabled: In-place I/O decompression only;
readahead: Cache the last incomplete compressed physical ========== =============================================
disabled In-place I/O decompression only;
readahead Cache the last incomplete compressed physical
cluster for further reading. It still does cluster for further reading. It still does
in-place I/O decompression for the rest in-place I/O decompression for the rest
compressed physical clusters; compressed physical clusters;
readaround: Cache the both ends of incomplete compressed readaround Cache the both ends of incomplete compressed
physical clusters for further reading. physical clusters for further reading.
It still does in-place I/O decompression It still does in-place I/O decompression
for the rest compressed physical clusters. for the rest compressed physical clusters.
========== =============================================
=================== =========================================================
On-disk details On-disk details
=============== ===============
...@@ -73,7 +92,7 @@ On-disk details ...@@ -73,7 +92,7 @@ On-disk details
Summary Summary
------- -------
Different from other read-only file systems, an EROFS volume is designed Different from other read-only file systems, an EROFS volume is designed
to be as simple as possible: to be as simple as possible::
|-> aligned with the block size |-> aligned with the block size
____________________________________________________________ ____________________________________________________________
...@@ -83,13 +102,17 @@ to be as simple as possible: ...@@ -83,13 +102,17 @@ to be as simple as possible:
All data areas should be aligned with the block size, but metadata areas All data areas should be aligned with the block size, but metadata areas
may not. All metadatas can be now observed in two different spaces (views): may not. All metadatas can be now observed in two different spaces (views):
1. Inode metadata space 1. Inode metadata space
Each valid inode should be aligned with an inode slot, which is a fixed Each valid inode should be aligned with an inode slot, which is a fixed
value (32 bytes) and designed to be kept in line with compact inode size. value (32 bytes) and designed to be kept in line with compact inode size.
Each inode can be directly found with the following formula: Each inode can be directly found with the following formula:
inode offset = meta_blkaddr * block_size + 32 * nid inode offset = meta_blkaddr * block_size + 32 * nid
::
|-> aligned with 8B |-> aligned with 8B
|-> followed closely |-> followed closely
+ meta_blkaddr blocks |-> another slot + meta_blkaddr blocks |-> another slot
...@@ -117,7 +140,7 @@ may not. All metadatas can be now observed in two different spaces (views): ...@@ -117,7 +140,7 @@ may not. All metadatas can be now observed in two different spaces (views):
|-> aligned with 4B |-> aligned with 4B
Inode could be 32 or 64 bytes, which can be distinguished from a common Inode could be 32 or 64 bytes, which can be distinguished from a common
field which all inode versions have -- i_format: field which all inode versions have -- i_format::
__________________ __________________ __________________ __________________
| i_format | | i_format | | i_format | | i_format |
...@@ -132,16 +155,19 @@ may not. All metadatas can be now observed in two different spaces (views): ...@@ -132,16 +155,19 @@ may not. All metadatas can be now observed in two different spaces (views):
proper alignment, and they could be optional for different data mappings. proper alignment, and they could be optional for different data mappings.
_currently_ total 4 valid data mappings are supported: _currently_ total 4 valid data mappings are supported:
== ====================================================================
0 flat file data without data inline (no extent); 0 flat file data without data inline (no extent);
1 fixed-sized output data compression (with non-compacted indexes); 1 fixed-sized output data compression (with non-compacted indexes);
2 flat file data with tail packing data inline (no extent); 2 flat file data with tail packing data inline (no extent);
3 fixed-sized output data compression (with compacted indexes, v5.3+). 3 fixed-sized output data compression (with compacted indexes, v5.3+).
== ====================================================================
The size of the optional xattrs is indicated by i_xattr_count in inode The size of the optional xattrs is indicated by i_xattr_count in inode
header. Large xattrs or xattrs shared by many different files can be header. Large xattrs or xattrs shared by many different files can be
stored in shared xattrs metadata rather than inlined right after inode. stored in shared xattrs metadata rather than inlined right after inode.
2. Shared xattrs metadata space 2. Shared xattrs metadata space
Shared xattrs space is similar to the above inode space, started with Shared xattrs space is similar to the above inode space, started with
a specific block indicated by xattr_blkaddr, organized one by one with a specific block indicated by xattr_blkaddr, organized one by one with
proper align. proper align.
...@@ -149,6 +175,8 @@ may not. All metadatas can be now observed in two different spaces (views): ...@@ -149,6 +175,8 @@ may not. All metadatas can be now observed in two different spaces (views):
Each share xattr can also be directly found by the following formula: Each share xattr can also be directly found by the following formula:
xattr offset = xattr_blkaddr * block_size + 4 * xattr_id xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
::
|-> aligned by 4 bytes |-> aligned by 4 bytes
+ xattr_blkaddr blocks |-> aligned with 4 bytes + xattr_blkaddr blocks |-> aligned with 4 bytes
_________________________________________________________________________ _________________________________________________________________________
...@@ -163,13 +191,15 @@ random file lookup, and all directory entries are _strictly_ recorded in ...@@ -163,13 +191,15 @@ random file lookup, and all directory entries are _strictly_ recorded in
alphabetical order in order to support improved prefix binary search alphabetical order in order to support improved prefix binary search
algorithm (could refer to the related source code). algorithm (could refer to the related source code).
::
___________________________ ___________________________
/ | / |
/ ______________|________________ / ______________|________________
/ / | nameoff1 | nameoffN-1 / / | nameoff1 | nameoffN-1
____________.______________._______________v________________v__________ ____________.______________._______________v________________v__________
| dirent | dirent | ... | dirent | filename | filename | ... | filename | | dirent | dirent | ... | dirent | filename | filename | ... | filename |
|___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
\ ^ \ ^
\ | * could have \ | * could have
\ | trailing '\0' \ | trailing '\0'
...@@ -184,14 +214,14 @@ introduce another on-disk field at all. ...@@ -184,14 +214,14 @@ introduce another on-disk field at all.
Compression Compression
----------- -----------
Currently, EROFS supports 4KB fixed-sized output transparent file compression, Currently, EROFS supports 4KB fixed-sized output transparent file compression,
as illustrated below: as illustrated below::
|---- Variant-Length Extent ----|-------- VLE --------|----- VLE ----- |---- Variant-Length Extent ----|-------- VLE --------|----- VLE -----
clusterofs clusterofs clusterofs clusterofs clusterofs clusterofs
| | | logical data | | | logical data
_________v_______________________________v_____________________v_______________ _________v_______________________________v_____________________v_______________
... | . | | . | | . | ... ... | . | | . | | . | ...
____|____.________|_____________|________.____|_____________|__.__________|____ ____|____.________|_____________|________.____|_____________|__.__________|____
|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-| |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|
size size size size size size size size size size
. . . . . . . .
...@@ -208,4 +238,3 @@ at most. For each logical cluster, there is a corresponding on-disk index to ...@@ -208,4 +238,3 @@ at most. For each logical cluster, there is a corresponding on-disk index to
describe its cluster type, physical cluster address, etc. describe its cluster type, physical cluster address, etc.
See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
.. SPDX-License-Identifier: GPL-2.0
The Second Extended Filesystem The Second Extended Filesystem
============================== ==============================
...@@ -14,8 +16,9 @@ Options ...@@ -14,8 +16,9 @@ Options
Most defaults are determined by the filesystem superblock, and can be Most defaults are determined by the filesystem superblock, and can be
set using tune2fs(8). Kernel-determined defaults are indicated by (*). set using tune2fs(8). Kernel-determined defaults are indicated by (*).
bsddf (*) Makes `df' act like BSD. ==================== === ================================================
minixdf Makes `df' act like Minix. bsddf (*) Makes ``df`` act like BSD.
minixdf Makes ``df`` act like Minix.
check=none, nocheck (*) Don't do extra checking of bitmaps on mount check=none, nocheck (*) Don't do extra checking of bitmaps on mount
(check=normal and check=strict options removed) (check=normal and check=strict options removed)
...@@ -62,6 +65,7 @@ quota, usrquota Enable user disk quota support ...@@ -62,6 +65,7 @@ quota, usrquota Enable user disk quota support
grpquota Enable group disk quota support grpquota Enable group disk quota support
(requires CONFIG_QUOTA). (requires CONFIG_QUOTA).
==================== === ================================================
noquota option ls silently ignored by ext2. noquota option ls silently ignored by ext2.
...@@ -294,9 +298,9 @@ respective fsck programs. ...@@ -294,9 +298,9 @@ respective fsck programs.
If you're exceptionally paranoid, there are 3 ways of making metadata If you're exceptionally paranoid, there are 3 ways of making metadata
writes synchronous on ext2: writes synchronous on ext2:
per-file if you have the program source: use the O_SYNC flag to open() - per-file if you have the program source: use the O_SYNC flag to open()
per-file if you don't have the source: use "chattr +S" on the file - per-file if you don't have the source: use "chattr +S" on the file
per-filesystem: add the "sync" option to mount (or in /etc/fstab) - per-filesystem: add the "sync" option to mount (or in /etc/fstab)
the first and last are not ext2 specific but do force the metadata to the first and last are not ext2 specific but do force the metadata to
be written synchronously. See also Journaling below. be written synchronously. See also Journaling below.
...@@ -316,10 +320,12 @@ Most of these limits could be overcome with slight changes in the on-disk ...@@ -316,10 +320,12 @@ Most of these limits could be overcome with slight changes in the on-disk
format and using a compatibility flag to signal the format change (at format and using a compatibility flag to signal the format change (at
the expense of some compatibility). the expense of some compatibility).
Filesystem block size: 1kB 2kB 4kB 8kB ===================== ======= ======= ======= ========
Filesystem block size 1kB 2kB 4kB 8kB
File size limit: 16GB 256GB 2048GB 2048GB ===================== ======= ======= ======= ========
Filesystem size limit: 2047GB 8192GB 16384GB 32768GB File size limit 16GB 256GB 2048GB 2048GB
Filesystem size limit 2047GB 8192GB 16384GB 32768GB
===================== ======= ======= ======= ========
There is a 2.4 kernel limit of 2048GB for a single block device, so no There is a 2.4 kernel limit of 2048GB for a single block device, so no
filesystem larger than that can be created at this time. There is also filesystem larger than that can be created at this time. There is also
...@@ -370,19 +376,24 @@ ext4 and journaling. ...@@ -370,19 +376,24 @@ ext4 and journaling.
References References
========== ==========
======================= ===============================================
The kernel source file:/usr/src/linux/fs/ext2/ The kernel source file:/usr/src/linux/fs/ext2/
e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
Filesystem Resizing http://ext2resize.sourceforge.net/ Filesystem Resizing http://ext2resize.sourceforge.net/
Compression (*) http://e2compr.sourceforge.net/ Compression [1]_ http://e2compr.sourceforge.net/
======================= ===============================================
Implementations for: Implementations for:
======================= ===========================================================
Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs
Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2 Windows 95 [1]_ http://www.yipton.net/content.html#FSDEXT2
DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ DOS client [1]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ OS/2 [2]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/ RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/
======================= ===========================================================
(*) no longer actively developed/supported (as of Apr 2001) .. [1] no longer actively developed/supported (as of Apr 2001)
(+) no longer actively developed/supported (as of Mar 2009) .. [2] no longer actively developed/supported (as of Mar 2009)
.. SPDX-License-Identifier: GPL-2.0
===============
Ext3 Filesystem Ext3 Filesystem
=============== ===============
......
================================================================================ .. SPDX-License-Identifier: GPL-2.0
==========================================
WHAT IS Flash-Friendly File System (F2FS)? WHAT IS Flash-Friendly File System (F2FS)?
================================================================================ ==========================================
NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
been equipped on a variety systems ranging from mobile to server systems. Since been equipped on a variety systems ranging from mobile to server systems. Since
...@@ -20,14 +22,15 @@ layout, but also for selecting allocation and cleaning algorithms. ...@@ -20,14 +22,15 @@ layout, but also for selecting allocation and cleaning algorithms.
The following git tree provides the file system formatting tool (mkfs.f2fs), The following git tree provides the file system formatting tool (mkfs.f2fs),
a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs). a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs).
>> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
- git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
For reporting bugs and sending patches, please use the following mailing list: For reporting bugs and sending patches, please use the following mailing list:
>> linux-f2fs-devel@lists.sourceforge.net
================================================================================ - linux-f2fs-devel@lists.sourceforge.net
BACKGROUND AND DESIGN ISSUES
================================================================================ Background and Design issues
============================
Log-structured File System (LFS) Log-structured File System (LFS)
-------------------------------- --------------------------------
...@@ -61,6 +64,7 @@ needs to reclaim these obsolete blocks seamlessly to users. This job is called ...@@ -61,6 +64,7 @@ needs to reclaim these obsolete blocks seamlessly to users. This job is called
as a cleaning process. as a cleaning process.
The process consists of three operations as follows. The process consists of three operations as follows.
1. A victim segment is selected through referencing segment usage table. 1. A victim segment is selected through referencing segment usage table.
2. It loads parent index structures of all the data in the victim identified by 2. It loads parent index structures of all the data in the victim identified by
segment summary blocks. segment summary blocks.
...@@ -71,9 +75,8 @@ This cleaning job may cause unexpected long delays, so the most important goal ...@@ -71,9 +75,8 @@ This cleaning job may cause unexpected long delays, so the most important goal
is to hide the latencies to users. And also definitely, it should reduce the is to hide the latencies to users. And also definitely, it should reduce the
amount of valid data to be moved, and move them quickly as well. amount of valid data to be moved, and move them quickly as well.
================================================================================ Key Features
KEY FEATURES ============
================================================================================
Flash Awareness Flash Awareness
--------------- ---------------
...@@ -94,10 +97,11 @@ Cleaning Overhead ...@@ -94,10 +97,11 @@ Cleaning Overhead
- Support multi-head logs for static/dynamic hot and cold data separation - Support multi-head logs for static/dynamic hot and cold data separation
- Introduce adaptive logging for efficient block allocation - Introduce adaptive logging for efficient block allocation
================================================================================ Mount Options
MOUNT OPTIONS =============
================================================================================
====================== ============================================================
background_gc=%s Turn on/off cleaning operations, namely garbage background_gc=%s Turn on/off cleaning operations, namely garbage
collection, triggered in background when I/O subsystem is collection, triggered in background when I/O subsystem is
idle. If background_gc=on, it will turn on the garbage idle. If background_gc=on, it will turn on the garbage
...@@ -167,7 +171,10 @@ fault_injection=%d Enable fault injection in all supported types with ...@@ -167,7 +171,10 @@ fault_injection=%d Enable fault injection in all supported types with
fault_type=%d Support configuring fault injection type, should be fault_type=%d Support configuring fault injection type, should be
enabled with fault_injection option, fault type value enabled with fault_injection option, fault type value
is shown below, it supports single or combined type. is shown below, it supports single or combined type.
=================== ===========
Type_Name Type_Value Type_Name Type_Value
=================== ===========
FAULT_KMALLOC 0x000000001 FAULT_KMALLOC 0x000000001
FAULT_KVMALLOC 0x000000002 FAULT_KVMALLOC 0x000000002
FAULT_PAGE_ALLOC 0x000000004 FAULT_PAGE_ALLOC 0x000000004
...@@ -183,6 +190,7 @@ fault_type=%d Support configuring fault injection type, should be ...@@ -183,6 +190,7 @@ fault_type=%d Support configuring fault injection type, should be
FAULT_CHECKPOINT 0x000001000 FAULT_CHECKPOINT 0x000001000
FAULT_DISCARD 0x000002000 FAULT_DISCARD 0x000002000
FAULT_WRITE_IO 0x000004000 FAULT_WRITE_IO 0x000004000
=================== ===========
mode=%s Control block allocation mode which supports "adaptive" mode=%s Control block allocation mode which supports "adaptive"
and "lfs". In "lfs" mode, there should be no random and "lfs". In "lfs" mode, there should be no random
writes towards main area. writes towards main area.
...@@ -246,22 +254,22 @@ compress_extension=%s Support adding specified extension, so that f2fs can enab ...@@ -246,22 +254,22 @@ compress_extension=%s Support adding specified extension, so that f2fs can enab
on compression extension list and enable compression on on compression extension list and enable compression on
these file by default rather than to enable it via ioctl. these file by default rather than to enable it via ioctl.
For other files, we can still enable compression via ioctl. For other files, we can still enable compression via ioctl.
====================== ============================================================
================================================================================ Debugfs Entries
DEBUGFS ENTRIES ===============
================================================================================
/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as /sys/kernel/debug/f2fs/ contains information about all the partitions mounted as
f2fs. Each file shows the whole f2fs information. f2fs. Each file shows the whole f2fs information.
/sys/kernel/debug/f2fs/status includes: /sys/kernel/debug/f2fs/status includes:
- major file system information managed by f2fs currently - major file system information managed by f2fs currently
- average SIT information about whole segments - average SIT information about whole segments
- current memory footprint consumed by f2fs. - current memory footprint consumed by f2fs.
================================================================================ Sysfs Entries
SYSFS ENTRIES =============
================================================================================
Information about mounted f2fs file systems can be found in Information about mounted f2fs file systems can be found in
/sys/fs/f2fs. Each mounted filesystem will have a directory in /sys/fs/f2fs. Each mounted filesystem will have a directory in
...@@ -271,20 +279,22 @@ The files in each per-device directory are shown in table below. ...@@ -271,20 +279,22 @@ The files in each per-device directory are shown in table below.
Files in /sys/fs/f2fs/<devname> Files in /sys/fs/f2fs/<devname>
(see also Documentation/ABI/testing/sysfs-fs-f2fs) (see also Documentation/ABI/testing/sysfs-fs-f2fs)
================================================================================ Usage
USAGE =====
================================================================================
1. Download userland tools and compile them. 1. Download userland tools and compile them.
2. Skip, if f2fs was compiled statically inside kernel. 2. Skip, if f2fs was compiled statically inside kernel.
Otherwise, insert the f2fs.ko module. Otherwise, insert the f2fs.ko module::
# insmod f2fs.ko # insmod f2fs.ko
3. Create a directory trying to mount 3. Create a directory trying to mount::
# mkdir /mnt/f2fs # mkdir /mnt/f2fs
4. Format the block device, and then mount as f2fs 4. Format the block device, and then mount as f2fs::
# mkfs.f2fs -l label /dev/block_device # mkfs.f2fs -l label /dev/block_device
# mount -t f2fs /dev/block_device /mnt/f2fs # mount -t f2fs /dev/block_device /mnt/f2fs
...@@ -294,18 +304,26 @@ The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem, ...@@ -294,18 +304,26 @@ The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem,
which builds a basic on-disk layout. which builds a basic on-disk layout.
The options consist of: The options consist of:
-l [label] : Give a volume label, up to 512 unicode name.
-a [0 or 1] : Split start location of each area for heap-based allocation. =============== ===========================================================
``-l [label]`` Give a volume label, up to 512 unicode name.
``-a [0 or 1]`` Split start location of each area for heap-based allocation.
1 is set by default, which performs this. 1 is set by default, which performs this.
-o [int] : Set overprovision ratio in percent over volume size. ``-o [int]`` Set overprovision ratio in percent over volume size.
5 is set by default. 5 is set by default.
-s [int] : Set the number of segments per section. ``-s [int]`` Set the number of segments per section.
1 is set by default. 1 is set by default.
-z [int] : Set the number of sections per zone. ``-z [int]`` Set the number of sections per zone.
1 is set by default. 1 is set by default.
-e [str] : Set basic extension list. e.g. "mp3,gif,mov" ``-e [str]`` Set basic extension list. e.g. "mp3,gif,mov"
-t [0 or 1] : Disable discard command or not. ``-t [0 or 1]`` Disable discard command or not.
1 is set by default, which conducts discard. 1 is set by default, which conducts discard.
=============== ===========================================================
fsck.f2fs fsck.f2fs
--------- ---------
...@@ -314,7 +332,8 @@ partition, which examines whether the filesystem metadata and user-made data ...@@ -314,7 +332,8 @@ partition, which examines whether the filesystem metadata and user-made data
are cross-referenced correctly or not. are cross-referenced correctly or not.
Note that, initial version of the tool does not fix any inconsistency. Note that, initial version of the tool does not fix any inconsistency.
The options consist of: The options consist of::
-d debug level [default:0] -d debug level [default:0]
dump.f2fs dump.f2fs
...@@ -327,20 +346,21 @@ It shows on-disk inode information recognized by a given inode number, and is ...@@ -327,20 +346,21 @@ It shows on-disk inode information recognized by a given inode number, and is
able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and
./dump_sit respectively. ./dump_sit respectively.
The options consist of: The options consist of::
-d debug level [default:0] -d debug level [default:0]
-i inode no (hex) -i inode no (hex)
-s [SIT dump segno from #1~#2 (decimal), for all 0~-1] -s [SIT dump segno from #1~#2 (decimal), for all 0~-1]
-a [SSA dump segno from #1~#2 (decimal), for all 0~-1] -a [SSA dump segno from #1~#2 (decimal), for all 0~-1]
Examples: Examples::
# dump.f2fs -i [ino] /dev/sdx
# dump.f2fs -s 0~-1 /dev/sdx (SIT dump)
# dump.f2fs -a 0~-1 /dev/sdx (SSA dump)
================================================================================ # dump.f2fs -i [ino] /dev/sdx
DESIGN # dump.f2fs -s 0~-1 /dev/sdx (SIT dump)
================================================================================ # dump.f2fs -a 0~-1 /dev/sdx (SSA dump)
Design
======
On-disk Layout On-disk Layout
-------------- --------------
...@@ -351,7 +371,7 @@ consists of a set of sections. By default, section and zone sizes are set to one ...@@ -351,7 +371,7 @@ consists of a set of sections. By default, section and zone sizes are set to one
segment size identically, but users can easily modify the sizes by mkfs. segment size identically, but users can easily modify the sizes by mkfs.
F2FS splits the entire volume into six areas, and all the areas except superblock F2FS splits the entire volume into six areas, and all the areas except superblock
consists of multiple segments as described below. consists of multiple segments as described below::
align with the zone size <-| align with the zone size <-|
|-> align with the segment size |-> align with the segment size
...@@ -373,28 +393,28 @@ consists of multiple segments as described below. ...@@ -373,28 +393,28 @@ consists of multiple segments as described below.
|__zone__| |__zone__|
- Superblock (SB) - Superblock (SB)
: It is located at the beginning of the partition, and there exist two copies It is located at the beginning of the partition, and there exist two copies
to avoid file system crash. It contains basic partition information and some to avoid file system crash. It contains basic partition information and some
default parameters of f2fs. default parameters of f2fs.
- Checkpoint (CP) - Checkpoint (CP)
: It contains file system information, bitmaps for valid NAT/SIT sets, orphan It contains file system information, bitmaps for valid NAT/SIT sets, orphan
inode lists, and summary entries of current active segments. inode lists, and summary entries of current active segments.
- Segment Information Table (SIT) - Segment Information Table (SIT)
: It contains segment information such as valid block count and bitmap for the It contains segment information such as valid block count and bitmap for the
validity of all the blocks. validity of all the blocks.
- Node Address Table (NAT) - Node Address Table (NAT)
: It is composed of a block address table for all the node blocks stored in It is composed of a block address table for all the node blocks stored in
Main area. Main area.
- Segment Summary Area (SSA) - Segment Summary Area (SSA)
: It contains summary entries which contains the owner information of all the It contains summary entries which contains the owner information of all the
data and node blocks stored in Main area. data and node blocks stored in Main area.
- Main Area - Main Area
: It contains file and directory data including their indices. It contains file and directory data including their indices.
In order to avoid misalignment between file system and flash-based storage, F2FS In order to avoid misalignment between file system and flash-based storage, F2FS
aligns the start block address of CP with the segment size. Also, it aligns the aligns the start block address of CP with the segment size. Also, it aligns the
...@@ -414,7 +434,7 @@ One of them always indicates the last valid data, which is called as shadow copy ...@@ -414,7 +434,7 @@ One of them always indicates the last valid data, which is called as shadow copy
mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism. mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism.
For file system consistency, each CP points to which NAT and SIT copies are For file system consistency, each CP points to which NAT and SIT copies are
valid, as shown as below. valid, as shown as below::
+--------+----------+---------+ +--------+----------+---------+
| CP | SIT | NAT | | CP | SIT | NAT |
...@@ -438,7 +458,7 @@ indirect node. F2FS assigns 4KB to an inode block which contains 923 data block ...@@ -438,7 +458,7 @@ indirect node. F2FS assigns 4KB to an inode block which contains 923 data block
indices, two direct node pointers, two indirect node pointers, and one double indices, two direct node pointers, two indirect node pointers, and one double
indirect node pointer as described below. One direct node block contains 1018 indirect node pointer as described below. One direct node block contains 1018
data blocks, and one indirect node block contains also 1018 node blocks. Thus, data blocks, and one indirect node block contains also 1018 node blocks. Thus,
one inode block (i.e., a file) covers: one inode block (i.e., a file) covers::
4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB. 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
...@@ -473,6 +493,8 @@ A dentry block consists of 214 dentry slots and file names. Therein a bitmap is ...@@ -473,6 +493,8 @@ A dentry block consists of 214 dentry slots and file names. Therein a bitmap is
used to represent whether each dentry is valid or not. A dentry block occupies used to represent whether each dentry is valid or not. A dentry block occupies
4KB with the following composition. 4KB with the following composition.
::
Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) + Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
dentries(11 * 214 bytes) + file name (8 * 214 bytes) dentries(11 * 214 bytes) + file name (8 * 214 bytes)
...@@ -498,23 +520,25 @@ F2FS implements multi-level hash tables for directory structure. Each level has ...@@ -498,23 +520,25 @@ F2FS implements multi-level hash tables for directory structure. Each level has
a hash table with dedicated number of hash buckets as shown below. Note that a hash table with dedicated number of hash buckets as shown below. Note that
"A(2B)" means a bucket includes 2 data blocks. "A(2B)" means a bucket includes 2 data blocks.
---------------------- ::
A : bucket
B : block
N : MAX_DIR_HASH_DEPTH
----------------------
level #0 | A(2B) ----------------------
A : bucket
B : block
N : MAX_DIR_HASH_DEPTH
----------------------
level #0 | A(2B)
| |
level #1 | A(2B) - A(2B) level #1 | A(2B) - A(2B)
| |
level #2 | A(2B) - A(2B) - A(2B) - A(2B) level #2 | A(2B) - A(2B) - A(2B) - A(2B)
. | . . . . . | . . . .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B) level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . . . | . . . .
level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B) level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
The number of blocks and buckets are determined by, The number of blocks and buckets are determined by::
,- 2, if n < MAX_DIR_HASH_DEPTH / 2, ,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
# of blocks in level #n = | # of blocks in level #n = |
...@@ -532,7 +556,7 @@ dentry consisting of the file name and its inode number. If not found, F2FS ...@@ -532,7 +556,7 @@ dentry consisting of the file name and its inode number. If not found, F2FS
scans the next hash table in level #1. In this way, F2FS scans hash tables in scans the next hash table in level #1. In this way, F2FS scans hash tables in
each levels incrementally from 1 to N. In each levels F2FS needs to scan only each levels incrementally from 1 to N. In each levels F2FS needs to scan only
one bucket determined by the following equation, which shows O(log(# of files)) one bucket determined by the following equation, which shows O(log(# of files))
complexity. complexity::
bucket number to scan in level #n = (hash value) % (# of buckets in level #n) bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
...@@ -540,7 +564,8 @@ In the case of file creation, F2FS finds empty consecutive slots that cover the ...@@ -540,7 +564,8 @@ In the case of file creation, F2FS finds empty consecutive slots that cover the
file name. F2FS searches the empty slots in the hash tables of whole levels from file name. F2FS searches the empty slots in the hash tables of whole levels from
1 to N in the same way as the lookup operation. 1 to N in the same way as the lookup operation.
The following figure shows an example of two cases holding children. The following figure shows an example of two cases holding children::
--------------> Dir <-------------- --------------> Dir <--------------
| | | |
child child child child
...@@ -611,14 +636,15 @@ Write-hint Policy ...@@ -611,14 +636,15 @@ Write-hint Policy
2) whint_mode=user-based. F2FS tries to pass down hints given by 2) whint_mode=user-based. F2FS tries to pass down hints given by
users. users.
===================== ======================== ===================
User F2FS Block User F2FS Block
---- ---- ----- ===================== ======================== ===================
META WRITE_LIFE_NOT_SET META WRITE_LIFE_NOT_SET
HOT_NODE " HOT_NODE "
WARM_NODE " WARM_NODE "
COLD_NODE " COLD_NODE "
*ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
*extension list " " extension list " "
-- buffered io -- buffered io
WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
...@@ -635,11 +661,13 @@ WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET ...@@ -635,11 +661,13 @@ WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE " WRITE_LIFE_NONE WRITE_LIFE_NONE " WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG " WRITE_LIFE_LONG WRITE_LIFE_LONG " WRITE_LIFE_LONG
===================== ======================== ===================
3) whint_mode=fs-based. F2FS passes down hints with its policy. 3) whint_mode=fs-based. F2FS passes down hints with its policy.
===================== ======================== ===================
User F2FS Block User F2FS Block
---- ---- ----- ===================== ======================== ===================
META WRITE_LIFE_MEDIUM; META WRITE_LIFE_MEDIUM;
HOT_NODE WRITE_LIFE_NOT_SET HOT_NODE WRITE_LIFE_NOT_SET
WARM_NODE " WARM_NODE "
...@@ -662,6 +690,7 @@ WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET ...@@ -662,6 +690,7 @@ WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE " WRITE_LIFE_NONE WRITE_LIFE_NONE " WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG " WRITE_LIFE_LONG WRITE_LIFE_LONG " WRITE_LIFE_LONG
===================== ======================== ===================
Fallocate(2) Policy Fallocate(2) Policy
------------------- -------------------
...@@ -681,6 +710,7 @@ Allocating disk space ...@@ -681,6 +710,7 @@ Allocating disk space
However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to
fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having
zero or random data, which is useful to the below scenario where: zero or random data, which is useful to the below scenario where:
1. create(fd) 1. create(fd)
2. ioctl(fd, F2FS_IOC_SET_PIN_FILE) 2. ioctl(fd, F2FS_IOC_SET_PIN_FILE)
3. fallocate(fd, 0, 0, size) 3. fallocate(fd, 0, 0, size)
...@@ -692,26 +722,28 @@ Compression implementation ...@@ -692,26 +722,28 @@ Compression implementation
-------------------------- --------------------------
- New term named cluster is defined as basic unit of compression, file can - New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of (n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not. cluster can be compressed or not.
- In cluster metadata layout, one special block address is used to indicate - In cluster metadata layout, one special block address is used to indicate
cluster is compressed one or normal one, for compressed cluster, following cluster is compressed one or normal one, for compressed cluster, following
metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs
stores data including compress header and compressed data. stores data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only - In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold. than specified threshold.
- To enable compression on regular inode, there are three ways: - To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout: * chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
Compress metadata layout::
[Dnode Structure] [Dnode Structure]
+-----------------------------------------------+ +-----------------------------------------------+
| cluster 1 | cluster 2 | ......... | cluster N | | cluster 1 | cluster 2 | ......... | cluster N |
...@@ -719,9 +751,9 @@ Compress metadata layout: ...@@ -719,9 +751,9 @@ Compress metadata layout:
. . . . . . . .
. . . . . . . .
. Compressed Cluster . . Normal Cluster . . Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+ +----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 | |compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+ +----------+---------+---------+---------+ +---------+---------+---------+---------+
. . . .
. . . .
. . . .
......
uevents and GFS2 .. SPDX-License-Identifier: GPL-2.0
==================
================
uevents and GFS2
================
During the lifetime of a GFS2 mount, a number of uevents are generated. During the lifetime of a GFS2 mount, a number of uevents are generated.
This document explains what the events are and what they are used This document explains what the events are and what they are used
for (by gfs_controld in gfs2-utils). for (by gfs_controld in gfs2-utils).
A list of GFS2 uevents A list of GFS2 uevents
----------------------- ======================
1. ADD 1. ADD
------
The ADD event occurs at mount time. It will always be the first The ADD event occurs at mount time. It will always be the first
uevent generated by the newly created filesystem. If the mount uevent generated by the newly created filesystem. If the mount
...@@ -21,6 +25,7 @@ with no journal assigned), and read-only (with journal assigned) status ...@@ -21,6 +25,7 @@ with no journal assigned), and read-only (with journal assigned) status
of the filesystem respectively. of the filesystem respectively.
2. ONLINE 2. ONLINE
---------
The ONLINE uevent is generated after a successful mount or remount. It The ONLINE uevent is generated after a successful mount or remount. It
has the same environment variables as the ADD uevent. The ONLINE has the same environment variables as the ADD uevent. The ONLINE
...@@ -29,6 +34,7 @@ RDONLY are a relatively recent addition (2.6.32-rc+) and will not ...@@ -29,6 +34,7 @@ RDONLY are a relatively recent addition (2.6.32-rc+) and will not
be generated by older kernels. be generated by older kernels.
3. CHANGE 3. CHANGE
---------
The CHANGE uevent is used in two places. One is when reporting the The CHANGE uevent is used in two places. One is when reporting the
successful mount of the filesystem by the first node (FIRSTMOUNT=Done). successful mount of the filesystem by the first node (FIRSTMOUNT=Done).
...@@ -52,6 +58,7 @@ cluster. For this reason the ONLINE uevent was used when adding a new ...@@ -52,6 +58,7 @@ cluster. For this reason the ONLINE uevent was used when adding a new
uevent for a successful mount or remount. uevent for a successful mount or remount.
4. OFFLINE 4. OFFLINE
----------
The OFFLINE uevent is only generated due to filesystem errors and is used The OFFLINE uevent is only generated due to filesystem errors and is used
as part of the "withdraw" mechanism. Currently this doesn't give any as part of the "withdraw" mechanism. Currently this doesn't give any
...@@ -59,6 +66,7 @@ information about what the error is, which is something that needs to ...@@ -59,6 +66,7 @@ information about what the error is, which is something that needs to
be fixed. be fixed.
5. REMOVE 5. REMOVE
---------
The REMOVE uevent is generated at the end of an unsuccessful mount The REMOVE uevent is generated at the end of an unsuccessful mount
or at the end of a umount of the filesystem. All REMOVE uevents will or at the end of a umount of the filesystem. All REMOVE uevents will
...@@ -68,9 +76,10 @@ kobject subsystem. ...@@ -68,9 +76,10 @@ kobject subsystem.
Information common to all GFS2 uevents (uevent environment variables) Information common to all GFS2 uevents (uevent environment variables)
---------------------------------------------------------------------- =====================================================================
1. LOCKTABLE= 1. LOCKTABLE=
--------------
The LOCKTABLE is a string, as supplied on the mount command The LOCKTABLE is a string, as supplied on the mount command
line (locktable=) or via fstab. It is used as a filesystem label line (locktable=) or via fstab. It is used as a filesystem label
...@@ -78,6 +87,7 @@ as well as providing the information for a lock_dlm mount to be ...@@ -78,6 +87,7 @@ as well as providing the information for a lock_dlm mount to be
able to join the cluster. able to join the cluster.
2. LOCKPROTO= 2. LOCKPROTO=
-------------
The LOCKPROTO is a string, and its value depends on what is set The LOCKPROTO is a string, and its value depends on what is set
on the mount command line, or via fstab. It will be either on the mount command line, or via fstab. It will be either
...@@ -85,12 +95,14 @@ lock_nolock or lock_dlm. In the future other lock managers ...@@ -85,12 +95,14 @@ lock_nolock or lock_dlm. In the future other lock managers
may be supported. may be supported.
3. JOURNALID= 3. JOURNALID=
-------------
If a journal is in use by the filesystem (journals are not If a journal is in use by the filesystem (journals are not
assigned for spectator mounts) then this will give the assigned for spectator mounts) then this will give the
numeric journal id in all GFS2 uevents. numeric journal id in all GFS2 uevents.
4. UUID= 4. UUID=
--------
With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID
into the filesystem superblock. If it exists, this will into the filesystem superblock. If it exists, this will
......
.. SPDX-License-Identifier: GPL-2.0
==================
Global File System Global File System
------------------ ==================
https://fedorahosted.org/cluster/wiki/HomePage https://fedorahosted.org/cluster/wiki/HomePage
...@@ -14,16 +17,18 @@ on one machine show up immediately on all other machines in the cluster. ...@@ -14,16 +17,18 @@ on one machine show up immediately on all other machines in the cluster.
GFS uses interchangeable inter-node locking mechanisms, the currently GFS uses interchangeable inter-node locking mechanisms, the currently
supported mechanisms are: supported mechanisms are:
lock_nolock -- allows gfs to be used as a local file system lock_nolock
- allows gfs to be used as a local file system
lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking lock_dlm
- uses a distributed lock manager (dlm) for inter-node locking.
The dlm is found at linux/fs/dlm/ The dlm is found at linux/fs/dlm/
Lock_dlm depends on user space cluster management systems found Lock_dlm depends on user space cluster management systems found
at the URL above. at the URL above.
To use gfs as a local file system, no external clustering systems are To use gfs as a local file system, no external clustering systems are
needed, simply: needed, simply::
$ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device
$ mount -t gfs2 /dev/block_device /dir $ mount -t gfs2 /dev/block_device /dir
...@@ -37,9 +42,12 @@ GFS2 is not on-disk compatible with previous versions of GFS, but it ...@@ -37,9 +42,12 @@ GFS2 is not on-disk compatible with previous versions of GFS, but it
is pretty close. is pretty close.
The following man pages can be found at the URL above: The following man pages can be found at the URL above:
============ =============================================
fsck.gfs2 to repair a filesystem fsck.gfs2 to repair a filesystem
gfs2_grow to expand a filesystem online gfs2_grow to expand a filesystem online
gfs2_jadd to add journals to a filesystem online gfs2_jadd to add journals to a filesystem online
tunegfs2 to manipulate, examine and tune a filesystem tunegfs2 to manipulate, examine and tune a filesystem
gfs2_convert to convert a gfs filesystem to gfs2 in-place gfs2_convert to convert a gfs filesystem to gfs2 in-place
mkfs.gfs2 to make a filesystem mkfs.gfs2 to make a filesystem
============ =============================================
Note: This filesystem doesn't have a maintainer. .. SPDX-License-Identifier: GPL-2.0
==================================
Macintosh HFS Filesystem for Linux Macintosh HFS Filesystem for Linux
================================== ==================================
HFS stands for ``Hierarchical File System'' and is the filesystem used
.. Note:: This filesystem doesn't have a maintainer.
HFS stands for ``Hierarchical File System`` and is the filesystem used
by the Mac Plus and all later Macintosh models. Earlier Macintosh by the Mac Plus and all later Macintosh models. Earlier Macintosh
models used MFS (``Macintosh File System''), which is not supported, models used MFS (``Macintosh File System``), which is not supported,
MacOS 8.1 and newer support a filesystem called HFS+ that's similar to MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
HFS but is extended in various areas. Use the hfsplus filesystem driver HFS but is extended in various areas. Use the hfsplus filesystem driver
to access such filesystems from Linux. to access such filesystems from Linux.
...@@ -49,25 +54,25 @@ Writing to HFS Filesystems ...@@ -49,25 +54,25 @@ Writing to HFS Filesystems
HFS is not a UNIX filesystem, thus it does not have the usual features you'd HFS is not a UNIX filesystem, thus it does not have the usual features you'd
expect: expect:
o You can't modify the set-uid, set-gid, sticky or executable bits or the uid * You can't modify the set-uid, set-gid, sticky or executable bits or the uid
and gid of files. and gid of files.
o You can't create hard- or symlinks, device files, sockets or FIFOs. * You can't create hard- or symlinks, device files, sockets or FIFOs.
HFS does on the other have the concepts of multiple forks per file. These HFS does on the other have the concepts of multiple forks per file. These
non-standard forks are represented as hidden additional files in the normal non-standard forks are represented as hidden additional files in the normal
filesystems namespace which is kind of a cludge and makes the semantics for filesystems namespace which is kind of a cludge and makes the semantics for
the a little strange: the a little strange:
o You can't create, delete or rename resource forks of files or the * You can't create, delete or rename resource forks of files or the
Finder's metadata. Finder's metadata.
o They are however created (with default values), deleted and renamed * They are however created (with default values), deleted and renamed
along with the corresponding data fork or directory. along with the corresponding data fork or directory.
o Copying files to a different filesystem will loose those attributes * Copying files to a different filesystem will loose those attributes
that are essential for MacOS to work. that are essential for MacOS to work.
Creating HFS filesystems Creating HFS filesystems
=================================== ========================
The hfsutils package from Robert Leslie contains a program called The hfsutils package from Robert Leslie contains a program called
hformat that can be used to create HFS filesystem. See hformat that can be used to create HFS filesystem. See
......
.. SPDX-License-Identifier: GPL-2.0
======================================
Macintosh HFSPlus Filesystem for Linux Macintosh HFSPlus Filesystem for Linux
====================================== ======================================
......
.. SPDX-License-Identifier: GPL-2.0
====================
Read/Write HPFS 2.09 Read/Write HPFS 2.09
====================
1998-2004, Mikulas Patocka 1998-2004, Mikulas Patocka
email: mikulas@artax.karlin.mff.cuni.cz :email: mikulas@artax.karlin.mff.cuni.cz
homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi :homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
CREDITS: Credits
=======
Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
is taken from it is taken from it
Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993) Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
Mount options Mount options
...@@ -50,6 +58,7 @@ timeshift=(-)nnn (default 0) ...@@ -50,6 +58,7 @@ timeshift=(-)nnn (default 0)
File names File names
==========
As in OS/2, filenames are case insensitive. However, shell thinks that names As in OS/2, filenames are case insensitive. However, shell thinks that names
are case sensitive, so for example when you create a file FOO, you can use are case sensitive, so for example when you create a file FOO, you can use
...@@ -64,6 +73,7 @@ access it under names 'a.', 'a..', 'a . . . ' etc. ...@@ -64,6 +73,7 @@ access it under names 'a.', 'a..', 'a . . . ' etc.
Extended attributes Extended attributes
===================
On HPFS partitions, OS/2 can associate to each file a special information called On HPFS partitions, OS/2 can associate to each file a special information called
extended attributes. Extended attributes are pairs of (key,value) where key is extended attributes. Extended attributes are pairs of (key,value) where key is
...@@ -88,6 +98,7 @@ values doesn't work. ...@@ -88,6 +98,7 @@ values doesn't work.
Symlinks Symlinks
========
You can do symlinks on HPFS partition, symlinks are achieved by setting extended You can do symlinks on HPFS partition, symlinks are achieved by setting extended
attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
...@@ -101,6 +112,7 @@ to analyze or change OS2SYS.INI. ...@@ -101,6 +112,7 @@ to analyze or change OS2SYS.INI.
Codepages Codepages
=========
HPFS can contain several uppercasing tables for several codepages and each HPFS can contain several uppercasing tables for several codepages and each
file has a pointer to codepage its name is in. However OS/2 was created in file has a pointer to codepage its name is in. However OS/2 was created in
...@@ -128,6 +140,7 @@ this codepage - if you don't try to do what I described above :-) ...@@ -128,6 +140,7 @@ this codepage - if you don't try to do what I described above :-)
Known bugs Known bugs
==========
HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
should work. If you have OS/2 server, use only read-only mode. I don't know how should work. If you have OS/2 server, use only read-only mode. I don't know how
...@@ -152,7 +165,8 @@ would result in directory tree splitting, that takes disk space. Workaround is ...@@ -152,7 +165,8 @@ would result in directory tree splitting, that takes disk space. Workaround is
to delete other files that are leaf (probability that the file is non-leaf is to delete other files that are leaf (probability that the file is non-leaf is
about 1/50) or to truncate file first to make some space. about 1/50) or to truncate file first to make some space.
You encounter this problem only if you have many directories so that You encounter this problem only if you have many directories so that
preallocated directory band is full i.e. preallocated directory band is full i.e.::
number_of_directories / size_of_filesystem_in_mb > 4. number_of_directories / size_of_filesystem_in_mb > 4.
You can't delete open directories. You can't delete open directories.
...@@ -174,6 +188,7 @@ anybody know what does it mean? ...@@ -174,6 +188,7 @@ anybody know what does it mean?
What does "unbalanced tree" message mean? What does "unbalanced tree" message mean?
=========================================
Old versions of this driver created sometimes unbalanced dnode trees. OS/2 Old versions of this driver created sometimes unbalanced dnode trees. OS/2
chkdsk doesn't scream if the tree is unbalanced (and sometimes creates chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
...@@ -187,6 +202,7 @@ whole created by this driver, it is BUG - let me know about it. ...@@ -187,6 +202,7 @@ whole created by this driver, it is BUG - let me know about it.
Bugs in OS/2 Bugs in OS/2
============
When you have two (or more) lost directories pointing each to other, chkdsk When you have two (or more) lost directories pointing each to other, chkdsk
locks up when repairing filesystem. locks up when repairing filesystem.
...@@ -199,13 +215,16 @@ File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and ...@@ -199,13 +215,16 @@ File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
marks them as short (and writes "minor fs error corrected"). This bug is not in marks them as short (and writes "minor fs error corrected"). This bug is not in
HPFS386. HPFS386.
Codepage bugs described above. Codepage bugs described above
=============================
If you don't install fixpacks, there are many, many more... If you don't install fixpacks, there are many, many more...
History History
=======
====== =========================================================================
0.90 First public release 0.90 First public release
0.91 Fixed bug that caused shooting to memory when write_inode was called on 0.91 Fixed bug that caused shooting to memory when write_inode was called on
open inode (rarely happened) open inode (rarely happened)
...@@ -219,78 +238,116 @@ History ...@@ -219,78 +238,116 @@ History
1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk 1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
Fixed a race-condition when write_inode is called while deleting file Fixed a race-condition when write_inode is called while deleting file
Fixed a bug that could possibly happen (with very low probability) when Fixed a bug that could possibly happen (with very low probability) when
using 0xff in filenames using 0xff in filenames.
Rewritten locking to avoid race-conditions Rewritten locking to avoid race-conditions
Mount option 'eas' now works Mount option 'eas' now works
Fsync no longer returns error Fsync no longer returns error
Files beginning with '.' are marked hidden Files beginning with '.' are marked hidden
Remount support added Remount support added
Alloc is not so slow when filesystem becomes full Alloc is not so slow when filesystem becomes full
Atimes are no more updated because it slows down operation Atimes are no more updated because it slows down operation
Code cleanup (removed all commented debug prints) Code cleanup (removed all commented debug prints)
1.92 Corrected a bug when sync was called just before closing file 1.92 Corrected a bug when sync was called just before closing file
1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it 1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
works with previous versions works with previous versions
Fixed a possible problem with disks > 64G (but I don't have one, so I can't Fixed a possible problem with disks > 64G (but I don't have one, so I can't
test it) test it)
Fixed a file overflow at 2G Fixed a file overflow at 2G
Added new option 'timeshift' Added new option 'timeshift'
Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
read-only mode read-only mode
Fixed a bug that slowed down alloc and prevented allocating 100% space Fixed a bug that slowed down alloc and prevented allocating 100% space
(this bug was not destructive) (this bug was not destructive)
1.94 Added workaround for one bug in Linux 1.94 Added workaround for one bug in Linux
Fixed one buffer leak Fixed one buffer leak
Fixed some incompatibilities with large extended attributes (but it's still Fixed some incompatibilities with large extended attributes (but it's still
not 100% ok, I have no info on it and OS/2 doesn't want to create them) not 100% ok, I have no info on it and OS/2 doesn't want to create them)
Rewritten allocation Rewritten allocation
Fixed a bug with i_blocks (du sometimes didn't display correct values) Fixed a bug with i_blocks (du sometimes didn't display correct values)
Directories have no longer archive attribute set (some programs don't like Directories have no longer archive attribute set (some programs don't like
it) it)
Fixed a bug that it set badly one flag in large anode tree (it was not Fixed a bug that it set badly one flag in large anode tree (it was not
destructive) destructive)
1.95 Fixed one buffer leak, that could happen on corrupted filesystem 1.95 Fixed one buffer leak, that could happen on corrupted filesystem
Fixed one bug in allocation in 1.94 Fixed one bug in allocation in 1.94
1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported 1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
error sometimes when opening directories in PMSHELL) error sometimes when opening directories in PMSHELL)
Fixed a possible bitmap race Fixed a possible bitmap race
Fixed possible problem on large disks Fixed possible problem on large disks
You can now delete open files You can now delete open files
Fixed a nondestructive race in rename Fixed a nondestructive race in rename
1.97 Support for HPFS v3 (on large partitions) 1.97 Support for HPFS v3 (on large partitions)
Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
ZFixed a bug that it didn't allow creation of files > 128M
(it should be 2G)
1.97.1 Changed names of global symbols 1.97.1 Changed names of global symbols
Fixed a bug when chmoding or chowning root directory Fixed a bug when chmoding or chowning root directory
1.98 Fixed a deadlock when using old_readdir 1.98 Fixed a deadlock when using old_readdir
Better directory handling; workaround for "unbalanced tree" bug in OS/2 Better directory handling; workaround for "unbalanced tree" bug in OS/2
1.99 Corrected a possible problem when there's not enough space while deleting 1.99 Corrected a possible problem when there's not enough space while deleting
file file
Now it tries to truncate the file if there's not enough space when deleting
Now it tries to truncate the file if there's not enough space when
deleting
Removed a lot of redundant code Removed a lot of redundant code
2.00 Fixed a bug in rename (it was there since 1.96) 2.00 Fixed a bug in rename (it was there since 1.96)
Better anti-fragmentation strategy Better anti-fragmentation strategy
2.01 Fixed problem with directory listing over NFS 2.01 Fixed problem with directory listing over NFS
Directory lseek now checks for proper parameters Directory lseek now checks for proper parameters
Fixed race-condition in buffer code - it is in all filesystems in Linux; Fixed race-condition in buffer code - it is in all filesystems in Linux;
when reading device (cat /dev/hda) while creating files on it, files when reading device (cat /dev/hda) while creating files on it, files
could be damaged could be damaged
2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond 2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond
end of partition end of partition
2.03 Char, block devices and pipes are correctly created 2.03 Char, block devices and pipes are correctly created
Fixed non-crashing race in unlink (Alexander Viro) Fixed non-crashing race in unlink (Alexander Viro)
Now it works with Japanese version of OS/2 Now it works with Japanese version of OS/2
2.04 Fixed error when ftruncate used to extend file 2.04 Fixed error when ftruncate used to extend file
2.05 Fixed crash when got mount parameters without = 2.05 Fixed crash when got mount parameters without =
Fixed crash when allocation of anode failed due to full disk Fixed crash when allocation of anode failed due to full disk
Fixed some crashes when block io or inode allocation failed Fixed some crashes when block io or inode allocation failed
2.06 Fixed some crash on corrupted disk structures 2.06 Fixed some crash on corrupted disk structures
Better allocation strategy Better allocation strategy
Reschedule points added so that it doesn't lock CPU long time Reschedule points added so that it doesn't lock CPU long time
It should work in read-only mode on Warp Server It should work in read-only mode on Warp Server
2.07 More fixes for Warp Server. Now it really works 2.07 More fixes for Warp Server. Now it really works
2.08 Creating new files is not so slow on large disks 2.08 Creating new files is not so slow on large disks
An attempt to sync deleted file does not generate filesystem error An attempt to sync deleted file does not generate filesystem error
2.09 Fixed error on extremely fragmented files 2.09 Fixed error on extremely fragmented files
====== =========================================================================
vim: set textwidth=80:
...@@ -46,9 +46,53 @@ Documentation for filesystem implementations. ...@@ -46,9 +46,53 @@ Documentation for filesystem implementations.
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
9p
adfs
affs
afs
autofs autofs
autofs-mount-control
befs
bfs
btrfs
ceph
cramfs
debugfs
dlmfs
ecryptfs
efivarfs
erofs
ext2
ext3
f2fs
gfs2
gfs2-uevents
hfs
hfsplus
hpfs
fuse fuse
inotify
isofs
nilfs2
nfs/index
ntfs
ocfs2
ocfs2-online-filecheck
omfs
orangefs
overlayfs overlayfs
proc
qnx6
ramfs-rootfs-initramfs
relay
romfs
squashfs
sysfs
sysv-fs
tmpfs
ubifs
ubifs-authentication.rst
udf
virtiofs virtiofs
vfat vfat
nfs/index zonefs
inotify .. SPDX-License-Identifier: GPL-2.0
a powerful yet simple file change notification system
===============================================================
Inotify - A Powerful yet Simple File Change Notification System
===============================================================
Document started 15 Mar 2005 by Robert Love <rml@novell.com> Document started 15 Mar 2005 by Robert Love <rml@novell.com>
Document updated 4 Jan 2015 by Zhang Zhen <zhenzhang.zhang@huawei.com> Document updated 4 Jan 2015 by Zhang Zhen <zhenzhang.zhang@huawei.com>
--Deleted obsoleted interface, just refer to manpages for user interface.
- Deleted obsoleted interface, just refer to manpages for user interface.
(i) Rationale (i) Rationale
Q: What is the design decision behind not tying the watch to the open fd of Q:
What is the design decision behind not tying the watch to the open fd of
the watched object? the watched object?
A: Watches are associated with an open inotify device, not an open file. A:
Watches are associated with an open inotify device, not an open file.
This solves the primary problem with dnotify: keeping the file open pins This solves the primary problem with dnotify: keeping the file open pins
the file and thus, worse, pins the mount. Dnotify is therefore infeasible the file and thus, worse, pins the mount. Dnotify is therefore infeasible
for use on a desktop system with removable media as the media cannot be for use on a desktop system with removable media as the media cannot be
unmounted. Watching a file should not require that it be open. unmounted. Watching a file should not require that it be open.
Q: What is the design decision behind using an-fd-per-instance as opposed to Q:
What is the design decision behind using an-fd-per-instance as opposed to
an fd-per-watch? an fd-per-watch?
A: An fd-per-watch quickly consumes more file descriptors than are allowed, A:
An fd-per-watch quickly consumes more file descriptors than are allowed,
more fd's than are feasible to manage, and more fd's than are optimally more fd's than are feasible to manage, and more fd's than are optimally
select()-able. Yes, root can bump the per-process fd limit and yes, users select()-able. Yes, root can bump the per-process fd limit and yes, users
can use epoll, but requiring both is a silly and extraneous requirement. can use epoll, but requiring both is a silly and extraneous requirement.
...@@ -65,9 +74,11 @@ A: An fd-per-watch quickly consumes more file descriptors than are allowed, ...@@ -65,9 +74,11 @@ A: An fd-per-watch quickly consumes more file descriptors than are allowed,
need not be a one-fd-per-process mapping; it is one-fd-per-queue and a need not be a one-fd-per-process mapping; it is one-fd-per-queue and a
process can easily want more than one queue. process can easily want more than one queue.
Q: Why the system call approach? Q:
Why the system call approach?
A: The poor user-space interface is the second biggest problem with dnotify. A:
The poor user-space interface is the second biggest problem with dnotify.
Signals are a terrible, terrible interface for file notification. Or for Signals are a terrible, terrible interface for file notification. Or for
anything, for that matter. The ideal solution, from all perspectives, is a anything, for that matter. The ideal solution, from all perspectives, is a
file descriptor-based one that allows basic file I/O and poll/select. file descriptor-based one that allows basic file I/O and poll/select.
......
.. SPDX-License-Identifier: GPL-2.0
==================
ISO9660 Filesystem
==================
Mount options that are the same as for msdos and vfat partitions. Mount options that are the same as for msdos and vfat partitions.
========= ========================================================
gid=nnn All files in the partition will be in group nnn. gid=nnn All files in the partition will be in group nnn.
uid=nnn All files in the partition will be owned by user id nnn. uid=nnn All files in the partition will be owned by user id nnn.
umask=nnn The permission mask (see umask(1)) for the partition. umask=nnn The permission mask (see umask(1)) for the partition.
========= ========================================================
Mount options that are the same as vfat partitions. These are only useful Mount options that are the same as vfat partitions. These are only useful
when using discs encoded using Microsoft's Joliet extensions. when using discs encoded using Microsoft's Joliet extensions.
============== =============================================================
iocharset=name Character set to use for converting from Unicode to iocharset=name Character set to use for converting from Unicode to
ASCII. Joliet filenames are stored in Unicode format, but ASCII. Joliet filenames are stored in Unicode format, but
Unix for the most part doesn't know how to deal with Unicode. Unix for the most part doesn't know how to deal with Unicode.
There is also an option of doing UTF-8 translations with the There is also an option of doing UTF-8 translations with the
utf8 option. utf8 option.
utf8 Encode Unicode names in UTF-8 format. Default is no. utf8 Encode Unicode names in UTF-8 format. Default is no.
============== =============================================================
Mount options unique to the isofs filesystem. Mount options unique to the isofs filesystem.
================= ============================================================
block=512 Set the block size for the disk to 512 bytes block=512 Set the block size for the disk to 512 bytes
block=1024 Set the block size for the disk to 1024 bytes block=1024 Set the block size for the disk to 1024 bytes
block=2048 Set the block size for the disk to 2048 bytes block=2048 Set the block size for the disk to 2048 bytes
...@@ -39,10 +52,13 @@ Mount options unique to the isofs filesystem. ...@@ -39,10 +52,13 @@ Mount options unique to the isofs filesystem.
recreate previous unhide behavior recreate previous unhide behavior
session=x Select number of session on multisession CD session=x Select number of session on multisession CD
sbsector=xxx Session begins from sector xxx sbsector=xxx Session begins from sector xxx
================= ============================================================
Recommended documents about ISO 9660 standard are located at: Recommended documents about ISO 9660 standard are located at:
http://www.y-adagio.com/
ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf - http://www.y-adagio.com/
- ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
identical with ISO 9660.", so it is a valid and gratis substitute of the identical with ISO 9660.", so it is a valid and gratis substitute of the
official ISO specification. official ISO specification.
.. SPDX-License-Identifier: GPL-2.0
======
NILFS2 NILFS2
------ ======
NILFS2 is a log-structured file system (LFS) supporting continuous NILFS2 is a log-structured file system (LFS) supporting continuous
snapshotting. In addition to versioning capability of the entire file snapshotting. In addition to versioning capability of the entire file
...@@ -25,9 +28,9 @@ available from the following download page. At least "mkfs.nilfs2", ...@@ -25,9 +28,9 @@ available from the following download page. At least "mkfs.nilfs2",
cleaner or garbage collector) are required. Details on the tools are cleaner or garbage collector) are required. Details on the tools are
described in the man pages included in the package. described in the man pages included in the package.
Project web page: https://nilfs.sourceforge.io/ :Project web page: https://nilfs.sourceforge.io/
Download page: https://nilfs.sourceforge.io/en/download.html :Download page: https://nilfs.sourceforge.io/en/download.html
List info: http://vger.kernel.org/vger-lists.html#linux-nilfs :List info: http://vger.kernel.org/vger-lists.html#linux-nilfs
Caveats Caveats
======= =======
...@@ -47,6 +50,7 @@ Mount options ...@@ -47,6 +50,7 @@ Mount options
NILFS2 supports the following mount options: NILFS2 supports the following mount options:
(*) == default (*) == default
======================= =======================================================
barrier(*) This enables/disables the use of write barriers. This barrier(*) This enables/disables the use of write barriers. This
nobarrier requires an IO stack which can support barriers, and nobarrier requires an IO stack which can support barriers, and
if nilfs gets an error on a barrier write, it will if nilfs gets an error on a barrier write, it will
...@@ -79,6 +83,7 @@ discard This enables/disables the use of discard/TRIM commands. ...@@ -79,6 +83,7 @@ discard This enables/disables the use of discard/TRIM commands.
nodiscard(*) The discard/TRIM commands are sent to the underlying nodiscard(*) The discard/TRIM commands are sent to the underlying
block device when blocks are freed. This is useful block device when blocks are freed. This is useful
for SSD devices and sparse/thinly-provisioned LUNs. for SSD devices and sparse/thinly-provisioned LUNs.
======================= =======================================================
Ioctls Ioctls
====== ======
...@@ -87,9 +92,11 @@ There is some NILFS2 specific functionality which can be accessed by application ...@@ -87,9 +92,11 @@ There is some NILFS2 specific functionality which can be accessed by application
through the system call interfaces. The list of all NILFS2 specific ioctls are through the system call interfaces. The list of all NILFS2 specific ioctls are
shown in the table below. shown in the table below.
Table of NILFS2 specific ioctls Table of NILFS2 specific ioctls:
..............................................................................
============================== ===============================================
Ioctl Description Ioctl Description
============================== ===============================================
NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between
checkpoint and snapshot state. This ioctl is checkpoint and snapshot state. This ioctl is
used in chcp and mkcp utilities. used in chcp and mkcp utilities.
...@@ -142,11 +149,12 @@ Table of NILFS2 specific ioctls ...@@ -142,11 +149,12 @@ Table of NILFS2 specific ioctls
NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and
upper limit of segments in bytes. This ioctl upper limit of segments in bytes. This ioctl
is used by nilfs_resize utility. is used by nilfs_resize utility.
============================== ===============================================
NILFS2 usage NILFS2 usage
============ ============
To use nilfs2 as a local file system, simply: To use nilfs2 as a local file system, simply::
# mkfs -t nilfs2 /dev/block_device # mkfs -t nilfs2 /dev/block_device
# mount -t nilfs2 /dev/block_device /dir # mount -t nilfs2 /dev/block_device /dir
...@@ -157,18 +165,20 @@ This will also invoke the cleaner through the mount helper program ...@@ -157,18 +165,20 @@ This will also invoke the cleaner through the mount helper program
Checkpoints and snapshots are managed by the following commands. Checkpoints and snapshots are managed by the following commands.
Their manpages are included in the nilfs-utils package above. Their manpages are included in the nilfs-utils package above.
==== ===========================================================
lscp list checkpoints or snapshots. lscp list checkpoints or snapshots.
mkcp make a checkpoint or a snapshot. mkcp make a checkpoint or a snapshot.
chcp change an existing checkpoint to a snapshot or vice versa. chcp change an existing checkpoint to a snapshot or vice versa.
rmcp invalidate specified checkpoint(s). rmcp invalidate specified checkpoint(s).
==== ===========================================================
To mount a snapshot, To mount a snapshot::
# mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir # mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir
where <cno> is the checkpoint number of the snapshot. where <cno> is the checkpoint number of the snapshot.
To unmount the NILFS2 mount point or snapshot, simply: To unmount the NILFS2 mount point or snapshot, simply::
# umount /dir # umount /dir
...@@ -181,7 +191,7 @@ Disk format ...@@ -181,7 +191,7 @@ Disk format
A nilfs2 volume is equally divided into a number of segments except A nilfs2 volume is equally divided into a number of segments except
for the super block (SB) and segment #0. A segment is the container for the super block (SB) and segment #0. A segment is the container
of logs. Each log is composed of summary information blocks, payload of logs. Each log is composed of summary information blocks, payload
blocks, and an optional super root block (SR): blocks, and an optional super root block (SR)::
______________________________________________________ ______________________________________________________
| |SB| | Segment | Segment | Segment | ... | Segment | | | |SB| | Segment | Segment | Segment | ... | Segment | |
...@@ -200,7 +210,7 @@ blocks, and an optional super root block (SR): ...@@ -200,7 +210,7 @@ blocks, and an optional super root block (SR):
|_blocks__|_________________|__| |_blocks__|_________________|__|
The payload blocks are organized per file, and each file consists of The payload blocks are organized per file, and each file consists of
data blocks and B-tree node blocks: data blocks and B-tree node blocks::
|<--- File-A --->|<--- File-B --->| |<--- File-A --->|<--- File-B --->|
_______________________________________________________________ _______________________________________________________________
...@@ -213,7 +223,7 @@ files without data blocks or B-tree node blocks. ...@@ -213,7 +223,7 @@ files without data blocks or B-tree node blocks.
The organization of the blocks is recorded in the summary information The organization of the blocks is recorded in the summary information
blocks, which contains a header structure (nilfs_segment_summary), per blocks, which contains a header structure (nilfs_segment_summary), per
file structures (nilfs_finfo), and per block structures (nilfs_binfo): file structures (nilfs_finfo), and per block structures (nilfs_binfo)::
_________________________________________________________________________ _________________________________________________________________________
| Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |... | Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |...
...@@ -223,7 +233,7 @@ file structures (nilfs_finfo), and per block structures (nilfs_binfo): ...@@ -223,7 +233,7 @@ file structures (nilfs_finfo), and per block structures (nilfs_binfo):
The logs include regular files, directory files, symbolic link files The logs include regular files, directory files, symbolic link files
and several meta data files. The mata data files are the files used and several meta data files. The mata data files are the files used
to maintain file system meta data. The current version of NILFS2 uses to maintain file system meta data. The current version of NILFS2 uses
the following meta data files: the following meta data files::
1) Inode file (ifile) -- Stores on-disk inodes 1) Inode file (ifile) -- Stores on-disk inodes
2) Checkpoint file (cpfile) -- Stores checkpoints 2) Checkpoint file (cpfile) -- Stores checkpoints
...@@ -232,7 +242,7 @@ the following meta data files: ...@@ -232,7 +242,7 @@ the following meta data files:
(DAT) block numbers. This file serves to (DAT) block numbers. This file serves to
make on-disk blocks relocatable. make on-disk blocks relocatable.
The following figure shows a typical organization of the logs: The following figure shows a typical organization of the logs::
_________________________________________________________________________ _________________________________________________________________________
| Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR| | Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR|
...@@ -250,7 +260,7 @@ three special inodes, inodes for the DAT, cpfile, and sufile. Inodes ...@@ -250,7 +260,7 @@ three special inodes, inodes for the DAT, cpfile, and sufile. Inodes
of regular files, directories, symlinks and other special files, are of regular files, directories, symlinks and other special files, are
included in the ifile. The inode of ifile itself is included in the included in the ifile. The inode of ifile itself is included in the
corresponding checkpoint entry in the cpfile. Thus, the hierarchy corresponding checkpoint entry in the cpfile. Thus, the hierarchy
among NILFS2 files can be depicted as follows: among NILFS2 files can be depicted as follows::
Super block (SB) Super block (SB)
| |
......
.. SPDX-License-Identifier: GPL-2.0
================================
The Linux NTFS filesystem driver The Linux NTFS filesystem driver
================================ ================================
Table of contents .. Table of contents
=================
- Overview - Overview
- Web site - Web site
- Features - Features
- Supported mount options - Supported mount options
- Known bugs and (mis-)features - Known bugs and (mis-)features
- Using NTFS volume and stripe sets - Using NTFS volume and stripe sets
- The Device-Mapper driver - The Device-Mapper driver
- The Software RAID / MD driver - The Software RAID / MD driver
- Limitations when using the MD driver - Limitations when using the MD driver
...@@ -66,8 +68,10 @@ Features ...@@ -66,8 +68,10 @@ Features
partition by creating a large file while in Windows and then loopback partition by creating a large file while in Windows and then loopback
mounting the file while in Linux and creating a Linux filesystem on it that mounting the file while in Linux and creating a Linux filesystem on it that
is used to install Linux on it. is used to install Linux on it.
- A comparison of the two drivers using: - A comparison of the two drivers using::
time find . -type f -exec md5sum "{}" \; time find . -type f -exec md5sum "{}" \;
run three times in sequence with each driver (after a reboot) on a 1.4GiB run three times in sequence with each driver (after a reboot) on a 1.4GiB
NTFS partition, showed the new driver to be 20% faster in total time elapsed NTFS partition, showed the new driver to be 20% faster in total time elapsed
(from 9:43 minutes on average down to 7:53). The time spent in user space (from 9:43 minutes on average down to 7:53). The time spent in user space
...@@ -104,6 +108,7 @@ In addition to the generic mount options described by the manual page for the ...@@ -104,6 +108,7 @@ In addition to the generic mount options described by the manual page for the
mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
following mount options: following mount options:
======================= =======================================================
iocharset=name Deprecated option. Still supported but please use iocharset=name Deprecated option. Still supported but please use
nls=name in the future. See description for nls=name. nls=name in the future. See description for nls=name.
...@@ -175,16 +180,22 @@ disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse ...@@ -175,16 +180,22 @@ disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse
errors=opt What to do when critical filesystem errors are found. errors=opt What to do when critical filesystem errors are found.
Following values can be used for "opt": Following values can be used for "opt":
continue: DEFAULT, try to clean-up as much as
======== =========================================
continue DEFAULT, try to clean-up as much as
possible, e.g. marking a corrupt inode as possible, e.g. marking a corrupt inode as
bad so it is no longer accessed, and then bad so it is no longer accessed, and then
continue. continue.
recover: At present only supported is recovery of recover At present only supported is recovery of
the boot sector from the backup copy. the boot sector from the backup copy.
If read-only mount, the recovery is done If read-only mount, the recovery is done
in memory only and not written to disk. in memory only and not written to disk.
Note that the options are additive, i.e. specifying: ======== =========================================
Note that the options are additive, i.e. specifying::
errors=continue,errors=recover errors=continue,errors=recover
means the driver will attempt to recover and if that means the driver will attempt to recover and if that
fails it will clean-up as much as possible and fails it will clean-up as much as possible and
continue. continue.
...@@ -202,12 +213,18 @@ mft_zone_multiplier= Set the MFT zone multiplier for the volume (this ...@@ -202,12 +213,18 @@ mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
In general use the default. If you have a lot of small In general use the default. If you have a lot of small
files then use a higher value. The values have the files then use a higher value. The values have the
following meaning: following meaning:
===== =================================
Value MFT zone size (% of volume size) Value MFT zone size (% of volume size)
===== =================================
1 12.5% 1 12.5%
2 25% 2 25%
3 37.5% 3 37.5%
4 50% 4 50%
===== =================================
Note this option is irrelevant for read-only mounts. Note this option is irrelevant for read-only mounts.
======================= =======================================================
Known bugs and (mis-)features Known bugs and (mis-)features
...@@ -252,13 +269,13 @@ To create the table describing your volume you will need to know each of its ...@@ -252,13 +269,13 @@ To create the table describing your volume you will need to know each of its
components and their sizes in sectors, i.e. multiples of 512-byte blocks. components and their sizes in sectors, i.e. multiples of 512-byte blocks.
For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
example if one of your partitions is /dev/hda2 you would do: example if one of your partitions is /dev/hda2 you would do::
$ fdisk -ul /dev/hda $ fdisk -ul /dev/hda
Disk /dev/hda: 81.9 GB, 81964302336 bytes Disk /dev/hda: 81.9 GB, 81964302336 bytes
255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors 255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
Units = sectors of 1 * 512 = 512 bytes Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System Device Boot Start End Blocks Id System
/dev/hda1 * 63 4209029 2104483+ 83 Linux /dev/hda1 * 63 4209029 2104483+ 83 Linux
...@@ -271,15 +288,17 @@ And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = ...@@ -271,15 +288,17 @@ And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
For Win2k and later dynamic disks, you can for example use the ldminfo utility For Win2k and later dynamic disks, you can for example use the ldminfo utility
which is part of the Linux LDM tools (the latest version at the time of which is part of the Linux LDM tools (the latest version at the time of
writing is linux-ldm-0.0.8.tar.bz2). You can download it from: writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
http://www.linux-ntfs.org/ http://www.linux-ntfs.org/
Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
will find the precompiled (i386) ldminfo utility there. NOTE: You will not be will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
able to compile this yourself easily so use the binary version! able to compile this yourself easily so use the binary version!
Then you would use ldminfo in dump mode to obtain the necessary information: Then you would use ldminfo in dump mode to obtain the necessary information::
$ ./ldminfo --dump /dev/hda $ ./ldminfo --dump /dev/hda
This would dump the LDM database found on /dev/hda which describes all of your This would dump the LDM database found on /dev/hda which describes all of your
dynamic disks and all the volumes on them. At the bottom you will see the dynamic disks and all the volumes on them. At the bottom you will see the
...@@ -305,42 +324,36 @@ give you the correct information to do this. ...@@ -305,42 +324,36 @@ give you the correct information to do this.
Assuming you know all your devices and their sizes things are easy. Assuming you know all your devices and their sizes things are easy.
For a linear raid the table would look like this (note all values are in For a linear raid the table would look like this (note all values are in
512-byte sectors): 512-byte sectors)::
--- cut here --- # Offset into Size of this Raid type Device Start sector
# Offset into Size of this Raid type Device Start sector # volume device of device
# volume device of device 0 1028161 linear /dev/hda1 0
0 1028161 linear /dev/hda1 0 1028161 3903762 linear /dev/hdb2 0
1028161 3903762 linear /dev/hdb2 0 4931923 2103211 linear /dev/hdc1 0
4931923 2103211 linear /dev/hdc1 0
--- cut here ---
For a striped volume, i.e. raid level 0, you will need to know the chunk size For a striped volume, i.e. raid level 0, you will need to know the chunk size
you used when creating the volume. Windows uses 64kiB as the default, so it you used when creating the volume. Windows uses 64kiB as the default, so it
will probably be this unless you changes the defaults when creating the array. will probably be this unless you changes the defaults when creating the array.
For a raid level 0 the table would look like this (note all values are in For a raid level 0 the table would look like this (note all values are in
512-byte sectors): 512-byte sectors)::
--- cut here --- # Offset Size Raid Number Chunk 1st Start 2nd Start
# Offset Size Raid Number Chunk 1st Start 2nd Start # into of the type of size Device in Device in
# into of the type of size Device in Device in # volume volume stripes device device
# volume volume stripes device device 0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
--- cut here ---
If there are more than two devices, just add each of them to the end of the If there are more than two devices, just add each of them to the end of the
line. line.
Finally, for a mirrored volume, i.e. raid level 1, the table would look like Finally, for a mirrored volume, i.e. raid level 1, the table would look like
this (note all values are in 512-byte sectors): this (note all values are in 512-byte sectors)::
--- cut here --- # Ofs Size Raid Log Number Region Should Number Source Start Target Start
# Ofs Size Raid Log Number Region Should Number Source Start Target Start # in of the type type of log size sync? of Device in Device in
# in of the type type of log size sync? of Device in Device in # vol volume params mirrors Device Device
# vol volume params mirrors Device Device 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
--- cut here ---
If you are mirroring to multiple devices you can specify further targets at the If you are mirroring to multiple devices you can specify further targets at the
end of the line. end of the line.
...@@ -353,17 +366,17 @@ to the "Target Device" or if you specified multiple target devices to all of ...@@ -353,17 +366,17 @@ to the "Target Device" or if you specified multiple target devices to all of
them. them.
Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
and hand it over to dmsetup to work with, like so: and hand it over to dmsetup to work with, like so::
$ dmsetup create myvolume1 /etc/ntfsvolume1 $ dmsetup create myvolume1 /etc/ntfsvolume1
You can obviously replace "myvolume1" with whatever name you like. You can obviously replace "myvolume1" with whatever name you like.
If it all worked, you will now have the device /dev/device-mapper/myvolume1 If it all worked, you will now have the device /dev/device-mapper/myvolume1
which you can then just use as an argument to the mount command as usual to which you can then just use as an argument to the mount command as usual to
mount the ntfs volume. For example: mount the ntfs volume. For example::
$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1 $ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
(You need to create the directory /mnt/myvol1 first and of course you can use (You need to create the directory /mnt/myvol1 first and of course you can use
anything you like instead of /mnt/myvol1 as long as it is an existing anything you like instead of /mnt/myvol1 as long as it is an existing
...@@ -395,9 +408,9 @@ Windows by default uses a stripe chunk size of 64k, so you probably want the ...@@ -395,9 +408,9 @@ Windows by default uses a stripe chunk size of 64k, so you probably want the
"chunk-size 64k" option for each raid-disk, too. "chunk-size 64k" option for each raid-disk, too.
For example, if you have a stripe set consisting of two partitions /dev/hda5 For example, if you have a stripe set consisting of two partitions /dev/hda5
and /dev/hdb1 your /etc/raidtab would look like this: and /dev/hdb1 your /etc/raidtab would look like this::
raiddev /dev/md0 raiddev /dev/md0
raid-level 0 raid-level 0
nr-raid-disks 2 nr-raid-disks 2
nr-spare-disks 0 nr-spare-disks 0
...@@ -427,7 +440,9 @@ Once the raidtab is setup, run for example raid0run -a to start all devices or ...@@ -427,7 +440,9 @@ Once the raidtab is setup, run for example raid0run -a to start all devices or
raid0run /dev/md0 to start a particular md device, in this case /dev/md0. raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
Then just use the mount command as usual to mount the ntfs volume using for Then just use the mount command as usual to mount the ntfs volume using for
example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume example::
mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
It is advisable to do the mount read-only to see if the md volume has been It is advisable to do the mount read-only to see if the md volume has been
setup correctly to avoid the possibility of causing damage to the data on the setup correctly to avoid the possibility of causing damage to the data on the
......
OCFS2 online file check .. SPDX-License-Identifier: GPL-2.0
-----------------------
=====================================
OCFS2 file system - online file check
=====================================
This document will describe OCFS2 online file check feature. This document will describe OCFS2 online file check feature.
...@@ -40,7 +43,7 @@ When there are errors in the OCFS2 filesystem, they are usually accompanied ...@@ -40,7 +43,7 @@ When there are errors in the OCFS2 filesystem, they are usually accompanied
by the inode number which caused the error. This inode number would be the by the inode number which caused the error. This inode number would be the
input to check/fix the file. input to check/fix the file.
There is a sysfs directory for each OCFS2 file system mounting: There is a sysfs directory for each OCFS2 file system mounting::
/sys/fs/ocfs2/<devname>/filecheck /sys/fs/ocfs2/<devname>/filecheck
...@@ -50,34 +53,36 @@ communicate with kernel space, tell which file(inode number) will be checked or ...@@ -50,34 +53,36 @@ communicate with kernel space, tell which file(inode number) will be checked or
fixed. Currently, three operations are supported, which includes checking fixed. Currently, three operations are supported, which includes checking
inode, fixing inode and setting the size of result record history. inode, fixing inode and setting the size of result record history.
1. If you want to know what error exactly happened to <inode> before fixing, do 1. If you want to know what error exactly happened to <inode> before fixing, do::
# echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check
# cat /sys/fs/ocfs2/<devname>/filecheck/check # cat /sys/fs/ocfs2/<devname>/filecheck/check
The output is like this: The output is like this::
INO DONE ERROR INO DONE ERROR
39502 1 GENERATION 39502 1 GENERATION
<INO> lists the inode numbers. <INO> lists the inode numbers.
<DONE> indicates whether the operation has been finished. <DONE> indicates whether the operation has been finished.
<ERROR> says what kind of errors was found. For the detailed error numbers, <ERROR> says what kind of errors was found. For the detailed error numbers,
please refer to the file linux/fs/ocfs2/filecheck.h. please refer to the file linux/fs/ocfs2/filecheck.h.
2. If you determine to fix this inode, do 2. If you determine to fix this inode, do::
# echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix
# cat /sys/fs/ocfs2/<devname>/filecheck/fix # cat /sys/fs/ocfs2/<devname>/filecheck/fix
The output is like this: The output is like this:::
INO DONE ERROR INO DONE ERROR
39502 1 SUCCESS 39502 1 SUCCESS
This time, the <ERROR> column indicates whether this fix is successful or not. This time, the <ERROR> column indicates whether this fix is successful or not.
3. The record cache is used to store the history of check/fix results. It's 3. The record cache is used to store the history of check/fix results. It's
default size is 10, and can be adjust between the range of 10 ~ 100. You can default size is 10, and can be adjust between the range of 10 ~ 100. You can
adjust the size like this: adjust the size like this::
# echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set # echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set
......
.. SPDX-License-Identifier: GPL-2.0
================
OCFS2 filesystem OCFS2 filesystem
================== ================
OCFS2 is a general purpose extent based shared disk cluster file OCFS2 is a general purpose extent based shared disk cluster file
system with many similarities to ext3. It supports 64 bit inode system with many similarities to ext3. It supports 64 bit inode
numbers, and has automatically extending metadata groups which may numbers, and has automatically extending metadata groups which may
...@@ -14,22 +18,26 @@ OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ ...@@ -14,22 +18,26 @@ OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
All code copyright 2005 Oracle except when otherwise noted. All code copyright 2005 Oracle except when otherwise noted.
CREDITS: Credits
=======
Lots of code taken from ext3 and other projects. Lots of code taken from ext3 and other projects.
Authors in alphabetical order: Authors in alphabetical order:
Joel Becker <joel.becker@oracle.com>
Zach Brown <zach.brown@oracle.com> - Joel Becker <joel.becker@oracle.com>
Mark Fasheh <mfasheh@suse.com> - Zach Brown <zach.brown@oracle.com>
Kurt Hackel <kurt.hackel@oracle.com> - Mark Fasheh <mfasheh@suse.com>
Tao Ma <tao.ma@oracle.com> - Kurt Hackel <kurt.hackel@oracle.com>
Sunil Mushran <sunil.mushran@oracle.com> - Tao Ma <tao.ma@oracle.com>
Manish Singh <manish.singh@oracle.com> - Sunil Mushran <sunil.mushran@oracle.com>
Tiger Yang <tiger.yang@oracle.com> - Manish Singh <manish.singh@oracle.com>
- Tiger Yang <tiger.yang@oracle.com>
Caveats Caveats
======= =======
Features which OCFS2 does not support yet: Features which OCFS2 does not support yet:
- Directory change notification (F_NOTIFY) - Directory change notification (F_NOTIFY)
- Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
...@@ -37,8 +45,10 @@ Mount options ...@@ -37,8 +45,10 @@ Mount options
============= =============
OCFS2 supports the following mount options: OCFS2 supports the following mount options:
(*) == default (*) == default
======================= ========================================================
barrier=1 This enables/disables barriers. barrier=0 disables it, barrier=1 This enables/disables barriers. barrier=0 disables it,
barrier=1 enables it. barrier=1 enables it.
errors=remount-ro(*) Remount the filesystem read-only on an error. errors=remount-ro(*) Remount the filesystem read-only on an error.
...@@ -104,3 +114,4 @@ journal_async_commit Commit block can be written to disk without waiting ...@@ -104,3 +114,4 @@ journal_async_commit Commit block can be written to disk without waiting
for descriptor blocks. If enabled older kernels cannot for descriptor blocks. If enabled older kernels cannot
mount the device. This will enable 'journal_checksum' mount the device. This will enable 'journal_checksum'
internally. internally.
======================= ========================================================
.. SPDX-License-Identifier: GPL-2.0
================================
Optimized MPEG Filesystem (OMFS) Optimized MPEG Filesystem (OMFS)
================================
Overview Overview
======== ========
...@@ -29,11 +33,13 @@ Options ...@@ -29,11 +33,13 @@ Options
OMFS supports the following mount-time options: OMFS supports the following mount-time options:
uid=n - make all files owned by specified user ============ ========================================
gid=n - make all files owned by specified group uid=n make all files owned by specified user
umask=xxx - set permission umask to xxx gid=n make all files owned by specified group
fmask=xxx - set umask to xxx for files umask=xxx set permission umask to xxx
dmask=xxx - set umask to xxx for directories fmask=xxx set umask to xxx for files
dmask=xxx set umask to xxx for directories
============ ========================================
Disk format Disk format
=========== ===========
...@@ -46,9 +52,9 @@ have a smaller size than a data block, but since they are both addressed by the ...@@ -46,9 +52,9 @@ have a smaller size than a data block, but since they are both addressed by the
same 64-bit block number, any remaining space in the smaller sysblock is same 64-bit block number, any remaining space in the smaller sysblock is
unused. unused.
Sysblock header information: Sysblock header information::
struct omfs_header { struct omfs_header {
__be64 h_self; /* FS block where this is located */ __be64 h_self; /* FS block where this is located */
__be32 h_body_size; /* size of useful data after header */ __be32 h_body_size; /* size of useful data after header */
__be16 h_crc; /* crc-ccitt of body_size bytes */ __be16 h_crc; /* crc-ccitt of body_size bytes */
...@@ -58,11 +64,11 @@ struct omfs_header { ...@@ -58,11 +64,11 @@ struct omfs_header {
u8 h_magic; /* OMFS_IMAGIC */ u8 h_magic; /* OMFS_IMAGIC */
u8 h_check_xor; /* XOR of header bytes before this */ u8 h_check_xor; /* XOR of header bytes before this */
__be32 h_fill2; __be32 h_fill2;
}; };
Files and directories are both represented by omfs_inode: Files and directories are both represented by omfs_inode::
struct omfs_inode { struct omfs_inode {
struct omfs_header i_head; /* header */ struct omfs_header i_head; /* header */
__be64 i_parent; /* parent containing this inode */ __be64 i_parent; /* parent containing this inode */
__be64 i_sibling; /* next inode in hash bucket */ __be64 i_sibling; /* next inode in hash bucket */
...@@ -73,7 +79,7 @@ struct omfs_inode { ...@@ -73,7 +79,7 @@ struct omfs_inode {
char i_fill3[64]; char i_fill3[64];
char i_name[OMFS_NAMELEN]; /* filename */ char i_name[OMFS_NAMELEN]; /* filename */
__be64 i_size; /* size of file, in bytes */ __be64 i_size; /* size of file, in bytes */
}; };
Directories in OMFS are implemented as a large hash table. Filenames are Directories in OMFS are implemented as a large hash table. Filenames are
hashed then prepended into the bucket list beginning at OMFS_DIR_START. hashed then prepended into the bucket list beginning at OMFS_DIR_START.
...@@ -82,19 +88,19 @@ until a match is found on i_name. Empty buckets are represented by block ...@@ -82,19 +88,19 @@ until a match is found on i_name. Empty buckets are represented by block
pointers with all-1s (~0). pointers with all-1s (~0).
A file is an omfs_inode structure followed by an extent table beginning at A file is an omfs_inode structure followed by an extent table beginning at
OMFS_EXTENT_START: OMFS_EXTENT_START::
struct omfs_extent_entry { struct omfs_extent_entry {
__be64 e_cluster; /* start location of a set of blocks */ __be64 e_cluster; /* start location of a set of blocks */
__be64 e_blocks; /* number of blocks after e_cluster */ __be64 e_blocks; /* number of blocks after e_cluster */
}; };
struct omfs_extent { struct omfs_extent {
__be64 e_next; /* next extent table location */ __be64 e_next; /* next extent table location */
__be32 e_extent_count; /* total # extents in this table */ __be32 e_extent_count; /* total # extents in this table */
__be32 e_fill; __be32 e_fill;
struct omfs_extent_entry e_entry; /* start of extent entries */ struct omfs_extent_entry e_entry; /* start of extent entries */
}; };
Each extent holds the block offset followed by number of blocks allocated to Each extent holds the block offset followed by number of blocks allocated to
the extent. The final extent in each table is a terminator with e_cluster the extent. The final extent in each table is a terminator with e_cluster
......
.. SPDX-License-Identifier: GPL-2.0
========
ORANGEFS ORANGEFS
======== ========
...@@ -21,25 +24,25 @@ Orangefs features include: ...@@ -21,25 +24,25 @@ Orangefs features include:
* Stateless * Stateless
MAILING LIST ARCHIVES Mailing List Archives
===================== =====================
http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
MAILING LIST SUBMISSIONS Mailing List Submissions
======================== ========================
devel@lists.orangefs.org devel@lists.orangefs.org
DOCUMENTATION Documentation
============= =============
http://www.orangefs.org/documentation/ http://www.orangefs.org/documentation/
USERSPACE FILESYSTEM SOURCE Userspace Filesystem Source
=========================== ===========================
http://www.orangefs.org/download http://www.orangefs.org/download
...@@ -48,16 +51,16 @@ Orangefs versions prior to 2.9.3 would not be compatible with the ...@@ -48,16 +51,16 @@ Orangefs versions prior to 2.9.3 would not be compatible with the
upstream version of the kernel client. upstream version of the kernel client.
RUNNING ORANGEFS ON A SINGLE SERVER Running ORANGEFS On a Single Server
=================================== ===================================
OrangeFS is usually run in large installations with multiple servers and OrangeFS is usually run in large installations with multiple servers and
clients, but a complete filesystem can be run on a single machine for clients, but a complete filesystem can be run on a single machine for
development and testing. development and testing.
On Fedora, install orangefs and orangefs-server. On Fedora, install orangefs and orangefs-server::
dnf -y install orangefs orangefs-server dnf -y install orangefs orangefs-server
There is an example server configuration file in There is an example server configuration file in
/etc/orangefs/orangefs.conf. Change localhost to your hostname if /etc/orangefs/orangefs.conf. Change localhost to your hostname if
...@@ -70,29 +73,29 @@ single line. Uncomment it and change the hostname if necessary. This ...@@ -70,29 +73,29 @@ single line. Uncomment it and change the hostname if necessary. This
controls clients which use libpvfs2. This does not control the controls clients which use libpvfs2. This does not control the
pvfs2-client-core. pvfs2-client-core.
Create the filesystem. Create the filesystem::
pvfs2-server -f /etc/orangefs/orangefs.conf pvfs2-server -f /etc/orangefs/orangefs.conf
Start the server. Start the server::
systemctl start orangefs-server systemctl start orangefs-server
Test the server. Test the server::
pvfs2-ping -m /pvfsmnt pvfs2-ping -m /pvfsmnt
Start the client. The module must be compiled in or loaded before this Start the client. The module must be compiled in or loaded before this
point. point::
systemctl start orangefs-client systemctl start orangefs-client
Mount the filesystem. Mount the filesystem::
mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
BUILDING ORANGEFS ON A SINGLE SERVER Building ORANGEFS on a Single Server
==================================== ====================================
Where OrangeFS cannot be installed from distribution packages, it may be Where OrangeFS cannot be installed from distribution packages, it may be
...@@ -102,49 +105,51 @@ You can omit --prefix if you don't care that things are sprinkled around ...@@ -102,49 +105,51 @@ You can omit --prefix if you don't care that things are sprinkled around
in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by
default, we will probably be changing the default to LMDB soon. default, we will probably be changing the default to LMDB soon.
./configure --prefix=/opt/ofs --with-db-backend=lmdb ::
make ./configure --prefix=/opt/ofs --with-db-backend=lmdb
make install make
Create an orangefs config file. make install
/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf Create an orangefs config file::
Create an /etc/pvfs2tab file. /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ Create an /etc/pvfs2tab file::
echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
/etc/pvfs2tab /etc/pvfs2tab
Create the mount point you specified in the tab file if needed. Create the mount point you specified in the tab file if needed::
mkdir /pvfsmnt mkdir /pvfsmnt
Bootstrap the server. Bootstrap the server::
/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
Start the server. Start the server::
/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf
Now the server should be running. Pvfs2-ls is a simple Now the server should be running. Pvfs2-ls is a simple
test to verify that the server is running. test to verify that the server is running::
/opt/ofs/bin/pvfs2-ls /pvfsmnt /opt/ofs/bin/pvfs2-ls /pvfsmnt
If stuff seems to be working, load the kernel module and If stuff seems to be working, load the kernel module and
turn on the client core. turn on the client core::
/opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core /opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
Mount your filesystem. Mount your filesystem::
mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
RUNNING XFSTESTS Running xfstests
================ ================
It is useful to use a scratch filesystem with xfstests. This can be It is useful to use a scratch filesystem with xfstests. This can be
...@@ -159,21 +164,23 @@ Then there are two FileSystem sections: orangefs and scratch. ...@@ -159,21 +164,23 @@ Then there are two FileSystem sections: orangefs and scratch.
This change should be made before creating the filesystem. This change should be made before creating the filesystem.
pvfs2-server -f /etc/orangefs/orangefs.conf ::
pvfs2-server -f /etc/orangefs/orangefs.conf
To run xfstests, create /etc/xfsqa.config. To run xfstests, create /etc/xfsqa.config::
TEST_DIR=/orangefs TEST_DIR=/orangefs
TEST_DEV=tcp://localhost:3334/orangefs TEST_DEV=tcp://localhost:3334/orangefs
SCRATCH_MNT=/scratch SCRATCH_MNT=/scratch
SCRATCH_DEV=tcp://localhost:3334/scratch SCRATCH_DEV=tcp://localhost:3334/scratch
Then xfstests can be run Then xfstests can be run::
./check -pvfs2 ./check -pvfs2
OPTIONS Options
======= =======
The following mount options are accepted: The following mount options are accepted:
...@@ -193,32 +200,32 @@ The following mount options are accepted: ...@@ -193,32 +200,32 @@ The following mount options are accepted:
Distributed locking is being worked on for the future. Distributed locking is being worked on for the future.
DEBUGGING Debugging
========= =========
If you want the debug (GOSSIP) statements in a particular If you want the debug (GOSSIP) statements in a particular
source file (inode.c for example) go to syslog: source file (inode.c for example) go to syslog::
echo inode > /sys/kernel/debug/orangefs/kernel-debug echo inode > /sys/kernel/debug/orangefs/kernel-debug
No debugging (the default): No debugging (the default)::
echo none > /sys/kernel/debug/orangefs/kernel-debug echo none > /sys/kernel/debug/orangefs/kernel-debug
Debugging from several source files: Debugging from several source files::
echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
All debugging: All debugging::
echo all > /sys/kernel/debug/orangefs/kernel-debug echo all > /sys/kernel/debug/orangefs/kernel-debug
Get a list of all debugging keywords: Get a list of all debugging keywords::
cat /sys/kernel/debug/orangefs/debug-help cat /sys/kernel/debug/orangefs/debug-help
PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE Protocol between Kernel Module and Userspace
============================================ ============================================
Orangefs is a user space filesystem and an associated kernel module. Orangefs is a user space filesystem and an associated kernel module.
...@@ -234,7 +241,8 @@ The kernel module implements a pseudo device that userspace ...@@ -234,7 +241,8 @@ The kernel module implements a pseudo device that userspace
can read from and write to. Userspace can also manipulate the can read from and write to. Userspace can also manipulate the
kernel module through the pseudo device with ioctl. kernel module through the pseudo device with ioctl.
THE BUFMAP: The Bufmap
----------
At startup userspace allocates two page-size-aligned (posix_memalign) At startup userspace allocates two page-size-aligned (posix_memalign)
mlocked memory buffers, one is used for IO and one is used for readdir mlocked memory buffers, one is used for IO and one is used for readdir
...@@ -250,7 +258,8 @@ copied from user space to kernel space with copy_from_user and is used ...@@ -250,7 +258,8 @@ copied from user space to kernel space with copy_from_user and is used
to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
then contains: then contains:
* refcnt - a reference counter * refcnt
- a reference counter
* desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
partition size, which represents the filesystem's block size and partition size, which represents the filesystem's block size and
is used for s_blocksize in super blocks. is used for s_blocksize in super blocks.
...@@ -259,15 +268,17 @@ then contains: ...@@ -259,15 +268,17 @@ then contains:
* desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
* total_size - the total size of the IO buffer. * total_size - the total size of the IO buffer.
* page_count - the number of 4096 byte pages in the IO buffer. * page_count - the number of 4096 byte pages in the IO buffer.
* page_array - a pointer to page_count * (sizeof(struct page*)) bytes * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes
of kcalloced memory. This memory is used as an array of pointers of kcalloced memory. This memory is used as an array of pointers
to each of the pages in the IO buffer through a call to get_user_pages. to each of the pages in the IO buffer through a call to get_user_pages.
* desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc)) * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
bytes of kcalloced memory. This memory is further intialized: bytes of kcalloced memory. This memory is further intialized:
user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
structure. user_desc->ptr points to the IO buffer. structure. user_desc->ptr points to the IO buffer.
::
pages_per_desc = bufmap->desc_size / PAGE_SIZE pages_per_desc = bufmap->desc_size / PAGE_SIZE
offset = 0 offset = 0
...@@ -293,7 +304,8 @@ then contains: ...@@ -293,7 +304,8 @@ then contains:
* readdir_index_lock - a spinlock to protect readdir_index_array during * readdir_index_lock - a spinlock to protect readdir_index_array during
update. update.
OPERATIONS: Operations
----------
The kernel module builds an "op" (struct orangefs_kernel_op_s) when it The kernel module builds an "op" (struct orangefs_kernel_op_s) when it
needs to communicate with userspace. Part of the op contains the "upcall" needs to communicate with userspace. Part of the op contains the "upcall"
...@@ -308,13 +320,19 @@ in flight at any given time. ...@@ -308,13 +320,19 @@ in flight at any given time.
Ops are stateful: Ops are stateful:
* unknown - op was just initialized * unknown
* waiting - op is on request_list (upward bound) - op was just initialized
* inprogr - op is in progress (waiting for downcall) * waiting
* serviced - op has matching downcall; ok - op is on request_list (upward bound)
* purged - op has to start a timer since client-core * inprogr
- op is in progress (waiting for downcall)
* serviced
- op has matching downcall; ok
* purged
- op has to start a timer since client-core
exited uncleanly before servicing op exited uncleanly before servicing op
* given up - submitter has given up waiting for it * given up
- submitter has given up waiting for it
When some arbitrary userspace program needs to perform a When some arbitrary userspace program needs to perform a
filesystem operation on Orangefs (readdir, I/O, create, whatever) filesystem operation on Orangefs (readdir, I/O, create, whatever)
...@@ -389,10 +407,15 @@ union of structs, each of which is associated with a particular ...@@ -389,10 +407,15 @@ union of structs, each of which is associated with a particular
response type. response type.
The several members outside of the union are: The several members outside of the union are:
- int32_t type - type of operation.
- int32_t status - return code for the operation. ``int32_t type``
- int64_t trailer_size - 0 unless readdir operation. - type of operation.
- char *trailer_buf - initialized to NULL, used during readdir operations. ``int32_t status``
- return code for the operation.
``int64_t trailer_size``
- 0 unless readdir operation.
``char *trailer_buf``
- initialized to NULL, used during readdir operations.
The appropriate member inside the union is filled out for any The appropriate member inside the union is filled out for any
particular response. particular response.
...@@ -449,18 +472,20 @@ Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests ...@@ -449,18 +472,20 @@ Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests
made by the kernel side. made by the kernel side.
A buffer_list containing: A buffer_list containing:
- a pointer to the prepared response to the request from the - a pointer to the prepared response to the request from the
kernel (struct pvfs2_downcall_t). kernel (struct pvfs2_downcall_t).
- and also, in the case of a readdir request, a pointer to a - and also, in the case of a readdir request, a pointer to a
buffer containing descriptors for the objects in the target buffer containing descriptors for the objects in the target
directory. directory.
... is sent to the function (PINT_dev_write_list) which performs ... is sent to the function (PINT_dev_write_list) which performs
the writev. the writev.
PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; PINT_dev_write_list has a local iovec array: struct iovec io_array[10];
The first four elements of io_array are initialized like this for all The first four elements of io_array are initialized like this for all
responses: responses::
io_array[0].iov_base = address of local variable "proto_ver" (int32_t) io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
io_array[0].iov_len = sizeof(int32_t) io_array[0].iov_len = sizeof(int32_t)
...@@ -475,7 +500,7 @@ responses: ...@@ -475,7 +500,7 @@ responses:
of global variable vfs_request (vfs_request_t) of global variable vfs_request (vfs_request_t)
io_array[3].iov_len = sizeof(pvfs2_downcall_t) io_array[3].iov_len = sizeof(pvfs2_downcall_t)
Readdir responses initialize the fifth element io_array like this: Readdir responses initialize the fifth element io_array like this::
io_array[4].iov_base = contents of member trailer_buf (char *) io_array[4].iov_base = contents of member trailer_buf (char *)
from out_downcall member of global variable from out_downcall member of global variable
...@@ -517,13 +542,13 @@ from a dentry is cheap, obtaining it from userspace is relatively expensive, ...@@ -517,13 +542,13 @@ from a dentry is cheap, obtaining it from userspace is relatively expensive,
hence the motivation to use the dentry when possible. hence the motivation to use the dentry when possible.
The timeout values d_time and getattr_time are jiffy based, and the The timeout values d_time and getattr_time are jiffy based, and the
code is designed to avoid the jiffy-wrap problem: code is designed to avoid the jiffy-wrap problem::
"In general, if the clock may have wrapped around more than once, there "In general, if the clock may have wrapped around more than once, there
is no way to tell how much time has elapsed. However, if the times t1 is no way to tell how much time has elapsed. However, if the times t1
and t2 are known to be fairly close, we can reliably compute the and t2 are known to be fairly close, we can reliably compute the
difference in a way that takes into account the possibility that the difference in a way that takes into account the possibility that the
clock may have wrapped between times." clock may have wrapped between times."
from course notes by instructor Andy Wang from course notes by instructor Andy Wang
------------------------------------------------------------------------------ .. SPDX-License-Identifier: GPL-2.0
T H E /proc F I L E S Y S T E M
------------------------------------------------------------------------------ ====================
/proc/sys Terrehon Bowden <terrehon@pacbell.net> October 7 1999 The /proc Filesystem
Bodo Bauer <bb@ricochet.net> ====================
===================== ======================================= ================
/proc/sys Terrehon Bowden <terrehon@pacbell.net>, October 7 1999
Bodo Bauer <bb@ricochet.net>
2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000
move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009 move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009
------------------------------------------------------------------------------
Version 1.3 Kernel version 2.2.12
Kernel version 2.4.0-test11-pre4
------------------------------------------------------------------------------
fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
===================== ======================================= ================
Table of Contents
-----------------
.. Table of Contents
0 Preface 0 Preface
0.1 Introduction/Credits 0.1 Introduction/Credits
...@@ -50,9 +51,8 @@ Table of Contents ...@@ -50,9 +51,8 @@ Table of Contents
4 Configuring procfs 4 Configuring procfs
4.1 Mount options 4.1 Mount options
------------------------------------------------------------------------------
Preface Preface
------------------------------------------------------------------------------ =======
0.1 Introduction/Credits 0.1 Introduction/Credits
------------------------ ------------------------
...@@ -95,20 +95,18 @@ We don't guarantee the correctness of this document, and if you come to us ...@@ -95,20 +95,18 @@ We don't guarantee the correctness of this document, and if you come to us
complaining about how you screwed up your system because of incorrect complaining about how you screwed up your system because of incorrect
documentation, we won't feel responsible... documentation, we won't feel responsible...
------------------------------------------------------------------------------ Chapter 1: Collecting System Information
CHAPTER 1: COLLECTING SYSTEM INFORMATION ========================================
------------------------------------------------------------------------------
------------------------------------------------------------------------------
In This Chapter In This Chapter
------------------------------------------------------------------------------ ---------------
* Investigating the properties of the pseudo file system /proc and its * Investigating the properties of the pseudo file system /proc and its
ability to provide information on the running Linux system ability to provide information on the running Linux system
* Examining /proc's structure * Examining /proc's structure
* Uncovering various information about the kernel and the processes running * Uncovering various information about the kernel and the processes running
on the system on the system
------------------------------------------------------------------------------
------------------------------------------------------------------------------
The proc file system acts as an interface to internal data structures in the The proc file system acts as an interface to internal data structures in the
kernel. It can be used to obtain information about the system and to change kernel. It can be used to obtain information about the system and to change
...@@ -134,9 +132,11 @@ never act on any new process that the kernel may, through chance, have ...@@ -134,9 +132,11 @@ never act on any new process that the kernel may, through chance, have
also assigned the process ID <pid>. Instead, operations on these FDs also assigned the process ID <pid>. Instead, operations on these FDs
usually fail with ESRCH. usually fail with ESRCH.
Table 1-1: Process specific entries in /proc .. table:: Table 1-1: Process specific entries in /proc
..............................................................................
============= ===============================================================
File Content File Content
============= ===============================================================
clear_refs Clears page referenced bits shown in smaps output clear_refs Clears page referenced bits shown in smaps output
cmdline Command line arguments cmdline Command line arguments
cpu Current and last cpu in which it was executed (2.4)(smp) cpu Current and last cpu in which it was executed (2.4)(smp)
...@@ -160,10 +160,10 @@ Table 1-1: Process specific entries in /proc ...@@ -160,10 +160,10 @@ Table 1-1: Process specific entries in /proc
can be derived from smaps, but is faster and more convenient can be derived from smaps, but is faster and more convenient
numa_maps An extension based on maps, showing the memory locality and numa_maps An extension based on maps, showing the memory locality and
binding policy as well as mem usage (in pages) of each mapping. binding policy as well as mem usage (in pages) of each mapping.
.............................................................................. ============= ===============================================================
For example, to get the status information of a process, all you have to do is For example, to get the status information of a process, all you have to do is
read the file /proc/PID/status: read the file /proc/PID/status::
>cat /proc/self/status >cat /proc/self/status
Name: cat Name: cat
...@@ -222,14 +222,17 @@ contains details information about the process itself. Its fields are ...@@ -222,14 +222,17 @@ contains details information about the process itself. Its fields are
explained in Table 1-4. explained in Table 1-4.
(for SMP CONFIG users) (for SMP CONFIG users)
For making accounting scalable, RSS related information are handled in an For making accounting scalable, RSS related information are handled in an
asynchronous manner and the value may not be very precise. To see a precise asynchronous manner and the value may not be very precise. To see a precise
snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table. snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
It's slow but very precise. It's slow but very precise.
Table 1-2: Contents of the status files (as of 4.19) .. table:: Table 1-2: Contents of the status files (as of 4.19)
..............................................................................
========================== ===================================================
Field Content Field Content
========================== ===================================================
Name filename of the executable Name filename of the executable
Umask file mode creation mask Umask file mode creation mask
State state (R is running, S is sleeping, D is sleeping State state (R is running, S is sleeping, D is sleeping
...@@ -254,7 +257,8 @@ Table 1-2: Contents of the status files (as of 4.19) ...@@ -254,7 +257,8 @@ Table 1-2: Contents of the status files (as of 4.19)
VmPin pinned memory size VmPin pinned memory size
VmHWM peak resident set size ("high water mark") VmHWM peak resident set size ("high water mark")
VmRSS size of memory portions. It contains the three VmRSS size of memory portions. It contains the three
following parts (VmRSS = RssAnon + RssFile + RssShmem) following parts
(VmRSS = RssAnon + RssFile + RssShmem)
RssAnon size of resident anonymous memory RssAnon size of resident anonymous memory
RssFile size of resident file mappings RssFile size of resident file mappings
RssShmem size of resident shmem memory (includes SysV shm, RssShmem size of resident shmem memory (includes SysV shm,
...@@ -292,11 +296,14 @@ Table 1-2: Contents of the status files (as of 4.19) ...@@ -292,11 +296,14 @@ Table 1-2: Contents of the status files (as of 4.19)
Mems_allowed_list Same as previous, but in "list format" Mems_allowed_list Same as previous, but in "list format"
voluntary_ctxt_switches number of voluntary context switches voluntary_ctxt_switches number of voluntary context switches
nonvoluntary_ctxt_switches number of non voluntary context switches nonvoluntary_ctxt_switches number of non voluntary context switches
.............................................................................. ========================== ===================================================
Table 1-3: Contents of the statm files (as of 2.6.8-rc3) .. table:: Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
..............................................................................
======== =============================== ==============================
Field Content Field Content
======== =============================== ==============================
size total program size (pages) (same as VmSize in status) size total program size (pages) (same as VmSize in status)
resident size of memory portions (pages) (same as VmRSS in status) resident size of memory portions (pages) (same as VmRSS in status)
shared number of pages that are shared (i.e. backed by a file, same shared number of pages that are shared (i.e. backed by a file, same
...@@ -307,12 +314,14 @@ Table 1-3: Contents of the statm files (as of 2.6.8-rc3) ...@@ -307,12 +314,14 @@ Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
drs number of pages of data/stack (including libs; broken, drs number of pages of data/stack (including libs; broken,
includes library text) includes library text)
dt number of dirty pages (always 0 on 2.6) dt number of dirty pages (always 0 on 2.6)
.............................................................................. ======== =============================== ==============================
.. table:: Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
Table 1-4: Contents of the stat files (as of 2.6.30-rc7) ============= ===============================================================
..............................................................................
Field Content Field Content
============= ===============================================================
pid process id pid process id
tcomm filename of the executable tcomm filename of the executable
state state (R is running, S is sleeping, D is sleeping in an state state (R is running, S is sleeping, D is sleeping in an
...@@ -348,7 +357,8 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) ...@@ -348,7 +357,8 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
blocked bitmap of blocked signals blocked bitmap of blocked signals
sigign bitmap of ignored signals sigign bitmap of ignored signals
sigcatch bitmap of caught signals sigcatch bitmap of caught signals
0 (place holder, used to be the wchan address, use /proc/PID/wchan instead) 0 (place holder, used to be the wchan address,
use /proc/PID/wchan instead)
0 (place holder) 0 (place holder)
0 (place holder) 0 (place holder)
exit_signal signal to send to parent thread on exit exit_signal signal to send to parent thread on exit
...@@ -365,39 +375,40 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) ...@@ -365,39 +375,40 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
arg_end address below which program command line is placed arg_end address below which program command line is placed
env_start address above which program environment is placed env_start address above which program environment is placed
env_end address below which program environment is placed env_end address below which program environment is placed
exit_code the thread's exit_code in the form reported by the waitpid system call exit_code the thread's exit_code in the form reported by the waitpid
.............................................................................. system call
============= ===============================================================
The /proc/PID/maps file contains the currently mapped memory regions and The /proc/PID/maps file contains the currently mapped memory regions and
their access permissions. their access permissions.
The format is: The format is::
address perms offset dev inode pathname address perms offset dev inode pathname
08048000-08049000 r-xp 00000000 03:00 8312 /opt/test 08048000-08049000 r-xp 00000000 03:00 8312 /opt/test
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test
0804a000-0806b000 rw-p 00000000 00:00 0 [heap] 0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7cb1000-a7cb2000 ---p 00000000 00:00 0 a7cb1000-a7cb2000 ---p 00000000 00:00 0
a7cb2000-a7eb2000 rw-p 00000000 00:00 0 a7cb2000-a7eb2000 rw-p 00000000 00:00 0
a7eb2000-a7eb3000 ---p 00000000 00:00 0 a7eb2000-a7eb3000 ---p 00000000 00:00 0
a7eb3000-a7ed5000 rw-p 00000000 00:00 0 a7eb3000-a7ed5000 rw-p 00000000 00:00 0
a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6
a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a800b000-a800e000 rw-p 00000000 00:00 0 a800b000-a800e000 rw-p 00000000 00:00 0
a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8024000-a8027000 rw-p 00000000 00:00 0 a8024000-a8027000 rw-p 00000000 00:00 0
a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
aff35000-aff4a000 rw-p 00000000 00:00 0 [stack] aff35000-aff4a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
where "address" is the address space in the process that it occupies, "perms" where "address" is the address space in the process that it occupies, "perms"
is a set of permissions: is a set of permissions::
r = read r = read
w = write w = write
...@@ -411,42 +422,44 @@ with the memory region, as the case would be with BSS (uninitialized data). ...@@ -411,42 +422,44 @@ with the memory region, as the case would be with BSS (uninitialized data).
The "pathname" shows the name associated file for this mapping. If the mapping The "pathname" shows the name associated file for this mapping. If the mapping
is not associated with a file: is not associated with a file:
[heap] = the heap of the program ======= ====================================
[stack] = the stack of the main process [heap] the heap of the program
[vdso] = the "virtual dynamic shared object", [stack] the stack of the main process
[vdso] the "virtual dynamic shared object",
the kernel system call handler the kernel system call handler
======= ====================================
or if empty, the mapping is anonymous. or if empty, the mapping is anonymous.
The /proc/PID/smaps is an extension based on maps, showing the memory The /proc/PID/smaps is an extension based on maps, showing the memory
consumption for each of the process's mappings. For each mapping (aka Virtual consumption for each of the process's mappings. For each mapping (aka Virtual
Memory Area, or VMA) there is a series of lines such as the following: Memory Area, or VMA) there is a series of lines such as the following::
08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash 08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash
Size: 1084 kB Size: 1084 kB
KernelPageSize: 4 kB KernelPageSize: 4 kB
MMUPageSize: 4 kB MMUPageSize: 4 kB
Rss: 892 kB Rss: 892 kB
Pss: 374 kB Pss: 374 kB
Shared_Clean: 892 kB Shared_Clean: 892 kB
Shared_Dirty: 0 kB Shared_Dirty: 0 kB
Private_Clean: 0 kB Private_Clean: 0 kB
Private_Dirty: 0 kB Private_Dirty: 0 kB
Referenced: 892 kB Referenced: 892 kB
Anonymous: 0 kB Anonymous: 0 kB
LazyFree: 0 kB LazyFree: 0 kB
AnonHugePages: 0 kB AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB Private_Hugetlb: 0 kB
Swap: 0 kB Swap: 0 kB
SwapPss: 0 kB SwapPss: 0 kB
KernelPageSize: 4 kB KernelPageSize: 4 kB
MMUPageSize: 4 kB MMUPageSize: 4 kB
Locked: 0 kB Locked: 0 kB
THPeligible: 0 THPeligible: 0
VmFlags: rd ex mr mw me dw VmFlags: rd ex mr mw me dw
The first of these lines shows the same information as is displayed for the The first of these lines shows the same information as is displayed for the
mapping in /proc/PID/maps. Following lines show the size of the mapping mapping in /proc/PID/maps. Following lines show the size of the mapping
...@@ -461,26 +474,35 @@ The "proportional set size" (PSS) of a process is the count of pages it has ...@@ -461,26 +474,35 @@ The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing it. in memory, where each page is divided by the number of processes sharing it.
So if a process has 1000 pages all to itself, and 1000 shared with one other So if a process has 1000 pages all to itself, and 1000 shared with one other
process, its PSS will be 1500. process, its PSS will be 1500.
Note that even a page which is part of a MAP_SHARED mapping, but has only Note that even a page which is part of a MAP_SHARED mapping, but has only
a single pte mapped, i.e. is currently used by only one process, is accounted a single pte mapped, i.e. is currently used by only one process, is accounted
as private and not as shared. as private and not as shared.
"Referenced" indicates the amount of memory currently marked as referenced or "Referenced" indicates the amount of memory currently marked as referenced or
accessed. accessed.
"Anonymous" shows the amount of memory that does not belong to any file. Even "Anonymous" shows the amount of memory that does not belong to any file. Even
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy. and a page is modified, the file page is replaced by a private anonymous copy.
"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). "LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
The memory isn't freed immediately with madvise(). It's freed in memory The memory isn't freed immediately with madvise(). It's freed in memory
pressure if the memory is clean. Please note that the printed value might pressure if the memory is clean. Please note that the printed value might
be lower than the real value due to optimizations used in the current be lower than the real value due to optimizations used in the current
implementation. If this is not desirable please file a bug report. implementation. If this is not desirable please file a bug report.
"AnonHugePages" shows the ammount of memory backed by transparent hugepage. "AnonHugePages" shows the ammount of memory backed by transparent hugepage.
"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
huge pages. huge pages.
"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by "Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field. reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
"Swap" shows how much would-be-anonymous memory is also used, but out on swap. "Swap" shows how much would-be-anonymous memory is also used, but out on swap.
For shmem mappings, "Swap" includes also the size of the mapped (and not For shmem mappings, "Swap" includes also the size of the mapped (and not
replaced by copy-on-write) part of the underlying shmem object out on swap. replaced by copy-on-write) part of the underlying shmem object out on swap.
"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
...@@ -489,36 +511,39 @@ does not take into account swapped out page of underlying shmem objects. ...@@ -489,36 +511,39 @@ does not take into account swapped out page of underlying shmem objects.
"THPeligible" indicates whether the mapping is eligible for allocating THP "THPeligible" indicates whether the mapping is eligible for allocating THP
pages - 1 if true, 0 otherwise. It just shows the current status. pages - 1 if true, 0 otherwise. It just shows the current status.
"VmFlags" field deserves a separate description. This member represents the kernel "VmFlags" field deserves a separate description. This member represents the
flags associated with the particular virtual memory area in two letter encoded kernel flags associated with the particular virtual memory area in two letter
manner. The codes are the following: encoded manner. The codes are the following:
rd - readable
wr - writeable == =======================================
ex - executable rd readable
sh - shared wr writeable
mr - may read ex executable
mw - may write sh shared
me - may execute mr may read
ms - may share mw may write
gd - stack segment growns down me may execute
pf - pure PFN range ms may share
dw - disabled write to the mapped file gd stack segment growns down
lo - pages are locked in memory pf pure PFN range
io - memory mapped I/O area dw disabled write to the mapped file
sr - sequential read advise provided lo pages are locked in memory
rr - random read advise provided io memory mapped I/O area
dc - do not copy area on fork sr sequential read advise provided
de - do not expand area on remapping rr random read advise provided
ac - area is accountable dc do not copy area on fork
nr - swap space is not reserved for the area de do not expand area on remapping
ht - area uses huge tlb pages ac area is accountable
ar - architecture specific flag nr swap space is not reserved for the area
dd - do not include area into core dump ht area uses huge tlb pages
sd - soft-dirty flag ar architecture specific flag
mm - mixed map area dd do not include area into core dump
hg - huge page advise flag sd soft dirty flag
nh - no-huge page advise flag mm mixed map area
mg - mergable advise flag hg huge page advise flag
nh no huge page advise flag
mg mergable advise flag
== =======================================
Note that there is no guarantee that every flag and associated mnemonic will Note that there is no guarantee that every flag and associated mnemonic will
be present in all further kernel releases. Things get changed, the flags may be present in all further kernel releases. Things get changed, the flags may
...@@ -531,6 +556,7 @@ enabled. ...@@ -531,6 +556,7 @@ enabled.
Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent
output can be achieved only in the single read call). output can be achieved only in the single read call).
This typically manifests when doing partial reads of these files while the This typically manifests when doing partial reads of these files while the
memory map is being modified. Despite the races, we do provide the following memory map is being modified. Despite the races, we do provide the following
guarantees: guarantees:
...@@ -544,9 +570,9 @@ The /proc/PID/smaps_rollup file includes the same fields as /proc/PID/smaps, ...@@ -544,9 +570,9 @@ The /proc/PID/smaps_rollup file includes the same fields as /proc/PID/smaps,
but their values are the sums of the corresponding values for all mappings of but their values are the sums of the corresponding values for all mappings of
the process. Additionally, it contains these fields: the process. Additionally, it contains these fields:
Pss_Anon - Pss_Anon
Pss_File - Pss_File
Pss_Shmem - Pss_Shmem
They represent the proportional shares of anonymous, file, and shmem pages, as They represent the proportional shares of anonymous, file, and shmem pages, as
described for smaps above. These fields are omitted in smaps since each described for smaps above. These fields are omitted in smaps since each
...@@ -558,20 +584,25 @@ The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG ...@@ -558,20 +584,25 @@ The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
bits on both physical and virtual pages associated with a process, and the bits on both physical and virtual pages associated with a process, and the
soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst
for details). for details).
To clear the bits for all the pages associated with the process To clear the bits for all the pages associated with the process::
> echo 1 > /proc/PID/clear_refs > echo 1 > /proc/PID/clear_refs
To clear the bits for the anonymous pages associated with the process To clear the bits for the anonymous pages associated with the process::
> echo 2 > /proc/PID/clear_refs > echo 2 > /proc/PID/clear_refs
To clear the bits for the file mapped pages associated with the process To clear the bits for the file mapped pages associated with the process::
> echo 3 > /proc/PID/clear_refs > echo 3 > /proc/PID/clear_refs
To clear the soft-dirty bit To clear the soft-dirty bit::
> echo 4 > /proc/PID/clear_refs > echo 4 > /proc/PID/clear_refs
To reset the peak resident set size ("high water mark") to the process's To reset the peak resident set size ("high water mark") to the process's
current value: current value::
> echo 5 > /proc/PID/clear_refs > echo 5 > /proc/PID/clear_refs
Any other value written to /proc/PID/clear_refs will have no effect. Any other value written to /proc/PID/clear_refs will have no effect.
...@@ -584,30 +615,33 @@ Documentation/admin-guide/mm/pagemap.rst. ...@@ -584,30 +615,33 @@ Documentation/admin-guide/mm/pagemap.rst.
The /proc/pid/numa_maps is an extension based on maps, showing the memory The /proc/pid/numa_maps is an extension based on maps, showing the memory
locality and binding policy, as well as the memory usage (in pages) of locality and binding policy, as well as the memory usage (in pages) of
each mapping. The output follows a general format where mapping details get each mapping. The output follows a general format where mapping details get
summarized separated by blank spaces, one mapping per each file line: summarized separated by blank spaces, one mapping per each file line::
address policy mapping details address policy mapping details
00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4 00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4
00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4 00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4
3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4 3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4
320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4 3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4
3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4 3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4
320698b000 default file=/lib64/libc-2.12.so 320698b000 default file=/lib64/libc-2.12.so
3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4 3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4
3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4 3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4
7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4 7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4
7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4 7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4
7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048 7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048
7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4 7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4
7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4 7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4
Where: Where:
"address" is the starting address for the mapping; "address" is the starting address for the mapping;
"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst); "policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst);
"mapping details" summarizes mapping data such as mapping type, page usage counters, "mapping details" summarizes mapping data such as mapping type, page usage counters,
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
size, in KB, that is backing the mapping up. size, in KB, that is backing the mapping up.
...@@ -621,9 +655,11 @@ the running kernel. The files used to obtain this information are contained in ...@@ -621,9 +655,11 @@ the running kernel. The files used to obtain this information are contained in
system. It depends on the kernel configuration and the loaded modules, which system. It depends on the kernel configuration and the loaded modules, which
files are there, and which are missing. files are there, and which are missing.
Table 1-5: Kernel info in /proc .. table:: Table 1-5: Kernel info in /proc
..............................................................................
============ ===============================================================
File Content File Content
============ ===============================================================
apm Advanced power management info apm Advanced power management info
buddyinfo Kernel memory allocator information (see text) (2.5) buddyinfo Kernel memory allocator information (see text) (2.5)
bus Directory containing bus specific information bus Directory containing bus specific information
...@@ -669,10 +705,10 @@ Table 1-5: Kernel info in /proc ...@@ -669,10 +705,10 @@ Table 1-5: Kernel info in /proc
version Kernel version version Kernel version
video bttv info of video resources (2.4) video bttv info of video resources (2.4)
vmallocinfo Show vmalloced areas vmallocinfo Show vmalloced areas
.............................................................................. ============ ===============================================================
You can, for example, check which interrupts are currently in use and what You can, for example, check which interrupts are currently in use and what
they are used for by looking in the file /proc/interrupts: they are used for by looking in the file /proc/interrupts::
> cat /proc/interrupts > cat /proc/interrupts
CPU0 CPU0
...@@ -691,7 +727,7 @@ they are used for by looking in the file /proc/interrupts: ...@@ -691,7 +727,7 @@ they are used for by looking in the file /proc/interrupts:
NMI: 0 NMI: 0
In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the
output of a SMP machine): output of a SMP machine)::
> cat /proc/interrupts > cat /proc/interrupts
...@@ -726,21 +762,25 @@ In 2.6.2* /proc/interrupts was expanded again. This time the goal was for ...@@ -726,21 +762,25 @@ In 2.6.2* /proc/interrupts was expanded again. This time the goal was for
/proc/interrupts to display every IRQ vector in use by the system, not /proc/interrupts to display every IRQ vector in use by the system, not
just those considered 'most important'. The new vectors are: just those considered 'most important'. The new vectors are:
THR -- interrupt raised when a machine check threshold counter THR
interrupt raised when a machine check threshold counter
(typically counting ECC corrected errors of memory or cache) exceeds (typically counting ECC corrected errors of memory or cache) exceeds
a configurable threshold. Only available on some systems. a configurable threshold. Only available on some systems.
TRM -- a thermal event interrupt occurs when a temperature threshold TRM
a thermal event interrupt occurs when a temperature threshold
has been exceeded for the CPU. This interrupt may also be generated has been exceeded for the CPU. This interrupt may also be generated
when the temperature drops back to normal. when the temperature drops back to normal.
SPU -- a spurious interrupt is some interrupt that was raised then lowered SPU
a spurious interrupt is some interrupt that was raised then lowered
by some IO device before it could be fully processed by the APIC. Hence by some IO device before it could be fully processed by the APIC. Hence
the APIC sees the interrupt but does not know what device it came from. the APIC sees the interrupt but does not know what device it came from.
For this case the APIC will generate the interrupt with a IRQ vector For this case the APIC will generate the interrupt with a IRQ vector
of 0xff. This might also be generated by chipset bugs. of 0xff. This might also be generated by chipset bugs.
RES, CAL, TLB -- rescheduling, call and TLB flush interrupts are RES, CAL, TLB]
rescheduling, call and TLB flush interrupts are
sent from one CPU to another per the needs of the OS. Typically, sent from one CPU to another per the needs of the OS. Typically,
their statistics are used by kernel developers and interested users to their statistics are used by kernel developers and interested users to
determine the occurrence of interrupts of the given type. determine the occurrence of interrupts of the given type.
...@@ -756,7 +796,8 @@ IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the ...@@ -756,7 +796,8 @@ IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
prof_cpu_mask. prof_cpu_mask.
For example For example::
> ls /proc/irq/ > ls /proc/irq/
0 10 12 14 16 18 2 4 6 8 prof_cpu_mask 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask
1 11 13 15 17 19 3 5 7 9 default_smp_affinity 1 11 13 15 17 19 3 5 7 9 default_smp_affinity
...@@ -764,20 +805,20 @@ For example ...@@ -764,20 +805,20 @@ For example
smp_affinity smp_affinity
smp_affinity is a bitmask, in which you can specify which CPUs can handle the smp_affinity is a bitmask, in which you can specify which CPUs can handle the
IRQ, you can set it by doing: IRQ, you can set it by doing::
> echo 1 > /proc/irq/10/smp_affinity > echo 1 > /proc/irq/10/smp_affinity
This means that only the first CPU will handle the IRQ, but you can also echo This means that only the first CPU will handle the IRQ, but you can also echo
5 which means that only the first and third CPU can handle the IRQ. 5 which means that only the first and third CPU can handle the IRQ.
The contents of each smp_affinity file is the same by default: The contents of each smp_affinity file is the same by default::
> cat /proc/irq/0/smp_affinity > cat /proc/irq/0/smp_affinity
ffffffff ffffffff
There is an alternate interface, smp_affinity_list which allows specifying There is an alternate interface, smp_affinity_list which allows specifying
a cpu range instead of a bitmask: a cpu range instead of a bitmask::
> cat /proc/irq/0/smp_affinity_list > cat /proc/irq/0/smp_affinity_list
1024-1031 1024-1031
...@@ -810,13 +851,13 @@ Linux uses slab pools for memory management above page level in version 2.2. ...@@ -810,13 +851,13 @@ Linux uses slab pools for memory management above page level in version 2.2.
Commonly used objects have their own slab pool (such as network buffers, Commonly used objects have their own slab pool (such as network buffers,
directory cache, and so on). directory cache, and so on).
.............................................................................. ::
> cat /proc/buddyinfo > cat /proc/buddyinfo
Node 0, zone DMA 0 4 5 4 4 3 ... Node 0, zone DMA 0 4 5 4 4 3 ...
Node 0, zone Normal 1 0 0 1 101 8 ... Node 0, zone Normal 1 0 0 1 101 8 ...
Node 0, zone HighMem 2 0 0 1 1 0 ... Node 0, zone HighMem 2 0 0 1 1 0 ...
External fragmentation is a problem under some workloads, and buddyinfo is a External fragmentation is a problem under some workloads, and buddyinfo is a
useful tool for helping diagnose these problems. Buddyinfo will give you a useful tool for helping diagnose these problems. Buddyinfo will give you a
...@@ -829,27 +870,27 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE ...@@ -829,27 +870,27 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
available in ZONE_NORMAL, etc... available in ZONE_NORMAL, etc...
More information relevant to external fragmentation can be found in More information relevant to external fragmentation can be found in
pagetypeinfo. pagetypeinfo::
> cat /proc/pagetypeinfo > cat /proc/pagetypeinfo
Page block order: 9 Page block order: 9
Pages per block: 512 Pages per block: 512
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2
Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9
Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0
Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452
Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Number of blocks type Unmovable Reclaimable Movable Reserve Isolate Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
Node 0, zone DMA 2 0 5 1 0 Node 0, zone DMA 2 0 5 1 0
Node 0, zone DMA32 41 6 967 2 0 Node 0, zone DMA32 41 6 967 2 0
Fragmentation avoidance in the kernel works by grouping pages of different Fragmentation avoidance in the kernel works by grouping pages of different
migrate types into the same contiguous regions of memory called page blocks. migrate types into the same contiguous regions of memory called page blocks.
...@@ -870,59 +911,63 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should ...@@ -870,59 +911,63 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
also be allocatable although a lot of filesystem metadata may have to be also be allocatable although a lot of filesystem metadata may have to be
reclaimed to achieve this. reclaimed to achieve this.
..............................................................................
meminfo: meminfo
~~~~~~~
Provides information about distribution and utilization of memory. This Provides information about distribution and utilization of memory. This
varies by architecture and compile options. The following is from a varies by architecture and compile options. The following is from a
16GB PIII, which has highmem enabled. You may not have all of these fields. 16GB PIII, which has highmem enabled. You may not have all of these fields.
> cat /proc/meminfo ::
MemTotal: 16344972 kB > cat /proc/meminfo
MemFree: 13634064 kB
MemAvailable: 14836172 kB MemTotal: 16344972 kB
Buffers: 3656 kB MemFree: 13634064 kB
Cached: 1195708 kB MemAvailable: 14836172 kB
SwapCached: 0 kB Buffers: 3656 kB
Active: 891636 kB Cached: 1195708 kB
Inactive: 1077224 kB SwapCached: 0 kB
HighTotal: 15597528 kB Active: 891636 kB
HighFree: 13629632 kB Inactive: 1077224 kB
LowTotal: 747444 kB HighTotal: 15597528 kB
LowFree: 4432 kB HighFree: 13629632 kB
SwapTotal: 0 kB LowTotal: 747444 kB
SwapFree: 0 kB LowFree: 4432 kB
Dirty: 968 kB SwapTotal: 0 kB
Writeback: 0 kB SwapFree: 0 kB
AnonPages: 861800 kB Dirty: 968 kB
Mapped: 280372 kB Writeback: 0 kB
Shmem: 644 kB AnonPages: 861800 kB
KReclaimable: 168048 kB Mapped: 280372 kB
Slab: 284364 kB Shmem: 644 kB
SReclaimable: 159856 kB KReclaimable: 168048 kB
SUnreclaim: 124508 kB Slab: 284364 kB
PageTables: 24448 kB SReclaimable: 159856 kB
NFS_Unstable: 0 kB SUnreclaim: 124508 kB
Bounce: 0 kB PageTables: 24448 kB
WritebackTmp: 0 kB NFS_Unstable: 0 kB
CommitLimit: 7669796 kB Bounce: 0 kB
Committed_AS: 100056 kB WritebackTmp: 0 kB
VmallocTotal: 112216 kB CommitLimit: 7669796 kB
VmallocUsed: 428 kB Committed_AS: 100056 kB
VmallocChunk: 111088 kB VmallocTotal: 112216 kB
Percpu: 62080 kB VmallocUsed: 428 kB
HardwareCorrupted: 0 kB VmallocChunk: 111088 kB
AnonHugePages: 49152 kB Percpu: 62080 kB
ShmemHugePages: 0 kB HardwareCorrupted: 0 kB
ShmemPmdMapped: 0 kB AnonHugePages: 49152 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
MemTotal: Total usable ram (i.e. physical ram minus a few reserved
MemTotal
Total usable ram (i.e. physical ram minus a few reserved
bits and the kernel binary code) bits and the kernel binary code)
MemFree: The sum of LowFree+HighFree MemFree
MemAvailable: An estimate of how much memory is available for starting new The sum of LowFree+HighFree
MemAvailable
An estimate of how much memory is available for starting new
applications, without swapping. Calculated from MemFree, applications, without swapping. Calculated from MemFree,
SReclaimable, the size of the file LRU lists, and the low SReclaimable, the size of the file LRU lists, and the low
watermarks in each zone. watermarks in each zone.
...@@ -930,69 +975,99 @@ MemAvailable: An estimate of how much memory is available for starting new ...@@ -930,69 +975,99 @@ MemAvailable: An estimate of how much memory is available for starting new
page cache to function well, and that not all reclaimable page cache to function well, and that not all reclaimable
slab will be reclaimable, due to items being in use. The slab will be reclaimable, due to items being in use. The
impact of those factors will vary from system to system. impact of those factors will vary from system to system.
Buffers: Relatively temporary storage for raw disk blocks Buffers
Relatively temporary storage for raw disk blocks
shouldn't get tremendously large (20MB or so) shouldn't get tremendously large (20MB or so)
Cached: in-memory cache for files read from the disk (the Cached
in-memory cache for files read from the disk (the
pagecache). Doesn't include SwapCached pagecache). Doesn't include SwapCached
SwapCached: Memory that once was swapped out, is swapped back in but SwapCached
Memory that once was swapped out, is swapped back in but
still also is in the swapfile (if memory is needed it still also is in the swapfile (if memory is needed it
doesn't need to be swapped out AGAIN because it is already doesn't need to be swapped out AGAIN because it is already
in the swapfile. This saves I/O) in the swapfile. This saves I/O)
Active: Memory that has been used more recently and usually not Active
Memory that has been used more recently and usually not
reclaimed unless absolutely necessary. reclaimed unless absolutely necessary.
Inactive: Memory which has been less recently used. It is more Inactive
Memory which has been less recently used. It is more
eligible to be reclaimed for other purposes eligible to be reclaimed for other purposes
HighTotal: HighTotal, HighFree
HighFree: Highmem is all memory above ~860MB of physical memory Highmem is all memory above ~860MB of physical memory
Highmem areas are for use by userspace programs, or Highmem areas are for use by userspace programs, or
for the pagecache. The kernel must use tricks to access for the pagecache. The kernel must use tricks to access
this memory, making it slower to access than lowmem. this memory, making it slower to access than lowmem.
LowTotal: LowTotal, LowFree
LowFree: Lowmem is memory which can be used for everything that Lowmem is memory which can be used for everything that
highmem can be used for, but it is also available for the highmem can be used for, but it is also available for the
kernel's use for its own data structures. Among many kernel's use for its own data structures. Among many
other things, it is where everything from the Slab is other things, it is where everything from the Slab is
allocated. Bad things happen when you're out of lowmem. allocated. Bad things happen when you're out of lowmem.
SwapTotal: total amount of swap space available SwapTotal
SwapFree: Memory which has been evicted from RAM, and is temporarily total amount of swap space available
SwapFree
Memory which has been evicted from RAM, and is temporarily
on the disk on the disk
Dirty: Memory which is waiting to get written back to the disk Dirty
Writeback: Memory which is actively being written back to the disk Memory which is waiting to get written back to the disk
AnonPages: Non-file backed pages mapped into userspace page tables Writeback
HardwareCorrupted: The amount of RAM/memory in KB, the kernel identifies as Memory which is actively being written back to the disk
AnonPages
Non-file backed pages mapped into userspace page tables
HardwareCorrupted
The amount of RAM/memory in KB, the kernel identifies as
corrupted. corrupted.
AnonHugePages: Non-file backed huge pages mapped into userspace page tables AnonHugePages
Mapped: files which have been mmaped, such as libraries Non-file backed huge pages mapped into userspace page tables
Shmem: Total memory used by shared memory (shmem) and tmpfs Mapped
ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated files which have been mmaped, such as libraries
Shmem
Total memory used by shared memory (shmem) and tmpfs
ShmemHugePages
Memory used by shared memory (shmem) and tmpfs allocated
with huge pages with huge pages
ShmemPmdMapped: Shared memory mapped into userspace with huge pages ShmemPmdMapped
KReclaimable: Kernel allocations that the kernel will attempt to reclaim Shared memory mapped into userspace with huge pages
KReclaimable
Kernel allocations that the kernel will attempt to reclaim
under memory pressure. Includes SReclaimable (below), and other under memory pressure. Includes SReclaimable (below), and other
direct allocations with a shrinker. direct allocations with a shrinker.
Slab: in-kernel data structures cache Slab
SReclaimable: Part of Slab, that might be reclaimed, such as caches in-kernel data structures cache
SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure SReclaimable
PageTables: amount of memory dedicated to the lowest level of page Part of Slab, that might be reclaimed, such as caches
SUnreclaim
Part of Slab, that cannot be reclaimed on memory pressure
PageTables
amount of memory dedicated to the lowest level of page
tables. tables.
NFS_Unstable: NFS pages sent to the server, but not yet committed to stable NFS_Unstable
NFS pages sent to the server, but not yet committed to stable
storage storage
Bounce: Memory used for block device "bounce buffers" Bounce
WritebackTmp: Memory used by FUSE for temporary writeback buffers Memory used for block device "bounce buffers"
CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), WritebackTmp
Memory used by FUSE for temporary writeback buffers
CommitLimit
Based on the overcommit ratio ('vm.overcommit_ratio'),
this is the total amount of memory currently available to this is the total amount of memory currently available to
be allocated on the system. This limit is only adhered to be allocated on the system. This limit is only adhered to
if strict overcommit accounting is enabled (mode 2 in if strict overcommit accounting is enabled (mode 2 in
'vm.overcommit_memory'). 'vm.overcommit_memory').
The CommitLimit is calculated with the following formula:
The CommitLimit is calculated with the following formula::
CommitLimit = ([total RAM pages] - [total huge TLB pages]) * CommitLimit = ([total RAM pages] - [total huge TLB pages]) *
overcommit_ratio / 100 + [total swap pages] overcommit_ratio / 100 + [total swap pages]
For example, on a system with 1G of physical RAM and 7G For example, on a system with 1G of physical RAM and 7G
of swap with a `vm.overcommit_ratio` of 30 it would of swap with a `vm.overcommit_ratio` of 30 it would
yield a CommitLimit of 7.3G. yield a CommitLimit of 7.3G.
For more details, see the memory overcommit documentation For more details, see the memory overcommit documentation
in vm/overcommit-accounting. in vm/overcommit-accounting.
Committed_AS: The amount of memory presently allocated on the system. Committed_AS
The amount of memory presently allocated on the system.
The committed memory is a sum of all of the memory which The committed memory is a sum of all of the memory which
has been allocated by processes, even if it has not been has been allocated by processes, even if it has not been
"used" by them as of yet. A process which malloc()'s 1G "used" by them as of yet. A process which malloc()'s 1G
...@@ -1005,21 +1080,25 @@ Committed_AS: The amount of memory presently allocated on the system. ...@@ -1005,21 +1080,25 @@ Committed_AS: The amount of memory presently allocated on the system.
This is useful if one needs to guarantee that processes will This is useful if one needs to guarantee that processes will
not fail due to lack of memory once that memory has been not fail due to lack of memory once that memory has been
successfully allocated. successfully allocated.
VmallocTotal: total size of vmalloc memory area VmallocTotal
VmallocUsed: amount of vmalloc area which is used total size of vmalloc memory area
VmallocChunk: largest contiguous block of vmalloc area which is free VmallocUsed
Percpu: Memory allocated to the percpu allocator used to back percpu amount of vmalloc area which is used
VmallocChunk
largest contiguous block of vmalloc area which is free
Percpu
Memory allocated to the percpu allocator used to back percpu
allocations. This stat excludes the cost of metadata. allocations. This stat excludes the cost of metadata.
.............................................................................. vmallocinfo
~~~~~~~~~~~
vmallocinfo:
Provides information about vmalloced/vmaped areas. One line per area, Provides information about vmalloced/vmaped areas. One line per area,
containing the virtual address range of the area, size in bytes, containing the virtual address range of the area, size in bytes,
caller information of the creator, and optional information depending caller information of the creator, and optional information depending
on the kind of area : on the kind of area :
========== ===================================================
pages=nr number of pages pages=nr number of pages
phys=addr if a physical address was specified phys=addr if a physical address was specified
ioremap I/O mapping (ioremap() and friends) ioremap I/O mapping (ioremap() and friends)
...@@ -1029,39 +1108,44 @@ on the kind of area : ...@@ -1029,39 +1108,44 @@ on the kind of area :
vpages buffer for pages pointers was vmalloced (huge area) vpages buffer for pages pointers was vmalloced (huge area)
N<node>=nr (Only on NUMA kernels) N<node>=nr (Only on NUMA kernels)
Number of pages allocated on memory node <node> Number of pages allocated on memory node <node>
========== ===================================================
::
> cat /proc/vmallocinfo > cat /proc/vmallocinfo
0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ... 0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ...
/0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ... 0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ...
/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f... 0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f...
phys=7fee8000 ioremap phys=7fee8000 ioremap
0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f... 0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f...
phys=7fee7000 ioremap phys=7fee7000 ioremap
0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210 0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210
0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ... 0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ...
/0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ... 0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ...
pages=2 vmalloc N1=2 pages=2 vmalloc N1=2
0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ... 0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ...
/0x130 [x_tables] pages=4 vmalloc N0=4 /0x130 [x_tables] pages=4 vmalloc N0=4
0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ... 0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ...
pages=14 vmalloc N2=14 pages=14 vmalloc N2=14
0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ... 0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ...
pages=4 vmalloc N1=4 pages=4 vmalloc N1=4
0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ... 0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ...
pages=2 vmalloc N1=2 pages=2 vmalloc N1=2
0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... 0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ...
pages=10 vmalloc N0=10 pages=10 vmalloc N0=10
..............................................................................
softirqs: softirqs
~~~~~~~~
Provides counts of softirq handlers serviced since boot time, for each cpu. Provides counts of softirq handlers serviced since boot time, for each cpu.
> cat /proc/softirqs ::
> cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3 CPU0 CPU1 CPU2 CPU3
HI: 0 0 0 0 HI: 0 0 0 0
TIMER: 27166 27120 27097 27034 TIMER: 27166 27120 27097 27034
...@@ -1083,7 +1167,7 @@ file drivers and a link for each IDE device, pointing to the device directory ...@@ -1083,7 +1167,7 @@ file drivers and a link for each IDE device, pointing to the device directory
in the controller specific subtree. in the controller specific subtree.
The file drivers contains general information about the drivers used for the The file drivers contains general information about the drivers used for the
IDE devices: IDE devices::
> cat /proc/ide/drivers > cat /proc/ide/drivers
ide-cdrom version 4.53 ide-cdrom version 4.53
...@@ -1094,23 +1178,27 @@ subdirectories. These are named ide0, ide1 and so on. Each of these ...@@ -1094,23 +1178,27 @@ subdirectories. These are named ide0, ide1 and so on. Each of these
directories contains the files shown in table 1-6. directories contains the files shown in table 1-6.
Table 1-6: IDE controller info in /proc/ide/ide? .. table:: Table 1-6: IDE controller info in /proc/ide/ide?
..............................................................................
======= =======================================
File Content File Content
======= =======================================
channel IDE channel (0 or 1) channel IDE channel (0 or 1)
config Configuration (only for PCI/IDE bridge) config Configuration (only for PCI/IDE bridge)
mate Mate name mate Mate name
model Type/Chipset of IDE controller model Type/Chipset of IDE controller
.............................................................................. ======= =======================================
Each device connected to a controller has a separate subdirectory in the Each device connected to a controller has a separate subdirectory in the
controllers directory. The files listed in table 1-7 are contained in these controllers directory. The files listed in table 1-7 are contained in these
directories. directories.
Table 1-7: IDE device information .. table:: Table 1-7: IDE device information
..............................................................................
================ ==========================================
File Content File Content
================ ==========================================
cache The cache cache The cache
capacity Capacity of the medium (in 512Byte blocks) capacity Capacity of the medium (in 512Byte blocks)
driver driver and version driver driver and version
...@@ -1121,10 +1209,10 @@ Table 1-7: IDE device information ...@@ -1121,10 +1209,10 @@ Table 1-7: IDE device information
settings device setup settings device setup
smart_thresholds IDE disk management thresholds smart_thresholds IDE disk management thresholds
smart_values IDE disk management values smart_values IDE disk management values
.............................................................................. ================ ==========================================
The most interesting file is settings. This file contains a nice overview of The most interesting file is ``settings``. This file contains a nice
the drive parameters: overview of the drive parameters::
# cat /proc/ide/ide0/hda/settings # cat /proc/ide/ide0/hda/settings
name value min max mode name value min max mode
...@@ -1155,9 +1243,11 @@ additional values you get for IP version 6 if you configure the kernel to ...@@ -1155,9 +1243,11 @@ additional values you get for IP version 6 if you configure the kernel to
support this. Table 1-9 lists the files and their meaning. support this. Table 1-9 lists the files and their meaning.
Table 1-8: IPv6 info in /proc/net .. table:: Table 1-8: IPv6 info in /proc/net
..............................................................................
========== =====================================================
File Content File Content
========== =====================================================
udp6 UDP sockets (IPv6) udp6 UDP sockets (IPv6)
tcp6 TCP sockets (IPv6) tcp6 TCP sockets (IPv6)
raw6 Raw device statistics (IPv6) raw6 Raw device statistics (IPv6)
...@@ -1167,12 +1257,13 @@ Table 1-8: IPv6 info in /proc/net ...@@ -1167,12 +1257,13 @@ Table 1-8: IPv6 info in /proc/net
rt6_stats Global IPv6 routing tables statistics rt6_stats Global IPv6 routing tables statistics
sockstat6 Socket statistics (IPv6) sockstat6 Socket statistics (IPv6)
snmp6 Snmp data (IPv6) snmp6 Snmp data (IPv6)
.............................................................................. ========== =====================================================
.. table:: Table 1-9: Network info in /proc/net
Table 1-9: Network info in /proc/net ============= ================================================================
..............................................................................
File Content File Content
============= ================================================================
arp Kernel ARP table arp Kernel ARP table
dev network devices with statistics dev network devices with statistics
dev_mcast the Layer2 multicast groups a device is listening too dev_mcast the Layer2 multicast groups a device is listening too
...@@ -1199,10 +1290,10 @@ Table 1-9: Network info in /proc/net ...@@ -1199,10 +1290,10 @@ Table 1-9: Network info in /proc/net
netlink List of PF_NETLINK sockets netlink List of PF_NETLINK sockets
ip_mr_vifs List of multicast virtual interfaces ip_mr_vifs List of multicast virtual interfaces
ip_mr_cache List of multicast routing cache ip_mr_cache List of multicast routing cache
.............................................................................. ============= ================================================================
You can use this information to see which network devices are available in You can use this information to see which network devices are available in
your system and how much traffic was routed over those devices: your system and how much traffic was routed over those devices::
> cat /proc/net/dev > cat /proc/net/dev
Inter-|Receive |[... Inter-|Receive |[...
...@@ -1228,7 +1319,7 @@ many times the slaves link has failed. ...@@ -1228,7 +1319,7 @@ many times the slaves link has failed.
If you have a SCSI host adapter in your system, you'll find a subdirectory If you have a SCSI host adapter in your system, you'll find a subdirectory
named after the driver for this adapter in /proc/scsi. You'll also see a list named after the driver for this adapter in /proc/scsi. You'll also see a list
of all recognized SCSI devices in /proc/scsi: of all recognized SCSI devices in /proc/scsi::
>cat /proc/scsi/scsi >cat /proc/scsi/scsi
Attached devices: Attached devices:
...@@ -1244,7 +1335,7 @@ The directory named after the driver has one file for each adapter found in ...@@ -1244,7 +1335,7 @@ The directory named after the driver has one file for each adapter found in
the system. These files contain information about the controller, including the system. These files contain information about the controller, including
the used IRQ and the IO address range. The amount of information shown is the used IRQ and the IO address range. The amount of information shown is
dependent on the adapter you use. The example shows the output for an Adaptec dependent on the adapter you use. The example shows the output for an Adaptec
AHA-2940 SCSI adapter: AHA-2940 SCSI adapter::
> cat /proc/scsi/aic7xxx/0 > cat /proc/scsi/aic7xxx/0
...@@ -1296,9 +1387,11 @@ number (0,1,2,...). ...@@ -1296,9 +1387,11 @@ number (0,1,2,...).
These directories contain the four files shown in Table 1-10. These directories contain the four files shown in Table 1-10.
Table 1-10: Files in /proc/parport .. table:: Table 1-10: Files in /proc/parport
..............................................................................
========= ====================================================================
File Content File Content
========= ====================================================================
autoprobe Any IEEE-1284 device ID information that has been acquired. autoprobe Any IEEE-1284 device ID information that has been acquired.
devices list of the device drivers using that port. A + will appear by the devices list of the device drivers using that port. A + will appear by the
name of the device currently using the port (it might not appear name of the device currently using the port (it might not appear
...@@ -1307,7 +1400,7 @@ Table 1-10: Files in /proc/parport ...@@ -1307,7 +1400,7 @@ Table 1-10: Files in /proc/parport
irq IRQ that parport is using for that port. This is in a separate irq IRQ that parport is using for that port. This is in a separate
file to allow you to alter it by writing a new value in (IRQ file to allow you to alter it by writing a new value in (IRQ
number or none). number or none).
.............................................................................. ========= ====================================================================
1.7 TTY info in /proc/tty 1.7 TTY info in /proc/tty
------------------------- -------------------------
...@@ -1317,16 +1410,18 @@ directory /proc/tty.You'll find entries for drivers and line disciplines in ...@@ -1317,16 +1410,18 @@ directory /proc/tty.You'll find entries for drivers and line disciplines in
this directory, as shown in Table 1-11. this directory, as shown in Table 1-11.
Table 1-11: Files in /proc/tty .. table:: Table 1-11: Files in /proc/tty
..............................................................................
============= ==============================================
File Content File Content
============= ==============================================
drivers list of drivers and their usage drivers list of drivers and their usage
ldiscs registered line disciplines ldiscs registered line disciplines
driver/serial usage statistic and status of single tty lines driver/serial usage statistic and status of single tty lines
.............................................................................. ============= ==============================================
To see which tty's are currently in use, you can simply look into the file To see which tty's are currently in use, you can simply look into the file
/proc/tty/drivers: /proc/tty/drivers::
> cat /proc/tty/drivers > cat /proc/tty/drivers
pty_slave /dev/pts 136 0-255 pty:slave pty_slave /dev/pts 136 0-255 pty:slave
...@@ -1347,7 +1442,7 @@ To see which tty's are currently in use, you can simply look into the file ...@@ -1347,7 +1442,7 @@ To see which tty's are currently in use, you can simply look into the file
Various pieces of information about kernel activity are available in the Various pieces of information about kernel activity are available in the
/proc/stat file. All of the numbers reported in this file are aggregates /proc/stat file. All of the numbers reported in this file are aggregates
since the system first booted. For a quick look, simply cat the file: since the system first booted. For a quick look, simply cat the file::
> cat /proc/stat > cat /proc/stat
cpu 2255 34 2290 22625563 6290 127 456 0 0 0 cpu 2255 34 2290 22625563 6290 127 456 0 0 0
...@@ -1372,6 +1467,7 @@ second). The meanings of the columns are as follows, from left to right: ...@@ -1372,6 +1467,7 @@ second). The meanings of the columns are as follows, from left to right:
- idle: twiddling thumbs - idle: twiddling thumbs
- iowait: In a word, iowait stands for waiting for I/O to complete. But there - iowait: In a word, iowait stands for waiting for I/O to complete. But there
are several problems: are several problems:
1. Cpu will not wait for I/O to complete, iowait is the time that a task is 1. Cpu will not wait for I/O to complete, iowait is the time that a task is
waiting for I/O to complete. When cpu goes into idle state for waiting for I/O to complete. When cpu goes into idle state for
outstanding task io, another task will be scheduled on this CPU. outstanding task io, another task will be scheduled on this CPU.
...@@ -1379,6 +1475,7 @@ second). The meanings of the columns are as follows, from left to right: ...@@ -1379,6 +1475,7 @@ second). The meanings of the columns are as follows, from left to right:
on any CPU, so the iowait of each CPU is difficult to calculate. on any CPU, so the iowait of each CPU is difficult to calculate.
3. The value of iowait field in /proc/stat will decrease in certain 3. The value of iowait field in /proc/stat will decrease in certain
conditions. conditions.
So, the iowait is not reliable by reading from /proc/stat. So, the iowait is not reliable by reading from /proc/stat.
- irq: servicing interrupts - irq: servicing interrupts
- softirq: servicing softirqs - softirq: servicing softirqs
...@@ -1422,18 +1519,19 @@ Information about mounted ext4 file systems can be found in ...@@ -1422,18 +1519,19 @@ Information about mounted ext4 file systems can be found in
/proc/fs/ext4/dm-0). The files in each per-device directory are shown /proc/fs/ext4/dm-0). The files in each per-device directory are shown
in Table 1-12, below. in Table 1-12, below.
Table 1-12: Files in /proc/fs/ext4/<devname> .. table:: Table 1-12: Files in /proc/fs/ext4/<devname>
..............................................................................
============== ==========================================================
File Content File Content
mb_groups details of multiblock allocator buddy cache of free blocks mb_groups details of multiblock allocator buddy cache of free blocks
.............................................................................. ============== ==========================================================
2.0 /proc/consoles 2.0 /proc/consoles
------------------ ------------------
Shows registered system console lines. Shows registered system console lines.
To see which character device lines are currently used for the system console To see which character device lines are currently used for the system console
/dev/console, you may simply look into the file /proc/consoles: /dev/console, you may simply look into the file /proc/consoles::
> cat /proc/consoles > cat /proc/consoles
tty0 -WU (ECp) 4:7 tty0 -WU (ECp) 4:7
...@@ -1441,41 +1539,45 @@ To see which character device lines are currently used for the system console ...@@ -1441,41 +1539,45 @@ To see which character device lines are currently used for the system console
The columns are: The columns are:
device name of the device +--------------------+-------------------------------------------------------+
operations R = can do read operations | device | name of the device |
W = can do write operations +====================+=======================================================+
U = can do unblank | operations | * R = can do read operations |
flags E = it is enabled | | * W = can do write operations |
C = it is preferred console | | * U = can do unblank |
B = it is primary boot console +--------------------+-------------------------------------------------------+
p = it is used for printk buffer | flags | * E = it is enabled |
b = it is not a TTY but a Braille device | | * C = it is preferred console |
a = it is safe to use when cpu is offline | | * B = it is primary boot console |
major:minor major and minor number of the device separated by a colon | | * p = it is used for printk buffer |
| | * b = it is not a TTY but a Braille device |
| | * a = it is safe to use when cpu is offline |
+--------------------+-------------------------------------------------------+
| major:minor | major and minor number of the device separated by a |
| | colon |
+--------------------+-------------------------------------------------------+
------------------------------------------------------------------------------
Summary Summary
------------------------------------------------------------------------------ -------
The /proc file system serves information about the running system. It not only The /proc file system serves information about the running system. It not only
allows access to process data but also allows you to request the kernel status allows access to process data but also allows you to request the kernel status
by reading files in the hierarchy. by reading files in the hierarchy.
The directory structure of /proc reflects the types of information and makes The directory structure of /proc reflects the types of information and makes
it easy, if not obvious, where to look for specific data. it easy, if not obvious, where to look for specific data.
------------------------------------------------------------------------------
------------------------------------------------------------------------------ Chapter 2: Modifying System Parameters
CHAPTER 2: MODIFYING SYSTEM PARAMETERS ======================================
------------------------------------------------------------------------------
------------------------------------------------------------------------------
In This Chapter In This Chapter
------------------------------------------------------------------------------ ---------------
* Modifying kernel parameters by writing into files found in /proc/sys * Modifying kernel parameters by writing into files found in /proc/sys
* Exploring the files which modify certain parameters * Exploring the files which modify certain parameters
* Review of the /proc/sys file tree * Review of the /proc/sys file tree
------------------------------------------------------------------------------
------------------------------------------------------------------------------
A very interesting part of /proc is the directory /proc/sys. This is not only A very interesting part of /proc is the directory /proc/sys. This is not only
a source of information, it also allows you to change parameters within the a source of information, it also allows you to change parameters within the
...@@ -1503,19 +1605,18 @@ kernels, and became part of it in version 2.2.1 of the Linux kernel. ...@@ -1503,19 +1605,18 @@ kernels, and became part of it in version 2.2.1 of the Linux kernel.
Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these
entries. entries.
------------------------------------------------------------------------------
Summary Summary
------------------------------------------------------------------------------ -------
Certain aspects of kernel behavior can be modified at runtime, without the Certain aspects of kernel behavior can be modified at runtime, without the
need to recompile the kernel, or even to reboot the system. The files in the need to recompile the kernel, or even to reboot the system. The files in the
/proc/sys tree can not only be read, but also modified. You can use the echo /proc/sys tree can not only be read, but also modified. You can use the echo
command to write value into these files, thereby changing the default settings command to write value into these files, thereby changing the default settings
of the kernel. of the kernel.
------------------------------------------------------------------------------
------------------------------------------------------------------------------
CHAPTER 3: PER-PROCESS PARAMETERS Chapter 3: Per-process Parameters
------------------------------------------------------------------------------ =================================
3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
...@@ -1588,26 +1689,28 @@ process should be killed in an out-of-memory situation. ...@@ -1588,26 +1689,28 @@ process should be killed in an out-of-memory situation.
This file contains IO statistics for each running process This file contains IO statistics for each running process
Example Example
------- ~~~~~~~
::
test:/tmp # dd if=/dev/zero of=/tmp/test.dat & test:/tmp # dd if=/dev/zero of=/tmp/test.dat &
[1] 3828 [1] 3828
test:/tmp # cat /proc/3828/io test:/tmp # cat /proc/3828/io
rchar: 323934931 rchar: 323934931
wchar: 323929600 wchar: 323929600
syscr: 632687 syscr: 632687
syscw: 632675 syscw: 632675
read_bytes: 0 read_bytes: 0
write_bytes: 323932160 write_bytes: 323932160
cancelled_write_bytes: 0 cancelled_write_bytes: 0
Description Description
----------- ~~~~~~~~~~~
rchar rchar
----- ^^^^^
I/O counter: chars read I/O counter: chars read
The number of bytes which this task has caused to be read from storage. This The number of bytes which this task has caused to be read from storage. This
...@@ -1618,7 +1721,7 @@ pagecache) ...@@ -1618,7 +1721,7 @@ pagecache)
wchar wchar
----- ^^^^^
I/O counter: chars written I/O counter: chars written
The number of bytes which this task has caused, or shall cause to be written The number of bytes which this task has caused, or shall cause to be written
...@@ -1626,7 +1729,7 @@ to disk. Similar caveats apply here as with rchar. ...@@ -1626,7 +1729,7 @@ to disk. Similar caveats apply here as with rchar.
syscr syscr
----- ^^^^^
I/O counter: read syscalls I/O counter: read syscalls
Attempt to count the number of read I/O operations, i.e. syscalls like read() Attempt to count the number of read I/O operations, i.e. syscalls like read()
...@@ -1634,7 +1737,7 @@ and pread(). ...@@ -1634,7 +1737,7 @@ and pread().
syscw syscw
----- ^^^^^
I/O counter: write syscalls I/O counter: write syscalls
Attempt to count the number of write I/O operations, i.e. syscalls like Attempt to count the number of write I/O operations, i.e. syscalls like
...@@ -1642,7 +1745,7 @@ write() and pwrite(). ...@@ -1642,7 +1745,7 @@ write() and pwrite().
read_bytes read_bytes
---------- ^^^^^^^^^^
I/O counter: bytes read I/O counter: bytes read
Attempt to count the number of bytes which this process really did cause to Attempt to count the number of bytes which this process really did cause to
...@@ -1652,7 +1755,7 @@ CIFS at a later time> ...@@ -1652,7 +1755,7 @@ CIFS at a later time>
write_bytes write_bytes
----------- ^^^^^^^^^^^
I/O counter: bytes written I/O counter: bytes written
Attempt to count the number of bytes which this process caused to be sent to Attempt to count the number of bytes which this process caused to be sent to
...@@ -1660,7 +1763,7 @@ the storage layer. This is done at page-dirtying time. ...@@ -1660,7 +1763,7 @@ the storage layer. This is done at page-dirtying time.
cancelled_write_bytes cancelled_write_bytes
--------------------- ^^^^^^^^^^^^^^^^^^^^^
The big inaccuracy here is truncate. If a process writes 1MB to a file and The big inaccuracy here is truncate. If a process writes 1MB to a file and
then deletes the file, it will in fact perform no writeout. But it will have then deletes the file, it will in fact perform no writeout. But it will have
...@@ -1673,12 +1776,11 @@ from the truncating task's write_bytes, but there is information loss in doing ...@@ -1673,12 +1776,11 @@ from the truncating task's write_bytes, but there is information loss in doing
that. that.
Note .. Note::
----
At its current implementation state, this is a bit racy on 32-bit machines: if At its current implementation state, this is a bit racy on 32-bit machines:
process A reads process B's /proc/pid/io while process B is updating one of if process A reads process B's /proc/pid/io while process B is updating one
those 64-bit counters, process A could see an intermediate result. of those 64-bit counters, process A could see an intermediate result.
More information about this can be found within the taskstats documentation in More information about this can be found within the taskstats documentation in
...@@ -1698,6 +1800,7 @@ of memory types. If a bit of the bitmask is set, memory segments of the ...@@ -1698,6 +1800,7 @@ of memory types. If a bit of the bitmask is set, memory segments of the
corresponding memory type are dumped, otherwise they are not dumped. corresponding memory type are dumped, otherwise they are not dumped.
The following 9 memory types are supported: The following 9 memory types are supported:
- (bit 0) anonymous private memory - (bit 0) anonymous private memory
- (bit 1) anonymous shared memory - (bit 1) anonymous shared memory
- (bit 2) file-backed private memory - (bit 2) file-backed private memory
...@@ -1719,13 +1822,13 @@ The default value of coredump_filter is 0x33; this means all anonymous memory ...@@ -1719,13 +1822,13 @@ The default value of coredump_filter is 0x33; this means all anonymous memory
segments, ELF header pages and hugetlb private memory are dumped. segments, ELF header pages and hugetlb private memory are dumped.
If you don't want to dump all shared memory segments attached to pid 1234, If you don't want to dump all shared memory segments attached to pid 1234,
write 0x31 to the process's proc file. write 0x31 to the process's proc file::
$ echo 0x31 > /proc/1234/coredump_filter $ echo 0x31 > /proc/1234/coredump_filter
When a new process is created, the process inherits the bitmask status from its When a new process is created, the process inherits the bitmask status from its
parent. It is useful to set up coredump_filter before the program runs. parent. It is useful to set up coredump_filter before the program runs.
For example: For example::
$ echo 0x7 > /proc/self/coredump_filter $ echo 0x7 > /proc/self/coredump_filter
$ ./some_program $ ./some_program
...@@ -1733,35 +1836,37 @@ For example: ...@@ -1733,35 +1836,37 @@ For example:
3.5 /proc/<pid>/mountinfo - Information about mounts 3.5 /proc/<pid>/mountinfo - Information about mounts
-------------------------------------------------------- --------------------------------------------------------
This file contains lines of the form: This file contains lines of the form::
36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11)
(1) mount ID: unique identifier of the mount (may be reused after umount) (1) mount ID: unique identifier of the mount (may be reused after umount)
(2) parent ID: ID of parent (or of self for the top of the mount tree) (2) parent ID: ID of parent (or of self for the top of the mount tree)
(3) major:minor: value of st_dev for files on filesystem (3) major:minor: value of st_dev for files on filesystem
(4) root: root of the mount within the filesystem (4) root: root of the mount within the filesystem
(5) mount point: mount point relative to the process's root (5) mount point: mount point relative to the process's root
(6) mount options: per mount options (6) mount options: per mount options
(7) optional fields: zero or more fields of the form "tag[:value]" (7) optional fields: zero or more fields of the form "tag[:value]"
(8) separator: marks the end of the optional fields (8) separator: marks the end of the optional fields
(9) filesystem type: name of filesystem of the form "type[.subtype]" (9) filesystem type: name of filesystem of the form "type[.subtype]"
(10) mount source: filesystem specific information or "none" (10) mount source: filesystem specific information or "none"
(11) super options: per super block options (11) super options: per super block options
Parsers should ignore all unrecognised optional fields. Currently the Parsers should ignore all unrecognised optional fields. Currently the
possible optional fields are: possible optional fields are:
================ ==============================================================
shared:X mount is shared in peer group X shared:X mount is shared in peer group X
master:X mount is slave to peer group X master:X mount is slave to peer group X
propagate_from:X mount is slave and receives propagation from peer group X (*) propagate_from:X mount is slave and receives propagation from peer group X [#]_
unbindable mount is unbindable unbindable mount is unbindable
================ ==============================================================
(*) X is the closest dominant peer group under the process's root. If .. [#] X is the closest dominant peer group under the process's root. If
X is the immediate master of the mount, or if there's no dominant peer X is the immediate master of the mount, or if there's no dominant peer
group under the same root, then only the "master:X" field is present group under the same root, then only the "master:X" field is present
and not the "propagate_from:X" field. and not the "propagate_from:X" field.
For more information on mount propagation see: For more information on mount propagation see:
...@@ -1804,77 +1909,86 @@ created with [see open(2) for details] and 'mnt_id' represents mount ID of ...@@ -1804,77 +1909,86 @@ created with [see open(2) for details] and 'mnt_id' represents mount ID of
the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo
for details]. for details].
A typical output is A typical output is::
pos: 0 pos: 0
flags: 0100002 flags: 0100002
mnt_id: 19 mnt_id: 19
All locks associated with a file descriptor are shown in its fdinfo too. All locks associated with a file descriptor are shown in its fdinfo too::
lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF
The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags
pair provide additional information particular to the objects they represent. pair provide additional information particular to the objects they represent.
Eventfd files Eventfd files
~~~~~~~~~~~~~ ~~~~~~~~~~~~~
::
pos: 0 pos: 0
flags: 04002 flags: 04002
mnt_id: 9 mnt_id: 9
eventfd-count: 5a eventfd-count: 5a
where 'eventfd-count' is hex value of a counter. where 'eventfd-count' is hex value of a counter.
Signalfd files
~~~~~~~~~~~~~~
::
Signalfd files
~~~~~~~~~~~~~~
pos: 0 pos: 0
flags: 04002 flags: 04002
mnt_id: 9 mnt_id: 9
sigmask: 0000000000000200 sigmask: 0000000000000200
where 'sigmask' is hex value of the signal mask associated where 'sigmask' is hex value of the signal mask associated
with a file. with a file.
Epoll files
~~~~~~~~~~~
::
Epoll files
~~~~~~~~~~~
pos: 0 pos: 0
flags: 02 flags: 02
mnt_id: 9 mnt_id: 9
tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7 tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7
where 'tfd' is a target file descriptor number in decimal form, where 'tfd' is a target file descriptor number in decimal form,
'events' is events mask being watched and the 'data' is data 'events' is events mask being watched and the 'data' is data
associated with a target [see epoll(7) for more details]. associated with a target [see epoll(7) for more details].
The 'pos' is current offset of the target file in decimal form The 'pos' is current offset of the target file in decimal form
[see lseek(2)], 'ino' and 'sdev' are inode and device numbers [see lseek(2)], 'ino' and 'sdev' are inode and device numbers
where target file resides, all in hex format. where target file resides, all in hex format.
Fsnotify files Fsnotify files
~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~
For inotify files the format is the following For inotify files the format is the following::
pos: 0 pos: 0
flags: 02000000 flags: 02000000
inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
where 'wd' is a watch descriptor in decimal form, ie a target file where 'wd' is a watch descriptor in decimal form, ie a target file
descriptor number, 'ino' and 'sdev' are inode and device where the descriptor number, 'ino' and 'sdev' are inode and device where the
target file resides and the 'mask' is the mask of events, all in hex target file resides and the 'mask' is the mask of events, all in hex
form [see inotify(7) for more details]. form [see inotify(7) for more details].
If the kernel was built with exportfs support, the path to the target If the kernel was built with exportfs support, the path to the target
file is encoded as a file handle. The file handle is provided by three file is encoded as a file handle. The file handle is provided by three
fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex
format. format.
If the kernel is built without exportfs support the file handle won't be If the kernel is built without exportfs support the file handle won't be
printed out. printed out.
If there is no inotify mark attached yet the 'inotify' line will be omitted. If there is no inotify mark attached yet the 'inotify' line will be omitted.
For fanotify files the format is For fanotify files the format is::
pos: 0 pos: 0
flags: 02 flags: 02
...@@ -1883,20 +1997,22 @@ pair provide additional information particular to the objects they represent. ...@@ -1883,20 +1997,22 @@ pair provide additional information particular to the objects they represent.
fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4
where fanotify 'flags' and 'event-flags' are values used in fanotify_init where fanotify 'flags' and 'event-flags' are values used in fanotify_init
call, 'mnt_id' is the mount point identifier, 'mflags' is the value of call, 'mnt_id' is the mount point identifier, 'mflags' is the value of
flags associated with mark which are tracked separately from events flags associated with mark which are tracked separately from events
mask. 'ino', 'sdev' are target inode and device, 'mask' is the events mask. 'ino', 'sdev' are target inode and device, 'mask' is the events
mask and 'ignored_mask' is the mask of events which are to be ignored. mask and 'ignored_mask' is the mask of events which are to be ignored.
All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask' All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask'
does provide information about flags and mask used in fanotify_mark does provide information about flags and mask used in fanotify_mark
call [see fsnotify manpage for details]. call [see fsnotify manpage for details].
While the first three lines are mandatory and always printed, the rest is
optional and may be omitted if no marks created yet.
While the first three lines are mandatory and always printed, the rest is Timerfd files
optional and may be omitted if no marks created yet. ~~~~~~~~~~~~~
Timerfd files ::
~~~~~~~~~~~~~
pos: 0 pos: 0
flags: 02 flags: 02
...@@ -1907,18 +2023,18 @@ pair provide additional information particular to the objects they represent. ...@@ -1907,18 +2023,18 @@ pair provide additional information particular to the objects they represent.
it_value: (0, 49406829) it_value: (0, 49406829)
it_interval: (1, 0) it_interval: (1, 0)
where 'clockid' is the clock type and 'ticks' is the number of the timer expirations where 'clockid' is the clock type and 'ticks' is the number of the timer expirations
that have occurred [see timerfd_create(2) for details]. 'settime flags' are that have occurred [see timerfd_create(2) for details]. 'settime flags' are
flags in octal form been used to setup the timer [see timerfd_settime(2) for flags in octal form been used to setup the timer [see timerfd_settime(2) for
details]. 'it_value' is remaining time until the timer exiration. details]. 'it_value' is remaining time until the timer exiration.
'it_interval' is the interval for the timer. Note the timer might be set up 'it_interval' is the interval for the timer. Note the timer might be set up
with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value' with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value'
still exhibits timer's remaining time. still exhibits timer's remaining time.
3.9 /proc/<pid>/map_files - Information about memory mapped files 3.9 /proc/<pid>/map_files - Information about memory mapped files
--------------------------------------------------------------------- ---------------------------------------------------------------------
This directory contains symbolic links which represent memory mapped files This directory contains symbolic links which represent memory mapped files
the process is maintaining. Example output: the process is maintaining. Example output::
| lr-------- 1 root root 64 Jan 27 11:24 333c600000-333c620000 -> /usr/lib64/ld-2.18.so | lr-------- 1 root root 64 Jan 27 11:24 333c600000-333c620000 -> /usr/lib64/ld-2.18.so
| lr-------- 1 root root 64 Jan 27 11:24 333c81f000-333c820000 -> /usr/lib64/ld-2.18.so | lr-------- 1 root root 64 Jan 27 11:24 333c81f000-333c820000 -> /usr/lib64/ld-2.18.so
...@@ -1976,17 +2092,22 @@ When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the ...@@ -1976,17 +2092,22 @@ When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
architecture specific status of the task. architecture specific status of the task.
Example Example
------- ~~~~~~~
::
$ cat /proc/6753/arch_status $ cat /proc/6753/arch_status
AVX512_elapsed_ms: 8 AVX512_elapsed_ms: 8
Description Description
----------- ~~~~~~~~~~~
x86 specific entries: x86 specific entries:
--------------------- ~~~~~~~~~~~~~~~~~~~~~
AVX512_elapsed_ms:
------------------ AVX512_elapsed_ms:
^^^^^^^^^^^^^^^^^^
If AVX512 is supported on the machine, this entry shows the milliseconds If AVX512 is supported on the machine, this entry shows the milliseconds
elapsed since the last time AVX512 usage was recorded. The recording elapsed since the last time AVX512 usage was recorded. The recording
happens on a best effort basis when a task is scheduled out. This means happens on a best effort basis when a task is scheduled out. This means
...@@ -2010,17 +2131,18 @@ x86 specific entries: ...@@ -2010,17 +2131,18 @@ x86 specific entries:
the task is unlikely an AVX512 user, but depends on the workload and the the task is unlikely an AVX512 user, but depends on the workload and the
scheduling scenario, it also could be a false negative mentioned above. scheduling scenario, it also could be a false negative mentioned above.
------------------------------------------------------------------------------
Configuring procfs Configuring procfs
------------------------------------------------------------------------------ ------------------
4.1 Mount options 4.1 Mount options
--------------------- ---------------------
The following mount options are supported: The following mount options are supported:
========= ========================================================
hidepid= Set /proc/<pid>/ access mode. hidepid= Set /proc/<pid>/ access mode.
gid= Set the group authorized to learn processes information. gid= Set the group authorized to learn processes information.
========= ========================================================
hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
(default). (default).
......
.. SPDX-License-Identifier: GPL-2.0
===================
The QNX6 Filesystem The QNX6 Filesystem
=================== ===================
...@@ -14,10 +17,12 @@ Specification ...@@ -14,10 +17,12 @@ Specification
qnx6fs shares many properties with traditional Unix filesystems. It has the qnx6fs shares many properties with traditional Unix filesystems. It has the
concepts of blocks, inodes and directories. concepts of blocks, inodes and directories.
On QNX it is possible to create little endian and big endian qnx6 filesystems. On QNX it is possible to create little endian and big endian qnx6 filesystems.
This feature makes it possible to create and use a different endianness fs This feature makes it possible to create and use a different endianness fs
for the target (QNX is used on quite a range of embedded systems) platform for the target (QNX is used on quite a range of embedded systems) platform
running on a different endianness. running on a different endianness.
The Linux driver handles endianness transparently. (LE and BE) The Linux driver handles endianness transparently. (LE and BE)
Blocks Blocks
...@@ -26,6 +31,7 @@ Blocks ...@@ -26,6 +31,7 @@ Blocks
The space in the device or file is split up into blocks. These are a fixed The space in the device or file is split up into blocks. These are a fixed
size of 512, 1024, 2048 or 4096, which is decided when the filesystem is size of 512, 1024, 2048 or 4096, which is decided when the filesystem is
created. created.
Blockpointers are 32bit, so the maximum space that can be addressed is Blockpointers are 32bit, so the maximum space that can be addressed is
2^32 * 4096 bytes or 16TB 2^32 * 4096 bytes or 16TB
...@@ -50,6 +56,7 @@ Each of these root nodes holds information like total size of the stored ...@@ -50,6 +56,7 @@ Each of these root nodes holds information like total size of the stored
data and the addressing levels in that specific tree. data and the addressing levels in that specific tree.
If the level value is 0, up to 16 direct blocks can be addressed by each If the level value is 0, up to 16 direct blocks can be addressed by each
node. node.
Level 1 adds an additional indirect addressing level where each indirect Level 1 adds an additional indirect addressing level where each indirect
addressing block holds up to blocksize / 4 bytes pointers to data blocks. addressing block holds up to blocksize / 4 bytes pointers to data blocks.
Level 2 adds an additional indirect addressing block level (so, already up Level 2 adds an additional indirect addressing block level (so, already up
...@@ -57,11 +64,13 @@ to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree). ...@@ -57,11 +64,13 @@ to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree).
Unused block pointers are always set to ~0 - regardless of root node, Unused block pointers are always set to ~0 - regardless of root node,
indirect addressing blocks or inodes. indirect addressing blocks or inodes.
Data leaves are always on the lowest level. So no data is stored on upper Data leaves are always on the lowest level. So no data is stored on upper
tree levels. tree levels.
The first Superblock is located at 0x2000. (0x2000 is the bootblock size) The first Superblock is located at 0x2000. (0x2000 is the bootblock size)
The Audi MMI 3G first superblock directly starts at byte 0. The Audi MMI 3G first superblock directly starts at byte 0.
Second superblock position can either be calculated from the superblock Second superblock position can either be calculated from the superblock
information (total number of filesystem blocks) or by taking the highest information (total number of filesystem blocks) or by taking the highest
device address, zeroing the last 3 bytes and then subtracting 0x1000 from device address, zeroing the last 3 bytes and then subtracting 0x1000 from
...@@ -84,6 +93,7 @@ Object mode field is POSIX format. (which makes things easier) ...@@ -84,6 +93,7 @@ Object mode field is POSIX format. (which makes things easier)
There are also pointers to the first 16 blocks, if the object data can be There are also pointers to the first 16 blocks, if the object data can be
addressed with 16 direct blocks. addressed with 16 direct blocks.
For more than 16 blocks an indirect addressing in form of another tree is For more than 16 blocks an indirect addressing in form of another tree is
used. (scheme is the same as the one used for the superblock root nodes) used. (scheme is the same as the one used for the superblock root nodes)
...@@ -96,13 +106,18 @@ Directories ...@@ -96,13 +106,18 @@ Directories
A directory is a filesystem object and has an inode just like a file. A directory is a filesystem object and has an inode just like a file.
It is a specially formatted file containing records which associate each It is a specially formatted file containing records which associate each
name with an inode number. name with an inode number.
'.' inode number points to the directory inode '.' inode number points to the directory inode
'..' inode number points to the parent directory inode '..' inode number points to the parent directory inode
Eeach filename record additionally got a filename length field. Eeach filename record additionally got a filename length field.
One special case are long filenames or subdirectory names. One special case are long filenames or subdirectory names.
These got set a filename length field of 0xff in the corresponding directory These got set a filename length field of 0xff in the corresponding directory
record plus the longfile inode number also stored in that record. record plus the longfile inode number also stored in that record.
With that longfilename inode number, the longfilename tree can be walked With that longfilename inode number, the longfilename tree can be walked
starting with the superblock longfilename root node pointers. starting with the superblock longfilename root node pointers.
...@@ -111,6 +126,7 @@ Special files ...@@ -111,6 +126,7 @@ Special files
Symbolic links are also filesystem objects with inodes. They got a specific Symbolic links are also filesystem objects with inodes. They got a specific
bit in the inode mode field identifying them as symbolic link. bit in the inode mode field identifying them as symbolic link.
The directory entry file inode pointer points to the target file inode. The directory entry file inode pointer points to the target file inode.
Hard links got an inode, a directory entry, but a specific mode bit set, Hard links got an inode, a directory entry, but a specific mode bit set,
...@@ -126,9 +142,11 @@ Long filenames ...@@ -126,9 +142,11 @@ Long filenames
Long filenames are stored in a separate addressing tree. The staring point Long filenames are stored in a separate addressing tree. The staring point
is the longfilename root node in the active superblock. is the longfilename root node in the active superblock.
Each data block (tree leaves) holds one long filename. That filename is Each data block (tree leaves) holds one long filename. That filename is
limited to 510 bytes. The first two starting bytes are used as length field limited to 510 bytes. The first two starting bytes are used as length field
for the actual filename. for the actual filename.
If that structure shall fit for all allowed blocksizes, it is clear why there If that structure shall fit for all allowed blocksizes, it is clear why there
is a limit of 510 bytes for the actual filename stored. is a limit of 510 bytes for the actual filename stored.
...@@ -138,6 +156,7 @@ Bitmap ...@@ -138,6 +156,7 @@ Bitmap
The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap
root node in the superblock and each bit in the bitmap represents one root node in the superblock and each bit in the bitmap represents one
filesystem block. filesystem block.
The first block is block 0, which starts 0x1000 after superblock start. The first block is block 0, which starts 0x1000 after superblock start.
So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical
address at which block 0 is located. address at which block 0 is located.
...@@ -149,11 +168,14 @@ Bitmap system area ...@@ -149,11 +168,14 @@ Bitmap system area
------------------ ------------------
The bitmap itself is divided into three parts. The bitmap itself is divided into three parts.
First the system area, that is split into two halves. First the system area, that is split into two halves.
Then userspace. Then userspace.
The requirement for a static, fixed preallocated system area comes from how The requirement for a static, fixed preallocated system area comes from how
qnx6fs deals with writes. qnx6fs deals with writes.
Each superblock got it's own half of the system area. So superblock #1 Each superblock got it's own half of the system area. So superblock #1
always uses blocks from the lower half while superblock #2 just writes to always uses blocks from the lower half while superblock #2 just writes to
blocks represented by the upper half bitmap system area bits. blocks represented by the upper half bitmap system area bits.
......
ramfs, rootfs and initramfs .. SPDX-License-Identifier: GPL-2.0
===========================
Ramfs, rootfs and initramfs
===========================
October 17, 2005 October 17, 2005
Rob Landley <rob@landley.net> Rob Landley <rob@landley.net>
============================= =============================
...@@ -99,14 +105,14 @@ out of that. ...@@ -99,14 +105,14 @@ out of that.
All this differs from the old initrd in several ways: All this differs from the old initrd in several ways:
- The old initrd was always a separate file, while the initramfs archive is - The old initrd was always a separate file, while the initramfs archive is
linked into the linux kernel image. (The directory linux-*/usr is devoted linked into the linux kernel image. (The directory ``linux-*/usr`` is
to generating this archive during the build.) devoted to generating this archive during the build.)
- The old initrd file was a gzipped filesystem image (in some file format, - The old initrd file was a gzipped filesystem image (in some file format,
such as ext2, that needed a driver built into the kernel), while the new such as ext2, that needed a driver built into the kernel), while the new
initramfs archive is a gzipped cpio archive (like tar only simpler, initramfs archive is a gzipped cpio archive (like tar only simpler,
see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst). The see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst).
kernel's cpio extraction code is not only extremely small, it's also The kernel's cpio extraction code is not only extremely small, it's also
__init text and data that can be discarded during the boot process. __init text and data that can be discarded during the boot process.
- The program run by the old initrd (which was called /initrd, not /init) did - The program run by the old initrd (which was called /initrd, not /init) did
...@@ -139,7 +145,7 @@ and living in usr/Kconfig) can be used to specify a source for the ...@@ -139,7 +145,7 @@ and living in usr/Kconfig) can be used to specify a source for the
initramfs archive, which will automatically be incorporated into the initramfs archive, which will automatically be incorporated into the
resulting binary. This option can point to an existing gzipped cpio resulting binary. This option can point to an existing gzipped cpio
archive, a directory containing files to be archived, or a text file archive, a directory containing files to be archived, or a text file
specification such as the following example: specification such as the following example::
dir /dev 755 0 0 dir /dev 755 0 0
nod /dev/console 644 0 0 c 5 1 nod /dev/console 644 0 0 c 5 1
...@@ -175,12 +181,12 @@ or extracting your own preprepared cpio files to feed to the kernel build ...@@ -175,12 +181,12 @@ or extracting your own preprepared cpio files to feed to the kernel build
(instead of a config file or directory). (instead of a config file or directory).
The following command line can extract a cpio image (either by the above script The following command line can extract a cpio image (either by the above script
or by the kernel build) back into its component files: or by the kernel build) back into its component files::
cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
The following shell script can create a prebuilt cpio archive you can The following shell script can create a prebuilt cpio archive you can
use in place of the above config file: use in place of the above config file::
#!/bin/sh #!/bin/sh
...@@ -202,14 +208,17 @@ use in place of the above config file: ...@@ -202,14 +208,17 @@ use in place of the above config file:
exit 1 exit 1
fi fi
Note: The cpio man page contains some bad advice that will break your initramfs .. Note::
archive if you follow it. It says "A typical way to generate the list
of filenames is with the find command; you should give find the -depth option The cpio man page contains some bad advice that will break your initramfs
to minimize problems with permissions on directories that are unwritable or not archive if you follow it. It says "A typical way to generate the list
searchable." Don't do this when creating initramfs.cpio.gz images, it won't of filenames is with the find command; you should give find the -depth
work. The Linux kernel cpio extractor won't create files in a directory that option to minimize problems with permissions on directories that are
doesn't exist, so the directory entries must go before the files that go in unwritable or not searchable." Don't do this when creating
those directories. The above script gets them in the right order. initramfs.cpio.gz images, it won't work. The Linux kernel cpio extractor
won't create files in a directory that doesn't exist, so the directory
entries must go before the files that go in those directories.
The above script gets them in the right order.
External initramfs images: External initramfs images:
-------------------------- --------------------------
...@@ -236,9 +245,10 @@ An initramfs archive is a complete self-contained root filesystem for Linux. ...@@ -236,9 +245,10 @@ An initramfs archive is a complete self-contained root filesystem for Linux.
If you don't already understand what shared libraries, devices, and paths If you don't already understand what shared libraries, devices, and paths
you need to get a minimal root filesystem up and running, here are some you need to get a minimal root filesystem up and running, here are some
references: references:
http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html - http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
http://www.linuxfromscratch.org/lfs/view/stable/ - http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
- http://www.linuxfromscratch.org/lfs/view/stable/
The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
designed to be a tiny C library to statically link early userspace designed to be a tiny C library to statically link early userspace
...@@ -255,7 +265,7 @@ name lookups, even when otherwise statically linked.) ...@@ -255,7 +265,7 @@ name lookups, even when otherwise statically linked.)
A good first step is to get initramfs to run a statically linked "hello world" A good first step is to get initramfs to run a statically linked "hello world"
program as init, and test it under an emulator like qemu (www.qemu.org) or program as init, and test it under an emulator like qemu (www.qemu.org) or
User Mode Linux, like so: User Mode Linux, like so::
cat > hello.c << EOF cat > hello.c << EOF
#include <stdio.h> #include <stdio.h>
...@@ -326,8 +336,8 @@ the above threads) is: ...@@ -326,8 +336,8 @@ the above threads) is:
explained his reasoning: explained his reasoning:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
and, most importantly, designed and implemented the initramfs code. and, most importantly, designed and implemented the initramfs code.
......
.. SPDX-License-Identifier: GPL-2.0
==================================
relay interface (formerly relayfs) relay interface (formerly relayfs)
================================== ==================================
...@@ -108,6 +111,7 @@ The relay interface implements basic file operations for user space ...@@ -108,6 +111,7 @@ The relay interface implements basic file operations for user space
access to relay channel buffer data. Here are the file operations access to relay channel buffer data. Here are the file operations
that are available and some comments regarding their behavior: that are available and some comments regarding their behavior:
=========== ============================================================
open() enables user to open an _existing_ channel buffer. open() enables user to open an _existing_ channel buffer.
mmap() results in channel buffer being mapped into the caller's mmap() results in channel buffer being mapped into the caller's
...@@ -136,13 +140,16 @@ poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are ...@@ -136,13 +140,16 @@ poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
close() decrements the channel buffer's refcount. When the refcount close() decrements the channel buffer's refcount. When the refcount
reaches 0, i.e. when no process or kernel client has the reaches 0, i.e. when no process or kernel client has the
buffer open, the channel buffer is freed. buffer open, the channel buffer is freed.
=========== ============================================================
In order for a user application to make use of relay files, the In order for a user application to make use of relay files, the
host filesystem must be mounted. For example, host filesystem must be mounted. For example::
mount -t debugfs debugfs /sys/kernel/debug mount -t debugfs debugfs /sys/kernel/debug
NOTE: the host filesystem doesn't need to be mounted for kernel .. Note::
the host filesystem doesn't need to be mounted for kernel
clients to create or use channels - it only needs to be clients to create or use channels - it only needs to be
mounted when user space applications need access to the buffer mounted when user space applications need access to the buffer
data. data.
...@@ -154,7 +161,7 @@ The relay interface kernel API ...@@ -154,7 +161,7 @@ The relay interface kernel API
Here's a summary of the API the relay interface provides to in-kernel clients: Here's a summary of the API the relay interface provides to in-kernel clients:
TBD(curr. line MT:/API/) TBD(curr. line MT:/API/)
channel management functions: channel management functions::
relay_open(base_filename, parent, subbuf_size, n_subbufs, relay_open(base_filename, parent, subbuf_size, n_subbufs,
callbacks, private_data) callbacks, private_data)
...@@ -162,17 +169,17 @@ TBD(curr. line MT:/API/) ...@@ -162,17 +169,17 @@ TBD(curr. line MT:/API/)
relay_flush(chan) relay_flush(chan)
relay_reset(chan) relay_reset(chan)
channel management typically called on instigation of userspace: channel management typically called on instigation of userspace::
relay_subbufs_consumed(chan, cpu, subbufs_consumed) relay_subbufs_consumed(chan, cpu, subbufs_consumed)
write functions: write functions::
relay_write(chan, data, length) relay_write(chan, data, length)
__relay_write(chan, data, length) __relay_write(chan, data, length)
relay_reserve(chan, length) relay_reserve(chan, length)
callbacks: callbacks::
subbuf_start(buf, subbuf, prev_subbuf, prev_padding) subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
buf_mapped(buf, filp) buf_mapped(buf, filp)
...@@ -180,7 +187,7 @@ TBD(curr. line MT:/API/) ...@@ -180,7 +187,7 @@ TBD(curr. line MT:/API/)
create_buf_file(filename, parent, mode, buf, is_global) create_buf_file(filename, parent, mode, buf, is_global)
remove_buf_file(dentry) remove_buf_file(dentry)
helper functions: helper functions::
relay_buf_full(buf) relay_buf_full(buf)
subbuf_start_reserve(buf, length) subbuf_start_reserve(buf, length)
...@@ -215,41 +222,41 @@ the file(s) created in create_buf_file() and is called during ...@@ -215,41 +222,41 @@ the file(s) created in create_buf_file() and is called during
relay_close(). relay_close().
Here are some typical definitions for these callbacks, in this case Here are some typical definitions for these callbacks, in this case
using debugfs: using debugfs::
/* /*
* create_buf_file() callback. Creates relay file in debugfs. * create_buf_file() callback. Creates relay file in debugfs.
*/ */
static struct dentry *create_buf_file_handler(const char *filename, static struct dentry *create_buf_file_handler(const char *filename,
struct dentry *parent, struct dentry *parent,
umode_t mode, umode_t mode,
struct rchan_buf *buf, struct rchan_buf *buf,
int *is_global) int *is_global)
{ {
return debugfs_create_file(filename, mode, parent, buf, return debugfs_create_file(filename, mode, parent, buf,
&relay_file_operations); &relay_file_operations);
} }
/* /*
* remove_buf_file() callback. Removes relay file from debugfs. * remove_buf_file() callback. Removes relay file from debugfs.
*/ */
static int remove_buf_file_handler(struct dentry *dentry) static int remove_buf_file_handler(struct dentry *dentry)
{ {
debugfs_remove(dentry); debugfs_remove(dentry);
return 0; return 0;
} }
/* /*
* relay interface callbacks * relay interface callbacks
*/ */
static struct rchan_callbacks relay_callbacks = static struct rchan_callbacks relay_callbacks =
{ {
.create_buf_file = create_buf_file_handler, .create_buf_file = create_buf_file_handler,
.remove_buf_file = remove_buf_file_handler, .remove_buf_file = remove_buf_file_handler,
}; };
And an example relay_open() invocation using them: And an example relay_open() invocation using them::
chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL); chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL);
...@@ -339,13 +346,13 @@ whether or not to actually move on to the next sub-buffer. ...@@ -339,13 +346,13 @@ whether or not to actually move on to the next sub-buffer.
To implement 'no-overwrite' mode, the userspace client would provide To implement 'no-overwrite' mode, the userspace client would provide
an implementation of the subbuf_start() callback something like the an implementation of the subbuf_start() callback something like the
following: following::
static int subbuf_start(struct rchan_buf *buf, static int subbuf_start(struct rchan_buf *buf,
void *subbuf, void *subbuf,
void *prev_subbuf, void *prev_subbuf,
unsigned int prev_padding) unsigned int prev_padding)
{ {
if (prev_subbuf) if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding; *((unsigned *)prev_subbuf) = prev_padding;
...@@ -355,7 +362,7 @@ static int subbuf_start(struct rchan_buf *buf, ...@@ -355,7 +362,7 @@ static int subbuf_start(struct rchan_buf *buf,
subbuf_start_reserve(buf, sizeof(unsigned int)); subbuf_start_reserve(buf, sizeof(unsigned int));
return 1; return 1;
} }
If the current buffer is full, i.e. all sub-buffers remain unconsumed, If the current buffer is full, i.e. all sub-buffers remain unconsumed,
the callback returns 0 to indicate that the buffer switch should not the callback returns 0 to indicate that the buffer switch should not
...@@ -370,20 +377,20 @@ ready sub-buffers will relay_buf_full() return 0, in which case the ...@@ -370,20 +377,20 @@ ready sub-buffers will relay_buf_full() return 0, in which case the
buffer switch can continue. buffer switch can continue.
The implementation of the subbuf_start() callback for 'overwrite' mode The implementation of the subbuf_start() callback for 'overwrite' mode
would be very similar: would be very similar::
static int subbuf_start(struct rchan_buf *buf, static int subbuf_start(struct rchan_buf *buf,
void *subbuf, void *subbuf,
void *prev_subbuf, void *prev_subbuf,
size_t prev_padding) size_t prev_padding)
{ {
if (prev_subbuf) if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding; *((unsigned *)prev_subbuf) = prev_padding;
subbuf_start_reserve(buf, sizeof(unsigned int)); subbuf_start_reserve(buf, sizeof(unsigned int));
return 1; return 1;
} }
In this case, the relay_buf_full() check is meaningless and the In this case, the relay_buf_full() check is meaningless and the
callback always returns 1, causing the buffer switch to occur callback always returns 1, causing the buffer switch to occur
......
ROMFS - ROM FILE SYSTEM .. SPDX-License-Identifier: GPL-2.0
=======================
ROMFS - ROM File System
=======================
This is a quite dumb, read only filesystem, mainly for initial RAM This is a quite dumb, read only filesystem, mainly for initial RAM
disks of installation disks. It has grown up by the need of having disks of installation disks. It has grown up by the need of having
...@@ -51,9 +55,9 @@ the 16 byte padding for the name and the contents, also 16+14+15 = 45 ...@@ -51,9 +55,9 @@ the 16 byte padding for the name and the contents, also 16+14+15 = 45
bytes. This is quite rare however, since most file names are longer bytes. This is quite rare however, since most file names are longer
than 3 bytes, and shorter than 15 bytes. than 3 bytes, and shorter than 15 bytes.
The layout of the filesystem is the following: The layout of the filesystem is the following::
offset content offset content
+---+---+---+---+ +---+---+---+---+
0 | - | r | o | m | \ 0 | - | r | o | m | \
...@@ -84,9 +88,9 @@ the source. This algorithm was chosen because although it's not quite ...@@ -84,9 +88,9 @@ the source. This algorithm was chosen because although it's not quite
reliable, it does not require any tables, and it is very simple. reliable, it does not require any tables, and it is very simple.
The following bytes are now part of the file system; each file header The following bytes are now part of the file system; each file header
must begin on a 16 byte boundary. must begin on a 16 byte boundary::
offset content offset content
+---+---+---+---+ +---+---+---+---+
0 | next filehdr|X| The offset of the next file header 0 | next filehdr|X| The offset of the next file header
...@@ -114,7 +118,9 @@ file is user and group 0, this should never be a problem for the ...@@ -114,7 +118,9 @@ file is user and group 0, this should never be a problem for the
intended use. The mapping of the 8 possible values to file types is intended use. The mapping of the 8 possible values to file types is
the following: the following:
== =============== ============================================
mapping spec.info means mapping spec.info means
== =============== ============================================
0 hard link link destination [file header] 0 hard link link destination [file header]
1 directory first file's header 1 directory first file's header
2 regular file unused, must be zero [MBZ] 2 regular file unused, must be zero [MBZ]
...@@ -123,6 +129,7 @@ the following: ...@@ -123,6 +129,7 @@ the following:
5 char device - " - 5 char device - " -
6 socket unused, MBZ 6 socket unused, MBZ
7 fifo unused, MBZ 7 fifo unused, MBZ
== =============== ============================================
Note that hard links are specifically marked in this filesystem, but Note that hard links are specifically marked in this filesystem, but
they will behave as you can expect (i.e. share the inode number). they will behave as you can expect (i.e. share the inode number).
...@@ -158,24 +165,24 @@ to romfs-subscribe@shadow.banki.hu, the content is irrelevant. ...@@ -158,24 +165,24 @@ to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
Pending issues: Pending issues:
- Permissions and owner information are pretty essential features of a - Permissions and owner information are pretty essential features of a
Un*x like system, but romfs does not provide the full possibilities. Un*x like system, but romfs does not provide the full possibilities.
I have never found this limiting, but others might. I have never found this limiting, but others might.
- The file system is read only, so it can be very small, but in case - The file system is read only, so it can be very small, but in case
one would want to write _anything_ to a file system, he still needs one would want to write _anything_ to a file system, he still needs
a writable file system, thus negating the size advantages. Possible a writable file system, thus negating the size advantages. Possible
solutions: implement write access as a compile-time option, or a new, solutions: implement write access as a compile-time option, or a new,
similarly small writable filesystem for RAM disks. similarly small writable filesystem for RAM disks.
- Since the files are only required to have alignment on a 16 byte - Since the files are only required to have alignment on a 16 byte
boundary, it is currently possibly suboptimal to read or execute files boundary, it is currently possibly suboptimal to read or execute files
from the filesystem. It might be resolved by reordering file data to from the filesystem. It might be resolved by reordering file data to
have most of it (i.e. except the start and the end) laying at "natural" have most of it (i.e. except the start and the end) laying at "natural"
boundaries, thus it would be possible to directly map a big portion of boundaries, thus it would be possible to directly map a big portion of
the file contents to the mm subsystem. the file contents to the mm subsystem.
- Compression might be an useful feature, but memory is quite a - Compression might be an useful feature, but memory is quite a
limiting factor in my eyes. limiting factor in my eyes.
- Where it is used? - Where it is used?
...@@ -183,4 +190,5 @@ limiting factor in my eyes. ...@@ -183,4 +190,5 @@ limiting factor in my eyes.
Have fun, Have fun,
Janos Farkas <chexum@shadow.banki.hu> Janos Farkas <chexum@shadow.banki.hu>
SQUASHFS 4.0 FILESYSTEM .. SPDX-License-Identifier: GPL-2.0
=======================
Squashfs 4.0 Filesystem
======================= =======================
Squashfs is a compressed read-only filesystem for Linux. Squashfs is a compressed read-only filesystem for Linux.
It uses zlib, lz4, lzo, or xz compression to compress files, inodes and It uses zlib, lz4, lzo, or xz compression to compress files, inodes and
directories. Inodes in the system are very small and all blocks are packed to directories. Inodes in the system are very small and all blocks are packed to
minimise data overhead. Block sizes greater than 4K are supported up to a minimise data overhead. Block sizes greater than 4K are supported up to a
...@@ -15,31 +19,33 @@ needed. ...@@ -15,31 +19,33 @@ needed.
Mailing list: squashfs-devel@lists.sourceforge.net Mailing list: squashfs-devel@lists.sourceforge.net
Web site: www.squashfs.org Web site: www.squashfs.org
1. FILESYSTEM FEATURES 1. Filesystem Features
---------------------- ----------------------
Squashfs filesystem features versus Cramfs: Squashfs filesystem features versus Cramfs:
============================== ========= ==========
Squashfs Cramfs Squashfs Cramfs
============================== ========= ==========
Max filesystem size: 2^64 256 MiB Max filesystem size 2^64 256 MiB
Max file size: ~ 2 TiB 16 MiB Max file size ~ 2 TiB 16 MiB
Max files: unlimited unlimited Max files unlimited unlimited
Max directories: unlimited unlimited Max directories unlimited unlimited
Max entries per directory: unlimited unlimited Max entries per directory unlimited unlimited
Max block size: 1 MiB 4 KiB Max block size 1 MiB 4 KiB
Metadata compression: yes no Metadata compression yes no
Directory indexes: yes no Directory indexes yes no
Sparse file support: yes no Sparse file support yes no
Tail-end packing (fragments): yes no Tail-end packing (fragments) yes no
Exportable (NFS etc.): yes no Exportable (NFS etc.) yes no
Hard link support: yes no Hard link support yes no
"." and ".." in readdir: yes no "." and ".." in readdir yes no
Real inode numbers: yes no Real inode numbers yes no
32-bit uids/gids: yes no 32-bit uids/gids yes no
File creation time: yes no File creation time yes no
Xattr support: yes no Xattr support yes no
ACL support: no no ACL support no no
============================== ========= ==========
Squashfs compresses data, inodes and directories. In addition, inode and Squashfs compresses data, inodes and directories. In addition, inode and
directory data are highly compacted, and packed on byte boundaries. Each directory data are highly compacted, and packed on byte boundaries. Each
...@@ -47,7 +53,7 @@ compressed inode is on average 8 bytes in length (the exact length varies on ...@@ -47,7 +53,7 @@ compressed inode is on average 8 bytes in length (the exact length varies on
file type, i.e. regular file, directory, symbolic link, and block/char device file type, i.e. regular file, directory, symbolic link, and block/char device
inodes have different sizes). inodes have different sizes).
2. USING SQUASHFS 2. Using Squashfs
----------------- -----------------
As squashfs is a read-only filesystem, the mksquashfs program must be used to As squashfs is a read-only filesystem, the mksquashfs program must be used to
...@@ -58,11 +64,11 @@ obtained from this site also. ...@@ -58,11 +64,11 @@ obtained from this site also.
The squashfs-tools development tree is now located on kernel.org The squashfs-tools development tree is now located on kernel.org
git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
3. SQUASHFS FILESYSTEM DESIGN 3. Squashfs Filesystem Design
----------------------------- -----------------------------
A squashfs filesystem consists of a maximum of nine parts, packed together on a A squashfs filesystem consists of a maximum of nine parts, packed together on a
byte alignment: byte alignment::
--------------- ---------------
| superblock | | superblock |
...@@ -229,15 +235,15 @@ location of the xattr list inside each inode, a 32-bit xattr id ...@@ -229,15 +235,15 @@ location of the xattr list inside each inode, a 32-bit xattr id
is stored. This xattr id is mapped into the location of the xattr is stored. This xattr id is mapped into the location of the xattr
list using a second xattr id lookup table. list using a second xattr id lookup table.
4. TODOS AND OUTSTANDING ISSUES 4. TODOs and Outstanding Issues
------------------------------- -------------------------------
4.1 Todo list 4.1 TODO list
------------- -------------
Implement ACL support. Implement ACL support.
4.2 Squashfs internal cache 4.2 Squashfs Internal Cache
--------------------------- ---------------------------
Blocks in Squashfs are compressed. To avoid repeatedly decompressing Blocks in Squashfs are compressed. To avoid repeatedly decompressing
......
.. SPDX-License-Identifier: GPL-2.0
sysfs - _The_ filesystem for exporting kernel objects. =====================================================
sysfs - _The_ filesystem for exporting kernel objects
=====================================================
Patrick Mochel <mochel@osdl.org> Patrick Mochel <mochel@osdl.org>
Mike Murphy <mamurph@cs.clemson.edu> Mike Murphy <mamurph@cs.clemson.edu>
Revised: 16 August 2011 :Revised: 16 August 2011
Original: 10 January 2003 :Original: 10 January 2003
What it is: What it is:
...@@ -24,7 +28,7 @@ Using sysfs ...@@ -24,7 +28,7 @@ Using sysfs
~~~~~~~~~~~ ~~~~~~~~~~~
sysfs is always compiled in if CONFIG_SYSFS is defined. You can access sysfs is always compiled in if CONFIG_SYSFS is defined. You can access
it by doing: it by doing::
mount -t sysfs sysfs /sys mount -t sysfs sysfs /sys
...@@ -65,17 +69,17 @@ formatting of data is heavily frowned upon. Doing these things may get ...@@ -65,17 +69,17 @@ formatting of data is heavily frowned upon. Doing these things may get
you publicly humiliated and your code rewritten without notice. you publicly humiliated and your code rewritten without notice.
An attribute definition is simply: An attribute definition is simply::
struct attribute { struct attribute {
char * name; char * name;
struct module *owner; struct module *owner;
umode_t mode; umode_t mode;
}; };
int sysfs_create_file(struct kobject * kobj, const struct attribute * attr); int sysfs_create_file(struct kobject * kobj, const struct attribute * attr);
void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr); void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr);
A bare attribute contains no means to read or write the value of the A bare attribute contains no means to read or write the value of the
...@@ -83,38 +87,38 @@ attribute. Subsystems are encouraged to define their own attribute ...@@ -83,38 +87,38 @@ attribute. Subsystems are encouraged to define their own attribute
structure and wrapper functions for adding and removing attributes for structure and wrapper functions for adding and removing attributes for
a specific object type. a specific object type.
For example, the driver model defines struct device_attribute like: For example, the driver model defines struct device_attribute like::
struct device_attribute { struct device_attribute {
struct attribute attr; struct attribute attr;
ssize_t (*show)(struct device *dev, struct device_attribute *attr, ssize_t (*show)(struct device *dev, struct device_attribute *attr,
char *buf); char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr, ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count); const char *buf, size_t count);
}; };
int device_create_file(struct device *, const struct device_attribute *); int device_create_file(struct device *, const struct device_attribute *);
void device_remove_file(struct device *, const struct device_attribute *); void device_remove_file(struct device *, const struct device_attribute *);
It also defines this helper for defining device attributes: It also defines this helper for defining device attributes::
#define DEVICE_ATTR(_name, _mode, _show, _store) \ #define DEVICE_ATTR(_name, _mode, _show, _store) \
struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store) struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
For example, declaring For example, declaring::
static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
is equivalent to doing: is equivalent to doing::
static struct device_attribute dev_attr_foo = { static struct device_attribute dev_attr_foo = {
.attr = { .attr = {
.name = "foo", .name = "foo",
.mode = S_IWUSR | S_IRUGO, .mode = S_IWUSR | S_IRUGO,
}, },
.show = show_foo, .show = show_foo,
.store = store_foo, .store = store_foo,
}; };
Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally
considered a bad idea." so trying to set a sysfs file writable for considered a bad idea." so trying to set a sysfs file writable for
...@@ -127,15 +131,21 @@ readable. The above case could be shortened to: ...@@ -127,15 +131,21 @@ readable. The above case could be shortened to:
static struct device_attribute dev_attr_foo = __ATTR_RW(foo); static struct device_attribute dev_attr_foo = __ATTR_RW(foo);
the list of helpers available to define your wrapper function is: the list of helpers available to define your wrapper function is:
__ATTR_RO(name): assumes default name_show and mode 0444
__ATTR_WO(name): assumes a name_store only and is restricted to mode __ATTR_RO(name):
assumes default name_show and mode 0444
__ATTR_WO(name):
assumes a name_store only and is restricted to mode
0200 that is root write access only. 0200 that is root write access only.
__ATTR_RO_MODE(name, mode): fore more restrictive RO access currently __ATTR_RO_MODE(name, mode):
fore more restrictive RO access currently
only use case is the EFI System Resource Table only use case is the EFI System Resource Table
(see drivers/firmware/efi/esrt.c) (see drivers/firmware/efi/esrt.c)
__ATTR_RW(name): assumes default name_show, name_store and setting __ATTR_RW(name):
assumes default name_show, name_store and setting
mode to 0644. mode to 0644.
__ATTR_NULL: which sets the name to NULL and is used as end of list __ATTR_NULL:
which sets the name to NULL and is used as end of list
indicator (see: kernel/workqueue.c) indicator (see: kernel/workqueue.c)
Subsystem-Specific Callbacks Subsystem-Specific Callbacks
...@@ -143,12 +153,12 @@ Subsystem-Specific Callbacks ...@@ -143,12 +153,12 @@ Subsystem-Specific Callbacks
When a subsystem defines a new attribute type, it must implement a When a subsystem defines a new attribute type, it must implement a
set of sysfs operations for forwarding read and write calls to the set of sysfs operations for forwarding read and write calls to the
show and store methods of the attribute owners. show and store methods of the attribute owners::
struct sysfs_ops { struct sysfs_ops {
ssize_t (*show)(struct kobject *, struct attribute *, char *); ssize_t (*show)(struct kobject *, struct attribute *, char *);
ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t); ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
}; };
[ Subsystems should have already defined a struct kobj_type as a [ Subsystems should have already defined a struct kobj_type as a
descriptor for this type, which is where the sysfs_ops pointer is descriptor for this type, which is where the sysfs_ops pointer is
...@@ -160,14 +170,14 @@ and struct attribute pointers to the appropriate pointer types, and ...@@ -160,14 +170,14 @@ and struct attribute pointers to the appropriate pointer types, and
calls the associated methods. calls the associated methods.
To illustrate: To illustrate::
#define to_dev(obj) container_of(obj, struct device, kobj) #define to_dev(obj) container_of(obj, struct device, kobj)
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) #define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
char *buf) char *buf)
{ {
struct device_attribute *dev_attr = to_dev_attr(attr); struct device_attribute *dev_attr = to_dev_attr(attr);
struct device *dev = to_dev(kobj); struct device *dev = to_dev(kobj);
ssize_t ret = -EIO; ssize_t ret = -EIO;
...@@ -179,7 +189,7 @@ static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, ...@@ -179,7 +189,7 @@ static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
dev_attr->show); dev_attr->show);
} }
return ret; return ret;
} }
...@@ -188,10 +198,10 @@ Reading/Writing Attribute Data ...@@ -188,10 +198,10 @@ Reading/Writing Attribute Data
To read or write attributes, show() or store() methods must be To read or write attributes, show() or store() methods must be
specified when declaring the attribute. The method types should be as specified when declaring the attribute. The method types should be as
simple as those defined for device attributes: simple as those defined for device attributes::
ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr, ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count); const char *buf, size_t count);
IOW, they should take only an object, an attribute, and a buffer as parameters. IOW, they should take only an object, an attribute, and a buffer as parameters.
...@@ -251,23 +261,23 @@ Other notes: ...@@ -251,23 +261,23 @@ Other notes:
sure to have a way to check this, if necessary. sure to have a way to check this, if necessary.
A very simple (and naive) implementation of a device attribute is: A very simple (and naive) implementation of a device attribute is::
static ssize_t show_name(struct device *dev, struct device_attribute *attr, static ssize_t show_name(struct device *dev, struct device_attribute *attr,
char *buf) char *buf)
{ {
return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name); return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name);
} }
static ssize_t store_name(struct device *dev, struct device_attribute *attr, static ssize_t store_name(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count) const char *buf, size_t count)
{ {
snprintf(dev->name, sizeof(dev->name), "%.*s", snprintf(dev->name, sizeof(dev->name), "%.*s",
(int)min(count, sizeof(dev->name) - 1), buf); (int)min(count, sizeof(dev->name) - 1), buf);
return count; return count;
} }
static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
(Note that the real implementation doesn't allow userspace to set the (Note that the real implementation doesn't allow userspace to set the
...@@ -280,23 +290,23 @@ Top Level Directory Layout ...@@ -280,23 +290,23 @@ Top Level Directory Layout
The sysfs directory arrangement exposes the relationship of kernel The sysfs directory arrangement exposes the relationship of kernel
data structures. data structures.
The top level sysfs directory looks like: The top level sysfs directory looks like::
block/ block/
bus/ bus/
class/ class/
dev/ dev/
devices/ devices/
firmware/ firmware/
net/ net/
fs/ fs/
devices/ contains a filesystem representation of the device tree. It maps devices/ contains a filesystem representation of the device tree. It maps
directly to the internal kernel device tree, which is a hierarchy of directly to the internal kernel device tree, which is a hierarchy of
struct device. struct device.
bus/ contains flat directory layout of the various bus types in the bus/ contains flat directory layout of the various bus types in the
kernel. Each bus's directory contains two subdirectories: kernel. Each bus's directory contains two subdirectories::
devices/ devices/
drivers/ drivers/
...@@ -331,71 +341,71 @@ Current Interfaces ...@@ -331,71 +341,71 @@ Current Interfaces
The following interface layers currently exist in sysfs: The following interface layers currently exist in sysfs:
- devices (include/linux/device.h) devices (include/linux/device.h)
---------------------------------- --------------------------------
Structure: Structure::
struct device_attribute { struct device_attribute {
struct attribute attr; struct attribute attr;
ssize_t (*show)(struct device *dev, struct device_attribute *attr, ssize_t (*show)(struct device *dev, struct device_attribute *attr,
char *buf); char *buf);
ssize_t (*store)(struct device *dev, struct device_attribute *attr, ssize_t (*store)(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count); const char *buf, size_t count);
}; };
Declaring: Declaring::
DEVICE_ATTR(_name, _mode, _show, _store); DEVICE_ATTR(_name, _mode, _show, _store);
Creation/Removal: Creation/Removal::
int device_create_file(struct device *dev, const struct device_attribute * attr); int device_create_file(struct device *dev, const struct device_attribute * attr);
void device_remove_file(struct device *dev, const struct device_attribute * attr); void device_remove_file(struct device *dev, const struct device_attribute * attr);
- bus drivers (include/linux/device.h) bus drivers (include/linux/device.h)
-------------------------------------- ------------------------------------
Structure: Structure::
struct bus_attribute { struct bus_attribute {
struct attribute attr; struct attribute attr;
ssize_t (*show)(struct bus_type *, char * buf); ssize_t (*show)(struct bus_type *, char * buf);
ssize_t (*store)(struct bus_type *, const char * buf, size_t count); ssize_t (*store)(struct bus_type *, const char * buf, size_t count);
}; };
Declaring: Declaring::
static BUS_ATTR_RW(name); static BUS_ATTR_RW(name);
static BUS_ATTR_RO(name); static BUS_ATTR_RO(name);
static BUS_ATTR_WO(name); static BUS_ATTR_WO(name);
Creation/Removal: Creation/Removal::
int bus_create_file(struct bus_type *, struct bus_attribute *); int bus_create_file(struct bus_type *, struct bus_attribute *);
void bus_remove_file(struct bus_type *, struct bus_attribute *); void bus_remove_file(struct bus_type *, struct bus_attribute *);
- device drivers (include/linux/device.h) device drivers (include/linux/device.h)
----------------------------------------- ---------------------------------------
Structure: Structure::
struct driver_attribute { struct driver_attribute {
struct attribute attr; struct attribute attr;
ssize_t (*show)(struct device_driver *, char * buf); ssize_t (*show)(struct device_driver *, char * buf);
ssize_t (*store)(struct device_driver *, const char * buf, ssize_t (*store)(struct device_driver *, const char * buf,
size_t count); size_t count);
}; };
Declaring: Declaring::
DRIVER_ATTR_RO(_name) DRIVER_ATTR_RO(_name)
DRIVER_ATTR_RW(_name) DRIVER_ATTR_RW(_name)
Creation/Removal: Creation/Removal::
int driver_create_file(struct device_driver *, const struct driver_attribute *); int driver_create_file(struct device_driver *, const struct driver_attribute *);
void driver_remove_file(struct device_driver *, const struct driver_attribute *); void driver_remove_file(struct device_driver *, const struct driver_attribute *);
Documentation Documentation
......
.. SPDX-License-Identifier: GPL-2.0
==================
SystemV Filesystem
==================
It implements all of It implements all of
- Xenix FS, - Xenix FS,
- SystemV/386 FS, - SystemV/386 FS,
- Coherent FS. - Coherent FS.
To install: To install:
* Answer the 'System V and Coherent filesystem support' question with 'y' * Answer the 'System V and Coherent filesystem support' question with 'y'
when configuring the kernel. when configuring the kernel.
* To mount a disk or a partition, use * To mount a disk or a partition, use::
mount [-r] -t sysv device mountpoint mount [-r] -t sysv device mountpoint
The file system type names
The file system type names::
-t sysv -t sysv
-t xenix -t xenix
-t coherent -t coherent
may be used interchangeably, but the last two will eventually disappear. may be used interchangeably, but the last two will eventually disappear.
Bugs in the present implementation: Bugs in the present implementation:
- Coherent FS: - Coherent FS:
- The "free list interleave" n:m is currently ignored. - The "free list interleave" n:m is currently ignored.
- Only file systems with no filesystem name and no pack name are recognized. - Only file systems with no filesystem name and no pack name are recognized.
(See Coherent "man mkfs" for a description of these features.) (See Coherent "man mkfs" for a description of these features.)
- SystemV Release 2 FS: - SystemV Release 2 FS:
The superblock is only searched in the blocks 9, 15, 18, which The superblock is only searched in the blocks 9, 15, 18, which
corresponds to the beginning of track 1 on floppy disks. No support corresponds to the beginning of track 1 on floppy disks. No support
for this FS on hard disk yet. for this FS on hard disk yet.
...@@ -28,12 +43,14 @@ Bugs in the present implementation: ...@@ -28,12 +43,14 @@ Bugs in the present implementation:
These filesystems are rather similar. Here is a comparison with Minix FS: These filesystems are rather similar. Here is a comparison with Minix FS:
* Linux fdisk reports on partitions * Linux fdisk reports on partitions
- Minix FS 0x81 Linux/Minix - Minix FS 0x81 Linux/Minix
- Xenix FS ?? - Xenix FS ??
- SystemV FS ?? - SystemV FS ??
- Coherent FS 0x08 AIX bootable - Coherent FS 0x08 AIX bootable
* Size of a block or zone (data allocation unit on disk) * Size of a block or zone (data allocation unit on disk)
- Minix FS 1024 - Minix FS 1024
- Xenix FS 1024 (also 512 ??) - Xenix FS 1024 (also 512 ??)
- SystemV FS 1024 (also 512 and 2048) - SystemV FS 1024 (also 512 and 2048)
...@@ -45,37 +62,51 @@ These filesystems are rather similar. Here is a comparison with Minix FS: ...@@ -45,37 +62,51 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
all the block numbers (including the super block) are offset by one track. all the block numbers (including the super block) are offset by one track.
* Byte ordering of "short" (16 bit entities) on disk: * Byte ordering of "short" (16 bit entities) on disk:
- Minix FS little endian 0 1 - Minix FS little endian 0 1
- Xenix FS little endian 0 1 - Xenix FS little endian 0 1
- SystemV FS little endian 0 1 - SystemV FS little endian 0 1
- Coherent FS little endian 0 1 - Coherent FS little endian 0 1
Of course, this affects only the file system, not the data of files on it! Of course, this affects only the file system, not the data of files on it!
* Byte ordering of "long" (32 bit entities) on disk: * Byte ordering of "long" (32 bit entities) on disk:
- Minix FS little endian 0 1 2 3 - Minix FS little endian 0 1 2 3
- Xenix FS little endian 0 1 2 3 - Xenix FS little endian 0 1 2 3
- SystemV FS little endian 0 1 2 3 - SystemV FS little endian 0 1 2 3
- Coherent FS PDP-11 2 3 0 1 - Coherent FS PDP-11 2 3 0 1
Of course, this affects only the file system, not the data of files on it! Of course, this affects only the file system, not the data of files on it!
* Inode on disk: "short", 0 means non-existent, the root dir ino is: * Inode on disk: "short", 0 means non-existent, the root dir ino is:
- Minix FS 1
- Xenix FS, SystemV FS, Coherent FS 2 ================================= ==
Minix FS 1
Xenix FS, SystemV FS, Coherent FS 2
================================= ==
* Maximum number of hard links to a file: * Maximum number of hard links to a file:
- Minix FS 250
- Xenix FS ?? =========== =========
- SystemV FS ?? Minix FS 250
- Coherent FS >=10000 Xenix FS ??
SystemV FS ??
Coherent FS >=10000
=========== =========
* Free inode management: * Free inode management:
- Minix FS a bitmap
- Minix FS
a bitmap
- Xenix FS, SystemV FS, Coherent FS - Xenix FS, SystemV FS, Coherent FS
There is a cache of a certain number of free inodes in the super-block. There is a cache of a certain number of free inodes in the super-block.
When it is exhausted, new free inodes are found using a linear search. When it is exhausted, new free inodes are found using a linear search.
* Free block management: * Free block management:
- Minix FS a bitmap
- Minix FS
a bitmap
- Xenix FS, SystemV FS, Coherent FS - Xenix FS, SystemV FS, Coherent FS
Free blocks are organized in a "free list". Maybe a misleading term, Free blocks are organized in a "free list". Maybe a misleading term,
since it is not true that every free block contains a pointer to since it is not true that every free block contains a pointer to
...@@ -86,13 +117,18 @@ These filesystems are rather similar. Here is a comparison with Minix FS: ...@@ -86,13 +117,18 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS. 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
* Super-block location: * Super-block location:
- Minix FS block 1 = bytes 1024..2047
- Xenix FS block 1 = bytes 1024..2047 =========== ==========================
- SystemV FS bytes 512..1023 Minix FS block 1 = bytes 1024..2047
- Coherent FS block 1 = bytes 512..1023 Xenix FS block 1 = bytes 1024..2047
SystemV FS bytes 512..1023
Coherent FS block 1 = bytes 512..1023
=========== ==========================
* Super-block layout: * Super-block layout:
- Minix FS
- Minix FS::
unsigned short s_ninodes; unsigned short s_ninodes;
unsigned short s_nzones; unsigned short s_nzones;
unsigned short s_imap_blocks; unsigned short s_imap_blocks;
...@@ -101,7 +137,9 @@ These filesystems are rather similar. Here is a comparison with Minix FS: ...@@ -101,7 +137,9 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
unsigned short s_log_zone_size; unsigned short s_log_zone_size;
unsigned long s_max_size; unsigned long s_max_size;
unsigned short s_magic; unsigned short s_magic;
- Xenix FS, SystemV FS, Coherent FS
- Xenix FS, SystemV FS, Coherent FS::
unsigned short s_firstdatazone; unsigned short s_firstdatazone;
unsigned long s_nzones; unsigned long s_nzones;
unsigned short s_fzone_count; unsigned short s_fzone_count;
...@@ -120,23 +158,33 @@ These filesystems are rather similar. Here is a comparison with Minix FS: ...@@ -120,23 +158,33 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
char s_fname[6]; char s_fname[6];
char s_fpack[6]; char s_fpack[6];
then they differ considerably: then they differ considerably:
Xenix FS
Xenix FS::
char s_clean; char s_clean;
char s_fill[371]; char s_fill[371];
long s_magic; long s_magic;
long s_type; long s_type;
SystemV FS
SystemV FS::
long s_fill[12 or 14]; long s_fill[12 or 14];
long s_state; long s_state;
long s_magic; long s_magic;
long s_type; long s_type;
Coherent FS
Coherent FS::
unsigned long s_unique; unsigned long s_unique;
Note that Coherent FS has no magic. Note that Coherent FS has no magic.
* Inode layout: * Inode layout:
- Minix FS
- Minix FS::
unsigned short i_mode; unsigned short i_mode;
unsigned short i_uid; unsigned short i_uid;
unsigned long i_size; unsigned long i_size;
...@@ -144,7 +192,9 @@ These filesystems are rather similar. Here is a comparison with Minix FS: ...@@ -144,7 +192,9 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
unsigned char i_gid; unsigned char i_gid;
unsigned char i_nlinks; unsigned char i_nlinks;
unsigned short i_zone[7+1+1]; unsigned short i_zone[7+1+1];
- Xenix FS, SystemV FS, Coherent FS
- Xenix FS, SystemV FS, Coherent FS::
unsigned short i_mode; unsigned short i_mode;
unsigned short i_nlink; unsigned short i_nlink;
unsigned short i_uid; unsigned short i_uid;
...@@ -155,38 +205,55 @@ These filesystems are rather similar. Here is a comparison with Minix FS: ...@@ -155,38 +205,55 @@ These filesystems are rather similar. Here is a comparison with Minix FS:
unsigned long i_mtime; unsigned long i_mtime;
unsigned long i_ctime; unsigned long i_ctime;
* Regular file data blocks are organized as * Regular file data blocks are organized as
- Minix FS
7 direct blocks
1 indirect block (pointers to blocks)
1 double-indirect block (pointer to pointers to blocks)
- Xenix FS, SystemV FS, Coherent FS
10 direct blocks
1 indirect block (pointers to blocks)
1 double-indirect block (pointer to pointers to blocks)
1 triple-indirect block (pointer to pointers to pointers to blocks)
* Inode size, inodes per block - Minix FS:
- Minix FS 32 32
- Xenix FS 64 16 - 7 direct blocks
- SystemV FS 64 16 - 1 indirect block (pointers to blocks)
- Coherent FS 64 8 - 1 double-indirect block (pointer to pointers to blocks)
- Xenix FS, SystemV FS, Coherent FS:
- 10 direct blocks
- 1 indirect block (pointers to blocks)
- 1 double-indirect block (pointer to pointers to blocks)
- 1 triple-indirect block (pointer to pointers to pointers to blocks)
=========== ========== ================
Inode size inodes per block
=========== ========== ================
Minix FS 32 32
Xenix FS 64 16
SystemV FS 64 16
Coherent FS 64 8
=========== ========== ================
* Directory entry on disk * Directory entry on disk
- Minix FS
- Minix FS::
unsigned short inode; unsigned short inode;
char name[14/30]; char name[14/30];
- Xenix FS, SystemV FS, Coherent FS
- Xenix FS, SystemV FS, Coherent FS::
unsigned short inode; unsigned short inode;
char name[14]; char name[14];
* Dir entry size, dir entries per block =========== ============== =====================
- Minix FS 16/32 64/32 Dir entry size dir entries per block
- Xenix FS 16 64 =========== ============== =====================
- SystemV FS 16 64 Minix FS 16/32 64/32
- Coherent FS 16 32 Xenix FS 16 64
SystemV FS 16 64
Coherent FS 16 32
=========== ============== =====================
* How to implement symbolic links such that the host fsck doesn't scream: * How to implement symbolic links such that the host fsck doesn't scream:
- Minix FS normal - Minix FS normal
- Xenix FS kludge: as regular files with chmod 1000 - Xenix FS kludge: as regular files with chmod 1000
- SystemV FS ?? - SystemV FS ??
......
.. SPDX-License-Identifier: GPL-2.0
=====
Tmpfs
=====
Tmpfs is a file system which keeps all files in virtual memory. Tmpfs is a file system which keeps all files in virtual memory.
...@@ -34,7 +40,7 @@ tmpfs has the following uses: ...@@ -34,7 +40,7 @@ tmpfs has the following uses:
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for 2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
POSIX shared memory (shm_open, shm_unlink). Adding the following POSIX shared memory (shm_open, shm_unlink). Adding the following
line to /etc/fstab should take care of this: line to /etc/fstab should take care of this::
tmpfs /dev/shm tmpfs defaults 0 0 tmpfs /dev/shm tmpfs defaults 0 0
...@@ -56,15 +62,17 @@ tmpfs has the following uses: ...@@ -56,15 +62,17 @@ tmpfs has the following uses:
tmpfs has three mount options for sizing: tmpfs has three mount options for sizing:
size: The limit of allocated bytes for this tmpfs instance. The ========= ============================================================
size The limit of allocated bytes for this tmpfs instance. The
default is half of your physical RAM without swap. If you default is half of your physical RAM without swap. If you
oversize your tmpfs instances the machine will deadlock oversize your tmpfs instances the machine will deadlock
since the OOM handler will not be able to free that memory. since the OOM handler will not be able to free that memory.
nr_blocks: The same as size, but in blocks of PAGE_SIZE. nr_blocks The same as size, but in blocks of PAGE_SIZE.
nr_inodes: The maximum number of inodes for this instance. The default nr_inodes The maximum number of inodes for this instance. The default
is half of the number of your physical RAM pages, or (on a is half of the number of your physical RAM pages, or (on a
machine with highmem) the number of lowmem RAM pages, machine with highmem) the number of lowmem RAM pages,
whichever is the lower. whichever is the lower.
========= ============================================================
These parameters accept a suffix k, m or g for kilo, mega and giga and These parameters accept a suffix k, m or g for kilo, mega and giga and
can be changed on remount. The size parameter also accepts a suffix % can be changed on remount. The size parameter also accepts a suffix %
...@@ -82,6 +90,7 @@ tmpfs has a mount option to set the NUMA memory allocation policy for ...@@ -82,6 +90,7 @@ tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be all files in that instance (if CONFIG_NUMA is enabled) - which can be
adjusted on the fly via 'mount -o remount ...' adjusted on the fly via 'mount -o remount ...'
======================== ==============================================
mpol=default use the process allocation policy mpol=default use the process allocation policy
(see set_mempolicy(2)) (see set_mempolicy(2))
mpol=prefer:Node prefers to allocate memory from the given Node mpol=prefer:Node prefers to allocate memory from the given Node
...@@ -89,6 +98,7 @@ mpol=bind:NodeList allocates memory only from nodes in NodeList ...@@ -89,6 +98,7 @@ mpol=bind:NodeList allocates memory only from nodes in NodeList
mpol=interleave prefers to allocate from each node in turn mpol=interleave prefers to allocate from each node in turn
mpol=interleave:NodeList allocates from each node of NodeList in turn mpol=interleave:NodeList allocates from each node of NodeList in turn
mpol=local prefers to allocate memory from the local node mpol=local prefers to allocate memory from the local node
======================== ==============================================
NodeList format is a comma-separated list of decimal numbers and ranges, NodeList format is a comma-separated list of decimal numbers and ranges,
a range being two hyphen-separated decimal numbers, the smallest and a range being two hyphen-separated decimal numbers, the smallest and
...@@ -98,9 +108,9 @@ A memory policy with a valid NodeList will be saved, as specified, for ...@@ -98,9 +108,9 @@ A memory policy with a valid NodeList will be saved, as specified, for
use at file creation time. When a task allocates a file in the file use at file creation time. When a task allocates a file in the file
system, the mount option memory policy will be applied with a NodeList, system, the mount option memory policy will be applied with a NodeList,
if any, modified by the calling task's cpuset constraints if any, modified by the calling task's cpuset constraints
[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags, listed [See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags,
below. If the resulting NodeLists is the empty set, the effective memory listed below. If the resulting NodeLists is the empty set, the effective
policy for the file will revert to "default" policy. memory policy for the file will revert to "default" policy.
NUMA memory allocation policies have optional flags that can be used in NUMA memory allocation policies have optional flags that can be used in
conjunction with their modes. These optional flags can be specified conjunction with their modes. These optional flags can be specified
...@@ -109,6 +119,8 @@ See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of ...@@ -109,6 +119,8 @@ See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of
all available memory allocation policy mode flags and their effect on all available memory allocation policy mode flags and their effect on
memory policy. memory policy.
::
=static is equivalent to MPOL_F_STATIC_NODES =static is equivalent to MPOL_F_STATIC_NODES
=relative is equivalent to MPOL_F_RELATIVE_NODES =relative is equivalent to MPOL_F_RELATIVE_NODES
...@@ -128,9 +140,11 @@ on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'. ...@@ -128,9 +140,11 @@ on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'.
To specify the initial root directory you can use the following mount To specify the initial root directory you can use the following mount
options: options:
mode: The permissions as an octal number ==== ==================================
uid: The user id mode The permissions as an octal number
gid: The group id uid The user id
gid The group id
==== ==================================
These options do not have any effect on remount. You can change these These options do not have any effect on remount. You can change these
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
...@@ -141,9 +155,9 @@ will give you tmpfs instance on /mytmpfs which can allocate 10GB ...@@ -141,9 +155,9 @@ will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root. RAM/SWAP in 10240 inodes and it is only accessible by root.
Author: :Author:
Christoph Rohland <cr@sap.com>, 1.12.01 Christoph Rohland <cr@sap.com>, 1.12.01
Updated: :Updated:
Hugh Dickins, 4 June 2007 Hugh Dickins, 4 June 2007
Updated: :Updated:
KOSAKI Motohiro, 16 Mar 2010 KOSAKI Motohiro, 16 Mar 2010
.. SPDX-License-Identifier: GPL-2.0
:orphan: :orphan:
.. UBIFS Authentication .. UBIFS Authentication
...@@ -92,11 +94,11 @@ UBIFS Index & Tree Node Cache ...@@ -92,11 +94,11 @@ UBIFS Index & Tree Node Cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Basic on-flash UBIFS entities are called *nodes*. UBIFS knows different types Basic on-flash UBIFS entities are called *nodes*. UBIFS knows different types
of nodes. Eg. data nodes (`struct ubifs_data_node`) which store chunks of file of nodes. Eg. data nodes (``struct ubifs_data_node``) which store chunks of file
contents or inode nodes (`struct ubifs_ino_node`) which represent VFS inodes. contents or inode nodes (``struct ubifs_ino_node``) which represent VFS inodes.
Almost all types of nodes share a common header (`ubifs_ch`) containing basic Almost all types of nodes share a common header (``ubifs_ch``) containing basic
information like node type, node length, a sequence number, etc. (see information like node type, node length, a sequence number, etc. (see
`fs/ubifs/ubifs-media.h`in kernel source). Exceptions are entries of the LPT ``fs/ubifs/ubifs-media.h`` in kernel source). Exceptions are entries of the LPT
and some less important node types like padding nodes which are used to pad and some less important node types like padding nodes which are used to pad
unusable content at the end of LEBs. unusable content at the end of LEBs.
......
.. SPDX-License-Identifier: GPL-2.0
===============
UBI File System
===============
Introduction Introduction
============= ============
UBIFS file-system stands for UBI File System. UBI stands for "Unsorted UBIFS file-system stands for UBI File System. UBI stands for "Unsorted
Block Images". UBIFS is a flash file system, which means it is designed Block Images". UBIFS is a flash file system, which means it is designed
...@@ -79,6 +85,7 @@ Mount options ...@@ -79,6 +85,7 @@ Mount options
(*) == default. (*) == default.
==================== =======================================================
bulk_read read more in one go to take advantage of flash bulk_read read more in one go to take advantage of flash
media that read faster sequentially media that read faster sequentially
no_bulk_read (*) do not bulk-read no_bulk_read (*) do not bulk-read
...@@ -98,6 +105,7 @@ auth_key= specify the key used for authenticating the filesystem. ...@@ -98,6 +105,7 @@ auth_key= specify the key used for authenticating the filesystem.
auth_hash_name= The hash algorithm used for authentication. Used for auth_hash_name= The hash algorithm used for authentication. Used for
both hashing and for creating HMACs. Typical values both hashing and for creating HMACs. Typical values
include "sha256" or "sha512" include "sha256" or "sha512"
==================== =======================================================
Quick usage instructions Quick usage instructions
...@@ -107,12 +115,14 @@ The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, ...@@ -107,12 +115,14 @@ The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax,
where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is
UBI volume name. UBI volume name.
Mount volume 0 on UBI device 0 to /mnt/ubifs: Mount volume 0 on UBI device 0 to /mnt/ubifs::
$ mount -t ubifs ubi0_0 /mnt/ubifs
$ mount -t ubifs ubi0_0 /mnt/ubifs
Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume
name): name)::
$ mount -t ubifs ubi0:rootfs /mnt/ubifs
$ mount -t ubifs ubi0:rootfs /mnt/ubifs
The following is an example of the kernel boot arguments to attach mtd0 The following is an example of the kernel boot arguments to attach mtd0
to UBI and mount volume "rootfs": to UBI and mount volume "rootfs":
...@@ -122,5 +132,6 @@ References ...@@ -122,5 +132,6 @@ References
========== ==========
UBIFS documentation and FAQ/HOWTO at the MTD web site: UBIFS documentation and FAQ/HOWTO at the MTD web site:
http://www.linux-mtd.infradead.org/doc/ubifs.html
http://www.linux-mtd.infradead.org/faq/ubifs.html - http://www.linux-mtd.infradead.org/doc/ubifs.html
- http://www.linux-mtd.infradead.org/faq/ubifs.html
* .. SPDX-License-Identifier: GPL-2.0
* Documentation/filesystems/udf.txt
* ===============
UDF file system
===============
If you encounter problems with reading UDF discs using this driver, If you encounter problems with reading UDF discs using this driver,
please report them according to MAINTAINERS file. please report them according to MAINTAINERS file.
...@@ -18,8 +20,10 @@ performance due to very poor read-modify-write support supplied internally ...@@ -18,8 +20,10 @@ performance due to very poor read-modify-write support supplied internally
by drive firmware. by drive firmware.
------------------------------------------------------------------------------- -------------------------------------------------------------------------------
The following mount options are supported: The following mount options are supported:
=========== ======================================
gid= Set the default group. gid= Set the default group.
umask= Set the default umask. umask= Set the default umask.
mode= Set the default file permissions. mode= Set the default file permissions.
...@@ -34,6 +38,7 @@ The following mount options are supported: ...@@ -34,6 +38,7 @@ The following mount options are supported:
longad Use long ad's (default) longad Use long ad's (default)
nostrict Unset strict conformance nostrict Unset strict conformance
iocharset= Set the NLS character set iocharset= Set the NLS character set
=========== ======================================
The uid= and gid= options need a bit more explaining. They will accept a The uid= and gid= options need a bit more explaining. They will accept a
decimal numeric value and all inodes on that mount will then appear as decimal numeric value and all inodes on that mount will then appear as
...@@ -47,13 +52,17 @@ the interactive user will always see the files on the disk as belonging to him. ...@@ -47,13 +52,17 @@ the interactive user will always see the files on the disk as belonging to him.
The remaining are for debugging and disaster recovery: The remaining are for debugging and disaster recovery:
===== ================================
novrs Skip volume sequence recognition novrs Skip volume sequence recognition
===== ================================
The following expect a offset from 0. The following expect a offset from 0.
========== =================================================
session= Set the CDROM session (default= last session) session= Set the CDROM session (default= last session)
anchor= Override standard anchor location. (default= 256) anchor= Override standard anchor location. (default= 256)
lastblock= Set the last block of the filesystem/ lastblock= Set the last block of the filesystem/
========== =================================================
------------------------------------------------------------------------------- -------------------------------------------------------------------------------
...@@ -62,5 +71,5 @@ For the latest version and toolset see: ...@@ -62,5 +71,5 @@ For the latest version and toolset see:
https://github.com/pali/udftools https://github.com/pali/udftools
Documentation on UDF and ECMA 167 is available FREE from: Documentation on UDF and ECMA 167 is available FREE from:
http://www.osta.org/ - http://www.osta.org/
http://www.ecma-international.org/ - http://www.ecma-international.org/
.. SPDX-License-Identifier: GPL-2.0
================================================
ZoneFS - Zone filesystem for Zoned block devices ZoneFS - Zone filesystem for Zoned block devices
================================================
Introduction Introduction
============ ============
...@@ -29,6 +33,7 @@ Zoned block devices ...@@ -29,6 +33,7 @@ Zoned block devices
Zoned storage devices belong to a class of storage devices with an address Zoned storage devices belong to a class of storage devices with an address
space that is divided into zones. A zone is a group of consecutive LBAs and all space that is divided into zones. A zone is a group of consecutive LBAs and all
zones are contiguous (there are no LBA gaps). Zones may have different types. zones are contiguous (there are no LBA gaps). Zones may have different types.
* Conventional zones: there are no access constraints to LBAs belonging to * Conventional zones: there are no access constraints to LBAs belonging to
conventional zones. Any read or write access can be executed, similarly to a conventional zones. Any read or write access can be executed, similarly to a
regular block device. regular block device.
...@@ -158,6 +163,7 @@ Format options ...@@ -158,6 +163,7 @@ Format options
-------------- --------------
Several optional features of zonefs can be enabled at format time. Several optional features of zonefs can be enabled at format time.
* Conventional zone aggregation: ranges of contiguous conventional zones can be * Conventional zone aggregation: ranges of contiguous conventional zones can be
aggregated into a single larger file instead of the default one file per zone. aggregated into a single larger file instead of the default one file per zone.
* File ownership: The owner UID and GID of zone files is by default 0 (root) * File ownership: The owner UID and GID of zone files is by default 0 (root)
...@@ -249,7 +255,7 @@ permissions. ...@@ -249,7 +255,7 @@ permissions.
Further action taken by zonefs I/O error recovery can be controlled by the user Further action taken by zonefs I/O error recovery can be controlled by the user
with the "errors=xxx" mount option. The table below summarizes the result of with the "errors=xxx" mount option. The table below summarizes the result of
zonefs I/O error processing depending on the mount option and on the zone zonefs I/O error processing depending on the mount option and on the zone
conditions. conditions::
+--------------+-----------+-----------------------------------------+ +--------------+-----------+-----------------------------------------+
| | | Post error state | | | | Post error state |
...@@ -275,6 +281,7 @@ conditions. ...@@ -275,6 +281,7 @@ conditions.
+--------------+-----------+-----------------------------------------+ +--------------+-----------+-----------------------------------------+
Further notes: Further notes:
* The "errors=remount-ro" mount option is the default behavior of zonefs I/O * The "errors=remount-ro" mount option is the default behavior of zonefs I/O
error processing if no errors mount option is specified. error processing if no errors mount option is specified.
* With the "errors=remount-ro" mount option, the change of the file access * With the "errors=remount-ro" mount option, the change of the file access
...@@ -302,6 +309,7 @@ Mount options ...@@ -302,6 +309,7 @@ Mount options
zonefs define the "errors=<behavior>" mount option to allow the user to specify zonefs define the "errors=<behavior>" mount option to allow the user to specify
zonefs behavior in response to I/O errors, inode size inconsistencies or zone zonefs behavior in response to I/O errors, inode size inconsistencies or zone
condition chages. The defined behaviors are as follow: condition chages. The defined behaviors are as follow:
* remount-ro (default) * remount-ro (default)
* zone-ro * zone-ro
* zone-offline * zone-offline
...@@ -325,77 +333,77 @@ Examples ...@@ -325,77 +333,77 @@ Examples
-------- --------
The following formats a 15TB host-managed SMR HDD with 256 MB zones The following formats a 15TB host-managed SMR HDD with 256 MB zones
with the conventional zones aggregation feature enabled. with the conventional zones aggregation feature enabled::
# mkzonefs -o aggr_cnv /dev/sdX # mkzonefs -o aggr_cnv /dev/sdX
# mount -t zonefs /dev/sdX /mnt # mount -t zonefs /dev/sdX /mnt
# ls -l /mnt/ # ls -l /mnt/
total 0 total 0
dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
The size of the zone files sub-directories indicate the number of files The size of the zone files sub-directories indicate the number of files
existing for each type of zones. In this example, there is only one existing for each type of zones. In this example, there is only one
conventional zone file (all conventional zones are aggregated under a single conventional zone file (all conventional zones are aggregated under a single
file). file)::
# ls -l /mnt/cnv # ls -l /mnt/cnv
total 137101312 total 137101312
-rw-r----- 1 root root 140391743488 Nov 25 13:23 0 -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
This aggregated conventional zone file can be used as a regular file. This aggregated conventional zone file can be used as a regular file::
# mkfs.ext4 /mnt/cnv/0 # mkfs.ext4 /mnt/cnv/0
# mount -o loop /mnt/cnv/0 /data # mount -o loop /mnt/cnv/0 /data
The "seq" sub-directory grouping files for sequential write zones has in this The "seq" sub-directory grouping files for sequential write zones has in this
example 55356 zones. example 55356 zones::
# ls -lv /mnt/seq # ls -lv /mnt/seq
total 14511243264 total 14511243264
-rw-r----- 1 root root 0 Nov 25 13:23 0 -rw-r----- 1 root root 0 Nov 25 13:23 0
-rw-r----- 1 root root 0 Nov 25 13:23 1 -rw-r----- 1 root root 0 Nov 25 13:23 1
-rw-r----- 1 root root 0 Nov 25 13:23 2 -rw-r----- 1 root root 0 Nov 25 13:23 2
... ...
-rw-r----- 1 root root 0 Nov 25 13:23 55354 -rw-r----- 1 root root 0 Nov 25 13:23 55354
-rw-r----- 1 root root 0 Nov 25 13:23 55355 -rw-r----- 1 root root 0 Nov 25 13:23 55355
For sequential write zone files, the file size changes as data is appended at For sequential write zone files, the file size changes as data is appended at
the end of the file, similarly to any regular file system. the end of the file, similarly to any regular file system::
# dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct
1+0 records in 1+0 records in
1+0 records out 1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s
# ls -l /mnt/seq/0 # ls -l /mnt/seq/0
-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
The written file can be truncated to the zone size, preventing any further The written file can be truncated to the zone size, preventing any further
write operation. write operation::
# truncate -s 268435456 /mnt/seq/0 # truncate -s 268435456 /mnt/seq/0
# ls -l /mnt/seq/0 # ls -l /mnt/seq/0
-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
Truncation to 0 size allows freeing the file zone storage space and restart Truncation to 0 size allows freeing the file zone storage space and restart
append-writes to the file. append-writes to the file::
# truncate -s 0 /mnt/seq/0 # truncate -s 0 /mnt/seq/0
# ls -l /mnt/seq/0 # ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
Since files are statically mapped to zones on the disk, the number of blocks of Since files are statically mapped to zones on the disk, the number of blocks of
a file as reported by stat() and fstat() indicates the size of the file zone. a file as reported by stat() and fstat() indicates the size of the file zone::
# stat /mnt/seq/0 # stat /mnt/seq/0
File: /mnt/seq/0 File: /mnt/seq/0
Size: 0 Blocks: 524288 IO Block: 4096 regular empty file Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
Device: 870h/2160d Inode: 50431 Links: 1 Device: 870h/2160d Inode: 50431 Links: 1
Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2019-11-25 13:23:57.048971997 +0900 Access: 2019-11-25 13:23:57.048971997 +0900
Modify: 2019-11-25 13:52:25.553805765 +0900 Modify: 2019-11-25 13:52:25.553805765 +0900
Change: 2019-11-25 13:52:25.553805765 +0900 Change: 2019-11-25 13:52:25.553805765 +0900
Birth: - Birth: -
The number of blocks of the file ("Blocks") in units of 512B blocks gives the The number of blocks of the file ("Blocks") in units of 512B blocks gives the
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment