Commit b0a4aa95 authored by Mauro Carvalho Chehab's avatar Mauro Carvalho Chehab

docs: nvdimm: convert to ReST

Rename the nvdimm documentation files to ReST, add an
index for them and adjust in order to produce a nice html
output via the Sphinx build system.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
parent 6e58e2d8
=============================
BTT - Block Translation Table
=============================
1. Introduction
---------------
===============
Persistent memory based storage is able to perform IO at byte (or more
accurately, cache line) granularity. However, we often want to expose such
......@@ -25,7 +26,7 @@ provides atomic sector updates.
2. Static Layout
----------------
================
The underlying storage on which a BTT can be laid out is not limited in any way.
The BTT, however, splits the available space into chunks of up to 512 GiB,
......@@ -33,27 +34,27 @@ called "Arenas".
Each arena follows the same layout for its metadata, and all references in an
arena are internal to it (with the exception of one field that points to the
next arena). The following depicts the "On-disk" metadata layout:
next arena). The following depicts the "On-disk" metadata layout::
Backing Store +-------> Arena
+---------------+ | +------------------+
| | | | Arena info block |
| Arena 0 +---+ | 4K |
| 512G | +------------------+
| | | |
+---------------+ | |
| | | |
| Arena 1 | | Data Blocks |
| 512G | | |
| | | |
+---------------+ | |
| . | | |
| . | | |
| . | | |
| | | |
| | | |
+---------------+ +------------------+
+---------------+ | +------------------+
| | | | Arena info block |
| Arena 0 +---+ | 4K |
| 512G | +------------------+
| | | |
+---------------+ | |
| | | |
| Arena 1 | | Data Blocks |
| 512G | | |
| | | |
+---------------+ | |
| . | | |
| . | | |
| . | | |
| | | |
| | | |
+---------------+ +------------------+
| |
| BTT Map |
| |
......@@ -69,7 +70,7 @@ next arena). The following depicts the "On-disk" metadata layout:
3. Theory of Operation
----------------------
======================
a. The BTT Map
......@@ -79,31 +80,37 @@ The map is a simple lookup/indirection table that maps an LBA to an internal
block. Each map entry is 32 bits. The two most significant bits are special
flags, and the remaining form the internal block number.
======== =============================================================
Bit Description
31 - 30 : Error and Zero flags - Used in the following way:
Bit Description
31 30
-----------------------------------------------------------------------
00 Initial state. Reads return zeroes; Premap = Postmap
01 Zero state: Reads return zeroes
10 Error state: Reads fail; Writes clear 'E' bit
11 Normal Block – has valid postmap
======== =============================================================
31 - 30 Error and Zero flags - Used in the following way:
== == ====================================================
31 30 Description
== == ====================================================
0 0 Initial state. Reads return zeroes; Premap = Postmap
0 1 Zero state: Reads return zeroes
1 0 Error state: Reads fail; Writes clear 'E' bit
1 1 Normal Block – has valid postmap
== == ====================================================
29 - 0 : Mappings to internal 'postmap' blocks
29 - 0 Mappings to internal 'postmap' blocks
======== =============================================================
Some of the terminology that will be subsequently used:
External LBA : LBA as made visible to upper layers.
ABA : Arena Block Address - Block offset/number within an arena
Premap ABA : The block offset into an arena, which was decided upon by range
============ ================================================================
External LBA LBA as made visible to upper layers.
ABA Arena Block Address - Block offset/number within an arena
Premap ABA The block offset into an arena, which was decided upon by range
checking the External LBA
Postmap ABA : The block number in the "Data Blocks" area obtained after
Postmap ABA The block number in the "Data Blocks" area obtained after
indirection from the map
nfree : The number of free blocks that are maintained at any given time.
nfree The number of free blocks that are maintained at any given time.
This is the number of concurrent writes that can happen to the
arena.
============ ================================================================
For example, after adding a BTT, we surface a disk of 1024G. We get a read for
......@@ -121,19 +128,21 @@ i.e. Every write goes to a "free" block. A running list of free blocks is
maintained in the form of the BTT flog. 'Flog' is a combination of the words
"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
lba : The premap ABA that is being written to
old_map : The old postmap ABA - after 'this' write completes, this will be a
======== =====================================================================
lba The premap ABA that is being written to
old_map The old postmap ABA - after 'this' write completes, this will be a
free block.
new_map : The new postmap ABA. The map will up updated to reflect this
new_map The new postmap ABA. The map will up updated to reflect this
lba->postmap_aba mapping, but we log it here in case we have to
recover.
seq : Sequence number to mark which of the 2 sections of this flog entry is
seq Sequence number to mark which of the 2 sections of this flog entry is
valid/newest. It cycles between 01->10->11->01 (binary) under normal
operation, with 00 indicating an uninitialized state.
lba' : alternate lba entry
old_map': alternate old postmap entry
new_map': alternate new postmap entry
seq' : alternate sequence number.
lba' alternate lba entry
old_map' alternate old postmap entry
new_map' alternate new postmap entry
seq' alternate sequence number.
======== =====================================================================
Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
......@@ -147,8 +156,10 @@ c. The concept of lanes
While 'nfree' describes the number of concurrent IOs an arena can process
concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
process.
process::
nlanes = min(nfree, num_cpus)
A lane number is obtained at the start of any IO, and is used for indexing into
all the on-disk and in-memory data structures for the duration of the IO. If
there are more CPUs than the max number of available lanes, than lanes are
......@@ -180,10 +191,10 @@ e. In-memory data structure: map locks
--------------------------------------
Consider a case where two writer threads are writing to the same LBA. There can
be a race in the following sequence of steps:
be a race in the following sequence of steps::
free[lane] = map[premap_aba]
map[premap_aba] = postmap_aba
free[lane] = map[premap_aba]
map[premap_aba] = postmap_aba
Both threads can update their respective free[lane] with the same old, freed
postmap_aba. This has made the layout inconsistent by losing a free entry, and
......@@ -202,6 +213,7 @@ On startup, we analyze the BTT flog to create our list of free blocks. We walk
through all the entries, and for each lane, of the set of two possible
'sections', we always look at the most recent one only (based on the sequence
number). The reconstruction rules/steps are simple:
- Read map[log_entry.lba].
- If log_entry.new matches the map entry, then log_entry.old is free.
- If log_entry.new does not match the map entry, then log_entry.new is free.
......@@ -245,6 +257,7 @@ Write:
An arena would be in an error state if any of the metadata is corrupted
irrecoverably, either due to a bug or a media error. The following conditions
indicate an error:
- Info block checksum does not match (and recovering from the copy also fails)
- All internal available blocks are not uniquely and entirely addressed by the
sum of mapped blocks and free blocks (from the BTT flog).
......@@ -263,11 +276,10 @@ The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
(pmem, or blk mode). The easiest way to set up such a namespace is using the
'ndctl' utility [1]:
For example, the ndctl command line to setup a btt with a 4k sector size is:
For example, the ndctl command line to setup a btt with a 4k sector size is::
ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
See ndctl create-namespace --help for more options.
[1]: https://github.com/pmem/ndctl
:orphan:
===================================
Non-Volatile Memory Device (NVDIMM)
===================================
.. toctree::
:maxdepth: 1
nvdimm
btt
security
NVDIMM SECURITY
===============
NVDIMM Security
===============
1. Introduction
......@@ -138,4 +139,5 @@ This command is only available when the master security is enabled, indicated
by the extended security status.
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf
[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf
......@@ -33,7 +33,7 @@ config BLK_DEV_PMEM
Documentation/admin-guide/kernel-parameters.rst). This driver converts
these persistent memory ranges into block devices that are
capable of DAX (direct-access) file system mappings. See
Documentation/nvdimm/nvdimm.txt for more details.
Documentation/nvdimm/nvdimm.rst for more details.
Say Y if you want to use an NVDIMM
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment