WIP: Rewrite the README. (The old one is kept for now.)

9ab67617 · Ophélie Gagnard · 39cdcb1b · 9ab67617 · 9ab67617
Commit 9ab67617 authored Feb 03, 2022 by Ophélie Gagnard
Hide whitespace changes
Inline Side-by-side

Showing with 47 additions and 28 deletions

README.md README.md +10 -28

old.README.md old.README.md +37 -0

No files found.
--- a/README.md
+++ b/README.md
 # metadata-collect-agent
-In the context of the project [GNU/Linux System files on-boot Tamper Detection System](https://www.erp5.com/group_section/forum/GNU-Linux-System-files-on-boot-Tamper-Detection-System-94DGdYfmx1), we need to create an agent that will be run inside an initramfs to report as much metadata as useful while keeping system boot times acceptable. It must then report that metadata to a remote service for later analysis.
+## Compile from source
-## Current performance properties
+### Without dracut
- Reads file system metadata from the main thread (stat, xattrs (SELinux, ..), POSIX ACLs)
+Get the executables:
- Reads files and hashes them in md5, sha1, sha256 and sha512 across multiple processes (as many as core count) with the `multiprocessing` python module
+make no-dracut
- Maximizes disk I/O utilization successfully, the Python code's performance is not a bottleneck, the disk is (good sign)
+At this stage, the useful files are in bin/, lib/ and flb.conf is useful too.
-Tested on a laptop with:
+Install:
- 3.2 GB/s read NVMe SSD
+(don't forget to set $DESTDIR and $PREFIX variables if you want to; default: DESTDIR="", PREFIX="/usr/local")
- Intel(R) Core(TM) i7-1065G7 CPU (4 cores, 8 threads) @ 1.30 GHz min / 3.90 GHz max
+make install-no-dracut
-  - 2 GHz per thread on average under full multithreaded load due to heat and unoptimal laptop thermals (Dell XPS 13 2020)
- For ~1 million files on EXT4 over LUKS+LVM and ~140GB occupied disk space:
+### With dracut
-    ```
-    real	6m11.532s
-    user	31m7.676s
-    sys	3m27.251s
-    ```
-6 minutes and 12 seconds of real world time
+#### Without signing the image.
-This will hardly get any better because the disk is the bottleneck, CPU usage is not full but disk I/O utilization is, peaking at 500 MB/s reads for these test conditions. 3.2 GB/s on this SSD is for sequential reads (optimal conditions).
-It can and probably will be faster on performant servers with less files, less disk space usage, more CPU cores and similar disk.
-## Desired performance properties
- Reduce memory usage
-  - Avoid storing all the collected data in memory at the same time
-    - encode and output JSON as the program runs (incompatible with tree-like data structure like now)
-    - discard data after output so that memory usage can be deterministic
- Beware of stack overflows
-  - Currently the file system traverse function is recursive, Python does not have tail recursion optimization so it potentially could overflow the stack. But due to file system paths being limited in size (is it always true? is it file system specific?), probably it's unlikely it ever will.
\ No newline at end of file
--- a/old.README.md
+++ b/old.README.md
+# metadata-collect-agent
+In the context of the project [GNU/Linux System files on-boot Tamper Detection System](https://www.erp5.com/group_section/forum/GNU-Linux-System-files-on-boot-Tamper-Detection-System-94DGdYfmx1), we need to create an agent that will be run inside an initramfs to report as much metadata as useful while keeping system boot times acceptable. It must then report that metadata to a remote service for later analysis.
+## Current performance properties
+- Reads file system metadata from the main thread (stat, xattrs (SELinux, ..), POSIX ACLs)
+- Reads files and hashes them in md5, sha1, sha256 and sha512 across multiple processes (as many as core count) with the `multiprocessing` python module
+- Maximizes disk I/O utilization successfully, the Python code's performance is not a bottleneck, the disk is (good sign)
+Tested on a laptop with:
+- 3.2 GB/s read NVMe SSD
+- Intel(R) Core(TM) i7-1065G7 CPU (4 cores, 8 threads) @ 1.30 GHz min / 3.90 GHz max
+  - 2 GHz per thread on average under full multithreaded load due to heat and unoptimal laptop thermals (Dell XPS 13 2020)
+- For ~1 million files on EXT4 over LUKS+LVM and ~140GB occupied disk space:
+    ```
+    real	6m11.532s
+    user	31m7.676s
+    sys	3m27.251s
+    ```
+6 minutes and 12 seconds of real world time
+This will hardly get any better because the disk is the bottleneck, CPU usage is not full but disk I/O utilization is, peaking at 500 MB/s reads for these test conditions. 3.2 GB/s on this SSD is for sequential reads (optimal conditions).
+It can and probably will be faster on performant servers with less files, less disk space usage, more CPU cores and similar disk.
+## Desired performance properties
+- Reduce memory usage
+  - Avoid storing all the collected data in memory at the same time
+    - encode and output JSON as the program runs (incompatible with tree-like data structure like now)
+    - discard data after output so that memory usage can be deterministic
+- Beware of stack overflows
+  - Currently the file system traverse function is recursive, Python does not have tail recursion optimization so it potentially could overflow the stack. But due to file system paths being limited in size (is it always true? is it file system specific?), probably it's unlikely it ever will.
\ No newline at end of file