• Andrew Morton's avatar
    [PATCH] dentry and inode cache hash algorithm performance changes. · 99effef9
    Andrew Morton authored
    From: "Jose R. Santos" <jrsantos@austin.ibm.com>
    
    It alleviates some issues seen with Linux when accessing millions of files on
    machines with large amounts of RAM (+32GB).  Both algorithms are base on some
    studies that Dominique Heger was doing on hash table efficiencies in Linux.
    The dentry hash table has been tested in small systems with one internal IDE
    hard disk as well as in large SMP with many fiberchanel disks.  Dominique
    claims that in all the testing done, they did not see one case were this has
    function provided worst performance and that in most test they were seeing
    better performance.
    
    The inode hash function was done by me base on Dominique's original work and
    has only been stress tested with SpecSFS.  It provided a 3% improvement over
    the default algorithm in the SpecSFS results and speed ups in the response
    time of almost all filesystem operations the benchmark stress.  With the
    better distribution is as also possible to reduce the number of inode buckets
    for 32 million to 16 million and still get a slightly better results.
    
    Anton was nice enough to provide some graphs that show the distribution 
    before and after the patch at http://samba.org/~anton/linux/sfs/1/
    
    For the dentry hash function, some of my other coorkers had put this hash
    function through various testing and have concluded that the hash function was
    equal or better than the default hash function.  These runs were done with a
    (hopefully to be Open Source soon) benchmark called FFSB which can simulate
    various io patters across many filesystems and variable file sizes.
    
    SpecSFS fileset is basically a lot of small file which varies depending on the
    size of the run.  For a not so big SMP system the number of file is in the +20
    Million files range.  Of those 20 million files only 10% are access randomly
    by the client.  The purpose of this is that the benchmark tries to stress not
    only the NFS layer but, VM and Filesystems layers as well.  The filesets are
    also hundreds of gigabytes in size in order to promote disk head movement by
    guaranteeing cache misses in memory.  SFS 27% of the workload are lookups
    __d_lookup has showing high in my profiles.
    
    For the inode hash the problem that I see is that when running a benchmark
    with this huge fileset we end up trying to free a lot of inode entries during
    the run while trying to put new entries in cache.  We end up calling
    ifind_fast() which calls find_inodes_fast() held under inode_lock.  In order
    to avoid holding the inode_lock we needed to avoid having long chains in that
    hash function.
    
    When I took a look at the original hash function, I found it to be a bit to
    simple for any workload.  My solution (which I took advantage of Dominique's
    work) was to create a hash that function that could generate completely
    different hashes depending on the hashval and the superblock in order to have
    the hash scale as we added more filesystems to the machine.
    
    Both of these problems can be somewhat tuned out by increasing the number of
    buckets of both d and i cache but it got to a point were I had 256MB of inode
    and 128MB in dentry hash buckets on a not so large SMP.  With the hash changes
    I have been able to reduce the number of buckets to 128MB for inode cache and
    to 32MB for dentry cache and still get better performance.
    
    If it help my case...  I haven't been running this benchmark for long, so I
    haven't been able to find a way to cheat.  I need to come up with generic
    solutions until I can find a cheat for the benchmark.  :)
    
    
    SDET results:
    
    Steve Pratt seem to have a SDET setup already and he did me the favor of
    running SDET with a reduce dentry entry hash table size.  I belive that
    his table suggest that less than 3% change is acceptable variability, but
    overall he got a 5% better number using the new hash algorith.
    
    A) x4408way1.sdet.2.6.5100000-8p.04-05-05_12.08.44 vs 
    B) x4408way1.sdet.2.6.5+hash-100000-8p.04-05-05_11.48.02
    
    
      Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
      Inode-cache hash table entries: 1048576 (order: 10, 4194304 bytes) 
    
    Results:Throughput
    
                                              tolerance = 0.00 + 3.00% of A
                          A            B
       Threads      Ops/sec      Ops/sec    %diff         diff    tolerance
    ----------- ------------ ------------ -------- ------------ ------------
             1    4341.9300    4401.9500     1.38        60.02       130.26 
             2    8242.2000    8165.1200    -0.94       -77.08       247.27 
             4   15274.4900   15257.1000    -0.11       -17.39       458.23 
             8   21326.9200   21320.7000    -0.03        -6.22       639.81 
            16   23056.2100   24282.8000     5.32      1226.59       691.69  * 
            32   23397.2500   24684.6100     5.50      1287.36       701.92  * 
            64   23372.7600   23632.6500     1.11       259.89       701.18 
           128   17009.3900   16651.9600    -2.10      -357.43       510.28 
    =========================================================================
    99effef9
dcache.c 41.8 KB