Commit eb28ef20 authored by Tim Peters's avatar Tim Peters

New collision resolution scheme: no polynomials, simpler, faster, less

code, less memory.  Tests have uncovered no drawbacks.  Christian and
Vladimir are the other two people who have burned many brain cells on the
dict code in recent years, and they like the approach too, so I'm checking
it in without further ado.
parent 951a8841
...@@ -116,13 +116,11 @@ Core ...@@ -116,13 +116,11 @@ Core
to crash if the element comparison routines for the dict keys and/or to crash if the element comparison routines for the dict keys and/or
values mutated the dicts. Making the code bulletproof slowed it down. values mutated the dicts. Making the code bulletproof slowed it down.
- Collisions in dicts now use polynomial division instead of multiplication - Collisions in dicts are resolved via a new approach, which can help
to generate the probe sequence, following an idea of Christian Tismer's. dramatically in bad cases. For example, looking up every key in a dict
This allows all bits of the hash code to come into play. It should have d with d.keys() = [i << 16 for i in range(20000)] is approximately 500x
little or no effect on speed in ordinary cases, but can help dramatically faster now. Thanks to Christian Tismer for pointing out the cause and
in bad cases. For example, looking up every key in a dict d with the nature of an effective cure (last December! better late than never).
d.keys() = [i << 16 for i in range(20000)] is approximately 500x faster
now.
Library Library
......
...@@ -3,121 +3,116 @@ ...@@ -3,121 +3,116 @@
#include "Python.h" #include "Python.h"
/* MINSIZE is the minimum size of a dictionary. This many slots are
/*
* MINSIZE is the minimum size of a dictionary. This many slots are
* allocated directly in the dict object (in the ma_smalltable member). * allocated directly in the dict object (in the ma_smalltable member).
* This must be a power of 2, and the first entry in the polys[] vector must * It must be a power of 2, and at least 4. 8 allows dicts with no more than
* match. * 5 active entries to live in ma_smalltable (and so avoid an additional
* malloc); instrumentation suggested this suffices for the majority of
* dicts (consisting mostly of usually-small instance dicts and usually-small
* dicts created to pass keyword arguments).
*/ */
#define MINSIZE 8 #define MINSIZE 8
/* define this out if you don't want conversion statistics on exit */ /* Define this out if you don't want conversion statistics on exit. */
#undef SHOW_CONVERSION_COUNTS #undef SHOW_CONVERSION_COUNTS
/* See large comment block below. This must be >= 1. */
#define PERTURB_SHIFT 5
/* /*
Table of irreducible polynomials to efficiently cycle through Major subtleties ahead: Most hash schemes depend on having a "good" hash
GF(2^n)-{0}, 2<=n<=30. A table size is always a power of 2. function, in the sense of simulating randomness. Python doesn't: its most
For a table size of 2**i, the polys entry is 2**i + j for some j in 1 thru important hash functions (for strings and ints) are very regular in common
2**i-1 inclusive. The polys[] entries here happen to add in the smallest j cases:
values "that work". Work means this: given any integer k in 1 thru 2**i-1
inclusive, a poly works if & only if repeating this code: >>> map(hash, (0, 1, 2, 3))
print k [0, 1, 2, 3]
k <<= 1 >>> map(hash, ("namea", "nameb", "namec", "named"))
if k >= 2**i: [-1658398457, -1658398460, -1658398459, -1658398462]
k ^= poly >>>
prints every integer in 1 thru 2**i-1 inclusive exactly once before printing
k a second time. Theory can be used to find such polys efficiently, but the This isn't necessarily bad! To the contrary, in a table of size 2**i, taking
operational defn. of "works" is sufficient to find them in reasonable time the low-order i bits as the initial table index is extremely fast, and there
via brute force program (hint: any poly that has an even number of 1 bits are no collisions at all for dicts indexed by a contiguous range of ints.
cannot work; ditto any poly with low bit 0; exploit those). The same is approximately true when keys are "consecutive" strings. So this
gives better-than-random behavior in common cases, and that's very desirable.
Some major subtleties: Most hash schemes depend on having a "good" hash
function, in the sense of simulating randomness. Python doesn't: some of OTOH, when collisions occur, the tendency to fill contiguous slices of the
its hash functions are trivial, such as hash(i) == i for ints i (excepting hash table makes a good collision resolution strategy crucial. Taking only
i == -1, because -1 is the "error occurred" return value from tp_hash). the last i bits of the hash code is also vulnerable: for example, consider
[i << 16 for i in range(20000)] as a set of keys. Since ints are their own
This isn't necessarily bad! To the contrary, that our hash tables are powers hash codes, and this fits in a dict of size 2**15, the last 15 bits of every
of 2 in size, and that we take the low-order bits as the initial table index, hash code are all 0: they *all* map to the same table index.
means that there are no collisions at all for dicts indexed by a contiguous
range of ints. This is "better than random" behavior, and that's very But catering to unusual cases should not slow the usual ones, so we just take
desirable. the last i bits anyway. It's up to collision resolution to do the rest. If
we *usually* find the key we're looking for on the first try (and, it turns
On the other hand, when collisions occur, the tendency to fill contiguous out, we usually do -- the table load factor is kept under 2/3, so the odds
slices of the hash table makes a good collision resolution strategy crucial; are solidly in our favor), then it makes best sense to keep the initial index
e.g., linear probing is right out. computation dirt cheap.
Reimer Behrends contributed the idea of using a polynomial-based approach, The first half of collision resolution is to visit table indices via this
using repeated multiplication by x in GF(2**n) where a polynomial is chosen recurrence:
such that x is a primitive root. This visits every table location exactly
once, and the sequence of locations probed is highly non-linear. j = ((5*j) + 1) mod 2**i
The same is also largely true of quadratic probing for power-of-2 tables, of For any initial j in range(2**i), repeating that 2**i times generates each
the specific int in range(2**i) exactly once (see any text on random-number generation for
proof). By itself, this doesn't help much: like linear probing (setting
(i + comb(1, 2)) mod size j += 1, or j -= 1, on each loop trip), it scans the table entries in a fixed
(i + comb(2, 2)) mod size order. This would be bad, except that's not the only thing we do, and it's
(i + comb(3, 2)) mod size actually *good* in the common cases where hash keys are consecutive. In an
(i + comb(4, 2)) mod size example that's really too small to make this entirely clear, for a table of
... size 2**3 the order of indices is:
(i + comb(j, 2)) mod size
0 -> 1 -> 6 -> 7 -> 4 -> 5 -> 2 -> 3 -> 0 [and here it's repeating]
flavor. The polynomial approach "scrambles" the probe indices better, but
more importantly allows to get *some* additional bits of the hash code into If two things come in at index 5, the first place we look after is index 2,
play via computing the initial increment, thus giving a weak form of double not 6, so if another comes in at index 6 the collision at 5 didn't hurt it.
hashing. Quadratic probing cannot be extended that way (the first probe Linear probing is deadly in this case because there the fixed probe order
offset must be 1, the second 3, the third 6, etc). is the *same* as the order consecutive keys are likely to arrive. But it's
extremely unlikely hash codes will follow a 5*j+1 recurrence by accident,
Christian Tismer later contributed the idea of using polynomial division and certain that consecutive hash codes do not.
instead of multiplication. The problem is that the multiplicative method
can't get *all* the bits of the hash code into play without expensive The other half of the strategy is to get the other bits of the hash code
computations that slow down the initial index and/or initial increment into play. This is done by initializing a (unsigned) vrbl "perturb" to the
computation. For a set of keys like [i << 16 for i in range(20000)], under full hash code, and changing the recurrence to:
the multiplicative method the initial index and increment were the same for
all keys, so every key followed exactly the same probe sequence, and so j = (5*j) + 1 + perturb;
this degenerated into a (very slow) linear search. The division method uses perturb >>= PERTURB_SHIFT;
all the bits of the hash code naturally in the increment, although it *may* use j % 2**i as the next table index;
visit locations more than once until such time as all the high bits of the
increment have been shifted away. It's also impossible to tell in advance Now the probe sequence depends (eventually) on every bit in the hash code,
whether incr is congruent to 0 modulo poly, so each iteration of the loop has and the pseudo-scrambling property of recurring on 5*j+1 is more valuable,
to guard against incr becoming 0. These are minor costs, as we usually don't because it quickly magnifies small differences in the bits that didn't affect
get into the probe loop, and when we do we usually get out on its first the initial index. Note that because perturb is unsigned, if the recurrence
iteration. is executed often enough perturb eventually becomes and remains 0. At that
point (very rarely reached) the recurrence is on (just) 5*j+1 again, and
that's certain to find an empty slot eventually (since it generates every int
in range(2**i), and we make sure there's always at least one empty slot).
Selecting a good value for PERTURB_SHIFT is a balancing act. You want it
small so that the high bits of the hash code continue to affect the probe
sequence across iterations; but you want it large so that in really bad cases
the high-order hash bits have an effect on early iterations. 5 was "the
best" in minimizing total collisions across experiments Tim Peters ran (on
both normal and pathological cases), but 4 and 6 weren't significantly worse.
Historical: Reimer Behrends contributed the idea of using a polynomial-based
approach, using repeated multiplication by x in GF(2**n) where an irreducible
polynomial for each table size was chosen such that x was a primitive root.
Christian Tismer later extended that to use division by x instead, as an
efficient way to get the high bits of the hash code into play. This scheme
also gave excellent collision statistics, but was more expensive: two
if-tests were required inside the loop; computing "the next" index took about
the same number of operations but without as much potential parallelism
(e.g., computing 5*j can go on at the same time as computing 1+perturb in the
above, and then shifting perturb can be done while the table index is being
masked); and the dictobject struct required a member to hold the table's
polynomial. In Tim's experiments the current scheme ran faster, produced
equally good collision statistics, needed less code & used less memory.
*/ */
static long polys[] = {
/* 4 + 3, */ /* first active entry if MINSIZE == 4 */
8 + 3, /* first active entry if MINSIZE == 8 */
16 + 3,
32 + 5,
64 + 3,
128 + 3,
256 + 29,
512 + 17,
1024 + 9,
2048 + 5,
4096 + 83,
8192 + 27,
16384 + 43,
32768 + 3,
65536 + 45,
131072 + 9,
262144 + 39,
524288 + 39,
1048576 + 9,
2097152 + 5,
4194304 + 3,
8388608 + 33,
16777216 + 27,
33554432 + 9,
67108864 + 71,
134217728 + 39,
268435456 + 9,
536870912 + 5,
1073741824 + 83
/* 2147483648 + 9 -- if we ever boost this to unsigned long */
};
/* Object used as dummy key to fill deleted entries */ /* Object used as dummy key to fill deleted entries */
static PyObject *dummy; /* Initialized by first call to newdictobject() */ static PyObject *dummy; /* Initialized by first call to newdictobject() */
...@@ -168,7 +163,6 @@ struct dictobject { ...@@ -168,7 +163,6 @@ struct dictobject {
int ma_fill; /* # Active + # Dummy */ int ma_fill; /* # Active + # Dummy */
int ma_used; /* # Active */ int ma_used; /* # Active */
int ma_size; /* total # slots in ma_table */ int ma_size; /* total # slots in ma_table */
int ma_poly; /* appopriate entry from polys vector */
/* ma_table points to ma_smalltable for small tables, else to /* ma_table points to ma_smalltable for small tables, else to
* additional malloc'ed memory. ma_table is never NULL! This rule * additional malloc'ed memory. ma_table is never NULL! This rule
* saves repeated runtime null-tests in the workhorse getitem and * saves repeated runtime null-tests in the workhorse getitem and
...@@ -202,8 +196,6 @@ show_counts(void) ...@@ -202,8 +196,6 @@ show_counts(void)
(mp)->ma_table = (mp)->ma_smalltable; \ (mp)->ma_table = (mp)->ma_smalltable; \
(mp)->ma_size = MINSIZE; \ (mp)->ma_size = MINSIZE; \
(mp)->ma_used = (mp)->ma_fill = 0; \ (mp)->ma_used = (mp)->ma_fill = 0; \
(mp)->ma_poly = polys[0]; \
assert(MINSIZE < (mp)->ma_poly && (mp)->ma_poly < MINSIZE*2); \
} while(0) } while(0)
PyObject * PyObject *
...@@ -235,28 +227,26 @@ The basic lookup function used by all operations. ...@@ -235,28 +227,26 @@ The basic lookup function used by all operations.
This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4. This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
Open addressing is preferred over chaining since the link overhead for Open addressing is preferred over chaining since the link overhead for
chaining would be substantial (100% with typical malloc overhead). chaining would be substantial (100% with typical malloc overhead).
However, instead of going through the table at constant steps, we cycle
through the values of GF(2^n). This avoids modulo computations, being
much cheaper on RISC machines, without leading to clustering.
The initial probe index is computed as hash mod the table size. The initial probe index is computed as hash mod the table size. Subsequent
Subsequent probe indices use the values of x^i in GF(2^n)-{0} as an offset, probe indices are computed as explained earlier.
where x is a root. The initial offset is derived from hash, too.
All arithmetic on hash should ignore overflow. All arithmetic on hash should ignore overflow.
(This version is due to Reimer Behrends, some ideas are also due to (The details in this version are due to Tim Peters, building on many past
Jyrki Alakuijala and Vladimir Marangozov.) contributions by Reimer Behrends, Jyrki Alakuijala, Vladimir Marangozov and
Christian Tismer).
This function must never return NULL; failures are indicated by returning This function must never return NULL; failures are indicated by returning
a dictentry* for which the me_value field is NULL. Exceptions are never a dictentry* for which the me_value field is NULL. Exceptions are never
reported by this function, and outstanding exceptions are maintained. reported by this function, and outstanding exceptions are maintained.
*/ */
static dictentry * static dictentry *
lookdict(dictobject *mp, PyObject *key, register long hash) lookdict(dictobject *mp, PyObject *key, register long hash)
{ {
register int i; register int i;
register unsigned int incr; register unsigned int perturb;
register dictentry *freeslot; register dictentry *freeslot;
register unsigned int mask = mp->ma_size-1; register unsigned int mask = mp->ma_size-1;
dictentry *ep0 = mp->ma_table; dictentry *ep0 = mp->ma_table;
...@@ -265,9 +255,7 @@ lookdict(dictobject *mp, PyObject *key, register long hash) ...@@ -265,9 +255,7 @@ lookdict(dictobject *mp, PyObject *key, register long hash)
register int checked_error = 0; register int checked_error = 0;
register int cmp; register int cmp;
PyObject *err_type, *err_value, *err_tb; PyObject *err_type, *err_value, *err_tb;
/* We must come up with (i, incr) such that 0 <= i < ma_size
and 0 < incr < ma_size and both are a function of hash.
i is the initial table index and incr the initial probe offset. */
i = hash & mask; i = hash & mask;
ep = &ep0[i]; ep = &ep0[i];
if (ep->me_key == NULL || ep->me_key == key) if (ep->me_key == NULL || ep->me_key == key)
...@@ -294,16 +282,12 @@ lookdict(dictobject *mp, PyObject *key, register long hash) ...@@ -294,16 +282,12 @@ lookdict(dictobject *mp, PyObject *key, register long hash)
} }
freeslot = NULL; freeslot = NULL;
} }
/* Derive incr from hash, just to make it more arbitrary. Note that
incr must not be 0, or we will get into an infinite loop.*/
incr = hash ^ ((unsigned long)hash >> 3);
/* In the loop, me_key == dummy is by far (factor of 100s) the /* In the loop, me_key == dummy is by far (factor of 100s) the
least likely outcome, so test for that last. */ least likely outcome, so test for that last. */
for (;;) { for (perturb = hash; ; perturb >>= PERTURB_SHIFT) {
if (!incr) i = (i << 2) + i + perturb + 1;
incr = 1; /* and incr will never be 0 again */ ep = &ep0[i & mask];
ep = &ep0[(i + incr) & mask];
if (ep->me_key == NULL) { if (ep->me_key == NULL) {
if (restore_error) if (restore_error)
PyErr_Restore(err_type, err_value, err_tb); PyErr_Restore(err_type, err_value, err_tb);
...@@ -335,10 +319,6 @@ lookdict(dictobject *mp, PyObject *key, register long hash) ...@@ -335,10 +319,6 @@ lookdict(dictobject *mp, PyObject *key, register long hash)
} }
else if (ep->me_key == dummy && freeslot == NULL) else if (ep->me_key == dummy && freeslot == NULL)
freeslot = ep; freeslot = ep;
/* Cycle through GF(2**n). */
if (incr & 1)
incr ^= mp->ma_poly; /* clears the lowest bit */
incr >>= 1;
} }
} }
...@@ -356,7 +336,7 @@ static dictentry * ...@@ -356,7 +336,7 @@ static dictentry *
lookdict_string(dictobject *mp, PyObject *key, register long hash) lookdict_string(dictobject *mp, PyObject *key, register long hash)
{ {
register int i; register int i;
register unsigned int incr; register unsigned int perturb;
register dictentry *freeslot; register dictentry *freeslot;
register unsigned int mask = mp->ma_size-1; register unsigned int mask = mp->ma_size-1;
dictentry *ep0 = mp->ma_table; dictentry *ep0 = mp->ma_table;
...@@ -370,8 +350,6 @@ lookdict_string(dictobject *mp, PyObject *key, register long hash) ...@@ -370,8 +350,6 @@ lookdict_string(dictobject *mp, PyObject *key, register long hash)
mp->ma_lookup = lookdict; mp->ma_lookup = lookdict;
return lookdict(mp, key, hash); return lookdict(mp, key, hash);
} }
/* We must come up with (i, incr) such that 0 <= i < ma_size
and 0 < incr < ma_size and both are a function of hash */
i = hash & mask; i = hash & mask;
ep = &ep0[i]; ep = &ep0[i];
if (ep->me_key == NULL || ep->me_key == key) if (ep->me_key == NULL || ep->me_key == key)
...@@ -385,16 +363,12 @@ lookdict_string(dictobject *mp, PyObject *key, register long hash) ...@@ -385,16 +363,12 @@ lookdict_string(dictobject *mp, PyObject *key, register long hash)
} }
freeslot = NULL; freeslot = NULL;
} }
/* Derive incr from hash, just to make it more arbitrary. Note that
incr must not be 0, or we will get into an infinite loop.*/
incr = hash ^ ((unsigned long)hash >> 3);
/* In the loop, me_key == dummy is by far (factor of 100s) the /* In the loop, me_key == dummy is by far (factor of 100s) the
least likely outcome, so test for that last. */ least likely outcome, so test for that last. */
for (;;) { for (perturb = hash; ; perturb >>= PERTURB_SHIFT) {
if (!incr) i = (i << 2) + i + perturb + 1;
incr = 1; /* and incr will never be 0 again */ ep = &ep0[i & mask];
ep = &ep0[(i + incr) & mask];
if (ep->me_key == NULL) if (ep->me_key == NULL)
return freeslot == NULL ? ep : freeslot; return freeslot == NULL ? ep : freeslot;
if (ep->me_key == key if (ep->me_key == key
...@@ -404,10 +378,6 @@ lookdict_string(dictobject *mp, PyObject *key, register long hash) ...@@ -404,10 +378,6 @@ lookdict_string(dictobject *mp, PyObject *key, register long hash)
return ep; return ep;
if (ep->me_key == dummy && freeslot == NULL) if (ep->me_key == dummy && freeslot == NULL)
freeslot = ep; freeslot = ep;
/* Cycle through GF(2**n). */
if (incr & 1)
incr ^= mp->ma_poly; /* clears the lowest bit */
incr >>= 1;
} }
} }
...@@ -448,7 +418,7 @@ actually be smaller than the old one. ...@@ -448,7 +418,7 @@ actually be smaller than the old one.
static int static int
dictresize(dictobject *mp, int minused) dictresize(dictobject *mp, int minused)
{ {
int newsize, newpoly; int newsize;
dictentry *oldtable, *newtable, *ep; dictentry *oldtable, *newtable, *ep;
int i; int i;
int is_oldtable_malloced; int is_oldtable_malloced;
...@@ -456,20 +426,12 @@ dictresize(dictobject *mp, int minused) ...@@ -456,20 +426,12 @@ dictresize(dictobject *mp, int minused)
assert(minused >= 0); assert(minused >= 0);
/* Find the smallest table size > minused, and its poly[] entry. */ /* Find the smallest table size > minused. */
newpoly = 0; for (newsize = MINSIZE;
newsize = MINSIZE; newsize <= minused && newsize > 0;
for (i = 0; i < sizeof(polys)/sizeof(polys[0]); ++i) { newsize <<= 1)
if (newsize > minused) { ;
newpoly = polys[i]; if (newsize <= 0) {
break;
}
newsize <<= 1;
if (newsize < 0) /* overflow */
break;
}
if (newpoly == 0) {
/* Ran out of polynomials or newsize overflowed. */
PyErr_NoMemory(); PyErr_NoMemory();
return -1; return -1;
} }
...@@ -511,7 +473,6 @@ dictresize(dictobject *mp, int minused) ...@@ -511,7 +473,6 @@ dictresize(dictobject *mp, int minused)
mp->ma_table = newtable; mp->ma_table = newtable;
mp->ma_size = newsize; mp->ma_size = newsize;
memset(newtable, 0, sizeof(dictentry) * newsize); memset(newtable, 0, sizeof(dictentry) * newsize);
mp->ma_poly = newpoly;
mp->ma_used = 0; mp->ma_used = 0;
i = mp->ma_fill; i = mp->ma_fill;
mp->ma_fill = 0; mp->ma_fill = 0;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment