On Fri, Dec 14, 2007, Ralf S. Engelschall wrote:
> On Fri, Dec 14, 2007, Jeff Johnson wrote:
>
> > @@ -170,4 +170,8 @@
> > script kiddie progress bars across the screen.
> > - jbj: arbitrary "%foo -p /bar" scriptlets as pair'ed RPMTAG_{FOO,FOOPROG}.
> > - jbj: arbitrary triggers, like scriptlets, but with a condition check too.
> > + - jbj: the hash *ahem* algorithm at rpmio/rpmhash.c:77 hashFunctionString()
> > + is pathetic. Ditto, rpmdb/fprint.c:186 fpHashFunction(). FYI, the
> > + fingerprint hash is in the top 10 pigs when profiling rpm installs, so
> > + better has immediate performance benefits. rpmio/lookup3.c == better.
>
> Question: what is the purpose of hashFunctionString()'s result:
> to be used for mostly uniquely identify the input (aka a more
> cryptographically strong message digest) or just to be used for filing
> the input in a hash table (aka just a standard hashing function). If
> the first is the intention, then the implementation is semantically
> bad, but a better implementation will be even much slower, of course.
> If the second is the intention, we can improve it: a few years ago I've
> measured lots of hash functions and have one of the fastest and best (in
> terms of equal distribution) ones at hand...
If the function is for plain hashing, feel free to performance
compare the old one with this:
--------------------------------------------------------------------------
Index: rpmio/rpmhash.c
--- rpmio/rpmhash.c 15 Nov 2007 17:27:59 -0000 1.7
+++ rpmio/rpmhash.c 14 Dec 2007 15:30:20 -0000
@@ -77,20 +77,67 @@
static uint32_t hashFunctionString(uint32_t h, const void * data, size_t size)
/*@*/
{
- const char * chp = data;
- unsigned char sum = (unsigned char)0;
- unsigned char xor = (unsigned char)0;
- size_t i;
+ const char *key = data;
if (size == 0)
- size = strlen(chp);
- for (i = 0; i < size; i++, chp++) {
- xor ^= *chp;
- sum += *chp;
- }
-
- h += ((uint32_t)(size << 16) + (uint32_t)(sum << 8) + (uint32_t)xor);
+ size = strlen(key);
+ /*
+ * DJBX33A (Daniel J. Bernstein, Times 33 with Addition)
+ *
+ * This is Daniel J. Bernstein's popular `times 33' hash function as
+ * posted by him years ago on comp.lang.c. It basically uses a function
+ * like ``hash(i) = hash(i-1) * 33 + str[i]''. This is one of the best
+ * known hash functions for strings. Because it is both computed very
+ * fast and distributes very well.
+ *
+ * The magic of number 33, i.e. why it works better than many other
+ * constants, prime or not, has never been adequately explained by
+ * anyone. So I try an explanation: if one experimentally tests all
+ * multipliers between 1 and 256 (as RSE did now) one detects that even
+ * numbers are not useable at all. The remaining 128 odd numbers
+ * (except for the number 1) work more or less all equally well. They
+ * all distribute in an acceptable way and this way fill a hash table
+ * with an average percent of approx. 86%.
+ *
+ * If one compares the Chi^2 values of the variants, the number 33 not
+ * even has the best value. But the number 33 and a few other equally
+ * good numbers like 17, 31, 63, 127 and 129 have nevertheless a great
+ * advantage to the remaining numbers in the large set of possible
+ * multipliers: their multiply operation can be replaced by a faster
+ * operation based on just one shift plus either a single addition
+ * or subtraction operation. And because a hash function has to both
+ * distribute good _and_ has to be very fast to compute, those few
+ * numbers should be preferred and seems to be the reason why Daniel J.
+ * Bernstein also preferred it.
+ *
+ * Below you can find the variant implemented with the
+ * multiplication optimized via bit shifts and hash unrolled eight
+ * times for speed.
+ * -- Ralf S. Engelschall <rse@engelschall.com>
+ */
+ if (h == 0)
+ h = 5381;
+ for (; size >= 8; size -= 8) {
+ h = ((h << 5) + h) + *key++;
+ h = ((h << 5) + h) + *key++;
+ h = ((h << 5) + h) + *key++;
+ h = ((h << 5) + h) + *key++;
+ h = ((h << 5) + h) + *key++;
+ h = ((h << 5) + h) + *key++;
+ h = ((h << 5) + h) + *key++;
+ h = ((h << 5) + h) + *key++;
+ }
+ switch (size) {
+ case 7: h = ((h << 5) + h) + *key++; /* fallthrough... */
+ case 6: h = ((h << 5) + h) + *key++; /* fallthrough... */
+ case 5: h = ((h << 5) + h) + *key++; /* fallthrough... */
+ case 4: h = ((h << 5) + h) + *key++; /* fallthrough... */
+ case 3: h = ((h << 5) + h) + *key++; /* fallthrough... */
+ case 2: h = ((h << 5) + h) + *key++; /* fallthrough... */
+ case 1: h = ((h << 5) + h) + *key++; break;
+ default: /* case 0: */ break;
+ }
return h;
}
--------------------------------------------------------------------------
As the comment explains, the DJBX33A function is one of the best (with
the meaning of the best balance between fast and still very good
distributing) hashing functions. Please check whether it helps you to
speed up RPM...
Ralf S. Engelschall
rse@engelschall.com
www.engelschall.com
Received on Fri Dec 14 16:32:52 2007