RPM Community Forums

Mailing List Message of <rpm-devel>

Re: Hierarchical data cache using UUID's

From: Jeff Johnson <n3npq@mac.com>
Date: Wed 14 May 2008 - 10:51:09 CEST
Message-Id: <44987CFD-0B1B-4191-9300-28091B6994F7@mac.com>

On May 14, 2008, at 4:28 AM, Ralf S. Engelschall wrote:

> On Tue, May 13, 2008, Jeff Johnson wrote:
>
>> On May 13, 2008, at 7:08 PM, Dmitry V. Levin wrote:
>>
>>> On Tue, May 13, 2008 at 05:25:36PM -0400, Jeff Johnson wrote:
>>> [...]
>>>> There's a problem with throwing thousands of files into one  
>>>> directory
>>>> however.
>>>> The fix is what I'm calling "terminfo-like", where all terminfo
>>>> entries that start
>>>> with 'a' are save as a/a*, etc etc to statistically minimize the
>>>> number of files
>>>> in each sub-directory.
>>>>
>>>> Someone, somewhere, has done the implementation I've just  
>>>> described, its
>>>> too obvious and too simple not to have been attempted (my guess).
>>>>
>>>> So I'm looking for pointers to existing implementations in C.
>>>
>>> Something similar, implemented in C:
>>> Git stores objects in filesystem using SHA-1 as a hash.
>>> It uses simple dir/file scheme:
>>> hex presentation of the first byte forms directory name, and hex
>>> presentation of remained 19 bytes forms file name within this
>>> directory, e.g. de/adbeefdeadc0decafef00dbadc0deddeadbeef.
>>
>> Poifect, exactly what I was looking for. Thanks.
>
> But one has to carefully select _which_ parts of the hash to use for
> the directory structure to end up with a reasonable distribution of
> files over directories. The hex-encoding refelcts part of the internal
> structure of a UUID (and each UUID version has a different structure)
> and hence one has to make sure not to pick up the wrong bytes. For
> instance if one would pick up the last bytes of a UUIDv1 one would end
> up with all files in the same directory as UUIDv1 usually (if not made
> out of random multicast addresses) have the hosts MAC encoded there
> (which does not change!). For UUIDv3, UUIDv4 and UUIDv5 is it less a
> problem as long as one doesn't pick out form near the middle (as there
> are the version encoded, etc).
>

Yup.

While I have your attention ...

1) I had to make certain choices (like for the UUIDv1 clock bits) in  
order to
retrofit what (imho) is a "universal" (and useful) time stamp. What  
should
be put in the clock field of a retrofitted UUIDv1 time stamp so that  
it can be
usefully shared?

2) Is there any guidance on how to substitute MD5/SHA1 with some other
hash/digest? Both UUIDv3/UUIDv5 substitute for the 60bit time stamp  
in a UUIDv1
iirc. Can other 60 bit strings be used similarly? I can see uses for  
a crc64 that
can be linearly combined with other UUID/crc64's, and yet still  
should be sufficiently
collision free to preserve uniqueness.

3) Is there some conventional (or de facto dominant) store using  
UUIDs that is
commonly used on web sites? Or is that usually just handled with a db  
schema,
which is certainly pretty easy to do. Perhap crc60 instead if  
truncated crc64
cannot be linearly combined (but the 4 bits can perhaps be stashed  
somewhere).

4) The design of the URI hierarchical name space (for a package  
cache, or more generally,
for *.rpm metadata) likely needs more thought than the backing store,  
or using UUID's as
primary keys. Any thoughts?

73 de Jeff
Received on Wed May 14 10:52:44 2008
Driven by Jeff Johnson and the RPM project team.
Hosted by OpenPKG and Ralf S. Engelschall.
Powered by FreeBSD and OpenPKG.