On May 14, 2008, at 4:28 AM, Ralf S. Engelschall wrote:
> On Tue, May 13, 2008, Jeff Johnson wrote:
>
>> On May 13, 2008, at 7:08 PM, Dmitry V. Levin wrote:
>>
>>> On Tue, May 13, 2008 at 05:25:36PM -0400, Jeff Johnson wrote:
>>> [...]
>>>> There's a problem with throwing thousands of files into one
>>>> directory
>>>> however.
>>>> The fix is what I'm calling "terminfo-like", where all terminfo
>>>> entries that start
>>>> with 'a' are save as a/a*, etc etc to statistically minimize the
>>>> number of files
>>>> in each sub-directory.
>>>>
>>>> Someone, somewhere, has done the implementation I've just
>>>> described, its
>>>> too obvious and too simple not to have been attempted (my guess).
>>>>
>>>> So I'm looking for pointers to existing implementations in C.
>>>
>>> Something similar, implemented in C:
>>> Git stores objects in filesystem using SHA-1 as a hash.
>>> It uses simple dir/file scheme:
>>> hex presentation of the first byte forms directory name, and hex
>>> presentation of remained 19 bytes forms file name within this
>>> directory, e.g. de/adbeefdeadc0decafef00dbadc0deddeadbeef.
>>
>> Poifect, exactly what I was looking for. Thanks.
>
> But one has to carefully select _which_ parts of the hash to use for
> the directory structure to end up with a reasonable distribution of
> files over directories. The hex-encoding refelcts part of the internal
> structure of a UUID (and each UUID version has a different structure)
> and hence one has to make sure not to pick up the wrong bytes. For
> instance if one would pick up the last bytes of a UUIDv1 one would end
> up with all files in the same directory as UUIDv1 usually (if not made
> out of random multicast addresses) have the hosts MAC encoded there
> (which does not change!). For UUIDv3, UUIDv4 and UUIDv5 is it less a
> problem as long as one doesn't pick out form near the middle (as there
> are the version encoded, etc).
>
Yup.
While I have your attention ...
1) I had to make certain choices (like for the UUIDv1 clock bits) in
order to
retrofit what (imho) is a "universal" (and useful) time stamp. What
should
be put in the clock field of a retrofitted UUIDv1 time stamp so that
it can be
usefully shared?
2) Is there any guidance on how to substitute MD5/SHA1 with some other
hash/digest? Both UUIDv3/UUIDv5 substitute for the 60bit time stamp
in a UUIDv1
iirc. Can other 60 bit strings be used similarly? I can see uses for
a crc64 that
can be linearly combined with other UUID/crc64's, and yet still
should be sufficiently
collision free to preserve uniqueness.
3) Is there some conventional (or de facto dominant) store using
UUIDs that is
commonly used on web sites? Or is that usually just handled with a db
schema,
which is certainly pretty easy to do. Perhap crc60 instead if
truncated crc64
cannot be linearly combined (but the 4 bits can perhaps be stashed
somewhere).
4) The design of the URI hierarchical name space (for a package
cache, or more generally,
for *.rpm metadata) likely needs more thought than the backing store,
or using UUID's as
primary keys. Any thoughts?
73 de Jeff
Received on Wed May 14 10:52:44 2008