A recent paper published by Mancoosi
describes (and has the necessary references for "Bloom filters")
an efficient algorithm for distributing package repository contents.
The fundamental problem with 100's of package repositories that are
in multiple locations is choosing the "best" repository from which to
download a package. There is usually duplication of package content when
mirroring, and repositories often have overlapped package content,
i.e. package that are sufficiently similar to be used interchangeably.
The paper uses "Bloom filters" which have a long history in combining
parallel/distributed database queries to optimize network bandwidth
usage in package discovery protocols.
But "packages" contained in "repositories" has an analogous problem
within RPM detecting "files" in "packages", also known as detecting
There are two locations within RPM code that build an in-memory store
of file paths in order to detect (essentially) set intersections
of file paths provided (or contained) in multiple packages.
One of the RPM code locations was described here
Using bloom filters to detect file dependencies is another
approach, complementary to using a rpmal backing store,
to reduce rpmlib memory footprint and scale to handling
a larger number of file paths in a given transaction.
The other RPM code location where Bloom filters have a
usage case (imho) is in detecting file conflicts between packages,
which has just shown up as a problem again
Precomputing Bloom filters and adding to *.rpm metadata,
(likely but not necessarily) through new tags in
package headers would be the first step towards
a better (i.e. smaller memory footprint, increased scalability
to more file paths in a transaction) approach to detect file conflicts.
73 de Jeff
Received on Wed Jan 14 15:48:19 2009
- application/pkcs7-signature attachment: smime.p7s