RPM Community Forums

Mailing List Message of <rpm-devel>

Using Bloom filters to detect file conflicts

From: Jeff Johnson <n3npq@mac.com>
Date: Wed 14 Jan 2009 - 15:47:54 CET
Message-id: <61977999-AE18-48D6-9D24-4E9B30CBE569@mac.com>
A recent paper published by Mancoosi
	http://www.mancoosi.org/papers/edbt09.pdf
describes (and has the necessary references for "Bloom filters")
an efficient algorithm for distributing package repository contents.

The fundamental problem with 100's of package repositories that are  
mirrored
in multiple locations is choosing the "best" repository from which to
download a package. There is usually duplication of package content when
mirroring, and repositories often have overlapped package content,
i.e. package that are sufficiently similar to be used interchangeably.

The paper uses "Bloom filters" which have a long history in combining
parallel/distributed database queries to optimize network bandwidth
usage in package discovery protocols.

But "packages" contained in "repositories" has an analogous problem
within RPM detecting "files" in "packages", also known as detecting
"file conflicts".

There are two locations within RPM code that build an in-memory store
of file paths in order to detect (essentially) set intersections
of file paths provided (or contained) in multiple packages.

One of the RPM code locations was described here
	http://rpm5.org/community/rpm-devel/3210.html
Using bloom filters to detect file dependencies is another
approach, complementary to using a rpmal backing store,
to reduce rpmlib memory footprint and scale to handling
a larger number of file paths in a given transaction.

The other RPM code location where Bloom filters have a
usage case (imho) is in detecting file conflicts between packages,
which has just shown up as a problem again
	http://rpm5.org/community/rpm-devel/3378.html

Precomputing Bloom filters and adding to *.rpm metadata,
(likely but not necessarily) through new tags in
package headers would be the first step towards
a better (i.e. smaller memory footprint, increased scalability
to more file paths in a transaction) approach to detect file conflicts.

Opinions?

73 de Jeff




  • application/pkcs7-signature attachment: smime.p7s
Received on Wed Jan 14 15:48:19 2009
Driven by Jeff Johnson and the RPM project team.
Hosted by OpenPKG and Ralf S. Engelschall.
Powered by FreeBSD and OpenPKG.