On Sat, Feb 09, 2008, Jeff Johnson wrote:
> (aside) I first made this request 4+ years ago:
>
> https://lists.dulug.duke.edu/pipermail/rpm-devel/2004-November/000139.html
>
> That's how long its taken to restart rpm development, dealing with issues
> like rpmrc files and NPTL in rpmdb and multilib and selinux and forks and
> ...
>
> Since 2004 I have managed to get back to the point where the href's
> contained within a plain (non-DAV) URI need to be iterated for
> Opendir/Glob functionality in rpmio.
>
> The best (i.e. most maintainable and least surprising imho) choice proposed
> was -lxml2:
>
> I have a modified testHTML.c from libxml2 and indeed the HTML parser in
> libxml2
> can be used.
>
> There's also lhtml, a Lua HTML parser around these days.
>
> If up to me, I'm going to pare down testHTML.c to extract the href's within
> so that rpmio Opendir/Glob function through plain HTTP.
>
> That does mean that libxml2 is mandatory if you want plain HTTP support.
> neon already needs an XML parser, typically expat is used, but one could
> in principle choose the already supported libxml2 for neon use.
>
> Any other ideas?
If your only purpose for an XML library is to extract hyperlinks (tags
"<a href="...">) from HTML/XHTML pages, I strongly recommend to _not_
use any fully-featured XML library. Why? First, because if the XML
library not explicitly supports a "smart HTML mode" you might too easily
fail in case you are confronted with just slightly broken HTML (as
the HTML flying around on the net is usually far away from a strictly
conforming XHTML and this way far away from XML). Second, the dependency
to LibXML might be a little bit too much for this purpose because LibXML
requires other libraries like libiconv, etc. Alternatives could be AXL
or libnxml as those have less dependencies and are a lot smaller.
But for just extracting hyperlinks I personally would just leverage some
medium-complex regular expressions. RPM already uses regular expressions
so there is no additional dependency to another library required and one
can more easily accept all types of broken HTML, too.
Ralf S. Engelschall
rse@engelschall.com
www.engelschall.com
Received on Sat Feb 9 18:52:34 2008