RPM Community Forums

Mailing List Message of <rpm-devel>

Re: Anyone know of a tasteful LGPL HTML parser in C?

From: Ralf S. Engelschall <rse+rpm-devel@rpm5.org>
Date: Sat 09 Feb 2008 - 18:51:12 CET
Message-ID: <20080209175112.GA40845@engelschall.com>
On Sat, Feb 09, 2008, Jeff Johnson wrote:

> (aside) I first made this request 4+ years ago:
>
> https://lists.dulug.duke.edu/pipermail/rpm-devel/2004-November/000139.html
>
> That's how long its taken to restart rpm development, dealing with issues
> like rpmrc files and NPTL in rpmdb and multilib and selinux and forks and
> ...
>
> Since 2004 I have managed to get back to the point where the href's
> contained within a plain (non-DAV) URI need to be iterated for
> Opendir/Glob functionality in rpmio.
>
> The best (i.e. most maintainable and least surprising imho) choice proposed
> was -lxml2:
>
> I have a modified testHTML.c from libxml2 and indeed the HTML parser in
> libxml2
> can be used.
>
> There's also lhtml, a Lua HTML parser around these days.
>
> If up to me, I'm going to pare down testHTML.c to extract the href's within
> so that rpmio Opendir/Glob function through plain HTTP.
>
> That does mean that libxml2 is mandatory if you want plain HTTP support.
> neon already needs an XML parser, typically expat is used, but one could
> in principle choose the already supported libxml2 for neon use.
>
> Any other ideas?

If your only purpose for an XML library is to extract hyperlinks (tags
"<a href="...">) from HTML/XHTML pages, I strongly recommend to _not_
use any fully-featured XML library. Why? First, because if the XML
library not explicitly supports a "smart HTML mode" you might too easily
fail in case you are confronted with just slightly broken HTML (as
the HTML flying around on the net is usually far away from a strictly
conforming XHTML and this way far away from XML). Second, the dependency
to LibXML might be a little bit too much for this purpose because LibXML
requires other libraries like libiconv, etc. Alternatives could be AXL
or libnxml as those have less dependencies and are a lot smaller.

But for just extracting hyperlinks I personally would just leverage some
medium-complex regular expressions. RPM already uses regular expressions
so there is no additional dependency to another library required and one
can more easily accept all types of broken HTML, too.

                                       Ralf S. Engelschall
                                       rse@engelschall.com
                                       www.engelschall.com
Received on Sat Feb 9 18:52:34 2008
Driven by Jeff Johnson and the RPM project team.
Hosted by OpenPKG and Ralf S. Engelschall.
Powered by FreeBSD and OpenPKG.