On Sat, Feb 09, 2008, Ralf S. Engelschall wrote:
> On Sat, Feb 09, 2008, Jeff Johnson wrote:
>
> > (aside) I first made this request 4+ years ago:
> >
> > https://lists.dulug.duke.edu/pipermail/rpm-devel/2004-November/000139.html
> >
> > That's how long its taken to restart rpm development, dealing with issues
> > like rpmrc files and NPTL in rpmdb and multilib and selinux and forks and
> > ...
> >
> > Since 2004 I have managed to get back to the point where the href's
> > contained within a plain (non-DAV) URI need to be iterated for
> > Opendir/Glob functionality in rpmio.
> >
> > The best (i.e. most maintainable and least surprising imho) choice proposed
> > was -lxml2:
> >
> > I have a modified testHTML.c from libxml2 and indeed the HTML parser in
> > libxml2
> > can be used.
> >
> > There's also lhtml, a Lua HTML parser around these days.
> >
> > If up to me, I'm going to pare down testHTML.c to extract the href's within
> > so that rpmio Opendir/Glob function through plain HTTP.
> >
> > That does mean that libxml2 is mandatory if you want plain HTTP support.
> > neon already needs an XML parser, typically expat is used, but one could
> > in principle choose the already supported libxml2 for neon use.
> >
> > Any other ideas?
>
> If your only purpose for an XML library is to extract hyperlinks (tags
> "<a href="...">) from HTML/XHTML pages, I strongly recommend to _not_
> use any fully-featured XML library. Why? First, because if the XML
> library not explicitly supports a "smart HTML mode" you might too easily
> fail in case you are confronted with just slightly broken HTML (as
> the HTML flying around on the net is usually far away from a strictly
> conforming XHTML and this way far away from XML). Second, the dependency
> to LibXML might be a little bit too much for this purpose because LibXML
> requires other libraries like libiconv, etc. Alternatives could be AXL
> or libnxml as those have less dependencies and are a lot smaller.
>
> But for just extracting hyperlinks I personally would just leverage some
> medium-complex regular expressions. RPM already uses regular expressions
> so there is no additional dependency to another library required and one
> can more easily accept all types of broken HTML, too.
Oh, sorry, I forgot to give you an example of the regex I'm thinking
about (using PCRE functionality to make it easier, but can be changed to
work with plain POSIX functionalities, too):
(?i)<a(?:\s+[a-z][a-z0-9_]*(?:=(?:"[^"]*"|\S+))?)*?\s+href=(?:"([^"]*)"|(\S+))
I've not tested in, so perhaps it is still buggy. But it should already
give you an impression what I'm thinking about. A lot more complex it
should not become...
Ralf S. Engelschall
rse@engelschall.com
www.engelschall.com
Received on Sat Feb 9 19:27:58 2008