RPM Community Forums

Mailing List Message of <rpm-devel>

Re: Anyone know of a tasteful LGPL HTML parser in C?

From: Jeff Johnson <n3npq@mac.com>
Date: Sat 09 Feb 2008 - 19:25:47 CET
Message-Id: <9CFA5DD8-E13C-47F1-B167-892685D5A1A0@mac.com>

On Feb 9, 2008, at 12:51 PM, Ralf S. Engelschall wrote:

> On Sat, Feb 09, 2008, Jeff Johnson wrote:
>
>> (aside) I first made this request 4+ years ago:
>>
>> https://lists.dulug.duke.edu/pipermail/rpm-devel/2004-November/ 
>> 000139.html
>>
>> That's how long its taken to restart rpm development, dealing with  
>> issues
>> like rpmrc files and NPTL in rpmdb and multilib and selinux and  
>> forks and
>> ...
>>
>> Since 2004 I have managed to get back to the point where the href's
>> contained within a plain (non-DAV) URI need to be iterated for
>> Opendir/Glob functionality in rpmio.
>>
>> The best (i.e. most maintainable and least surprising imho) choice  
>> proposed
>> was -lxml2:
>>
>> I have a modified testHTML.c from libxml2 and indeed the HTML  
>> parser in
>> libxml2
>> can be used.
>>
>> There's also lhtml, a Lua HTML parser around these days.
>>
>> If up to me, I'm going to pare down testHTML.c to extract the  
>> href's within
>> so that rpmio Opendir/Glob function through plain HTTP.
>>
>> That does mean that libxml2 is mandatory if you want plain HTTP  
>> support.
>> neon already needs an XML parser, typically expat is used, but one  
>> could
>> in principle choose the already supported libxml2 for neon use.
>>
>> Any other ideas?
>
> If your only purpose for an XML library is to extract hyperlinks (tags
> "<a href="...">) from HTML/XHTML pages, I strongly recommend to _not_
> use any fully-featured XML library. Why? First, because if the XML
> library not explicitly supports a "smart HTML mode" you might too  
> easily
> fail in case you are confronted with just slightly broken HTML (as
> the HTML flying around on the net is usually far away from a strictly
> conforming XHTML and this way far away from XML). Second, the  
> dependency
> to LibXML might be a little bit too much for this purpose because  
> LibXML
> requires other libraries like libiconv, etc. Alternatives could be AXL
> or libnxml as those have less dependencies and are a lot smaller.
>

Portability needs/problems well known. Basically why I did DAV first  
instead.

The needs for iterating Opendir are dirt simple. I need the analogue
of dp->d_name from struct dirent in an ARGV_t array. That's all I am
getting from DAV "collections". For performance I'd like a few
other items at the same time, but I can always run a HEAD to
get what is needed if necessary.

For Fts(3) I will also need the attached href pointers to traverse  
the hierarchy
logically, loops and cross-site and all the other goop needed too.

> But for just extracting hyperlinks I personally would just leverage  
> some
> medium-complex regular expressions. RPM already uses regular  
> expressions
> so there is no additional dependency to another library required  
> and one
> can more easily accept all types of broken HTML, too.
>

Yup, I need a HTML parser. I have a HTML href ripper from Alan Cox  
(see my
original request) that could be used. However, if I write my own custom
HTML ripper I'm going to plagued by obscurely obscene HTML parser  
issues until
the year 2020. I hope to have retired to Tahiti by then instead ;-)

Which is my winding rationale for choosing the HTML parser in libxml2  
instead
of just hacking out some RE's. Yes build bloat, but rpm->neon->  
already dragged in
many many build bloat libraries.

73 de Jeff
Received on Sat Feb 9 19:26:04 2008
Driven by Jeff Johnson and the RPM project team.
Hosted by OpenPKG and Ralf S. Engelschall.
Powered by FreeBSD and OpenPKG.