On Feb 9, 2008, at 12:51 PM, Ralf S. Engelschall wrote:
> On Sat, Feb 09, 2008, Jeff Johnson wrote:
>
>> (aside) I first made this request 4+ years ago:
>>
>> https://lists.dulug.duke.edu/pipermail/rpm-devel/2004-November/
>> 000139.html
>>
>> That's how long its taken to restart rpm development, dealing with
>> issues
>> like rpmrc files and NPTL in rpmdb and multilib and selinux and
>> forks and
>> ...
>>
>> Since 2004 I have managed to get back to the point where the href's
>> contained within a plain (non-DAV) URI need to be iterated for
>> Opendir/Glob functionality in rpmio.
>>
>> The best (i.e. most maintainable and least surprising imho) choice
>> proposed
>> was -lxml2:
>>
>> I have a modified testHTML.c from libxml2 and indeed the HTML
>> parser in
>> libxml2
>> can be used.
>>
>> There's also lhtml, a Lua HTML parser around these days.
>>
>> If up to me, I'm going to pare down testHTML.c to extract the
>> href's within
>> so that rpmio Opendir/Glob function through plain HTTP.
>>
>> That does mean that libxml2 is mandatory if you want plain HTTP
>> support.
>> neon already needs an XML parser, typically expat is used, but one
>> could
>> in principle choose the already supported libxml2 for neon use.
>>
>> Any other ideas?
>
> If your only purpose for an XML library is to extract hyperlinks (tags
> "<a href="...">) from HTML/XHTML pages, I strongly recommend to _not_
> use any fully-featured XML library. Why? First, because if the XML
> library not explicitly supports a "smart HTML mode" you might too
> easily
> fail in case you are confronted with just slightly broken HTML (as
> the HTML flying around on the net is usually far away from a strictly
> conforming XHTML and this way far away from XML). Second, the
> dependency
> to LibXML might be a little bit too much for this purpose because
> LibXML
> requires other libraries like libiconv, etc. Alternatives could be AXL
> or libnxml as those have less dependencies and are a lot smaller.
>
Portability needs/problems well known. Basically why I did DAV first
instead.
The needs for iterating Opendir are dirt simple. I need the analogue
of dp->d_name from struct dirent in an ARGV_t array. That's all I am
getting from DAV "collections". For performance I'd like a few
other items at the same time, but I can always run a HEAD to
get what is needed if necessary.
For Fts(3) I will also need the attached href pointers to traverse
the hierarchy
logically, loops and cross-site and all the other goop needed too.
> But for just extracting hyperlinks I personally would just leverage
> some
> medium-complex regular expressions. RPM already uses regular
> expressions
> so there is no additional dependency to another library required
> and one
> can more easily accept all types of broken HTML, too.
>
Yup, I need a HTML parser. I have a HTML href ripper from Alan Cox
(see my
original request) that could be used. However, if I write my own custom
HTML ripper I'm going to plagued by obscurely obscene HTML parser
issues until
the year 2020. I hope to have retired to Tahiti by then instead ;-)
Which is my winding rationale for choosing the HTML parser in libxml2
instead
of just hacking out some RE's. Yes build bloat, but rpm->neon->
already dragged in
many many build bloat libraries.
73 de Jeff
Received on Sat Feb 9 19:26:04 2008