RPM Community Forums

Mailing List Message of <rpm-devel>

Re: Anyone know of a tasteful LGPL HTML parser in C?

From: Jeff Johnson <n3npq@mac.com>
Date: Thu 14 Feb 2008 - 03:32:56 CET
Message-Id: <5F25D511-35F8-415B-AD13-09589F17E8F9@mac.com>

On Feb 9, 2008, at 1:26 PM, Ralf S. Engelschall wrote:

>
> (?i)<a(?:\s+[a-z][a-z0-9_]*(?:=(?:"[^"]*"|\S+))?)*?\s+href=(?:"([^"] 
> *)"|(\S+))
>
> I've not tested in, so perhaps it is still buggy. But it should  
> already
> give you an impression what I'm thinking about. A lot more complex it
> should not become...
>

Now that rpmgrep exists (so that I can apply PCRE regex's to URI's)
I can see from what is matched with the RE above as

     rpmgrep --color HREF_PATTERN_ABOVE http://rpm5.org/

that the pattern you gave me is a very promising first step that may
be an alternative to using the libxml2 HTML parser.

What remains to do is to find the elements in the "collection" of a
plain HTTP URI that are analogues of Readdir(3) dp->d_name;
the href's are (if you will) the equivalent of Readlink(2) end-points,
the names of the element(s) in the "collection" will also need to
be matched.

Hmmm, "collections" (as in sub-directories to be traversed) always  
seem to have
the pesky trailing '/', that might be sufficient to distingush  
DIRECTORY from FILE.

Perhaps a pattern to match *.rpm suffix on the href is the analogue
of a FILE in the "collection" when using rpmio traversal through
plain (i.e. non-DAV) HTTP transport.

Dunno. More actual experience is needed, I'l hack up some scriptie  
tomorrow.

(off the wall aside) I never would have dreamed that I would ever find
colorized grep output useful. Adding --color to display the value that
is matched by the pattern is so so so much less eye bleed.

Thank you!

73 de Jeff
Received on Thu Feb 14 03:33:08 2008
Driven by Jeff Johnson and the RPM project team.
Hosted by OpenPKG and Ralf S. Engelschall.
Powered by FreeBSD and OpenPKG.