If you only want to remove certain parts you could take a to look at HTML::TreeBuilder and friends and use it to selectively pull out the elements you want from what's there. Alternately if your HTML is well formed enough you could use XML::Twig to do something similar.
Another source of inspiration might be to get the slashcode source and look at its comment filtering (seeing as that's what this sounds like you're trying to do).
#!/usr/bin/perl -w use HTML::Scrubber; use strict; my $html = q[
a => link br =>
b => bold u => UNDERLINE ]; # only allow the following tags my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ] ] ); print $scrubber->scrub($html); __END__ Output:
a => link br =>
b => bold u => UNDERLINE
There's a module based on HTML::Parser for this on my pages (not released to CPAN) that allows you just that and more. It allows you to specify not just the list of tags to allow, but also the attributes. So not more unexpected onMouseOvers and onLoads ;-)
Just the module name is a bit silly ...
Jenda
| XML sucks. Badly. SOAP on the other hand is the most powerfull vacuum pump ever invented. |
perlmonks.org content © perlmonks.org and Anonymous Monk, bmann, Fletch, Jenda
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03