easy HTML::TokeParser help request
2ge
created: 2006-08-03 09:11:12

Hello guys,

I am using this great module for parsing, I like it, it is quite easy, but now I get into trouble - really dont know how to parse following:

I read all nodes posted here, also tutorial. Main trouble is, I dont know create while() loop only for <div class="full"> html code - if I knew this, I would write such a parser.

thanks for any help
Re: easy HTML::TokeParser help request
created: 2006-08-03 09:16:27

Not a TokeParser solution, but using HTML::TreeBuilder I'd use $t->look_down( _tag => "div", class => "full" ) to get a list of the divs you're interested and then call $div->look_down( _tag => 'a' ) on each of those. Sometimes the tree solution's just conceptually easier to get your brane around.

Re^2: easy HTML::TokeParser help request
2ge
created: 2006-08-03 09:27:42
Thanks, I know HTML::TreeBuilder - used it some time ago. But I'd like to use only one parser module, if it is possible, I decided for HTML::TokeParser. I hope exists some solution for this module, also I dont want use hack with regular expressions or so. Maybe I have to use $p->unget_token( @tokens ) to get desired links.
Re^3: easy HTML::TokeParser help request
created: 2006-08-03 09:33:12

If you're dead set on tokeing it, build a state machine:

  • start in looking_for_full until you see a div with class full, when you transition to looking_for_content
  • when you see a div with class content in state looking_for_content, transition to looking_for_anchors
  • when you see an anchor in looking_for_anchors, save the href attribute
  • when you see a
in looking_for_anchors, go back to looking_for_full

Additional: Little note on implementation: you'd have a $state variable which keeps track of which state you're in (start with my $state = 'looking_for_full';). You'd then have a while( my $t = $stream->get_token ) { ... } loop, inside of which you'd implement the above behaviors. Any non-interesting token for the current state would be ignored (e.g. just next back to fetch the next token).

Re^4: easy HTML::TokeParser help request
2ge
created: 2006-08-03 10:21:23

Thanks for nice explanation, Fletch. I think this is how it should be done. But...In real life is easier to make regexp for that content I want to parse, so when doesn't exist other simpler solution for HTML::TokeParser, I have to pick up regexs.

Also, which parser is better ? HTML::TreeBuilder or ? I want learn only one, which is able to parse these relative easy things and has no other glitches...
Re^5: easy HTML::TokeParser help request
created: 2006-08-03 10:42:11

If you really have a fixed format that you can guarantee isn't going to change (e.g. this is a one-off throw away program to convert old data into a new format), sure go ahead and use regexen. Otherwise you'll find out n months down the road that you're going to spend the time again re-implementing it when the HTML changes because the designer got a new version of Dreampage 06 X.

As for which parser is better: depends. Which is better, a Ferrari or a heavy duty pickup? Try moving a couple palates of bricks with the former, or winning a race with the later.

In my personal experience the answer is: depends. :) I used to use TokeParser more than TreeBuilder (writing my own RSS feeds before sites provided them themselves), but more often that's now the other way around. As you can see, for a task with more context sensitivity (foo elements 2 levels down inside bar elements) it's more scaffolding from the programmer to do things with TokeParser than with a tree. But there's other types of tasks (extract any foo elements with class zorch) that'll probably be simpler to think of in the TokeParser manner.

If you haven't looked at it you should also take a gander at HTML::TokeParser::Simple which provides an even nicer token interface.

Re^6: easy HTML::TokeParser help request
2ge
created: 2006-08-04 08:20:06

I think, when we are talking about changing format, in many times there is also needed change also program, not only regexp. Now it is ok, assume it will not change.

Thanks for nice answer about comparing TokeParser and TreeBuilder, that is really enough for me and I see the difference. Maybe there should be some other module putting those two propertie together (take token and have also tree stored somewhere). Dont know if it is possible. The main thing is - problem is solved, and I hope this node will help also to other monks!
Re^3: easy HTML::TokeParser help request
created: 2006-08-03 15:35:52

Of about the last 10 "I need to parse this HTML/XML structure" questions asked here nine of the answers were trivial using ::TreeParser (there are XML and HTML versions) and the other was trivial using XML::Twig.

Personally I use TreeParser more often in an HTML context and XML::Twig for XHTML and XML. XML::Twig is very powerful for editing, TreeBuilder is very good at looking stuff up.

At the end of the day the more modules you know a little bit about the more quickly and reliably you get stuff done. Don't be afraid to read documentation! Sometimes a quick question in the CB can save a huge amount of time, if you have a general idea where you are headed in the first place.

Limiting yourself to a single module is ... limiting! There is no one tool that does every job, not even computers.


DWIM is Perl's answer to Gödel
Re: easy HTML::TokeParser help request
created: 2006-08-03 10:56:23
Relies on well-formed HTML:
#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser;

my $doc = do { local $/;  };

my $p = HTML::TokeParser->new( \$doc ); 

while ( my $outer = $p->get_tag("div") ) {
	
	next unless $outer->[1]{class} eq "full";
	my $nested_div = 0;

	while ( my $inner = $p->get_tag ) {
	
		# keep count of nested divs
		$nested_div++ if $inner->[0] eq "div";
		$nested_div-- if $inner->[0] eq "/div";

		# "full" div has closed
		last if $nested_div == -1;

		print $p->get_text, "\n" if $inner->[0] eq "a";
	}
}

__DATA__

Re^2: easy HTML::TokeParser help request
2ge
created: 2006-08-04 08:12:05
Thanks! Very very nice solution, easy to understand and it works like it should. I give you ++
Re: easy HTML::TokeParser help request
created: 2006-08-03 12:21:28
With HTML::TokeParser::Simple

(assumes all links within class="full")

#!/usr/bin/perl

use strict;
use warnings;
use HTML::TokeParser::Simple;

my $html = do {local $/; };
my $p = HTML::TokeParser::Simple->new(\$html)
  or die "can't parse: $!";
  
my ($in_full, @href);

while (my $t = $p->get_token){
  
  next if  
    $t->is_start_tag('div') 
      and 
    $t->get_attr('class') 
      and 
    $t->get_attr('class') eq 'content';
    
  $in_full++, next if 
    $t->is_start_tag('div')
      and
    $t->get_attr('class') eq 'full';
     
  $in_full = 0, next if
    $t->is_start_tag('div')
      and
    $t->get_attr('class') ne 'full';
        
  next unless $in_full;
  
  push @href, $t->get_attr('href') if 
    $t->is_start_tag('a');
  
}

print "$_\n" for @href;

__DATA__

perlmonks.org content © perlmonks.org and 2ge, Fletch, GrandFather, un-chomp, wfsp

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03