simple regex
Anonymous Monk
created: 2006-03-03 12:35:21
Having a little trouble matching on a simple regex. My guess is some of the " or > or such are special characters. I always wondered if there's a list that contains all the special characters to watch out for in a regex?

The HTML I am tyring to match is, I'm trying to get the number value.


$page =~ m/name=\"VERIFIER\" value=\"(\d+\-\d+)\">/x;
print $1;
Re: simple regex
created: 2006-03-03 12:42:17
Try:
$page =~ m/value="([-0-9]+?"/;
Re^2: simple regex
created: 2006-03-03 12:51:14

You may also wish to use the m/regex/si flags in case the HTML spans multiple lines or the VERIFIER name attribute is not upper-cased.


No good deed goes unpunished. -- (attributed to) Oscar Wilde
Re^2: simple regex
created: 2006-03-03 14:37:38
Who says 'Fun with regexes' is not a contact sport?
$text .= 'blahblah';
$text .= "\nblah\n";
$text .= 'blahblah';
$text .= "\nblah\n";

$page = (($text =~ m/name="VERIFIER"\ value="([-0-9]+?)">/xg),$1)[0];

print "$page";
So, $page will end up as "076-62" in your example... and to hell with $1.




If you wanted make it more interesting, you could replace the appropriate line with:
$page = (($text =~ m//xg),$1.'-'.$2)[0];
or
$page = (($text =~ m//xg),$1.'-'.$2)[1];
if you only wanted a portion of the value. (The '072' or '062' in your example.)




FYI, the 'special character' that's goofing with you (at least as written in your example) is the space between "VERIFIER" and value. The " characters do not have to be backwhack-escaped, but that non-escaped space in the middle of your regex will cause the match to fail.

Well, that's my $.02 worth. No, for refunds you'll have to check our customer service department.
Re: simple regex
created: 2006-03-03 12:51:05
The reason your RE is failing is because of the /x which makes the RE engine ignore all unescaped whitespace in your RE, like the space between name=\"VERIFIER\" and value=. Check out perlre for more info.
Re: simple regex
created: 2006-03-03 12:53:22

Hi,

You have tried well it's not special characters problem.

$page =~ m/name="VERIFIER" value="(\d+)-(\d+)"/g;
print "$1 $2";
Re: simple regex
created: 2006-03-03 12:54:06
"always wondered if there's a list that contains all the special characters to watch out for"

This is pretty much covered in perldoc perlre

Re: simple regex
created: 2006-03-03 13:29:38
I find that "simple regex" and HTML very rarely fit nicely together!

You can never be sure what the HTML will actually look like as [ptum] mentions [id://534302|above]. If your HTML is part of a file it would be well worth considering a parser.

#!/bin/perl5

use strict;
use warnings;
use HTML::TokeParser::Simple;

my $html = '';
my $attribute = 'value';
my $value;

my $tp = HTML::TokeParser::Simple->new(\$html)
  or die "Couldn't parse string: $!";

while (my $t = $tp->get_token) {
	
  if (
    $t->is_start_tag('input') and 
    $value = $t->get_attr($attribute)
  )
  {
      print "*$value*\n";
  }

}

perlmonks.org content © perlmonks.org and Anonymous Monk, explorer, gube, lo_tech, McDarren, Paladin, ptum, wfsp

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03