Regex Report
japhy
created: 2004-06-27 13:17:39
I plan on having a draft of my regex article ready for review by the end of June. Hopefully, by early July, Regexp::Parser will be on CPAN. Once that's ready to use, I'm going to make a couple sub-modules (like Regexp::Explain), and then I'm going to work subclassing it to match Perl 6 regexes.

What follows is rescinded by me; I won't delete the text, but it's here in a small red font to let you know it's (already) out-dated.

That being said, I'm also going to release (if I can figure out how to do it safely) re::capture, which will introduce a new assertion: (?N=pat). It will allow you to specify what capture group you're assigning to. Here's an example of its use:

# parses text like:
# name = japhy  age = "22"  lang = 'Perl'
# into a hash... but it retains those pesky quotes :/
my %data = $text =~ m{
  ([^=\s]+) \s* = \s* 
  (
    ' [^']* ' |
    " [^"]* " |
    \S+
  )
}xg;
That's pesky because then you have to post-process the quotes out of them. re::capture (isn't that a witty name?) will allow you to say:
# parses text like:
# name = japhy  age = "22"  lang = 'Perl'
# into a hash... but doesn't capture the quotes!
my %data = $text =~ m{
  ([^=\s]+) \s* = \s* 
  (?:
    ' (?2= [^']* ) ' |
    " (?2= [^"]* ) " |
    ( \S+ )
  )
}xg;
This case might be resolved in other ways, but it's a good demonstration of what the module does. The other thing I think I'll make it implement are captures that exist only in the regex, and are ignored (that is, not returned) afterwards. That means you can write:
# parses text like:
# name = japhy  age = "22"  lang = 'Perl'
# into a hash... but doesn't capture the quotes!
my %data = $text =~ m{
  ([^=\s]+) \s* = \s* 
  (?:
    (?*3= ['"] ) (?2= .*? ) \3 |
    ( \S+ )
  )
}xg;
and the regex will only return ($1, $2) each time it matches.

This is not going to be a filter, but rather will work like [re], and redefine the functions Perl uses to do its compiling and matching. It won't change much, but it will add support for this new assertion.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a [http://www.pobox.com/~japhy/resume.txt|job] (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: Regex Report
created: 2004-06-27 16:30:21

Just something to think about but the dotNet regex library supports named captures:

(?PATTERN)

While you are hacking this funky stuff maybe such a thing would also be cool. Maybe you could use %+ to hold the captures? So

if ('demerphq'=~/(?\w+)/) {
  print $+{perlmonk}
}

would work. I mean its a bit embarrassing that dotNet has a cool regex feature that perl doesn't. (IMO anyway. :-)

Thanks for your efforts [japhy].

---
demerphq

    First they ignore you, then they laugh at you, then they fight you, then you win.
    -- Gandhi


Re: Regex Report
created: 2004-06-27 22:23:46
Sounds interesting, and though I cannot ATM think of any situation where I might want to use it, I am almost certain that I have in the past wished for something similar (if not the same). I am not sure if you meant for the last example:
# parses text like:
# name = japhy  age = "22"  lang = 'Perl'
# into a hash... but doesn't capture the quotes!
my %data = $text =~ m{
  ([^=\s]+) \s* = \s* 
  (?:
    (?3= ['"] ) (.*? ) \3 |
    ( \S+ )
  )
}xg;
So that what was previously in the (?2= .* ) would be returned.

This case might be resolved in other ways,
Sorry couldn't help presenting one:
while (  ) {
  my %data = $_ =~ m{
    ([^=\s]+) \s* = \s* 
    ["']?(
      (?<=')[^']* |
      (?<=")[^"]* |
      \S+
    )["']?
  }xg;

-enlil

(as a side note I would have to agree with [demerphq] that [id://370038|named captures] would be cool as well)
Re^2: Regex Report
created: 2004-06-27 23:40:26
I would make your example a little safer:
my %data = $_ =~ m{
  ([^=\s]+) \s* = \s* 
  ["']? (
    (?<= '  ) [^']* (?=  '   ) |
    (?<=  " ) [^"]* (?=   "  ) |
    (?

That, to me, seems safer, because it ensures a string that starts quoted ends quoted, and a string that doesn't start quoted doesn't end quoted (for some bizarre reason).


_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a [http://www.pobox.com/~japhy/resume.txt|job] (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: Regex Report
created: 2004-06-27 23:59:41
Ok. Regexp::Parser and the Perl 6 regex parser will be my primary concerns (apart from the article). That whole re-ordered captures and temporary-captures thing will have to wait.

In the mean time, if you want named captures, I suggest you look at Steve Grazzini's Regexp::Fields, which does what you want. I'm glad I didn't try implementing it -- it doesn't look like a cake-walk.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: Regex Report
created: 2004-06-28 09:32:48

When you post this, please be sure to not leave important parts of the grammar in inaccessible lexicals. I tried to do something once with YAPE::Regexp but had to write node 251219 just to get access to the grammar in my %pat. That really sucked. The only change I really needed was to have %pat be a global so it'd be re-useable.

Purty please? Won't you think of the children?

Re^2: Regex Report
created: 2004-06-28 09:39:38
Heh, sorry. You'll be happy to know all the grammar is stored (gasp?) as methods of the object. This means you have method names like "(" and "[" and "|". If you think this is blasphemous, tough cookies. In fact, the only non-weird looking method name is "atom", which is the starting node for the grammar.

Another thing. Right now, the grammar is determined on the fly. That is, each rule (upon successful match) tells the object what possible rules follow it. Perhaps I should implement that differently.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re^3: Regex Report
created: 2004-06-28 11:27:11

So there's code that looks like $next = List::Util::first { eval { $self->$_ } } @tokens? Oof. Holey AUTOLOAD batman! Why not just make the token a parameter to some function instead of passing the value via the function name? Or is this so you can get overriding? When do we get to see this code and are you sure you couldn't have written this using a mundane method?

Re^4: Regex Report
created: 2004-06-28 11:52:52
Rules are methods of the object so that object inheritence works easily. Data is not inherited easily, methods are.
_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

perlmonks.org content © perlmonks.org and demerphq, diotalevi, Enlil, japhy

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03