when $$s =~ m/\G.../gc is too verbose
stefp
created: 2006-02-02 17:15:37
For manually writing lexers my favorite idiom is $$s =~ m/\G.../gc. In scalar context it permits to advance in a string $$s Iwant to lex. If it matches the current position moves past the match, if not it is inchanged, \G permits to anchor the match at the current position. I could also use $$s = s/^...//. It does not cost much because the implementation does not move the string to truncate but just move an internal pointer. But this is immaterial to the following discussion.

A lexer for Parse::Yapp ends up looking like

 sub lexer { 
    my($parser)=shift;
    my $s = $parser->YYData->{INPUT}; # reference to the string to lex
    m/\G\s+/gc;  skip any spaces
    return ('INT', $1) if $$s =~  m/\G\(d+)/gc;
    return ('ID', $1)  if $$s =~ m/[A-Z](\w*)/gc;
    ...  # and it goes on for many tentative matches
 }
I know that I always match on $$s so why should I restate it at each match. I _had_ to remove these useless $$S !

It took me a long time to realize that I could do it with a typeglob trick :

  *_ = $parser->YYData->{INPUT}; # reference to the string to lex
Now $_ is an alias to the string to lex. So I can match on it I and don't need the =~ operator anymore

-- stefp

Re: when $$s =~ m/\G.../gc is too verbose
created: 2006-02-02 17:36:34

Clever. I usually put a local in there though, just to avoid trouble.

Re^2: when $$s =~ m/\G.../gc is too verbose
created: 2006-02-02 19:09:07
Oops, I forgot it. I always localize or lexicalize variables proper to a subroutine That's why I never noticed that $_ is not implicetely localized at the entry of a subroutine contrary to what I thought.

Sadly, localizing *_ or $_ doesn't play well with reference shuffling of strings with positions.

 sub lexer {  (*_) = @_;    print $1 if m/\G(A)/gc || m/\G(B)/gc ; }  
 my $a = "AB"; lexer \$a; ; lexer \$a;
This prints "A" then "B"; If I add a local *_ or a local $_, at the entry of the lexer routine, that does not work anymore. So much for a cool trick.

-- stefp

Re: when $$s =~ m/\G.../gc is too verbose
created: 2006-02-02 18:28:40
Maybe I don't understand the problem, but if you really hate typing so much, why not generalize the solution instead?

Not that I like to generalize things, because I usually end up un-generalizing them a few months later (stupid shifting requirements!), but it seems easier than playing with symbol table manipulations just to save a few keystrokes to me... am I missing something?

I'm thinking of something roughly along these lines... completely untested and possibly wrong code is below. ;-)

# make a table of regular expression patterns
my %table = ( qr/(\d+)/      => 'INT',
              qr/([A-Z]\w*)/ => 'ID',
              .... # more tokens here );
my ($parser) = shift;

my $s = $parser->YYData->{INPUT};

my @matches; # any matches found by our re go in here

foreach my $re ( keys %table ) {
        # for each regexp, check to see if it matches, and
        # put all the captured values in @matches if it does
        @matches = ( $$s =~ m/\G$re/gc );

       # return the appropriate token, and captures...
        return( $table{$re}, @matches) if (@matches);
} # end search for a token match

# token not found... put error handling here ...

--
Ytrew

Re^2: when $$s =~ m/\G.../gc is too verbose
created: 2006-02-02 19:00:02
Without going to the extremeties of toke.c (the Perl tokenizer), things are usally more complicated than mere pattern matching. One may have to test whatever flags. Otherwise, indeed one could factorize one way or another.

-- stefp

Re: when $$s =~ m/\G.../gc is too verbose (for)
tye
created: 2006-02-03 00:00:15
for( $$s ) {
    ....
}

- [tye]        

Re^2: when $$s =~ m/\G.../gc is too verbose (for)
created: 2006-02-03 04:49:57
This is the trick used by Calc.yp in the Parse::Yapp distribution. It indeed creates an alias but conveys the wrong message because the the block is not really used as a loop.

-- stefp

Re^3: when $$s =~ m/\G.../gc is too verbose (for)
created: 2006-02-03 06:15:34
That's why I wished Perl allowed another keyword as yet another synonym for for/foreach — I'd propose "with", for example:
with($$s) {
    ...
}
But in the meantime, I've trained myself to actually read/see
for(SCALAR) { ... }
as
with(SCALAR) { ... }

Chalk it up as another Perl idiom.

Re^4: when $$s =~ m/\G.../gc is too verbose (for)
created: 2006-02-03 14:10:07
It's called "given" in Perl 6 (or in Perl6::Switch, if you want to play with 6ish topicalizers in Perl 5).

Perl 6 also has syntactic relief for the m/\G.../gc monstrosity as well. That turns into m:p/.../, where the :p tells it to start matching at the current position. (But generally you don't even need that since subrules in a grammar always anchor to the current position anyway.)

Re^3: when $$s =~ m/\G.../gc is too verbose (for)
tye
created: 2006-02-03 14:45:32

Much like in English, you can use Perl's for() for iterating over a list, iterating via initialization + check + step, or associating a single topic with a block of syntax. So I, without apology, use for() for topicalizing. For you, I won't stop doing this. (: Excuse me for not demonstrating the use of English "for" analogous to init + check + step.

- tye        

Re: when $$s =~ m/\G.../gc is too verbose
created: 2006-02-03 10:48:07

You can also use a single regexp with all alternatives and \G and the g flag but without the c flag. Then you can decide which alternative mached by checking the definedness of $1 and other match variables.

I sometimes use that idiom instead of many regexps with a gc flag. A nice example is the glob_to_re function in cgrep (snapshot) (which is btw an improved version of my Egrep clone with function name display). A simpler example is in Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.).

Re: when $$s =~ m/\G.../gc is too verbose
created: 2006-02-07 12:53:24

Why not just store $$s in a local copy of $_?

#Either
local $_ = $$s;

#Or
s//$$s/; #tricky.. ;-)

Actually, in this case, I'd be tempted to alter your approach altogether and use a regex table.

sub lexer {
   my ($parser) = shift;
   my $s = $parser->YYData->{INPUT};
   
   # I don't get your line: 'm/\G\s+/gc;  skip any spaces'
   my %dispatch = (
                    INT  => qr/\G(\d+)/gc, 
                    ID   => qr/\G([A-Z]\w*)/gc,
                    #.. and so on ..
                  );
   
   while (my ($key, $regex) = each %dispatch) {
      return ($key, $1) if $$s =~ $regex;
   }
}
<-radiant.matrix->
A collection of thoughts and links from the minds of geeks
The Code that can be seen is not the true Code
I haven't found a problem yet that can't be solved by a well-placed [http://en.wikipedia.org/wiki/Trebuchet|trebuchet]
Re^2: when $$s =~ m/\G.../gc is too verbose
created: 2006-02-07 12:58:15

Your second solution is no solution:

$_ = '!';
$s = \'No';
s//$$s/;
print;

You'd need to empty out $_ first, so the local $_ is the way.

Re^3: when $$s =~ m/\G.../gc is too verbose
created: 2006-02-07 14:32:07

Absolutely. That's the tricky part. ;-) It will work when $_ is undefined, but not otherwise. Of course, you could always change it to s/.*/$$s/, but still not advisable. More of an obfu trick...

<-radiant.matrix->
A collection of thoughts and links from the minds of geeks
The Code that can be seen is not the true Code
I haven't found a problem yet that can't be solved by a well-placed trebuchet
Re^2: when $$s =~ m/\G.../gc is too verbose
created: 2006-02-07 17:31:41
About the copy: before even thinking about positions in strings, using a string copy is a no-no. Copying the string to be parsed for each token is madness.

About a table for lexing : this is irrelevant to the discussion. Also, lexing can be more complex than matching. Yes, one can insert regular code in regex but that the sign that a table based lexing is not appropriate.

As I said tye, using for is the right way to alias to $_. I don't like it because in the programming space, for is a loop... for me. :)

In the natural language space, well, English is not my first language.

So to paraphrase Churchill, for is the worst solution, but it is the only one.

Hopefully, like said TimToady, Perl6 will be cleaner.

-- stefp

perlmonks.org content © perlmonks.org and ambrus, Anonymous Monk, bart, chromatic, Corion, radiantmatrix, stefp, TimToady, tye

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03