A few days ago, Limbic~Region was eagerly hunting all over the place for every bit of information he could find on parsing, in particular, on Yacc-style and recursive descent parsers. He was doubting whether there is any use in trying to come up with an implementation idea of your own, and write your own parser, when such decent toolkits are readily available.
I suddenly recalled a parser that I wrote myself, in the days before I had any formal education about compilers. When later I looked into the Dragon Book, I was rather amazed that the system I had come up with, was not described. There is someting that comes somewhat in the neighbourhood, called "Operator Precedence Parsing" (section 4.6). But, at least, I do not have to invent any new weird math symbols in order to explain how it works. :)
Its purpose is to parse and calculate infix expressions, stuff like "2+3*(4+5)". As is the case with recursive descent parsing, it is easy to implement by hand — actually, easier; in fact it's so easy that I had no trouble at all implementing it in pure Z80 assembler. But, in contrast with recursive descent parsing, precedence (calculation order) rules for the various operators are defined by a precedence table: each operator has a numerical precedence value (the higher the value, the tighter it binds) and an associativity type (left or right). As a result, new infix operators can easily be added without changing any code, but by merely adding their entries to the operators table.
The grammar it parses has the basic form:
expr := value ( op value )*where value is something like a number or a variable's name, and op is an infix operator.
The code layout is a bit different from the above spec, and in pseudocode, looks like this:
sub parse_expr {
my($value, @stack);
while(1) {
$value = parse_value or die "Parse error";
my $op = parse_operator or last;
$value = process(\@stack, $value, $op); # intermediate processing
push @stack, { value => $value, op => $op }; # store for the next round
}
$value = process(\@stack, $value); # final processing
return $value;
}
process() uses a stack to hold intermediate values and operators. In addition, there are a few variables holding the last value that got parsed or calculated ($value), and the last operator that was just seen ($op).
Lexing (tokenizing) the data happens in 2 independent lexers: one for the core values, and one for the operators. Which lexer is used depends on whether an operator or a value is expected. That way you can reuse the same symbol for different roles without conflict, such as using "-" for a subtraction operator, and as an unary minus.
If a value is expected, either at the start of an expression, or following an infix operator, and none is recognized, a parsing error is raised.
If an operator is expected and none is seen, it is merely assumed that the end of the expression has been reached, and the parser returns the calculated value, after completing any postponed operations. The input pointer remains where it is, and you can continue parsing using any other means, where it stopped.
Processing happens like this: the precedence of the last recognized operator is compared to the precedence of the operator on top of the stack. If it's higher, the current value and operator are pushed onto the stack and the next item will be parsed first; if it's lower, the previous postponed calculation will be performed first, possibly repeated for any older values and operators on the stack. Finally, the final result, and the operator, are pushed onto the stack.
You can compare this to the LALR (Yacc style) parsers, where people talk about shift (= push) and reduce (= calculate).
If their precedence is the same, the associativity of the operator (on the stack) determines whether a calculation will be performed first (left), or postponed (right). I strongly recommend against ever using different types associativity for the same precedence level, so it doesn't matter which operator's associativity you use.
That's it. That's the whole parser.
How about nested parens? Simple, use the parsing rule
value := '(' expr ')'
That is, if you recognize an opening paren, recurse into parsing an expression, end by trying to match a closing paren. (As a closing paren is not recognized as an operator, hence the expression parser will just return when it gets there.)
This way, you can nest parens to any depth.
And now: code. This first implementation stays close to the theoretical description. A lot of the following code is not necessary, as it serves to show what is going on.
#!perl -w
use strict;
my %var;
my %op = (
'+' => { prec => 10, assoc => 'L', exec => sub { $_[0] + $_[1] }},
'-' => { prec => 10, assoc => 'L', exec => sub { $_[0] - $_[1] }},
'*' => { prec => 20, assoc => 'L', exec => sub { $_[0] * $_[1] }},
'/' => { prec => 20, assoc => 'L', exec => sub { $_[0] / $_[1] }},
'%' => { prec => 20, assoc => 'L', exec => sub { $_[0] % $_[1] }},
'**' => { prec => 30, assoc => 'R', exec => sub { $_[0] ** $_[1] }},
);
# fields
use constant VALUE => 0;
use constant OP => 1;
sub parse_expr {
my @stack;
while (1) {
trace();
my($value) = parse_value() or die "Parse error at " . where();
trace("value=$value");
if(/\G\s*(\*\*|[+\-*\/%\\])/gc) { # operator
my $op = $1;
trace("op=$op");
while (@stack and (($op{$op}{prec} < $op{$stack[-1][OP]}{prec})
or ($op{$op}{prec} == $op{$stack[-1][OP]}{prec})
and $op{$stack[-1][OP]}{assoc} eq "L")) {
my($lhs, $op) = @{pop @stack};
trace("popping $lhs $op");
$value = $op{$op}{exec}->($lhs, $value);
trace("result = $value");
}
trace("pushing $value $op");
push @stack, [ $value, $op ];
} else { # no more
while(@stack) {
my($lhs, $op) = @{pop @stack};
trace("popping $lhs $op");
$value = $op{$op}{exec}->($lhs, $value);
trace("result = $value");
}
trace("returning $value");
return $value;
}
}
}
sub parse_value {
/\G\s+/gc;
if(/\G\+/gc) { # '+' value
trace("Unary plus");
return parse_value();
}
if(/\G-/gc) { # '-' value
trace("Unary minus");
return -parse_value();
}
if(/\G((?:\d+\.?\d*|\.\d+)(?i:E[+-]?\d+)?)/gc) { # number
return $1;
}
if(/\G((?i:[a-z]\w*))/gc) { # variable
return $var{$1};
}
if(/\G\(/gc) { # '(' expr ')'
my $value = parse_expr();
/\G\s*/gc;
/\G\)/gc or die "Parse error: ')' expected at: " . where() ;
return $value;
}
return;
}
sub evaluate {
local $_ = shift;
my $value = parse_expr();
/\G\s+/gc;
/\G$/gc or die sprintf "Junk characters at end: \"%s\"", where();
return $value;
}
sub where { # debugging info
my $s = $_;
substr($s, pos || 0, 0) = "\267";
return $s;
}
sub trace {
my($pkg, $file, $line) = caller;
printf STDERR "Line %d \"%s\" %s\n", $line, where(), @_ ? shift : "";
}
%var = ( a => 101, b => 7 );
$_ = "20+3*a+10*-b-5*(3 +2)*5";
$\ = "\n";
print evaluate($_);
As an implementation, it's not ideal: it contains code duplication, the precedence/associativity test is quite verbose, and it's not machine code friendly (low on resources, using as few variables and datastructures as possible), due to the local stacks in the expression parser. That all can be remedied using a few simple tricks.
While processing the data on the stack, you do not have to check the stack depth all the time, as you can use a sentinel value on the stack instead. Just make sure the precedence values of all operators are > 0, and give the sentinel a precedence value of 0, lower than any precedence values for the operators, then every calculation will be performed before parsing ever finishes, while the sentinel just sits safely on the stack. This way, we can even use a global stack, such as the return stack when coding in assembler, in a perfectly safe way.
Associativity can be tackled by pretending a left associative operator on the stack has a higher precedence than the same operator that just got parsed, while a right associative operator has a lower or equal (thus, not higher) precedence on stack than by default, and then the calculation will be postponed. In practice, we just have to make sure the precedence values in the table are even, and increment on the precedence value, just before it gets pushed onto the stack, for a left associative operator.
In the follwoing code, only the sub parse_expr has been changed, but I've posted the whole code to ease copy/run on the program.
#!perl -w
use strict;
my %var;
my %op = (
'+' => { prec => 10, assoc => 'L', exec => sub { $_[0] + $_[1] }},
'-' => { prec => 10, assoc => 'L', exec => sub { $_[0] - $_[1] }},
'*' => { prec => 20, assoc => 'L', exec => sub { $_[0] * $_[1] }},
'/' => { prec => 20, assoc => 'L', exec => sub { $_[0] / $_[1] }},
'%' => { prec => 20, assoc => 'L', exec => sub { $_[0] % $_[1] }},
'**' => { prec => 30, assoc => 'R', exec => sub { $_[0] ** $_[1] }},
);
# fields
use constant VALUE => 0;
use constant OP => 1;
sub parse_expr {
my @stack;
while (1) {
trace();
my($value) = parse_value() or die "Parse error at " . where();
trace("value=$value");
if(/\G\s*(\*\*|[+\-*\/%\\])/gc) { # operator
my $op = $1;
trace("op=$op");
while (@stack and (($op{$op}{prec} < $op{$stack[-1][OP]}{prec})
or ($op{$op}{prec} == $op{$stack[-1][OP]}{prec})
and $op{$stack[-1][OP]}{assoc} eq "L")) {
my($lhs, $op) = @{pop @stack};
trace("popping $lhs $op");
$value = $op{$op}{exec}->($lhs, $value);
trace("result = $value");
}
trace("pushing $value $op");
push @stack, [ $value, $op ];
} else { # no more
while(@stack) {
my($lhs, $op) = @{pop @stack};
trace("popping $lhs $op");
$value = $op{$op}{exec}->($lhs, $value);
trace("result = $value");
}
trace("returning $value");
return $value;
}
}
}
sub parse_value {
/\G\s+/gc;
if(/\G\+/gc) { # '+' value
trace("Unary plus");
return parse_value();
}
if(/\G-/gc) { # '-' value
trace("Unary minus");
return -parse_value();
}
if(/\G((?:\d+\.?\d*|\.\d+)(?i:E[+-]?\d+)?)/gc) { # number
return $1;
}
if(/\G((?i:[a-z]\w*))/gc) { # variable
return $var{$1};
}
if(/\G\(/gc) { # '(' expr ')'
my $value = parse_expr();
/\G\s*/gc;
/\G\)/gc or die "Parse error: ')' expected at: " . where() ;
return $value;
}
return;
}
sub evaluate {
local $_ = shift;
my $value = parse_expr();
/\G\s+/gc;
/\G$/gc or die sprintf "Junk characters at end: \"%s\"", where();
return $value;
}
sub where { # debugging info
my $s = $_;
substr($s, pos || 0, 0) = "\267";
return $s;
}
sub trace {
my($pkg, $file, $line) = caller;
printf STDERR "Line %d \"%s\" %s\n", $line, where(), @_ ? shift : "";
}
%var = ( a => 101, b => 7 );
$_ = "20+3*a+10*-b-5*(3 +2)*5";
$\ = "\n";
print evaluate($_);
Extending the parser to recognize new syntax is easy enough. For example, if you have it recognize a name followed by an opening paren, you can treat this as a prefix function call:
value := name '(' ( expr ( ',' expr ) * )? ')'
(You will have to recognize the opening paren before you assume a name is just a variable)
One by one, parse (and calculate) the expressions in the arguments list, putting the results onto a stack, and finally, call the associated function.
If you want to recognize and handle assignments to variables in the form of "a=expr", then you'll have to implement a way to have lvalues. I can think up 2 basic routes from the top of my head. One is to use Perl voodoo, and have a variable return an object that stringifies/nummifies to the variable's current value, but which you can also use to modify its value. The other is to always return a record structure for the values.
Extending the idea of lvalues to pre-/post- -increment/-decrement, you'll probably have to forbid "++" and "--" for any other use, so "--2" is no longer equivalent to 2.
It'd be very nice to recognize advanced syntax structures, such as "1 < x < 2", in a more programmer-friendly rather than computer-friendly way, meaning "x > 1 and x < 2". I think it's doable, as we're comparing precedence of the two "<" operators anyway, but it'll have to do some exceptional processing for that particular operator.
Other special cases, like optional operators, so that "2a" could mean "2*a>", or "foo bar" meaning "
Instead of just calculating a value on the fly, the parser can easily be modified to return a parse tree, or produce postfix code instead.
This is the basic parser, which, because it doesn't have to recognize everything (it just stops at the first thing it doesn't recognize, it's perfectly happy if it just recognized a valid expression as a prefix), can easily be embedded in other parsers. For example, you can write a handcoded parser to recognize basic command statements, control structures, etc. where this (sub)parser handles expressions, which most programmers find the hardest things to tackle, anyway.
Like I said, I invented this parsing method myself (some 15 or more years ago), and you're free to use any code, any derived code, or any reimplementations as you see fit. I'd just like some credit if you do so, so please don't pretend you invented it yourself. (Let's say it's under a BSD-style license.) Just, no patenting, or any other such crap.
a + b.c - 1.3 * 2 + 3would show up as:
@tree = qw(a + b.c - 1.3 * 2 + 3);As I am parsing I also create a hash of found operators and I note their precedence which I get from a table (it is possible to add any amount of other operators to the table). This hash for this example would look like
%found = ('+' => 85, '-' => 85, '*' => 90)
I then call a method called "apply_precedence" passing it the tree and the %found
hash. Apply precedence takes the highest precedence operator and splits the
@tree array into sub trees whenever it finds an operator. Each of those
sub trees recursively calls "apply_precedence" until each sub tree only has one
element. The returned elements are placed in an execution optree that looks like
['+', 'a', 'b.c']. The previous example would parse down to something like (but with
a little bit different syntax for encoding the parsed variables, operators and arguments
to expressions:
['*', ['+', 'a', ['-', 'b.c', 1.3]], ['+', 2, 3]]
=head1 VARIABLE PARSE TREE
CGI::Ex::Template parses templates into an tree of operations. Even
variable access is parsed into a tree. This is done in a manner
somewhat similar to the way that TT operates except that nested
variables such as foo.bar|baz contain the '.' or '|' in between each
name level. Operators are parsed and stored as part of the variable (it
may be more appropriate to say we are parsing a term or an expression).
The following table shows a variable or expression and the corresponding parsed tree
(this is what the parse_variable method would return).
one [ 'one', 0 ]
one() [ 'one', [] ]
one.two [ 'one', 0, '.', 'two', 0 ]
one|two [ 'one', 0, '|', 'two', 0 ]
one.$two [ 'one', 0, '.', ['two', 0 ], 0 ]
one(two) [ 'one', [ ['two', 0] ] ]
one.${two().three} [ 'one', 0, '.', ['two', [], '.', 'three', 0], 0]
2.34 2.34
"one" "one"
"one"|length [ \"one", 0, '|', 'length', 0 ]
"one $a two" [ \ [ '~', 'one ', ['a', 0], ' two' ], 0 ]
[0, 1, 2] [ \ [ 'array', 0, 1, 2 ], 0 ]
[0, 1, 2].size [ \ [ 'array', 0, 1, 2 ], 0, '.', 'size', 0 ]
['a', a, $a ] [ \ [ 'array', 'a', ['a', 0], [['a', 0], 0] ], 0]
{a => 'b'} [ \ [ 'hash', 'a', 'b' ], 0 ]
{a => 'b'}.size [ \ [ 'hash', 'a', 'b' ], 0, '.', 'size', 0 ]
{$a => b} [ \ [ 'hash', ['a', 0], ['b', 0] ], 0 ]
1 + 2 [ \ [ '+', 1, 2 ], 0]
a + b [ \ [ '+', ['a', 0], ['b', 0] ], 0 ]
a * (b + c) [ \ [ '*', ['a', 0], [ \ ['+', ['b', 0], ['c', 0]], 0 ]], 0 ]
(a + b) [ \ [ '+', ['a', 0], ['b', 0] ]], 0 ]
(a + b) * c [ \ [ '*', [ \ [ '+', ['a', 0], ['b', 0] ], 0 ], ['c', 0] ], 0 ]
a ? b : c [ \ [ '?', ['a', 0], ['b', 0], ['c', 0] ], 0 ]
a || b || c [ \ [ '||', ['a', 0], [ \ [ '||', ['b', 0], ['c', 0] ], 0 ] ], 0 ]
! a [ \ [ '!', ['a', 0] ], 0 ]
Some notes on the parsing.
Operators are parsed as part of the variable and become part of the variable tree.
Operators are stored in the variable tree using a reference to the arrayref - which
allows for quickly descending the parsed variable tree and determining that the next
node is an operator.
Parenthesis () can be used at any point in an expression to disambiguate precedence.
"Variables" that appear to be literal strings or literal numbers
are returned as the literal (no operator tree).
The following perl can be typed at the command line to view the parsed variable tree:
perl -e 'use CGI::Ex::Template; print CGI::Ex::Template::dump_parse("foo.bar + 2")."\n"'
Also the following can be included in a template to view the output in a template:
[% USE cet = CGI::Ex::Template %]
[%~ cet.dump_parse('foo.bar + 2').replace('\s+', ' ') %]
$OPERATORS = [
# type precedence symbols action (undef means play_operator will handle)
['prefix', 98, ['++'], undef ],
['prefix', 98, ['--'], undef ],
['postfix', 98, ['++'], undef ],
['postfix', 98, ['--'], undef ],
['infix', 96, ['**', 'pow'], sub { $_[0] ** $_[1] } ],
['prefix', 93, ['!'], sub { ! $_[0] } ],
['prefix', 93, ['-'], sub { @_ == 1 ? 0 - $_[0] : $_[0] - $_[1] } ],
['infix', 90, ['*'], sub { $_[0] * $_[1] } ],
['infix', 90, ['/'], sub { $_[0] / $_[1] } ],
['infix', 90, ['div', 'DIV'], sub { int($_[0] / $_[1]) } ],
['infix', 90, ['%', 'mod', 'MOD'], sub { $_[0] % $_[1] } ],
['infix', 85, ['+'], sub { $_[0] + $_[1] } ],
['infix', 85, ['-'], sub { @_ == 1 ? 0 - $_[0] : $_[0] - $_[1] } ],
['infix', 85, ['~', '_'], sub { join "", @_ } ],
['infix', 80, ['<'], sub { $_[0] < $_[1] } ],
['infix', 80, ['>'], sub { $_[0] > $_[1] } ],
['infix', 80, ['<='], sub { $_[0] <= $_[1] } ],
['infix', 80, ['>='], sub { $_[0] >= $_[1] } ],
['infix', 80, ['lt'], sub { $_[0] lt $_[1] } ],
['infix', 80, ['gt'], sub { $_[0] gt $_[1] } ],
['infix', 80, ['le'], sub { $_[0] le $_[1] } ],
['infix', 80, ['ge'], sub { $_[0] ge $_[1] } ],
['infix', 75, ['==', 'eq'], sub { $_[0] eq $_[1] } ],
['infix', 75, ['!=', 'ne'], sub { $_[0] ne $_[1] } ],
['infix', 70, ['&&'], undef ],
['infix', 65, ['||'], undef ],
['infix', 60, ['..'], sub { $_[0] .. $_[1] } ],
['ternary', 55, ['?', ':'], undef ],
['assign', 53, ['+='], sub { $_[0] + $_[1] } ],
['assign', 53, ['-='], sub { $_[0] - $_[1] } ],
['assign', 53, ['*='], sub { $_[0] * $_[1] } ],
['assign', 53, ['/='], sub { $_[0] / $_[1] } ],
['assign', 53, ['%='], sub { $_[0] % $_[1] } ],
['assign', 53, ['**='], sub { $_[0]** $_[1] } ],
['assign', 53, ['~=', '_='], sub { $_[0] . $_[1] } ],
['assign', 52, ['='], undef ],
['prefix', 50, ['not', 'NOT'], sub { ! $_[0] } ],
['infix', 45, ['and', 'AND'], undef ],
['infix', 40, ['or', 'OR'], undef ],
];
Again - it is sort of funny to see the same ideas discovered and rediscovered over and over.
What about associativity?
It printed out:6 / 5 * 4
CET: [% 6 / 5 * 4 %]
Perl: [% PERL %]print 6 / 5 * 4 [% END %]
2 ** 3 ** 4
CET: [% 2 ** 3 ** 4 %]
Perl: [% PERL %]print 2 ** 3 ** 4[% END %]
a---b
CET: [% a = 5; b = 2 %][% a---b %]
Perl: [% PERL %]$a = 5; $b = 2; print $a---$b[% END %]
a--- -b
CET: [% a = 5; b = 2 %][% a--- -b %]
Perl: [% PERL %]$a = 5; $b = 2; print $a--- -$b[% END %]
a--- --b
CET: [% a = 5; b = 2 %][% a--- --b %]
Perl: [% PERL %]$a = 5; $b = 2; print $a--- --$b[% END %]
6 / 5 * 4 CET: 0.3 Perl: 4.8 2 ** 3 ** 4 CET: 2.41785163922926e+24 Perl: 2.41785163922926e+24 a---b CET: 3 Perl: 3 a--- -b CET: 7 Perl: 7 a--- --b CET: 4 Perl: 4So - it looks like I need to fix my right vs left vs non-associative. I'll add that to the table and change the parser (it is always doing right right now - it used to always do left - it will be trivial and won't even cause a speed hit to allow it to do both). Thank you - I knew about precedence and precedence makes complete sense - associativity rules seem like they are a little more arbitrary and it seems to cry foul to the user that it isn't consistent (such is legacy). In the perl6 operators table they don't even mention if the operator group is right or left (though it probably does elsewhere in the doc).
I don't know if I triggered [Limbic~Region|L~R]'s hunt, but I was asking in the CB about modules for expression parsing and he expressed an interest.
I was looking for something that would be supplied with a) the string to parse, and b) the parse rules - a set of functions, binary and unary operators, and atoms. Something like this:
Expr::Parse->new({
function => [
{
name => 'p',
type => 'bool',
args => [ 'int' ],
test => sub { _is_prime($_[1]) },
},
{
name => 'rev',
type => 'int',
args => [ 'int' ],
test => sub { scalar reverse $_[1] },
},
],
binop => [
{
name => '+',
type => 'int',
args => [ 'int', 'int' ],
prec => 4,
test => sub { $_[1] + $_[2] },
},
{
name => '*',
type => 'int',
args => [ 'int', 'int' ],
prec => 3,
test => sub { $_[1] * $_[2] },
},
{
name => '=',
type => 'bool',
args => [ 'int', 'int' ],
prec => 6,
test => sub { $_[1] == $_[2] },
},
unop => [
{
name => '!',
type => 'bool',
args => [ 'bool' ],
prec => 1,
test => sub { $_[1] ? 0 : 1 },
},
],
atom => [
{
name => 'const',
pat => '\d+',
type => 'int',
test => sub { $_[1] },
},
],
});
Nobody in the CB could suggest anything at the time, so I wrote my own constructing a [cpan://Tree::Simple] tree as output. I found it surprisingly hard to write, even treating '(', ',' and ')' as builtins (for the function(arg, list) support), and ignoring the type information - I wrote and threw away hundreds of lines of code in at least half a dozen trial implementations before finally coming up with something workable if not particularly pretty.
If your parser can easily be extended to handle function calls I'll have a go at adapting it for my application.
Hugo
If your parser can easily be extended to handle function calls I'll have a go at adapting it for my application.I said it would be easy, didn't I? Well I've got to put my money (er, code) where my mouth is, so I adapted it to process function calls. With this global spec, which defines 2 functions:
my %function = (
sumsq => sub { my $sum = 0; foreach(@_) { $sum += $_*$_; } return $sum; }, # sum of squares
sqrt => sub { return sqrt shift; },
);
I added this piece in parse_value(), just in front of the code to handle variables:
if(/\G((?i:[a-z]\w*))\s*\(/gc) { # function '('
my $function = $1;
$function{$function} or die sprintf "Undefined function '$function' called at: \"%s\"", where();
my @arg;
unless(/\G\s*(?=\))/gc) {
while(1){
my($value) = parse_expr() or die sprintf "Expression expected at: \"%s\"", where();
push @arg, $value;
/\G\s*,/gc or last;
}
}
/\G\s+/gc;
/\G\)/gc or die sprintf "Parse error: ')' expected at: \"%s\"", where();
trace(sprintf "function '$function' called with %d argument%s", scalar @arg, @arg==1 ? "" : "s");
return $function{$function}->(@arg);
}
and with the data string "sumsq(3,2+2)*sqrt(36)", the output is:
Line 29 "·sumsq(3,2+2)*sqrt(36)" Line 29 "sumsq(·3,2+2)*sqrt(36)" Line 31 "sumsq(3·,2+2)*sqrt(36)" value=3 Line 29 "sumsq(3,·2+2)*sqrt(36)" Line 31 "sumsq(3,2·+2)*sqrt(36)" value=2 Line 36 "sumsq(3,2+·2)*sqrt(36)" op=+ Line 29 "sumsq(3,2+·2)*sqrt(36)" Line 31 "sumsq(3,2+2·)*sqrt(36)" value=2 Line 43 "sumsq(3,2+2·)*sqrt(36)" popping 2 + Line 45 "sumsq(3,2+2·)*sqrt(36)" result = 4 Line 85 "sumsq(3,2+2)·*sqrt(36)" function 'sumsq' called with 2 arguments Line 31 "sumsq(3,2+2)·*sqrt(36)" value=25 Line 36 "sumsq(3,2+2)*·sqrt(36)" op=* Line 29 "sumsq(3,2+2)*·sqrt(36)" Line 29 "sumsq(3,2+2)*sqrt(·36)" Line 31 "sumsq(3,2+2)*sqrt(36·)" value=36 Line 85 "sumsq(3,2+2)*sqrt(36)·" function 'sqrt' called with 1 argument Line 31 "sumsq(3,2+2)*sqrt(36)·" value=6 Line 43 "sumsq(3,2+2)*sqrt(36)·" popping 25 * Line 45 "sumsq(3,2+2)*sqrt(36)·" result = 150 150 Stack: This value is never affected
It seems to work fine, was finished in something like 1/4 hour, and doesn't take hundreds of lines of code. :)
type: prefix, postfix, left, right, none, ternary, or assign (right) precedence: the relative precedence value symbols or names: an arrayref of symbols or operators for that operation function: a code ref to run when that operator is found.
I wrote my own constructing a Tree::Simple tree as output.For completeness sake, I've rewritten the parser so it produces a parse tree. I took a quick look at [mod://Tree::Simple], and I found it too hard to my taste for the little benefit it would give me, so I'm using a handrolled function object instead — yes I'm converting the infix operators into prefix function calls. As an extra benefit, I can use [mod://overload] to return a symbolic representation of the parse tree when used as a string, or actually evaluate it, when used as a number.
I think it clearly demontrates its viability as a parser for real work.
#!perl -w
use strict;
my %var;
my %op = (
'+' => { prec => 10, assoc => 'L', exec => sub { $_[0] + $_[1] }, function => 'sum'},
'-' => { prec => 10, assoc => 'L', exec => sub { $_[0] - $_[1] }, function => 'dif'},
'*' => { prec => 20, assoc => 'L', exec => sub { $_[0] * $_[1] }, function => 'mul'},
'/' => { prec => 20, assoc => 'L', exec => sub { $_[0] / $_[1] }, function => 'div'},
'%' => { prec => 20, assoc => 'L', exec => sub { $_[0] % $_[1] }, function => 'mod'},
'**' => { prec => 30, assoc => 'R', exec => sub { $_[0] ** $_[1] }, function => 'pow'},
);
my %function = (
sumsq => sub { my $sum = 0; foreach(@_) { $sum += $_*$_; } return $sum; },
sqrt => sub { return sqrt shift; },
negate => sub { return -shift },
);
# turn the operators into a function
foreach my $op (values %op) {
$function{$op->{function}} = $op->{exec};
}
{
package Function;
sub new {
my $class = shift;
my %self;
$self{function} = shift;
$self{arguments} = [ @_ ];
return bless \%self, $class;
}
sub stringify {
my $self = shift;
local $" = ", ";
return "$self->{function}(@{$self->{arguments}})";
}
use overload '""' => \&stringify;
sub evaluate {
my $self = shift;
my $code = $function{$self->{function}} or die "No code provided for '$self->{function}'";
return $code->(map 0+$_, @{$self->{arguments}});
}
use overload '0+' => \&evaluate, fallback => 1;
}
# fields:
use constant VALUE => 0;
use constant OP => 1;
use constant PREC => 2;
use constant TRACE => 1;
sub parse_expr {
my @stack;
push @stack, [ undef, undef, 0 ]; # sentinel
while (1) {
trace() if TRACE;
my($value) = parse_value() or die "Parse error at " . where();
trace("value=$value") if TRACE;
my($op, $prec);
if(/\G\s*(\*\*|[+\-*\/%\\])/gc) { # operator
$op = $1;
$prec = $op{$op}{prec};
trace("op=$op") if TRACE;
} else { # no more
$prec = 0;
}
# process
while($stack[-1][PREC] > $prec) {
my($lhs, $op) = @{pop @stack};
trace("popping $lhs $op") if TRACE;
$value = Function->new($op{$op}{function}, $lhs, $value);
trace("result = $value") if TRACE;
}
if($prec) {
if($op{$op}{assoc} eq 'L') {
$prec++;
}
push @stack, [ $value, $op, $prec ];
} else {
pop @stack; # sentinel
return $value;
}
}
}
sub parse_value {
/\G\s+/gc;
if(/\G\+/gc) { # '+' value
trace("Unary plus") if TRACE;
return parse_value();
}
if(/\G-/gc) { # '-' value
trace("Unary minus") if TRACE;
return Function->new(negate => parse_value());
}
if(/\G((?:\d+\.?\d*|\.\d+)(?i:E[+-]?\d+)?)/gc) { # number
return $1;
}
if(/\G((?i:[a-z]\w*))\s*\(/gc) { # function '('
my $function = $1;
$function{$function} or die sprintf "Undefined function '$function' called at: \"%s\"", where();
my @arg;
unless(/\G\s*(?=\))/gc) {
while(1){
my($value) = parse_expr() or die sprintf "Expression expected at: \"%s\"", where();
push @arg, $value;
/\G\s*,/gc or last;
}
}
/\G\s+/gc;
/\G\)/gc or die sprintf "Parse error: ')' expected at: \"%s\"", where();
trace(sprintf "function '$function' called with %d argument%s", scalar @arg, @arg==1 ? "" : "s")
if TRACE;
return Function->new($function, @arg);
}
if(/\G((?i:[a-z]\w*))/gc) { # variable
return $var{$1};
}
if(/\G\(/gc) { # '(' expr ')'
my $value = parse_expr();
/\G\s+/gc;
/\G\)/gc or die sprintf "Parse error: ')' expected at: \"%s\"", where();
return $value;
}
return;
}
sub evaluate {
local $_ = shift;
my $value = parse_expr();
/\G\s+/gc;
/\G$/gc or die sprintf "Junk characters at end: \"%s\"", where();
return $value;
}
sub where { # debugging info
my $s = $_;
substr($s, pos || 0, 0) = "\267";
return $s;
}
sub trace {
my($pkg, $file, $line) = caller;
printf STDERR "Line %d \"%s\" %s\n", $line, where(), @_ ? shift : "";
}
%var = ( a => 101, b => 7 );
$_ = "sumsq(3,2+2)*sqrt(36)/2";
my $result = evaluate($_);
print "\nsource: $_\n";
use Data::Dumper;
$Data::Dumper::Indent = 1;
print "Dumped:\n", Dumper $result;
print "Stringified: $result\n";
print "Numerified: " . (0+$result) . "\n";
Output:
Line 61 "·sumsq(3,2+2)*sqrt(36)/2"
Line 61 "sumsq(·3,2+2)*sqrt(36)/2"
Line 63 "sumsq(3·,2+2)*sqrt(36)/2" value=3
Line 61 "sumsq(3,·2+2)*sqrt(36)/2"
Line 63 "sumsq(3,2·+2)*sqrt(36)/2" value=2
Line 68 "sumsq(3,2+·2)*sqrt(36)/2" op=+
Line 61 "sumsq(3,2+·2)*sqrt(36)/2"
Line 63 "sumsq(3,2+2·)*sqrt(36)/2" value=2
Line 75 "sumsq(3,2+2·)*sqrt(36)/2" popping 2 +
Line 77 "sumsq(3,2+2·)*sqrt(36)/2" result = sum(2, 2)
Line 117 "sumsq(3,2+2)·*sqrt(36)/2" function 'sumsq' called with 2 arguments
Line 63 "sumsq(3,2+2)·*sqrt(36)/2" value=sumsq(3, sum(2, 2))
Line 68 "sumsq(3,2+2)*·sqrt(36)/2" op=*
Line 61 "sumsq(3,2+2)*·sqrt(36)/2"
Line 61 "sumsq(3,2+2)*sqrt(·36)/2"
Line 63 "sumsq(3,2+2)*sqrt(36·)/2" value=36
Line 117 "sumsq(3,2+2)*sqrt(36)·/2" function 'sqrt' called with 1 argument
Line 63 "sumsq(3,2+2)*sqrt(36)·/2" value=sqrt(36)
Line 68 "sumsq(3,2+2)*sqrt(36)/·2" op=/
Line 75 "sumsq(3,2+2)*sqrt(36)/·2" popping sumsq(3, sum(2, 2)) *
Line 77 "sumsq(3,2+2)*sqrt(36)/·2" result = mul(sumsq(3, sum(2, 2)), sqrt(36))
Line 61 "sumsq(3,2+2)*sqrt(36)/·2"
Line 63 "sumsq(3,2+2)*sqrt(36)/2·" value=2
Line 75 "sumsq(3,2+2)*sqrt(36)/2·" popping mul(sumsq(3, sum(2, 2)), sqrt(36)) /
Line 77 "sumsq(3,2+2)*sqrt(36)/2·" result = div(mul(sumsq(3, sum(2, 2)), sqrt(36)), 2)
source: sumsq(3,2+2)*sqrt(36)/2
Dumped:
$VAR1 = bless( {
'function' => 'div',
'arguments' => [
bless( {
'function' => 'mul',
'arguments' => [
bless( {
'function' => 'sumsq',
'arguments' => [
'3',
bless( {
'function' => 'sum',
'arguments' => [
'2',
'2'
]
}, 'Function' )
]
}, 'Function' ),
bless( {
'function' => 'sqrt',
'arguments' => [
'36'
]
}, 'Function' )
]
}, 'Function' ),
'2'
]
}, 'Function' );
Stringified: div(mul(sumsq(3, sum(2, 2)), sqrt(36)), 2)
Numerified: 75
I have something similar in an interpreter I started to write ages ago in perl, but has abandonned it. The interpreter works but cannot interpret user-defined functions which makes it unsuitable for most purposes. I am now not motivated to continue it.
You can download the interpreter but I also copy some relevant parts of the code here. You can download the full version and run it with -p to dump the optree instead of interpretting it. (Piping the dump through | sed 's/\bline( /\n&/g' may make it more readable.)
For this, you have to know that the get_tok function gets the next token as a string, and unget_tok pushes a token back so that it's the next token get_tok will read. The following code will parse expressions, and is called at various places from the code that parses statements.
{
my($pars_expr1,$pars_expr2);
my %binops= (
# fun(args), var[index], and var.elt will be handled specially in code
"+", "add", "-", "sub", "*", "mul",
"/", "div", "div", "idiv", "mod", "mod",
"<", "lt", "=", "eq", ">", "gt", "<=", "le", "<>", "ne", ">=", "ge",
"es", "and", "vagy", "or",
);
my %binoppr= (
"+", 50, "-", 50, "*", 40,
"/", 40, "div", 40, "mod", 40,
"<", 60, "=", 60, ">", 60, "<=", 60, "<>", 60, ">=", 60,
"es", 40, "vagy", 50,
# all operators with the same precedence are left-to-right associative,
# eg: a-b-c ==> (a-b)-c
);
my %unops= (
# (expr), +expr are handled specially by code
"-", "neg", "nem", "not",
);
my %unoppr= (
"-", 30, "nem", 30,
);
my %unfunc= (
"kerekit", "round", "round", "round", "egeszresz", "floor",
"sqr", "sqr", "sqrt", "sqrt",
);
# More precedences hard-coded in the code below:
# maximal:137, unary+:30, unary():-INF, unary functions:30,
# binary[]:20, binary.:20, binary():20
$pars_expr1= sub {
my($t,$f,$x,$u,$p,$g);
($p,)= @_;
$t= get_tok ();
$t=~ m!^\d! and
return [$t=~/[.eE]/?"qf":"qi", 0+$t];
is_id $t and
return ["var", $t];
$f= $unops{$t} and do {
$g= $unoppr{$t};
return [$f, $pars_expr2->($g<$p?$g:$p)];
};
$t eq "(" and do {
$x= $pars_expr2->(137);
($u= get_tok) eq ")" or
die qq!error parsing expr at "$u": ")" expected!;
return $x;
};
$t eq "+" and
return $pars_expr2->(30<$p?30:$p);
$f= $unfunc{$t} and 30<=$p and do {
($u= get_tok) eq "(" or
die qq!parse error at "$u": "(" expected!;
$x= $pars_expr2->(137);
($u= get_tok) eq ")" or
die qq!parse error at "$u": ")" expected!;
return [$f, $x];
};
$t eq "'" and
return ["qs", get_tok];
die qq(error parsing expr at "$t": expr expected);
};
$pars_expr2= sub {
my($o,$f,$y,$x,$u,$p,$g);
($p,)= @_;
$x= $pars_expr1->($p);
{
$o= get_tok;
$f= $binops{$o} and
($g= $binoppr{$o})<=$p and
do {
$y= $pars_expr2->($g-1);
$x= [$f, $x, $y];
redo;
};
$o eq "." and 20<=$p and
do {
$$x[0] eq "var" or
die qq(parse error at ".": expression cannot be dotted);
$y= get_tok;
is_id $y or
die qq(parse error at "$y": record field expected);
push @$x, $y;
redo;
};
$o eq "[" and 20<=$p and
do {
$$x[0] eq "var" or
die qq(parse error at "[": expression cannot be subscripted);
$y= $pars_expr2->(137);
($u= get_tok) eq "]" or
die qq(parse error at "$u": "]" expected);
push @$x, $y;
redo;
};
};
unget_tok $o;
$x;
};
sub pars_expr () {
$pars_expr2->(137);
};
sub pars_lvalue () {
$pars_expr2->(20);
};
};
[http://www.math.bme.hu/~ambrus/pu/scan-snapshot.tgz|Another interpreter] I wrote in C++ as a school project also has a similar parsing routine.
perlmonks.org content © perlmonks.org and ambrus, bart, hv, ikegami, Rhandom
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03