In the process of trying to emulate the C pre-processor I had major trouble trying to handle C style /* ... */ comments. There are two issues that cause particular grief - comments can span lines and, at least for some compilers, comments can be nested (and are in the code I need to handle).
An additional gotcha is that things that look like comments in strings need to be retained.
The code below parses an input string and generates an output string comprising the original text sans C style comments. Note that it leaves C++ single line comments however - but they are easily dealt with in the second pass.
use strict;
use warnings;
use Parse::RecDescent;
my $decommendedText = '';
sub concat ($) {$decommendedText .= $_[0]; 1;}
my $decomment = <<'GRAMMAR';
file : block(s)
block : string
{::concat ($item{string}); 1}
| m{((?!/\*|"|').)+}s
{::concat ($item[-1]); 1}
| comment
{::concat ($item{comment}); 1;}
string : /"([^"]|\\")*"/
{$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;}
| /'([^']|\\')*'/
{$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;}
comment : '/*' commentBlock '*/'
{$return = $text =~ /^\n/ ? "\n" : ''; 1;}
commentBlock : m{((?! \*/ | /\* ).)*}sx comment m{((?! \*/ | /\* ).)*}sx
{$return = "\n"; 1;}
| m{((?! \*/ | /\* ).)+}sx
{$return = ''; 1;}
GRAMMAR
my $parse = new Parse::RecDescent ($decomment);
my $input = <<'DATA';
#include "StdAfx.h" // Tail comment
#include "Utility\perftime.h"
#pragma hdrstop
/* Comment before MACRO */
/* Comment /* and nested comment */ lines */
#define MACRO 10\
+ 3 // Multi line macro with comment
#define __DEBUG /* comment */ 1
#define STRING 'This is a string' /* comment */
#define COMMENT "/* comment in \"a\" string */"
// c++ comment line
/* Comment at start for a number of lines */
/* multi-line comment
/* nested */
block */
// cpp block
char PerfTimer::Buf[64];
DATA
$parse->file($input) or die "Parse failed\n";
print $decommendedText;
Prints:
#include "StdAfx.h"// Tail comment
#include "Utility\perftime.h"
#pragma hdrstop
#define MACRO 10\
+ 3 // Multi line macro with comment
#define __DEBUG 1
#define STRING 'This is a string'
#define COMMENT "/* comment in \"a\" string */"
// c++ comment line
// cpp block
char PerfTimer::Buf[64];
I have a few little improvements.
The following addresses these issues.
use strict;
use warnings;
use Parse::RecDescent;
my $decomment = <<'GRAMMAR';
{
use strict;
use warnings;
sub concat { $decommendedText .= $_[0]; }
}
file :
| block(s) /\Z/
{$return = $decommendedText; 1;}
block : string
{concat ($item{string}); 1;}
| m{((?!/\*|"|').)+}s
{concat ($item[-1]); 1;}
| comment
{concat ($item{comment}); 1;}
string : /"([^"]|\\")*"/
{$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;}
| /'([^']|\\')*'/
{$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;}
comment : '/*' commentBlock '*/'
{$return = $text =~ /^\n/ ? "\n" : ''; 1;}
commentBlock : m{((?! \*/ | /\* ).)*}sx comment m{((?! \*/ | /\* ).)*}sx
{$return = "\n"; 1;}
| m{((?! \*/ | /\* ).)+}sx
{$return = ''; 1;}
GRAMMAR
...
my $decommendedText = $parse->file($input);
die "Parse failed\n" if not defined $decommendedText;
print $decommendedText;
Update: Now fixes $decommendedText being a global.
Actually concat is there:
sub concat ($) {$decommendedText .= $_[0]; 1;}
:)
Update: Thanks for the other tips - especially the returned result from file and eof detection.
This doesn't handle this case:
// We do not use /*-style comments
It doesn't even handle the case:
// We don't use old C-style comments
because it tries to find the closing single quote to match the apostrophe in "don't". You simply have to parse //-style comments for such a tool.
/"([^"]|\\")*"/
This doesn't handle "\\". Also note that it will fail for strings of 32K characters, which is why I prefer to add the + in "([^"\\]+|\\.)*".
Why don't you factor out m{((?! \*/ | /\* ).)+}sx into its own rule so you don't have to repeat that regex three times and so you can assign a descriptive name to it to aid understanding?
m{((?!/\*|"|').)+}s could be replaced by [^/"'] and /(?!*), which is more to my taste but YMMV.
And I'd probably do this all with simpler regexes and a simple state machine instead of resorting to Parse::RecDescent (not that my result will be simpler code in total). Note that I even avoid having to slurp the entire input into a single string.
#!/usr/bin/perl -w
use strict;
$|= 1; # Useful for ad-hoc testing
my $canNest= 1; # Whether /*-style comments can be nested
my $depth= 0;
my $output= "";
while( ) {
while( ! m[\G\z]gc ) {
while( $depth && m[/[*]|[*]/]gc ) {
if( "/" eq substr( $_, $-[0], 1 ) ) {
$depth++;
} elsif( $canNest ) {
$depth--;
} else {
$depth= 0;
}
}
last if $depth;
if( m[
\G
(?:
[^'"/]+
| ' (?: [^'\\]+ | \\. )* '
| " (?: [^"\\]+ | \\. )* "
| /(?![/*])
)+
]xgc
) {
$output .= substr( $_, $-[0], $+[0] - $-[0] );
} elsif( m[\G//.*]gc ) {
# skip C++ comments
} elsif( m[\G/[*]]gc ) {
$depth++;
} elsif( m[\G['"]]gc ) {
warn "Ignoring unclosed quote: $_";
} else {
die $_, ' ' x pos($_), "^\nCouldn't be parsed";
}
}
print $output;
$output= "";
}
warn "$depth unclosed /*-comments\n" if $depth;
__END__
#include "StdAfx.h" // Tail comment
#include "Utility\perftime.h"
#pragma hdrstop
/* Comment before MACRO */
/* Comment /* and nested comment */ lines */
#define MACRO 10\
+ 3 // Multi line macro with comment
#define __DEBUG /* comment */ 1
#define STRING 'This is a string' /* comment */
#define BACKSLASH '\\'
#define COMMENT "/* comment in \"a\" string */"
// c++ comment line
/* Comment at start for a number of lines */
/* multi-line comment
/* nested */
block */
// cpp block
char PerfTimer::Buf[64];
// Don't use contractions
// /*-style comment below over multiple lines:
test/*ing how newlines work
when a comment spans lines, does it st*/ing?
total/*divide*//count//*comment
Produces
#include "StdAfx.h"
#include "Utility\perftime.h"
#pragma hdrstop
#define MACRO 10\
+ 3
#define __DEBUG 1
#define STRING 'This is a string'
#define BACKSLASH '\\'
#define COMMENT "/* comment in \"a\" string */"
char PerfTimer::Buf[64];
testing?
total/count
(Very minor updates applied.)
- [tye]
C comments don't nest!
They don't lay eggs, either... :)
It may well be that the comments that you are familiar with are sterile and have no inclination toward nesting. However, I can assure you, that our comments are quite lively and prone to nesting. They therefore require appropriate management.
Perhaps I should mention that these are actually C++ comments and it may be that hybrid vigor accounts for the difference in nesting behaviour?
perlmonks.org content © perlmonks.org and ForgotPasswordAgain, GrandFather, ikegami, tye
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03