C comment stripping preprocessor
GrandFather
created: 2006-08-09 14:20:17

In the process of trying to emulate the C pre-processor I had major trouble trying to handle C style /* ... */ comments. There are two issues that cause particular grief - comments can span lines and, at least for some compilers, comments can be nested (and are in the code I need to handle).

An additional gotcha is that things that look like comments in strings need to be retained.

The code below parses an input string and generates an output string comprising the original text sans C style comments. Note that it leaves C++ single line comments however - but they are easily dealt with in the second pass.

use strict;
use warnings;
use Parse::RecDescent;

my $decommendedText = '';

sub concat ($) {$decommendedText .= $_[0]; 1;}

my $decomment = <<'GRAMMAR';
file : block(s)

block   : string
            {::concat ($item{string}); 1}
        | m{((?!/\*|"|').)+}s
            {::concat ($item[-1]); 1}
        | comment
            {::concat ($item{comment}); 1;}
            
string  : /"([^"]|\\")*"/
            {$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;}
        | /'([^']|\\')*'/
            {$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;}
        
comment : '/*' commentBlock '*/'
            {$return = $text =~ /^\n/ ? "\n" : ''; 1;}

commentBlock   :  m{((?! \*/ | /\* ).)*}sx comment m{((?! \*/ | /\* ).)*}sx
            {$return = "\n"; 1;}
        | m{((?! \*/ | /\* ).)+}sx
            {$return = ''; 1;}

GRAMMAR


my $parse = new Parse::RecDescent ($decomment);

my $input = <<'DATA';
#include "StdAfx.h" // Tail comment
#include "Utility\perftime.h"
#pragma hdrstop

/* Comment before MACRO */
/* Comment /* and nested comment */ lines */

#define MACRO 10\
              + 3 // Multi line macro with comment

#define __DEBUG /* comment */ 1
#define STRING 'This is a string' /* comment */
#define COMMENT "/* comment in \"a\" string */"

// c++ comment line
/* Comment at start for a number of lines */
/* multi-line comment
/* nested */
block */
// cpp block

char PerfTimer::Buf[64];
DATA

$parse->file($input) or die "Parse failed\n";
print $decommendedText;

Prints:

#include "StdAfx.h"// Tail comment
#include "Utility\perftime.h"
#pragma hdrstop



#define MACRO 10\
              + 3 // Multi line macro with comment

#define __DEBUG 1
#define STRING 'This is a string'
#define COMMENT "/* comment in \"a\" string */"
// c++ comment line


// cpp block

char PerfTimer::Buf[64];

DWIM is Perl's answer to Gödel
Re: C comment stripping preprocessor
created: 2006-08-09 14:34:25

I have a few little improvements.

  • strict and warnings are specified for your program, but not for the code snippets in your grammar.
  • There is an undue requirement on the user (i.e. the calling script) to provide concat.
  • $decommendedText should not be global, and the user (i.e. the calling script) should not have to initialize it.
  • There is no check for end-of-file. If the parser fails halfway through, there's no way of knowing.

The following addresses these issues.

use strict;
use warnings;
use Parse::RecDescent;

my $decomment = <<'GRAMMAR';

{
   use strict;
   use warnings;

   sub concat { $decommendedText .= $_[0]; }
}

file    : 
        | block(s) /\Z/
            {$return = $decommendedText; 1;}

block   : string
            {concat ($item{string}); 1;}
        | m{((?!/\*|"|').)+}s
            {concat ($item[-1]); 1;}
        | comment
            {concat ($item{comment}); 1;}
            
string  : /"([^"]|\\")*"/
            {$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;}
        | /'([^']|\\')*'/
            {$return = $item[-1] . ($text =~ /^\n/ ? "\n" : ''); 1;}
        
comment : '/*' commentBlock '*/'
            {$return = $text =~ /^\n/ ? "\n" : ''; 1;}

commentBlock   :  m{((?! \*/ | /\* ).)*}sx comment m{((?! \*/ | /\* ).)*}sx
            {$return = "\n"; 1;}
        | m{((?! \*/ | /\* ).)+}sx
            {$return = ''; 1;}

GRAMMAR

...

my $decommendedText = $parse->file($input);
die "Parse failed\n" if not defined $decommendedText;
print $decommendedText;

Update: Now fixes $decommendedText being a global.

Re^2: C comment stripping preprocessor
created: 2006-08-09 14:37:08

Actually concat is there:

sub concat ($) {$decommendedText .= $_[0]; 1;}

:)

Update: Thanks for the other tips - especially the returned result from file and eof detection.


DWIM is Perl's answer to Gödel
Re^3: C comment stripping preprocessor
created: 2006-08-09 14:43:09
It was in main::. In the caller. Why would a module rely on the calling script to provide its internal functions? That's no good. I moved the function into the module where it should be. The problem becomes extremely evident when you Precompile (as you should).
Re: C comment stripping preprocessor (problems)
tye
created: 2006-08-09 15:20:50

This doesn't handle this case:

// We do not use /*-style comments

It doesn't even handle the case:

// We don't use old C-style comments

because it tries to find the closing single quote to match the apostrophe in "don't". You simply have to parse //-style comments for such a tool.

/"([^"]|\\­")*"/

This doesn't handle "\\". Also note that it will fail for strings of 32K characters, which is why I prefer to add the + in "([^"\\]+|\\­.)*".

Why don't you factor out m{((?! \*/ | /\* ).)+}sx into its own rule so you don't have to repeat that regex three times and so you can assign a descriptive name to it to aid understanding?

m{((?!/\*|­"|').)+}s could be replaced by [^/"'] and /(?!*), which is more to my taste but YMMV.

And I'd probably do this all with simpler regexes and a simple state machine instead of resorting to Parse::RecDescent (not that my result will be simpler code in total). Note that I even avoid having to slurp the entire input into a single string.

#!/usr/bin/perl -w
use strict;

$|= 1;              # Useful for ad-hoc testing
my $canNest= 1;     # Whether /*-style comments can be nested

my $depth= 0;
my $output= "";
while(  ) {
    while(  ! m[\G\z]gc  ) {
        while(  $depth  &&  m[/[*]|[*]/]gc  ) {
            if(  "/" eq substr( $_, $-[0], 1 )  ) {
                $depth++;
            } elsif(  $canNest  ) {
                $depth--;
            } else {
                $depth= 0;
            }
        }
        last   if  $depth;
        if( m[
                \G
                (?:
                    [^'"/]+
                  | ' (?: [^'\\]+ | \\. )* '
                  | " (?: [^"\\]+ | \\. )* "
                  | /(?![/*])
                )+
            ]xgc
        ) {
            $output .= substr( $_, $-[0], $+[0] - $-[0] );
        } elsif(  m[\G//.*]gc  ) {
            # skip C++ comments
        } elsif(  m[\G/[*]]gc  ) {
            $depth++;
        } elsif(  m[\G['"]]gc  ) {
            warn "Ignoring unclosed quote: $_";
        } else {
            die $_, ' ' x pos($_), "^\nCouldn't be parsed";
        }
    }
    print $output;
    $output= "";
}
warn "$depth unclosed /*-comments\n"   if  $depth;
__END__
#include "StdAfx.h" // Tail comment
#include "Utility\perftime.h"
#pragma hdrstop

/* Comment before MACRO */
/* Comment /* and nested comment */ lines */

#define MACRO 10\
              + 3 // Multi line macro with comment

#define __DEBUG /* comment */ 1
#define STRING 'This is a string' /* comment */
#define BACKSLASH '\\'
#define COMMENT "/* comment in \"a\" string */"

// c++ comment line
/* Comment at start for a number of lines */
/* multi-line comment
/* nested */
block */
// cpp block

char PerfTimer::Buf[64];
// Don't use contractions
// /*-style comment below over multiple lines:
test/*ing how newlines work
when a comment spans lines, does it st*/ing?
total/*divide*//count//*comment

Produces

#include "StdAfx.h"
#include "Utility\perftime.h"
#pragma hdrstop




#define MACRO 10\
              + 3

#define __DEBUG  1
#define STRING 'This is a string'
#define BACKSLASH '\\'
#define COMMENT "/* comment in \"a\" string */"






char PerfTimer::Buf[64];


testing?
total/count

(Very minor updates applied.)

- [tye]        

Re: C comment stripping preprocessor
created: 2006-08-10 06:32:48

C comments don't nest!

They don't lay eggs, either... :)

Re^2: C comment stripping preprocessor
created: 2006-08-10 06:42:35

It may well be that the comments that you are familiar with are sterile and have no inclination toward nesting. However, I can assure you, that our comments are quite lively and prone to nesting. They therefore require appropriate management.

Perhaps I should mention that these are actually C++ comments and it may be that hybrid vigor accounts for the difference in nesting behaviour?


DWIM is Perl's answer to Gödel

perlmonks.org content © perlmonks.org and ForgotPasswordAgain, GrandFather, ikegami, tye

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03