Regular Expression tricky newline problem
jkva
created: 2006-01-02 17:03:08
Fellow monks,

I am trying to accomplish the following : Say I have a file called info.txt :
Line1 : Dit is de eerste regel
Line2 : Dit is de tweede regel
Line3 : Dit is de derde regel
Line4 : Dit is de vierde regel

I have slurped this into a scalar called $contents using open and my $contents = join('', ). This all works right.

My objective is then to create a regular expression that captures what comes after "Line3 :" until the end of that line, so basically until it meets a newline after that.
I have tried several things for quite a while now and don't seem to be getting closer. The greedy .* operator seems to get me the closest but sunce I have to ignore newlines using the /s flag I can't get this to work.

I would be grateful for any help.... knowing PM I will be getting a "D'oh why didn't I think of that" answer ;-)

-- jkva
Re: Regular Expression tricky newline problem
created: 2006-01-02 17:13:03

Use a non-greedy match:

#!/usr/bin/perl

use strict;
use warnings;

local $/=undef;
my $contents=;
my ($thirdline) = $contents =~ m/Line3 : (.*?)\n/s;
print $thirdline."\n";


__DATA__
Line1 : Dit is de eerste regel
Line2 : Dit is de tweede regel
Line3 : Dit is de derde regel
Line4 : Dit is de vierde regel
Output:
Dit is de derde regel
Or you could match against a negated character class:
my ($thirdline) = $contents =~ m/Line3 : ([^\n]+)/s;

A computer is a state machine. Threads are for people who can't program state machines. -- Alan Cox
Reaped: Re^2: Regular Expression tricky newline problem
created: 2006-01-02 22:58:53
This node was taken out by the NodeReaper on 2006-01-03 02-01-22
Reason: [jdporter]: troll rampage continues

You may view the original node and the consideration vote tally.

Re: Regular Expression tricky newline problem
created: 2006-01-02 17:13:13
Greedy matching works as long as you're not using the /s modifier:
use strict;

my $string = join '', ;

if($string =~ /^Line3 : (.*)/m) {
    print "$1\n";
}

__DATA__
Line1 : Dit is de eerste regel
Line2 : Dit is de tweede regel
Line3 : Dit is de derde regel
Line4 : Dit is de vierde regel
The /m modifier is necessary to have the ^ anchor match the beginning of any line in a multi-line string. The greedy .* will then match anything until the end of that line.

Had you used the /s modifier, the greedy .* would have matched newlines as well and therefore gobbled up everything until the end of the multi-line string.

Reaped: Re^2: Regular Expression tricky newline problem
created: 2006-01-02 22:57:32
This node was taken out by the NodeReaper on 2006-01-03 02-04-46
Reason: [jdporter]: troll rampage continues

You may view the original node and the consideration vote tally.

Re: Regular Expression tricky newline problem
created: 2006-01-02 17:16:15

Use '?' to get a nongreedy match up to the first newline:

$data =~ m/Line3 : (.*?)\n/s;

You can also use a greedy match with a negated character class:

$data =~ m/Line3 : ([^\n]*)/s;

My test code follows:

use warnings;
use strict;

my $data = join( '',  );
print "[$data]\n\n";

$data =~ m/Line3 : (.*?)\n/s or die;
print "match: [$1]\n";

$data =~ m/Line3 : ([^\n]*)/s or die;
print "match: [$1]\n";

__DATA__
Line1 : Dit is de eerste regel
Line2 : Dit is de tweede regel
Line3 : Dit is de derde regel
Line4 : Dit is de vierde regel

Re: Regular Expression tricky newline problem
created: 2006-01-02 17:37:08

Some sample code with "this is what I get", and "this is what I want" would help understand where you are having a problem. The following should be a good starting point, if not the stimulus for a D'oh moment :).

use strict;
use warnings;

my $lines = do {local $/; };

my ($line3) = $lines =~ /Line 3 : (.*?)\n/;
print ">$line3<";

__DATA__
Line 1 : Dit is de eerste regel
Line 2 : Dit is de tweede regel
Line 3 : Dit is de derde regel
Line 4 : Dit is de vierde regel

Prints:

>Dit is de derde regel<

DWIM is Perl's answer to Gödel
Reaped: Re^2: Regular Expression tricky newline problem
created: 2006-01-02 23:00:18
This node was taken out by the NodeReaper on 2006-01-03 02-04-51
Reason: [jdporter]: troll rampage continues

You may view the original node and the consideration vote tally.

Re: Regular Expression tricky newline problem
created: 2006-01-02 21:58:13
I have slurped this into a scalar called $contents using open and my $contents = join('', ). This all works right.
That works, of course .. just wanted to share my immediate thought of this node: node 287647
Re: Regular Expression tricky newline problem
created: 2006-01-02 22:50:51
You've got answers for doing the appropriate regex on the slurped file data, as well as suggestions on improving how you do the slurp, so I'd just like to add that I wouldn't use a whole-file slurp into a scalar in a case like this.

The task appears to be line-oriented, so it would make sense to stick with line-oriented handling of the data. Depending on what else might need to be done with the file contents in the same script (whether you need to do things with other lines besides "Line 3"), you could either read the whole file into an array of lines and use grep on the array, or else use grep directly on the line-oriented file-read operator:

# load file into an array of lines, and use "Line 3":

my @lines = ;
my ( $keeper ) = grep /^Line 3 : /, @lines;

# or just get "Line 3" from the file, and skip the rest:
#my ($keeper) = grep /^Line 3 : /, ;

# (update: added parens around $keeper, as per Aristotle's correction)

# either way, remove the unwanted content from the kept line:

$keeper =~ s/Line 3 : //;
Reaped: Re^2: Regular Expression tricky newline problem
created: 2006-01-02 23:01:37
This node was taken out by the NodeReaper on 2006-01-03 02-04-52
Reason: [jdporter]: troll rampage continues

You may view the original node and the consideration vote tally.

Re^2: Regular Expression tricky newline problem
created: 2006-01-03 08:16:00

Careful, you’re invoking grep in scalar context. $keeper will only contain the count of matches. This has to be written with a parenthesised [doc://my], like so:

my ( $keeper ) = grep /^Line 3 : /, @lines;

However, that always goes through the entire data, regardless of where the match is found. A better way would be [doc://List::Util]’s first; with which the context does not matter either:

use List::Util qw( first );

my $keeper = first { /^Line 3 : / } @lines;

Makeshifts last the longest.

perlmonks.org content © perlmonks.org and Aristotle, bobf, davidrw, graff, GrandFather, jkva, NodeReaper, saintmike, tirwhan

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03