1 John Doe, joe blogs, title journal Animal Female Protein 2 Fred.......The keyword field starts at Animal followed by carriage return Female then Protein. I need to replace these carriage returns with tabs but retaining the carrigae return separating the entries i.e. between Protein and 2. Any suggestions would be great
my @fields;
my $i = 0;
while ( my $line = <$FILE> ) {
chomp $line;
if ( $line =~ /\t/ ) {
if ($i > 0 ) {
# do smth with last @fields array
}
@fields = split "\t", $line;
$i++;
}
else {
push @fields, $line;
}
}
UPDATE: @fields contains the last line after finishing the while loop.
use strict;
use warnings;
# Slurp data.
#
{
local $/ = undef;
$_ = ;
}
# Construct regular expression to pull out each data
# item; use extended syntax and single-line matching
# (i.e. a . matches newline).
#
my $rxExtract = qr{(?xs)
(.+?) # capture one or more characters
# with non-greedy matching
(?= # zero-width look ahead
(?: # alternation group, either
\d+\t # digits then a tab
| # or
\z # end of string
) # close alternation
) # close look-ahead
};
# Global match to pull data items out of string. Then
# replace all but the last newline in each data item
# with tabs. Print items out.
#
my @items = /$rxExtract/g;
s/\n(?=.)/\t/g for @items;
print @items;
__END__
1 John Doe, Joe Bloggs title journal Animal
Female
Protein
2 Mary Clary title magazine Fish
Nor
Fowl
3 Charley Farley, Piggy Malone title book The
Phantom
Raspberry
Blower
This produces
1 John Doe, Joe Bloggs title journal Animal Female Protein 2 Mary Clary title magazine Fish Nor Fowl 3 Charley Farley, Piggy Malone title book The Phantom Raspberry Blower
I hope this is of use.
Cheers,
JohnGG
$ perl -e '($s)="aaabbaabbb"=~/(a.*)b/;print "$s\n";' aaabbaabb $ perl -e '($s)="aaabbaabbb"=~/(a.*?)b/;print "$s\n";' aaa
The next construct is the tricky bit. The (?= ... ) is called a zero-width positive look-ahead assertion; I think I've got that right. Basically, the regular expression engine keeps track of where it has reached in the string; the look-ahead says to the engine, staying where you are, look further along from this point to see if you can find whatever. In our case we are looking for one of two things; one or more digits followed by a tab (the \d+\t) or the end of the string (the \z), in effect EOF. The (?: ... ) uses the '(' and ')' to group the alternations ('|' is the regular expression or) and the ?: switches off regular expression memory because we aren't interested in what the look-ahead finds, only that it has found it.
The line
my @items = /$rxExtract/g;
does a couple of things. It uses our previously constructed regular expression and matches it against $_ which is the default behaviour. The thing to note is that the match is done globally with the / ... /g flag. Because of global, the expression keeps going along the string finding matches and because we have used regular expression memory, what it matches is assigned to the @items list, all in one fell swoop.
As an aside, if we had slurped the file into a lexical variable like this
my $string = ;
you can't rely on the default matching against $_ so you would do this
my @items = $string =~ /$rxExtract/g;
We now have each data item in it's own element in the list but the items still contain the unwanted newlines that you wish to turn into tabs. We can again use a look-ahead assertion, this time in a substitution. We want to replace a newline only if it is followed by another character, it doesn't matter what character. We don't want to touch the last newline in the data item as we want that in our modified data file and that will not be followed by anything else. The \n(?=.) says a newline followed by some single character and because the look-ahead consumes no characters leaving the pointer behind the newline, only the newline gets replaced. The
s/\n(?=.)/\t/g for @items;
iterates over @items aliasing each element in turn to $_ and then doing a global substitution of any newline in the middle of the data item with a tab.
I hope this makes things clearer for you.
Cheers,
JohnGG
perlmonks.org content © perlmonks.org and Anonymous Monk, johngg, lima1
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03