Special behavior for LF and CR in RegExs?
Argel
created: 2006-01-04 19:34:45
I'm curious as to why the following does not work. I gather it has something to do with some underlying magic of regular expressions (or perhaps in split)? It just seems odd that I have to do an m/$regex/m to split on LFs and CRs. I would have thought switching to using octal or hex codes would override some of that default behavior.
# Doesn't DWIM
$data =~ s/\012/\015/;
$data =~ s/\015+/\015/;
@records = split /\015/, $data;
My other curiosity would be is there a way to split without having to resort to the map+chomp afterwards (while leaving the rest of the data intact)?
# Works, but the map+chomp seems ugly
@records = map {chomp $_;  $_} split /^/xms, $data;
Note: I realize only the 'm' option is necessary. The 'xs' are as per Perl Best Practices.

Thank you oh great wise ones!!

-- Argel

Re: Special behavior for LF and CR in RegExs?
created: 2006-01-04 19:52:35
what about just this?
my @records = split($/, $data);
Update: Your original code will work if you add the /g modifier to the substitutions..
perl -le '$_="blah\r\nfoo\r\nstuff\r\n"; s/\012/\015/g; s/\015+/\015/g; print join ":", split(/\015/,$_)'
Re: Special behavior for LF and CR in RegExs?
created: 2006-01-04 19:55:24

The “doesn’t DWIM” snip seems to be missing /g modifiers. Posting accident, or is that so in your code as well?

Anyway, if split /^/m works, it seems that split /\n/ also should. Does it not?

You can minimise that code quite a bit, btw, by simply saying chomp( @records = split /^/xms, $data );

Makeshifts last the longest.

Re^2: Special behavior for LF and CR in RegExs?
created: 2006-01-04 20:19:38
Good catch on the missing 'g'!! You are right, that did work.

I have seen splitting on a \n work and also seen it not work. I'm using a compiled by myself perl 5.8.0 on Solaris 8 so perhaps there is a bug buried away in there?

Looks like davidrw's $/ suggestion also works. Given the above \n problem I think I will use that instead.

Thanks for all the help!!

-- Argel

Re^3: Special behavior for LF and CR in RegExs?
created: 2006-01-04 20:39:56

Well, $/ is the input record separator; generally, in strings and patterns, \n is magically mapped to that behind the scenes – even if it consists of multiple characters on the platform in question, such as CR/LF on DOS.

Basically, using \n will always work so long as the data you’re processing comes from the same platform that you’re running on. If not, you’ll need to convert end-of-line markers. There’s no way to avoid this.

So outside specific scenarios, you should use \n or $/ and let Perl handle the specifics. That will also yield the most portable scripts.

Makeshifts last the longest.

Re^4: Special behavior for LF and CR in RegExs? (Ah! No!!)
tye
created: 2006-01-05 00:31:55

\n is not magically mapped to $/ "behind the scenes". There are so many misconceptions combined in that sentence that I'm at a loss at where to start.

My reaction is strong because these many misconceptions are common, I've railed against them several times, and I respect you enough to be truely shocked to hear this from you.

It's late and I'm very tired and yet also rather busy so I'll make the rude suggestion that you might want to super search for nodes by me regarding newline + $/ + \n + \r (not because I'm the only one who has anything useful to say on that subject, but because even using all of those terms, I suspect you'd otherwise get a lot to sort through while I'm sure I have several treatments of these all-too-common misconceptions under my name).

Update: Struck out one search term to yield a more interesting search. Though the meat of it is mostly collected in node 264785.

- tye        

Re^5: Special behavior for LF and CR in RegExs?
created: 2006-01-05 07:50:40

Ah, I was missing the last paragraph from the linked node. There is some magic mapping – I remember that much because I started learning Perl on Windows; but it happens elsewhere than what I thought – which I never found out first-hand because I left Windows behind at a time when I was still a Perl greenhorn.

Thanks for the correction. (And yes, I know the issue is touchy. :-))

I was going to ask a question to be sure whether I understood correctly, then I remembered binmode… and now it’s all coming back. Maybe I wasn’t so green back then; maybe it has simply been too long.

At least, if my regained understanding is right, my advice is sound anyway: that in general, you want to use \n and not worry about it; and that if you have files from other platforms, you’ll have to convert anyway.

Makeshifts last the longest.

perlmonks.org content © perlmonks.org and Argel, Aristotle, davidrw, tye

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03