Circulating through each section of a regex
Win
created: 2006-02-03 07:12:04
Dear Monks,

I have a long regex (longer than shown) and I want to take each component and do something with it. The code below illustrates the point but I am aware that it is wrong.


 if ($_ =~ /(.{10})\t  #1 
	 ([\d|\.]+)\t  #2
	 ([\d|\.]+)\t  #3
	 ([\d|\.]+)\t  #4
         ([\d|\.]+)\t  #5
	 ([\d|\.]+)\t  #6
	 ([\d|\.]+)\t  #7
	 ([\d|\.]+)\t  #8
	 ([\d|\.]+)\t  #9
		 /x) {
	for (5..9){  print $$_; }
	    
	}

Please could somebody point me in the right direction.
Re: Circulating through each section of a regex
created: 2006-02-03 07:33:09

Using a temp variable for clarity:

my $string = $_;
if (my @match = $string =~ /(.{10})\t  #1 
     ([\d|\.]+)\t  #2
     ([\d|\.]+)\t  #3
     ([\d|\.]+)\t  #4
         ([\d|\.]+)\t  #5
     ([\d|\.]+)\t  #6
     ([\d|\.]+)\t  #7
     ([\d|\.]+)\t  #8
     ([\d|\.]+)\t  #9
         /x) {
    for (5..9){  print $match[$_]; }
        
    }
}

All dogma is stupid.
Re^2: Circulating through each section of a regex
Win
created: 2006-02-03 07:41:23
I don't think that is right. I want to print each element of the regex. Sorry I'm finding it difficult to explain what I mean.
Re^3: Circulating through each section of a regex
created: 2006-02-03 07:50:21

That's exactly what my code will do (albeit it'll only print the 5th to 9th matched element). If that's not what you want, give some sample input and expected output.


All dogma is stupid.
Re^3: Circulating through each section of a regex
created: 2006-02-03 08:40:40

"each element of the regex" is an expression devoid of sense, or at least does not make any sense in the context of your question. But your example code is enough to show what you really want. And he showed you code that will just do that, although it is not strictly necessary to use a temporary variable. What does make you think it's wrong? Did you try it at least?!?

Re^4: Circulating through each section of a regex
Win
created: 2006-02-03 08:51:41
I am trying it now. I can see the logic of the code now after a bit more thought.
Re: Circulating through each section of a regex
created: 2006-02-03 08:33:32

You're using symrefs, don't! Well, do not make a habit of it at least. This is an error under

use strict; WHICH YOU SHOULD BE USING ANYWAY!!

Use the return value of the match instead!

Re^2: Circulating through each section of a regex
Win
created: 2006-02-03 09:01:20
Sorry I don't understand the point you are making here.
Re^3: Circulating through each section of a regex
created: 2006-02-03 11:24:29

Ok, I'll explain it bit by bit:

I wrote:

You're using symrefs, don't! Well, do not make a habit of it at least. This is an error under

Well, in the code you gave as an example of what you wanted you had a line like thus:

for (5..9){  print $$_; }

Indeed you're accessing a numbered variable by prepending a dollar sign to its name, which happens to be contained in $_. This is called a symbolic reference or symref for for short. Now, there happen to be situations in which symrefs are useful or even necessary. But in general they're error prone and risky. This is why there exists a [mod://strict] pragmatic module designed to prohibit them. To do so it is enough to insert in your code a

use strict 'refs';

line, which will make every attempt to use a symref a syntactical error for the rest of the enclosing lexical scope. Actually this feature ("removal") is also activated if you just write:

use strict;

along with other re[mod://strict]ions aimed at reducing the risk of making trivial mistakes.

use strict; WHICH YOU SHOULD BE USING ANYWAY!!

This is exactly what I meant and I mean. Quite about every program with the partial exception of the most trivial ones, e.g. one-liners, should have

use strict;
use warnings;

at the top. Do yours have them? I suspect they do not, but I may be wrong...

Use the return value of the match instead!

This means exactly: use a technique like the one of [tirwhan]'s [id://527612|solution]. That is: a match in list context returns the values of the numbered variables. So, instead of accessing them by whatever means, in this case it is convenient to gather them that way. You said that it took you some more thought but that eventually you understood the logic of the suggested code: if you really did, then my comment should be well clear to you as well. Is it?

Re: Circulating through each section of a regex
created: 2006-02-03 08:49:34
I prefer the [id://527612|approach] of [tirwhan] myself, capturing the matches into an array, but sometimes you can't use it: if you're using the /g modifier too. In that case you just get a flat list of all submatches for all matches.

In such a case, it's still possible to go deeper into the nuts and bolts, and use the [doc://perlvar|@-] and [doc://perlvar|@+] arrays, together with [doc://substr]. You do need direct access to the scalar you matched on.

Something like this:

for my $i (5 .. 9) {
    print substr($_, $-[$i], $+[$i]-$-[$i]);
}

BTW I think your [\d|\.]+ is an error, you probably don't want to match "|" characters.

Re^2: Circulating through each section of a regex
Win
created: 2006-02-03 09:00:30
I want to match any number (string of digits) that may also contain a decimal point. How do I get that as a regex. Is it [\d|\.]+
Re^3: Circulating through each section of a regex
created: 2006-02-03 09:06:37
No, like I said, that matches "|" too. Simply drop it: ([\d.]+). That'll match any sequence of digits and dots.

If you want to be a bit more restrictive, for example allowing only one dot, and on the other hand be more relaxed in your number format, for example accepting minus signs, you can take a look at Regexp::Common, in particular the Regexp::Common::number submodule.

perlmonks.org content © perlmonks.org and bart, blazar, tirwhan, Win

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03