substring extraction
bfdi533
created: 2006-01-03 16:56:57

I have a file with a really long string in it (it is actually XML but for some reason it is stored in 1 line). What I need to do is to do a substring search of the file and print out the "word" that contains the substring. This "word" might be a url, a description, etc. For coding and extraction purposes, the "word" is delineated by whitespace. So I need to back up to the beginning of the "word" an print out to the end of the "word."

Here is the code I have already but as you can see it uses an absolute substring size and I need it to be dynamic:

while (<>) {
        my $istr = lc($_);
        my $offset = index($istr,"cesi");
        print $offset."\n";
        if ($offset > -1) {
                my $str = substr($istr, $offset-20, 100);
                print $str."\n";
        }
}

Thanks in advance for any input.

Re: substring extraction
created: 2006-01-03 17:08:10
pretty easy with a regex....
my $string = 'cesi';
while($istr =~ /(\S*$string\S*)/gi) {
  print "$1\n";
}
not tested, but should work... the i does case insensitive matching, the g matches more than once, allowing the loop to catch all occurances. Lowercasing the string ahead of time may help the speed, especially if you want the output to be lowercase (though you probably don't if you have things like URLs).

If you want to know the location of the word in the source string the special array @- and @+ should come in handy.

                - Ant
                - Some of my [/index.pl?node_id=56739#Best|best] work - (1 2 3)

Re^2: substring extraction
created: 2006-01-05 10:54:17

Perfect; that was the missing piece. I knew that most likely had to use a regex but that is admittedly a weak point for me. This does just what I am looking for.

Thanks for the help and the rapid reply.

Re: substring extraction
created: 2006-01-03 17:11:26

Your description doesn't completely tally with your code. If the file contains a single long string, then your while loop will only iterate once. However, to print out all, whitespace delimited words that either match or contain a given search term, you could use:

$string = 'this is a really long string (no really, it is!) that contains a whitespace delimited word';
print $1 while $string =~ m[(\b\S*limit\S*\b)]gi;; ## All words, case insensitive.
delimited

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

perlmonks.org content © perlmonks.org and bfdi533, BrowserUk, suaveant

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03