Checking LInes in Text File
Anonymous Monk
created: 2006-06-01 14:52:28
Hi Monks,
What should be the best way to open a text file and compare a line input for duplication and if it is duplicated delete the duplicated line from the text file. Trying to find some information on that but nothing specific.
Thanks for helping!
Re: Checking LInes in Text File
created: 2006-06-01 15:02:07

For small files read the file a line at a time and add the lines to a hash.

When you have read a line first check that it's not in the hash already. If not, print it to the output file and insert it into the hash, otherwise skip to the next line.

If you can't figure that out, write some code that is your best guess at how to do it, ask again and include your code.


DWIM is Perl's answer to Gödel
Re^2: Checking LInes in Text File
created: 2006-06-01 16:47:51
Grandfather, I have a further question related to your suggestion. If I have hundres of lines of data like this:

tag1:xxxxxxx tag2:xxxxxx tag3:xxxxxxx tag4:yyyyy

How can I remove all lines that have the same tags 1 through 3 and replace it with a single line that has a new tag4? Currently I am able to remove all excess lines with the same tags 1 through 3 using your method but am unable to change tag4 because the hash method works by not writing subsequent values. Hence once I find out I have a duplicate it is too late to change it as the first has already been written. Any suggestions? Thanks!
Re^3: Checking LInes in Text File
created: 2006-06-01 19:13:29

Provide half a dozen lines of sample data, the test code you are currently using, and a sample of the output you expect to see.

For the test code it is easiest to use a __DATA__ section for the test data rather than an external file and simply print the result rather than generating an external file.


DWIM is Perl's answer to Gödel
Re^4: Checking LInes in Text File
created: 2006-06-01 20:27:58
Thanks GrandFather!

My data would looks like this:
MCAT: 0xf30cbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA0
MCAT: 0xcc3fbed1 PCAT: 0x000fb109 LMAT: 0x00000800 TYPE: KA1
MCAT: 0xeeccbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA1
MCAT: 0xf30cbe91 PCAT: 0xafaddd09 LMAT: 0x00040000 TYPE: KA0
MCAT: 0xeeecbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA0
MCAT: 0xcc000331 PCAT: 0x000fb109 LMAT: 0x00000800 TYPE: KA1
MCAT: 0xe554be01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA1
MCAT: 0xf30cbe91 PCAT: 0xafaddd09 LMAT: 0x00040000 TYPE: KA1
so every set of data (MCAT, PCAT, LMAT) is unique except for the fourth and eight which differ only by TYPE.

I would like to roll the fourth and eight together so it would look like this:

MCAT: 0xf30cbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA0
MCAT: 0xcc3fbed1 PCAT: 0x000fb109 LMAT: 0x00000800 TYPE: KA1
MCAT: 0xeeccbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA1
MCAT: 0xf30cbe91 PCAT: 0xafaddd09 LMAT: 0x00040000 TYPE: KA0, KA1
MCAT: 0xeeecbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA0
MCAT: 0xcc000331 PCAT: 0x000fb109 LMAT: 0x00000800 TYPE: KA1
MCAT: 0xe554be01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA1

This leaves me with a total of 7 data groups instead of 8.

So far my code doesn't even come close to accomplishing doing this....
Re^5: Checking LInes in Text File
created: 2006-06-01 20:38:27

The following is ok for reasonable size files but may bog down when things get huge.

use strict;
use warnings;
use Data::Dump::Streamer;

my %firstLines;
my @lines;

while () {
    chomp;
    my ($data, $type) = /(.*)\s+TYPE:\s+(\w+)$/;
    
    next if ! defined $type; # ignore malformed line
    if (exists $firstLines{$data}) {
        $lines[$firstLines{$data}] .= ", $type";
    } else {
        $firstLines{$data} = @lines;
        push @lines, $_;
    }
}

print join "\n", @lines;

__DATA__
MCAT: 0xf30cbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA0
MCAT: 0xcc3fbed1 PCAT: 0x000fb109 LMAT: 0x00000800 TYPE: KA1
MCAT: 0xeeccbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA1
MCAT: 0xf30cbe91 PCAT: 0xafaddd09 LMAT: 0x00040000 TYPE: KA0
MCAT: 0xeeecbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA0
MCAT: 0xcc000331 PCAT: 0x000fb109 LMAT: 0x00000800 TYPE: KA1
MCAT: 0xe554be01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA1
MCAT: 0xf30cbe91 PCAT: 0xafaddd09 LMAT: 0x00040000 TYPE: KA1

Prints:

MCAT: 0xf30cbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA0
MCAT: 0xcc3fbed1 PCAT: 0x000fb109 LMAT: 0x00000800 TYPE: KA1
MCAT: 0xeeccbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA1
MCAT: 0xf30cbe91 PCAT: 0xafaddd09 LMAT: 0x00040000 TYPE: KA0, KA1
MCAT: 0xeeecbe01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA0
MCAT: 0xcc000331 PCAT: 0x000fb109 LMAT: 0x00000800 TYPE: KA1
MCAT: 0xe554be01 PCAT: 0xcda2b409 LMAT: 0x00100000 TYPE: KA1

DWIM is Perl's answer to Gödel
Re^6: Checking LInes in Text File
created: 2006-06-02 12:52:49
Grandfather,

Thank you very much!!! I really appreciate you solving that problem for me. I'm very new to PERL and have to admit that your code took a while to make sense to me. I didn't realize that you could access an array by the data value, I thought you had to access it by location (0,1,2,3... etc). Thanks again for your help!
Re^7: Checking LInes in Text File
created: 2006-06-02 19:38:22

Just in case there is some confusion or misunderstanding of some of the Perl tricks used I better go through some of that code and elaborate on what's happening. Note that I've taken interesting lines in processing order rather that the order they are coded.

$firstLines{$data} = @lines;

this is a little tricksy. It creates a new entry in %firstLines that contains the index to the new line as the value and is keyed by the unique part of the line contents. @lines in scalar context returns the number of elements in the array.

if (exists $firstLines{$data})

checks to see if we've already seen a specific line.

$lines[$firstLines{$data}] .= ", $type";

builds the multiple entries for duplicated lines. Note that $firstLines{$data} returns the index number that was stored earlier.


DWIM is Perl's answer to Gödel
Re^8: Checking LInes in Text File
created: 2006-06-05 19:11:52
Thanks much for the added commentary! Now I think I actually understand what you did.
Re: Checking LInes in Text File
created: 2006-06-01 15:05:22
You can also do this on the commandline:
  sort -u infile > outfile
# or (you might many some of uniq's extra options):
  sort infile | uniq > outfile
Re: Checking LInes in Text File
created: 2006-06-01 15:44:21
It would help me to know what you are trying to do. The solutions offered thus far -- "sort -u" and using a hash -- both assume that you want to eliminate a line if it has a duplicate anywhere in the file. If all you want to do is eliminate successive repeated lines, something like this might be better:
my $last = $_ = <>;
print;
while (<>)
        {
        print if ($_ ne $last);
        $last = $_;
        }
Re^2: Checking LInes in Text File
created: 2006-06-01 16:01:06
Successive repeated lines can also be eliminated at the (unixy) command line with uniq infile > outfile, so long as you don't run the data through sort first.

Also note that, of the solutions provided thus far, the hash-based option is the only one which will both eliminate all duplicates (printing only the first appearance of each line) and also preserve the original order of the (remaining) lines, which may or may not be significant to you.

perlmonks.org content © perlmonks.org and Anonymous Monk, davidrw, esper, GrandFather, ibeneedinghelp

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03