How to Remove Junk Characters
Rajeshk
created: 2006-01-05 04:46:09

Hi Monks,
I have problem while downloading HTML files using LWP::UserAgent.
There are some Junks Characters found in downloaded files.
Is any way to download the file without junks.
Note:I am using in Windows OS. Download the webpage to see junk Characters 'http://www.whitecase.com/attorneys/detail.aspx?attorney=1148';
Here is my try:

use LWP::UserAgent;
my $ua = new LWP::UserAgent;
$ua->proxy(['http']=> 'http://00.00.0.00:0000');
my  $url = 'http://www.whitecase.com/attorneys/detail.aspx?attorney=1148';
# Create a request

my $req = HTTP::Request->new('GET' => $url);
$req->proxy_authorization_basic("xxxxx", "xxxxx");

my $res = $ua->request($req);
if ($res->is_success) {
	my $file_cnt = $res->content;
	print "$file_cnt";
	open WOUT, ">out.html" or die "Can't open File: out.html";
	print WOUT $file_cnt;
	close WOUT;
}
else {
	print "Download Error\n";
}


Thanks & Regards,
Rajesh.K

Edit: [g0n] - replaced pre tags with code tags

Edit: [planetscape] - replaced [ ] with [ ]

Re: How to Remove Junk Characters
created: 2006-01-05 05:18:05

Could you show us what it looks like?

Post a sample to give us some idea.

Update:
Try this

my $file_cnt = $res->content;
$file_cnt =~ s/\r//g;
Re^2: How to Remove Junk Characters
created: 2006-01-05 07:20:50

Hi wfsp,

I tried your code. $file_cnt =~ s/\r//g;
It's not working.


Thanks,
Rajesh.K

Re: How to Remove Junk Characters
created: 2006-01-05 07:31:29
I took out your $ua->proxy line and your code runs fine. The out.html has no corruption. I'm on linux using Mozilla.

I'm not really a human, but I play one on earth. flash japh
Re^2: How to Remove Junk Characters
created: 2006-01-05 08:51:01
He may mean this
....Dr. Börries.....
It should, of course, be "Börries"
I tried
binmode(STDOUT, ':utf8');
with no success. Any idea what's happening?

wfsp

Re^3: How to Remove Junk Characters
created: 2006-01-05 12:58:41
This works:
use Unicode::String qw(utf8);
#....
print utf8($file_cnt);


holli, /regexed monk/
Re: How to Remove Junk Characters
created: 2006-01-05 08:33:10

I am not sure what you mean by "junk" characters. May you post an example of what you mean?

I took out the proxy lines and ran the code; the file downloaded without any errors. However, I think you are referring to accented characters such as ö in the source - Use HTML::Entities if you want to encode them into the proper ö format.

But please post an example so we can be sure of what you want.

~abseed
Re^2: How to Remove Junk Characters
created: 2006-01-06 00:47:58

Hi Monks,

Here are some sample junk characters 

Downloaded files Input  -- Original Output
===========================================         

1. jury trial.  For his   -- jury trial.  For his  
2. Börries Ahrens  	-- Börries Ahrens  
3. Aldejohann’s main     -- Mr. Aldejohann’s 
4. University of Münster -- University of Münster
5. the €625 million senior and €130  -- €625 million senior and €130 
6. acquisition of a properties’  -- acquisition of a properties’
7. Westfield College – University  -- Westfield College – University 
8. Teléfonos  -- Teléfonos
9.(Celumóvil S -- (Celumóvil S
10. Dr. jur., 1990, with a dissertation on “Die Unabhängigkeit des genossenschaftlichen Prüfungsverbandes” (“The Independence of the Cooperative Inspection Association”) ---

Dr. jur., 1990, with a dissertation on "Die Unabhängigkeit des genossenschaftlichen Prüfungsverbandes" ("The Independence of the Cooperative Inspection Association")

Thanks,
Rajesh.K

Re^3: How to Remove Junk Characters
created: 2006-01-06 14:56:18
Change
my $file_cnt = $res->content;
to
my $file_cnt = $res->decoded_content;

See [cpan://HTTP::Message] for an explanation of the difference.

Many thanks to the search artist [kwapping] for finding it and to [tye] for explaining it :-)

perlmonks.org content © perlmonks.org and abcde, holli, Rajeshk, wfsp, zentara

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03