Hi Monks,
I have problem while downloading HTML files using LWP::UserAgent.
There are some Junks Characters found in downloaded files.
Is any way to download the file without junks.
Note:I am using in Windows OS. Download the webpage to see junk Characters
'http://www.whitecase.com/attorneys/detail.aspx?attorney=1148';
Here is my try:
use LWP::UserAgent;
my $ua = new LWP::UserAgent;
$ua->proxy(['http']=> 'http://00.00.0.00:0000');
my $url = 'http://www.whitecase.com/attorneys/detail.aspx?attorney=1148';
# Create a request
my $req = HTTP::Request->new('GET' => $url);
$req->proxy_authorization_basic("xxxxx", "xxxxx");
my $res = $ua->request($req);
if ($res->is_success) {
my $file_cnt = $res->content;
print "$file_cnt";
open WOUT, ">out.html" or die "Can't open File: out.html";
print WOUT $file_cnt;
close WOUT;
}
else {
print "Download Error\n";
}
Thanks & Regards,
Rajesh.K
Edit: [g0n] - replaced pre tags with code tags
Edit: [planetscape] - replaced [ ] with [ ]
Could you show us what it looks like?
Post a sample to give us some idea.
Update:
Try this
my $file_cnt = $res->content; $file_cnt =~ s/\r//g;
It should, of course, be "Börries"....Dr. Börries.....
binmode(STDOUT, ':utf8');with no success. Any idea what's happening?
wfsp
I am not sure what you mean by "junk" characters. May you post an example of what you mean?
I took out the proxy lines and ran the code; the file downloaded without any errors. However, I think you are referring to accented characters such as ö in the source - Use HTML::Entities if you want to encode them into the proper ö format.
But please post an example so we can be sure of what you want.
Hi Monks,
Here are some sample junk characters
Downloaded files Input -- Original Output
===========================================
1. jury trial. For his -- jury trial. For his
2. Börries Ahrens -- Börries Ahrens
3. Aldejohannâs main -- Mr. Aldejohanns
4. University of Münster -- University of Münster
5. the â¬625 million senior and â¬130 -- 625 million senior and 130
6. acquisition of a propertiesâ -- acquisition of a properties
7. Westfield College â University -- Westfield College University
8. Teléfonos -- Teléfonos
9.(Celumóvil S -- (Celumóvil S
10. Dr. jur., 1990, with a dissertation on âDie Unabhängigkeit des genossenschaftlichen Prüfungsverbandesâ (âThe Independence of the Cooperative Inspection Associationâ) ---
Dr. jur., 1990, with a dissertation on "Die Unabhängigkeit des genossenschaftlichen Prüfungsverbandes" ("The Independence of the Cooperative Inspection Association")
Thanks,
Rajesh.K
my $file_cnt = $res->content;to
my $file_cnt = $res->decoded_content;
See [cpan://HTTP::Message] for an explanation of the difference.
Many thanks to the search artist [kwapping] for finding it and to [tye] for explaining it :-)
perlmonks.org content © perlmonks.org and abcde, holli, Rajeshk, wfsp, zentara
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03