If I print this string to a filehandle with :utf8 layer set (via binmode or at open), will the utf8 be garbled somehow, since the string is not marked/known utf8? Or will everything be OK, since the utf8 string is in byte mode anyway?
I have read perlunicode, perluniintro, utf8, encoding, Encode and other docs and this is one thing I cannot figure out.
This is important because LWP does not seem to set the utf8 flag for HTML strings, even when utf-8 is sent in the appropriate HTTP header. And I am doing some scraping on these pages and printing some of the text back out on a filehandle where I set the layer to ':utf8' and I'm hoping this will Just Work.
If not, I need to add some code that takes HTML::Encode to the HTTP response and even the HTML itself to sniff out the encoding and, if present, call Encode::decode() on it.
For example:
use strict;
use warnings;
use LWP::Simple;
use Encode;
#I have confirmed the correct HTTP header sent here
my $html = get('http://-redacted-/file.utf8.html');
print "utf8: " . utf8::is_utf8($html) . "\n";
#Output is utf8:
my $html2 = decode('UTF-8' => $html);
print "utf8: " . utf8::is_utf8($html2) . "\n";
#Output is utf8: 1
Yes the output will be garbled, as perl thinks the contents of the string is ISO-Latin-1, and it will be "helpfully" converted to UTF-8 in the process.
You could just set the UTF-8 flag on the string, and leave the bytes as they are. One way is to use the private function _utf8_on() in [module://Encode] — well, it's not exactly private, but you're advised to use it very sparingly. Another way is to use [doc://pack] this way:
$perl_utf8 = pack 'U0a*', $raw_utf8;
I'd recommend to check if the UTF8 is in a "consistent state" afterwards, with utf8::valid(), for example.
p.s. I just came across this function in the docs for utf8:
I haven't tried it, but it sounds like something you could use.
- utf8::decode($string)
- Attempts to convert in-place the octet sequence in UTF-X to the corresponding character sequence. The UTF-8 flag is turned on only if the source string contains multiple-byte UTF-X characters. If $string is invalid as UTF-X, returns false; otherwise returns true.
utf8::decode I had not considered -- I thought maybe utf8::upgrade, but now it looks like that is only for actual Latin-1 strings.
What I think I'll end up doing is use HTML::Encode to properly sniff out the encoding of various docs I pull off the Web from LWP, since I shouldn't be making assumptions about their encoding anyway. Then use Encode::decode to decode them (to a Perl utf8 string) based on whatever encoding I get from HTML::Encoding.
Tough going, this utf8 business.
perlmonks.org content © perlmonks.org and bart, ryantate
prlmnks.org © 2006 edmund von der burg (eccles & toad)
v 0.03