utf-8 keys in a tied hash cause warning
saintmike
created: 2006-08-03 17:10:11
I'm puzzled that hashes tied via dbmopen apparently don't like utf-8 encoded keys:
    my $utf8key = "\x{05D0}";

    dbmopen(my %hash, "/tmp/mydb", 0666) || die "d'oh!";
    $hash{$utf8key} = "bar";
    dbmclose(%hash);
prints
    Wide character in null operation at ./test.pl line 8.
As checked with Encode::is_utf8, the string in $utf8key has the utf8 flag on.

Is this a bug in the dbm implementation or am I just confused?

It happens with perl5.8.8 and perl5.9.3. Thanks for any help.

Re: utf-8 keys in a tied hash cause warning
created: 2006-08-03 18:04:40
Here's what use diagnostics has to say about that:

(W utf8) Perl met a wide character (>255) when it wasn't expecting one. This warning is by default on for I/O (like print). The easiest way to quiet this warning is simply to add the :utf8 layer to the output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the warning is to add no warnings 'utf8'; but that is often closer to cheating. In general, you are supposed to explicitly mark the filehandle with an encoding, see open and perlfunc/binmode.

Hope that helps!

---
It's all fine and dandy until someone has to look at the code.
Re: utf-8 keys in a tied hash cause warning
created: 2006-08-03 18:22:19

It's not the fault of tied hashes.

use Tie::Hash qw( );
our @ISA = 'Tie::StdHash';

sub STORE {
   my ($self, $key, $val) = @_;
   print($key eq "\x{05D0}" ? "utf" : "not utf", "\n");
   return $self->SUPER::STORE($key, $val);
}

my %h;
tie %h, __PACKAGE__;
$h{"\x{05D0}"} = 1;   # Prints 'utf' in 5.8.6

dbm probably doesn't support unicode keys. The workaround is to encode your strings of chars into strings of bytes. UTF-8 is probably the best suited encoding.

Re: utf-8 keys in a tied hash cause warning
created: 2006-08-03 22:37:03
Following up on the earlier replies, if I supplement the OP code like so:
use Encode;

my $utf8key = "\x{05D0}";
my $usable_key = encode( 'utf8', $utf8key );

dbmopen(my %hash, "/tmp/mydb", 0666) || die "d'oh!";
$hash{$usable_key} = "bar";
dbmclose(%hash);
I don't get the warning message. I also noticed some differences in the content of the resulting dbm file -- the OP version had null bytes where the 'encoded' version had non-null bytes, suggesting that the warning issued by the OP version reflects an actual failure to store the data.

Having to encode the hash keys like this is certainly a PITA (a minor one, but still). Perhaps the maintainer(s) the various *DBM_File modules can be persuaded to update them so as to handle this properly -- easy enough to do, I'd expect.

Re^2: utf-8 keys in a tied hash cause warning
created: 2006-08-04 00:32:51
I don't get the warning message

It's not a warning. It's a fatal error. It was added to newer versions of Perl.

I also noticed some differences in the content of the resulting dbm file

Not here. 5.8.0 with Encode, 5.8.0 without Encode and 5.8.8 with Encode output:

0000: 02 00 FE 03 FB 03 00 00 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
...
03D0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
03E0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
03F0: 00 00 00 00 00 00 00 00 00 00 00 62 61 72 D7 90

Since strings of chars are stored internally as UTF-8, the resulting file is indentical.

Re^2: utf-8 keys in a tied hash cause warning
created: 2006-08-04 05:56:22
dbmopen is obsolete.
Re^3: utf-8 keys in a tied hash cause warning
created: 2006-08-04 12:16:59
It doesn't matter if you use dbmopen or tie in this scenario, the problem seems to lie in the storage engine(s) used by these functions.

perlmonks.org content © perlmonks.org and Anonymous Monk, graff, ikegami, kwaping, saintmike

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03