Win32 API directory searches that return wide / unicode filenames
dallen16
created: 2006-01-28 09:54:25
I had a Win32 Perl coding challenge to search a directory(WinXP, NTFS) that may contain files with unicode / wide filenames in addition to ANSI filenames. Through trial, error, and searches against Perl mailing list archives, it seems apparent that Win32 Perl's builtin directory functions do not support return Win32 unicode / wide filenames. More specifically, the builtin functions return filenames like "??????_HostID_2006-01-19_213218.xls" when the filename contains unicode / wide characters (same behavior as the DOS "dir" command). The problem is that you can't pass these filenames to something like stat or via OLE to ask Excel to open it. After much trial and error, I found a solution using Win32::API, the FindFirstFileW, FindNextFileW, and FindClose Win32 API calls -- with an "unpack" workaround suggested by $Bill Luebkert.
use strict;
use Win32::API;
use Unicode::String;
use English;

$OUTPUT_AUTOFLUSH=1;

$Win32::API::DEBUG = 0;

binmode(STDOUT, ":utf8"); 

use constant ERROR_NO_MORE_FILES  => 18;  
use constant INVALID_HANDLE_VALUE => -1;  

Win32::API::Struct-> typedef('FILETIME', qw(
  DWORD dwLowDateTime;
  DWORD dwHighDateTime;
));				# 8 bytes

use constant FILE_ATTRIBUTE_READONLY =>  0x00000001;
use constant FILE_ATTRIBUTE_HIDDEN =>  0x00000002;
use constant FILE_ATTRIBUTE_SYSTEM =>  0x00000004;
use constant FILE_ATTRIBUTE_DIRECTORY =>  0x00000010;
use constant FILE_ATTRIBUTE_ARCHIVE =>  0x00000020;
use constant FILE_ATTRIBUTE_NORMAL =>  0x00000080;
use constant FILE_ATTRIBUTE_TEMPORARY =>  0x00000100;
use constant FILE_ATTRIBUTE_COMPRESSED =>  0x00000800;
use constant MAX_PATH =>  260;

Win32::API::Struct-> typedef('WIN32_FIND_DATAW', qw(
  DWORD dwFileAttributes;
  FILETIME ftCreationTime;
  FILETIME ftLastAccessTime;
  FILETIME ftLastWriteTime;
  DWORD nFileSizeHigh;
  DWORD nFileSizeLow;
  DWORD dwReserved0;
  DWORD dwReserved1;
  WCHAR cFileName[520];
  WCHAR cAlternateFileName[28];
));		# 4 + 8 x 3 + 4 x 4 + 520 + 28 = 592 bytes 

# Note:  Win32::API::Struct->Typedef should seemingly 
# allocate 2 bytes per WCHAR, the sizeof('WCHAR'), but it 
# actually allocates only 1 byte -- same as TCHAR

my $FindFirstFile = Win32::API->new('kernel32.dll', 'FindFirstFileW', 'PS', 'N') or  die "FindFirstFile: $^E";
my $FindNextFile  = Win32::API->new('kernel32.dll', 'FindNextFileW', 'NS', 'I')  or die "FindNextFile $^E";  
my $FindClose     = Win32::API->new('kernel32.dll', 'FindClose', 'N', 'I')   or die "FileClose $^E";  

#  set your own value here... 
my $FileSpec = "//?/C:/My Documents/Tool/*.xls";

my $FileInfo = Win32::API::Struct-> new('WIN32_FIND_DATAW');

my $uFileSpec = Unicode::String->new;
$uFileSpec->utf8($FileSpec);
print "FileSpec = ", $uFileSpec->as_string, "\n";

my $handle = $FindFirstFile-> Call($uFileSpec->utf16le, $FileInfo);

if ($handle == INVALID_HANDLE_VALUE) {
	printf "Error is %d - %s\n", Win32::GetLastError (),
	  Win32::FormatMessage (Win32::GetLastError ());
	exit(1);
} else {
	print "FindFirstFile worked\n";
	
	my $count = 1;
	
	my $ufn = Unicode::String->new;
	my $ualtfn = Unicode::String->new;

# to get the filename in unicode UTF-16LE format, must
# unpack the $FileInfo 'buffer' hash element because
# Win32 API doesn't recognize the cFileName hash
# element as an UTF-16LE format character string
# It treats it as an ASCII Z / null terminated string

	my ($cFileName, $cAlternateFileName) = unpack( "x44A520A28", $FileInfo->{buffer} );
		
	# if length is odd, pad filenames with null byte as utf16le char string byte length must be even
	# occurs when last UTF-16LE char of filename is ANSI (e.g., "S") and unpack A template strips off following null byte
	# e.g., "S\0"
	
	if (length($cFileName) & 1) {  
		$cFileName .= "\x00";
	}
	if (length($cAlternateFileName) & 1) { 
		$cAlternateFileName .= "\x00";
	}
	
	$ufn->utf16le($cFileName);
	$ualtfn->utf16le($cAlternateFileName);
	
	print "($count) filename = ", $ufn->as_string, "\n";
	print "\talt = ", $ualtfn->as_string, "\n";
	
	while (my $result = $FindNextFile->Call($handle,$FileInfo)) {
		
		$count++;
		
		($cFileName, $cAlternateFileName) = unpack( "x44A520A28", $FileInfo->{buffer} );
				
		if (length($cFileName) & 1) {  
			$cFileName .= "\x00";
		}
		if (length($cAlternateFileName) & 1) {
			$cAlternateFileName .= "\x00";
		}
		
		$ufn->utf16le($cFileName);
		$ualtfn->utf16le($cAlternateFileName);
	
		print "($count) filename = ", $ufn->as_string, "\n";
		print "\talt = ", $ualtfn->as_string, "\n";
	}
}

$FindClose->Call($handle) or die "FindClose $^E";

exit(0);
Re: Win32 API directory searches that return wide / unicode filenames
created: 2006-01-28 10:50:03
I was under the impression that you could make perl use the wide system calls by supplying the -C switch. However, perlrun.pod says:
(In Perls earlier than 5.8.1 the -C switch was a Win32-only switch that enabled the use of Unicode-aware "wide system call" Win32 APIs. This feature was practically unused, however, and the command line switch was therefore "recycled".)
I actually had been using that switch in the past. Pity.
Re: Win32 API directory searches that return wide / unicode filenames
created: 2006-03-03 12:01:29
Last December, I brought this problem up on p5p:

http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2005-12/msg00259.html

No one had any good solution for it.

Ideally, perl/win32.* will be updated by someone (or by me) to support the Wide functionality. It's rather a huge shortcoming, when standard tools like File::Find and opendir() won't work at all. It's hard to write cross platform code that way when Win32 _Perl_ is fundamentally broken wrt Unicode filenames.

Re^2: Win32 API directory searches that return wide / unicode filenames
created: 2006-03-03 12:35:49
deally, perl/win32.* will be updated by someone (or by me) to support the Wide functionality.

All the code to support this option is still there from when the -C command line option enabled teh use of teh wide apis on win32.

# Win32.c:775-782
    /* do the FindFirstFile call */
    if (USING_WIDE()) {
	A2WHELPER(scanname, wbuffer, sizeof(wbuffer));
	fh = FindFirstFileW(PerlDir_mapW(wbuffer), &wFindData);
    }
    else {
	fh = FindFirstFileA(PerlDir_mapA(scanname), &aFindData);
    }

# Win32.c:858-870
	    if (USING_WIDE()) {
		res = FindNextFileW(dirp->handle, &wFindData);
		if (res) {
		    W2AHELPER(wFindData.cFileName, buffer, sizeof(buffer));
		    ptr = buffer;
		}
	    }
	    else {
		res = FindNextFileA(dirp->handle, &aFindData);
		if (res)
		    ptr = aFindData.cFileName;
	    }

All that is needed is the addition of code to enable this at runtime in perl.c/S_parse_body() where it presumaly used to live. I can't see any signs of it there now.

Of course, much harder might be persuading the powers that be to allocate a new switch letter for the purpose. Maybe additional values could be added to the new interpretation of the -C switch. Say 'W' and 128 for Wide calls. Ie. perl -CW script.pl would do what -C used to do?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: Win32 API directory searches that return wide / unicode filenames
created: 2006-03-04 14:26:33
FYI - all the if (USING_WIDE()) calls were removed by a patch from Jan back in November in bleadperl.

So that's really not the route to go.

I'm not entirely sure why the A functions are the default instead of W.

Re^4: Win32 API directory searches that return wide / unicode filenames
created: 2006-03-04 14:48:06

That seems a shame, but if Jan is Jan Duboius, he probably had good reasons?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Win32 API directory searches that return wide / unicode filenames
created: 2006-03-31 22:46:37
How do I use this code in conjunction with MP3::Info to get the tags of mp3 files with unicode filenames?

perlmonks.org content © perlmonks.org and Anonymous Monk, BrowserUk, dallen16, dsully, perlslicker, rhesa

prlmnks.org © 2006 edmund von der burg (eccles & toad)

v 0.03