[an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] (none) [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] (none) [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive]
 
[an error occurred while processing this directive] [an error occurred while processing this directive]
Skåne Sjælland Linux User Group - http://www.sslug.dk Home   Subscribe   Mail Archive   Forum   Calendar   Search
MhonArc Date: [Date Prev] [Date Index] [Date Next]   Thread: [Date Prev] [Thread Index] [Date Next]   MhonArc
 

Re: [PERL] UTF-8 check



Anders Sønderberg Mortensen <sslug@sslug> writes:

> Men det ville stadig være interessant, hvis der fandtes noget simplere,
> som ikke afhænger af andre biblioteker.

Det burde være ret simpelt at læse RFC2279 og implementerer noget
selv:

   The table below summarizes the format of these different octet
   types.  The letter x indicates bits available for encoding bits of
   the UCS-4 character value.

   UCS-4 range (hex.)           UTF-8 octet sequence (binary)
   0000 0000-0000 007F   0xxxxxxx
   0000 0080-0000 07FF   110xxxxx 10xxxxxx
   0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx

   0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
   0400 0000-7FFF FFFF   1111110x 10xxxxxx ... 10xxxxxx

Så et eller andet med:

sub validp {
    @bytes = unpack "C*", $_[0];
    while(@bytes) {
        if ($bytes[0] < 128 ) {      # 0xxxxxxx
            splice @bytes, 0, 1;
        } elsif ($bytes[0] < 224) {  # 110xxxxx
            return 0 if $#bytes == 1;
            return 0 unless ($bytes[1] & 0x3F) == 0x80;
            splice @bytes, 0, 2;
        } elsif ($bytes[0] < 240) {  # 1110xxxx
            return 0 if $#bytes == 2;
            return 0 unless ($bytes[1] & 0x3F) == 0x80;
            return 0 unless ($bytes[2] & 0x3F) == 0x80;
            splice @bytes, 0, 3;
        } elsif ($bytes[0] < 248) {  # 11110xxx
            return 0 if $#bytes == 3;
            return 0 unless ($bytes[1] & 0x3F) == 0x80;
            return 0 unless ($bytes[2] & 0x3F) == 0x80;
            return 0 unless ($bytes[3] & 0x3F) == 0x80;
            splice @bytes, 0, 4;
        } elsif ($bytes[0] < 252) {  # 111110xx
            return 0 if $#bytes == 4;
            return 0 unless ($bytes[1] & 0x3F) == 0x80;
            return 0 unless ($bytes[2] & 0x3F) == 0x80;
            return 0 unless ($bytes[3] & 0x3F) == 0x80;
            return 0 unless ($bytes[4] & 0x3F) == 0x80;
            splice @bytes, 0, 5;
        } elsif ($bytes[0] < 254) {  # 1111110x
            return 0 if $#bytes == 5;
            return 0 unless ($bytes[1] & 0x3F) == 0x80;
            return 0 unless ($bytes[2] & 0x3F) == 0x80;
            return 0 unless ($bytes[3] & 0x3F) == 0x80;
            return 0 unless ($bytes[4] & 0x3F) == 0x80;
            return 0 unless ($bytes[5] & 0x3F) == 0x80;
            splice @bytes, 0, 6;
        } else { # 11111110 or 11111111 is invalid
            return 0;
        }
    }
    return 1;
}


Det kan sikkert gøres kortere, men ovenstående burde virke og
skitserer ideen til hvordan det generel gøres.

-- 
 Peter Makholm     |        We constantly have to keep in mind why natural
 sslug@sslug |    languages are good at what they're good at. And to
 http://hacking.dk |     never forget that Perl is a human language first,
                   |                        and a computer language second


 
Home   Subscribe   Mail Archive   Index   Calendar   Search

 
 
Questions about the web-pages to <www_admin>. Last modified 2005-08-10, 19:55 CEST [an error occurred while processing this directive]
This page is maintained by [an error occurred while processing this directive]MHonArc [an error occurred while processing this directive] # [an error occurred while processing this directive] *