In preparation for a presentation I’m giving at this month’s Syracuse PHP Users Group meeting, I found the need to read in Unicode characters in PHP one at a time. Unicode is still second-class in PHP; PHP6 failed and we have to fallback to extensions like the mbstring extension and/or libraries like Portable UTF-8. And even with those, I didn’t see a unicode-capable fgetc() so I wrote my own.
Years ago, I wrote a post describing how to read Unicode characters in C, so the logic was already familiar. As a refresher, UTF-8 is a multi-byte encoding scheme capable of representing over 2 million characters using 4 bytes or less. The first 128 characters are encoded the same as 7-bit ASCII with 0 as the most-significant bit. The other characters are encoded using multiple bytes, each byte with 1 as the most-significant bit. The bit pattern in the first byte of a multi-byte sequence tells us how many bytes are needed to represent the character.
Here’s what the function looks like:
function ufgetc($fp) { // mask values for first byte's bit patterns static $mask = [ 192, // 110xxxxx 224, // 1110xxxx 240 // 11110xxx ]; // read first byte $ch = fgetc($fp); if ($ch === false) { // return false on EOF return false; } // single-byte character if ((ord($ch) & $mask[0]) != $mask[0]) { return $ch; } // multi-byte character $buf = $ch; for ($i = 0; $i < count($mask); $i++) { if ((ord($ch) & $mask[$i]) != $mask[$i]) { break; } $buf .= fgetc($fp); } return $buf; }PHP’s fgetc() reads in 8 bits at a time just like it’s counterpart in C, but these bytes are represented as a single-character string in PHP’s type system so we need to use the byte’s integer value for the mask check to succeed.
Hi Timothy, thanks for sharing the code, saved me a lot of time !
ReplyDeleteregards Lars