Thursday, August 6, 2015

A Unicode fgetc() in PHP

In preparation for a presentation I'm giving at this month's Syracuse PHP Users Group meeting, I found the need to read in Unicode characters in PHP one at a time. Unicode is still second-class in PHP; PHP6 failed and we have to fallback to extensions like the mbstring extension and/or libraries like Portable UTF-8. And even with those, I didn't see a unicode-capable fgetc() so I wrote my own.

Years ago, I wrote a post describing how to read Unicode characters in C, so the logic was already familiar. As a refresher, UTF-8 is a multi-byte encoding scheme capable of representing over 2 million characters using 4 bytes or less. The first 128 characters are encoded the same as 7-bit ASCII with 0 as the most-significant bit. The other characters are encoded using multiple bytes, each byte with 1 as the most-significant bit. The bit pattern in the first byte of a multi-byte sequence tells us how many bytes are needed to represent the character.

Here's what the function looks like:

function ufgetc($fp)
{
    // mask values for first byte's bit patterns
    static $mask = [
        192, // 110xxxxx
        224, // 1110xxxx
        240  // 11110xxx
    ];

    // read first byte
    $ch = fgetc($fp);
    if ($ch === false) {
        // return false on EOF
        return false;
    }

    // single-byte character
    if ((ord($ch) & $mask[0]) != $mask[0]) {
        return $ch;
    }

    // multi-byte character
    $buf = $ch;
    for ($i = 0; $i < count($mask); $i++) {
        if ((ord($ch) & $mask[$i]) != $mask[$i]) {
            break;
        }
        $buf .= fgetc($fp);
    }
    return $buf;
}
PHP's fgetc() reads in 8 bits at a time just like it's counterpart in C, but these bytes are represented as a single-character string in PHP's type system so we need to use the byte's integer value for the mask check to succeed.