Saturday, June 4, 2011

Reading Unicode (UTF-8) in C

In working on scanner code for Kiwi I did a bit of reading up on Unicode. It's not really as difficult as one might think parsing UTF-8 character by character in C. In the end I opted to use ICU so I could take advantage of its character class functions instead of rolling my own, but the by-hand method I thought was still worth sharing.

Functions like getc() read in a byte from an input stream. ASCII was the predominant encoding scheme and encoded characters in 7-8 bits, so reading a byte was effectively the same as reading a character. But you can only represent 255 characters using 8 bits, far too little to represent all the characters of the world's languages. The most common Unicode scheme is UTF-8, is a multi-byte encoding scheme capable of representing over 2 million characters using 4 bytes or less.

The 128 characters of 7-bit ASCII encoding scheme are encoded the same, the most-significant bit is always 0. Other characters can be encoded as multiple bytes but the most-significant bit in such cases is always 1. The bit pattern in the first byte of such multi-byte characters too that make it easy to determine exactly how many bytes are used for the character.


The first bit serves as a flag. If set, the character is encoded using multiple bytes, and if not, the character is encoded using just one. Subsequent bits indicate the number of bytes used. Each byte in a multi-byte encoded character will aways start with the pattern "10".

With that knowledge, one can write a wrapper that reads characters in from a UTF-8 encoded source one at a time. getc() reads a byte at a time, and a quick peek at the significant bits determines if additional bytes must be read to fetch the entire character.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define U_MAX_BYTES 4

unsigned short u_getc(FILE *stream, char *bytes) {
    /* mask values for bit pattern of first byte in multi-byte
     UTF-8 sequences: 
       192 - 110xxxxx - for U+0080 to U+07FF 
       224 - 1110xxxx - for U+0800 to U+FFFF 
       240 - 11110xxx - for U+010000 to U+1FFFFF */
    static unsigned short mask[] = {192, 224, 240}; 

    unsigned short i, j; 

    /* initialize buffer */
    memset(bytes, 0, U_MAX_BYTES + 1); 

    /* read first byte into buffer */
    bytes[0] = getc(stream);
    if (bytes[0] == EOF) { 
        return 0; 
    } 

    /* check how many more bytes need to be read for
     character */
    i = 0;
    if ((bytes[0] & mask[0]) == mask[0]) i++;
    if ((bytes[0] & mask[1]) == mask[1]) i++;
    if ((bytes[0] & mask[2]) == mask[2]) i++;

    /* read subsequent character bytes */
    j = 0;
    while (j < i) {
        j++;
        bytes[j] = getc(stream);
    }

    /* return number of bytes read into buffer */
    return i + 1;
}
The wrapper function u_getc() is passed an open file handle and a buffer to store the character. It returns the number of bytes read into the buffer.
int main(int argc, char *argv[]) {
    /* allocating +1 for null gives ability to print character
     as string */
    char *bytes = (char*)calloc(U_MAX_BYTES + 1, sizeof(char)); 

    /* read and print until end of file */
    while (u_getc(stdin, bytes)) { 
        printf("%s\n", bytes); 
    }

    return 0; 
}
Grouping characters into tokens such as operators and numeric literals is easy because UTF-8 is backward compatible with 7-bit ASCII, which allows performing character tests like byte[0] == '+' and isdigit(byte[0]). But it was the need to test multi-byte sequences that ultimately lead me to use ICU. Writing and testing a function that properly identifies membership in the set of 14,725+ possible characters that could appear in an identifier name wasn't a tempting prospect.

8 comments:

  1. Why are you writing a language in C. You should write it in a real language, like C#.

    ReplyDelete
  2. That's very clever! People sure were smart with their bits before they were so cheap and easy. Although, why not score the extra two bits of subsequent bytes of a multi-byte encoded set? Like for a U+0080 U+07FF one, why not get 13 bits instead of 11?
    110xxxxx - xxxxxxxx

    ReplyDelete
  3. The extra bit has to do with error correcting. If you're reading a corrupt stream and see one that starts with 10x that you're not expecting, you know something is wrong. My code sample doesn't really account for that... shame on me!

    ReplyDelete
  4. Interesting ... my roommate and I were talking the other night about how many gates for correctness checking there are due to the world of computation being fickle at the start. Backups of backups of backups, correctness and completeness audits, the list goes on. We agreed that things are stable enough now to relax some of that, but we can also thank it for the stability we now enjoy.

    ReplyDelete
  5. It's great and finally got solution of my problem.
    Thanks.

    ReplyDelete
  6. Very educational, thank you for posting this!
    Note that you should call memset with size U_MAX_BYTES+1 though.

    ReplyDelete
    Replies
    1. I didn't think it was necessary because only max bytes is written to it (the memory at max+1 should never be written to and hence remain null from the calloc call). But as I think about it more, +1 for the extra assurance it really is null is so trivial to do that you're probably right. I've updated the code... thanks!

      Delete