Skip to main content

Reading Unicode (UTF-8) in C

In working on scanner code for Kiwi I did a bit of reading up on Unicode. It's not really as difficult as one might think parsing UTF-8 character by character in C. In the end I opted to use ICU so I could take advantage of its character class functions instead of rolling my own, but the by-hand method I thought was still worth sharing.

Functions like getc() read in a byte from an input stream. ASCII was the predominant encoding scheme and encoded characters in 7-8 bits, so reading a byte was effectively the same as reading a character. But you can only represent 255 characters using 8 bits, far too little to represent all the characters of the world's languages. The most common Unicode scheme is UTF-8, is a multi-byte encoding scheme capable of representing over 2 million characters using 4 bytes or less.

The 128 characters of 7-bit ASCII encoding scheme are encoded the same, the most-significant bit is always 0. Other characters can be encoded as multiple bytes but the most-significant bit in such cases is always 1. The bit pattern in the first byte of such multi-byte characters too that make it easy to determine exactly how many bytes are used for the character.

The first bit serves as a flag. If set, the character is encoded using multiple bytes, and if not, the character is encoded using just one. Subsequent bits indicate the number of bytes used. Each byte in a multi-byte encoded character will always start with the pattern "10".

With that knowledge, one can write a wrapper that reads characters in from a UTF-8 encoded source one at a time. getc() reads a byte at a time, and a quick peek at the significant bits determines if additional bytes must be read to fetch the entire character.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define U_MAX_BYTES 4

unsigned short u_getc(FILE *stream, char *bytes) {
    /* mask values for bit pattern of first byte in multi-byte
     UTF-8 sequences: 
       192 - 110xxxxx - for U+0080 to U+07FF 
       224 - 1110xxxx - for U+0800 to U+FFFF 
       240 - 11110xxx - for U+010000 to U+1FFFFF */
    static unsigned short mask[] = {192, 224, 240}; 

    unsigned short i, j; 

    /* initialize buffer */
    memset(bytes, 0, U_MAX_BYTES + 1); 

    /* read first byte into buffer */
    bytes[0] = getc(stream);
    if (bytes[0] == EOF) { 
        return 0; 
    } 

    /* check how many more bytes need to be read for
     character */
    i = 0;
    if ((bytes[0] & mask[0]) == mask[0]) i++;
    if ((bytes[0] & mask[1]) == mask[1]) i++;
    if ((bytes[0] & mask[2]) == mask[2]) i++;

    /* read subsequent character bytes */
    j = 0;
    while (j < i) {
        j++;
        bytes[j] = getc(stream);
    }

    /* return number of bytes read into buffer */
    return i + 1;
}

The wrapper function u_getc() is passed an open file handle and a buffer to store the character. It returns the number of bytes read into the buffer.

int main(int argc, char *argv[]) {
    /* allocating +1 for null gives ability to print character
     as string */
    char *bytes = (char*)calloc(U_MAX_BYTES + 1, sizeof(char)); 

    /* read and print until end of file */
    while (u_getc(stdin, bytes)) { 
        printf("%s\n", bytes); 
    }

    return 0; 
}

Grouping characters into tokens such as operators and numeric literals is easy because UTF-8 is backward compatible with 7-bit ASCII, which allows performing character tests like byte[0] == '+' and isdigit(byte[0]). But it was the need to test multi-byte sequences that ultimately lead me to use ICU. Writing and testing a function that properly identifies membership in the set of 14,725+ possible characters that could appear in an identifier name wasn't a tempting prospect.

Comments

  1. Why are you writing a language in C. You should write it in a real language, like C#.

    ReplyDelete
  2. That's very clever! People sure were smart with their bits before they were so cheap and easy. Although, why not score the extra two bits of subsequent bytes of a multi-byte encoded set? Like for a U+0080 U+07FF one, why not get 13 bits instead of 11?
    110xxxxx - xxxxxxxx

    ReplyDelete
  3. The extra bit has to do with error correcting. If you're reading a corrupt stream and see one that starts with 10x that you're not expecting, you know something is wrong. My code sample doesn't really account for that... shame on me!

    ReplyDelete
  4. Interesting ... my roommate and I were talking the other night about how many gates for correctness checking there are due to the world of computation being fickle at the start. Backups of backups of backups, correctness and completeness audits, the list goes on. We agreed that things are stable enough now to relax some of that, but we can also thank it for the stability we now enjoy.

    ReplyDelete
  5. It's great and finally got solution of my problem.
    Thanks.

    ReplyDelete
  6. Very educational, thank you for posting this!
    Note that you should call memset with size U_MAX_BYTES+1 though.

    ReplyDelete
    Replies
    1. I didn't think it was necessary because only max bytes is written to it (the memory at max+1 should never be written to and hence remain null from the calloc call). But as I think about it more, +1 for the extra assurance it really is null is so trivial to do that you're probably right. I've updated the code... thanks!

      Delete
  7. Casting the result of calloc() and friends is unnecessary and may hide a bug. Please fix the code.

    ReplyDelete
  8. @Giedrius Statkevičius: why?

    ReplyDelete

Post a Comment

Popular posts from this blog

Writing a Minimal PSR-0 Autoloader

An excellent overview of autoloading in PHP and the PSR-0 standard was written by Hari K T over at PHPMaster.com , and it's definitely worth the read. But maybe you don't like some of the bloated, heavier autoloader offerings provided by various PHP frameworks, or maybe you just like to roll your own solutions. Is it possible to roll your own minimal loader and still be compliant? First, let's look at what PSR-0 mandates, taken directly from the standards document on GitHub : A fully-qualified namespace and class must have the following structure \<Vendor Name>\(<Namespace>\)*<Class Name> Each namespace must have a top-level namespace ("Vendor Name"). Each namespace can have as many sub-namespaces as it wishes. Each namespace separator is converted to a DIRECTORY_SEPARATOR when loading from the file system. Each "_" character in the CLASS NAME is converted to a DIRECTORY_SEPARATOR . The "_" character has no special ...

Composing Music with PHP

I’m not an expert on probability theory, artificial intelligence, and machine learning. And even my Music 201 class from years ago has been long forgotten. But if you’ll indulge me for the next 10 minutes, I think you’ll find that even just a little knowledge can yield impressive results if creatively woven together. I’d like to share with you how to teach PHP to compose music. Here’s an example: You’re looking at a melody generated by PHP. It’s not the most memorable, but it’s not unpleasant either. And surprisingly, the code to generate such sequences is rather brief. So what’s going on? The script calculates a probability map of melodic intervals and applies a Markov process to generate a new sequence. In friendlier terms, musical data is analyzed by a script to learn which intervals make up pleasing melodies. It then creates a new composition by selecting pitches based on the possibilities it’s observed. . Standing on Shoulders Composition doesn’t happen in a vacuum. Bach wa...

Learning Prolog

I'm not quite sure exactly I was searching for, but somehow I serendipitously stumbled upon the site learnprolognow.org a few months ago. It's the home for an introductory Prolog programming course. Logic programming offers an interesting way to think about your problems; I've been doing so much procedural and object-oriented programming in the past decade that it really took effort to think at a higher level! I found the most interesting features to be definite clause grammars (DCG), and unification. Difference lists are very powerful and Prolog's DCG syntax makes it easy to work with them. Specifying a grammar such as: s(s(NP,VP)) --> np(NP,X,Y,subject), vp(VP,X,Y). np(np(DET,NBAR,PP),X,Y,_) --> det(DET,X), nbar(NBAR,X,Y), pp(PP). np(np(DET,NBAR),X,Y,_) --> det(DET,X), nbar(NBAR,X,Y). np(np(PRO),X,Y,Z) --> pro(PRO,X,Y,Z). vp(vp(V),X,Y) --> v(V,X,Y). vp(vp(V,NP),X,Y) --> v(V,X,Y), np(NP,_,_,object). nbar(nbar(JP),X,3) --> jp(JP,X). pp(pp(PREP,N...