Skip to main content

Reading Unicode (UTF-8) in C

In working on scanner code for Kiwi I did a bit of reading up on Unicode. It's not really as difficult as one might think parsing UTF-8 character by character in C. In the end I opted to use ICU so I could take advantage of its character class functions instead of rolling my own, but the by-hand method I thought was still worth sharing.

Functions like getc() read in a byte from an input stream. ASCII was the predominant encoding scheme and encoded characters in 7-8 bits, so reading a byte was effectively the same as reading a character. But you can only represent 255 characters using 8 bits, far too little to represent all the characters of the world's languages. The most common Unicode scheme is UTF-8, is a multi-byte encoding scheme capable of representing over 2 million characters using 4 bytes or less.

The 128 characters of 7-bit ASCII encoding scheme are encoded the same, the most-significant bit is always 0. Other characters can be encoded as multiple bytes but the most-significant bit in such cases is always 1. The bit pattern in the first byte of such multi-byte characters too that make it easy to determine exactly how many bytes are used for the character.

The first bit serves as a flag. If set, the character is encoded using multiple bytes, and if not, the character is encoded using just one. Subsequent bits indicate the number of bytes used. Each byte in a multi-byte encoded character will always start with the pattern "10".

With that knowledge, one can write a wrapper that reads characters in from a UTF-8 encoded source one at a time. getc() reads a byte at a time, and a quick peek at the significant bits determines if additional bytes must be read to fetch the entire character.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define U_MAX_BYTES 4

unsigned short u_getc(FILE *stream, char *bytes) {
    /* mask values for bit pattern of first byte in multi-byte
     UTF-8 sequences: 
       192 - 110xxxxx - for U+0080 to U+07FF 
       224 - 1110xxxx - for U+0800 to U+FFFF 
       240 - 11110xxx - for U+010000 to U+1FFFFF */
    static unsigned short mask[] = {192, 224, 240}; 

    unsigned short i, j; 

    /* initialize buffer */
    memset(bytes, 0, U_MAX_BYTES + 1); 

    /* read first byte into buffer */
    bytes[0] = getc(stream);
    if (bytes[0] == EOF) { 
        return 0; 
    } 

    /* check how many more bytes need to be read for
     character */
    i = 0;
    if ((bytes[0] & mask[0]) == mask[0]) i++;
    if ((bytes[0] & mask[1]) == mask[1]) i++;
    if ((bytes[0] & mask[2]) == mask[2]) i++;

    /* read subsequent character bytes */
    j = 0;
    while (j < i) {
        j++;
        bytes[j] = getc(stream);
    }

    /* return number of bytes read into buffer */
    return i + 1;
}

The wrapper function u_getc() is passed an open file handle and a buffer to store the character. It returns the number of bytes read into the buffer.

int main(int argc, char *argv[]) {
    /* allocating +1 for null gives ability to print character
     as string */
    char *bytes = (char*)calloc(U_MAX_BYTES + 1, sizeof(char)); 

    /* read and print until end of file */
    while (u_getc(stdin, bytes)) { 
        printf("%s\n", bytes); 
    }

    return 0; 
}

Grouping characters into tokens such as operators and numeric literals is easy because UTF-8 is backward compatible with 7-bit ASCII, which allows performing character tests like byte[0] == '+' and isdigit(byte[0]). But it was the need to test multi-byte sequences that ultimately lead me to use ICU. Writing and testing a function that properly identifies membership in the set of 14,725+ possible characters that could appear in an identifier name wasn't a tempting prospect.

Comments

  1. Why are you writing a language in C. You should write it in a real language, like C#.

    ReplyDelete
  2. That's very clever! People sure were smart with their bits before they were so cheap and easy. Although, why not score the extra two bits of subsequent bytes of a multi-byte encoded set? Like for a U+0080 U+07FF one, why not get 13 bits instead of 11?
    110xxxxx - xxxxxxxx

    ReplyDelete
  3. The extra bit has to do with error correcting. If you're reading a corrupt stream and see one that starts with 10x that you're not expecting, you know something is wrong. My code sample doesn't really account for that... shame on me!

    ReplyDelete
  4. Interesting ... my roommate and I were talking the other night about how many gates for correctness checking there are due to the world of computation being fickle at the start. Backups of backups of backups, correctness and completeness audits, the list goes on. We agreed that things are stable enough now to relax some of that, but we can also thank it for the stability we now enjoy.

    ReplyDelete
  5. It's great and finally got solution of my problem.
    Thanks.

    ReplyDelete
  6. Very educational, thank you for posting this!
    Note that you should call memset with size U_MAX_BYTES+1 though.

    ReplyDelete
    Replies
    1. I didn't think it was necessary because only max bytes is written to it (the memory at max+1 should never be written to and hence remain null from the calloc call). But as I think about it more, +1 for the extra assurance it really is null is so trivial to do that you're probably right. I've updated the code... thanks!

      Delete
  7. Casting the result of calloc() and friends is unnecessary and may hide a bug. Please fix the code.

    ReplyDelete
  8. @Giedrius Statkevičius: why?

    ReplyDelete

Post a Comment

Popular posts from this blog

Composing Music with PHP

I’m not an expert on probability theory, artificial intelligence, and machine learning. And even my Music 201 class from years ago has been long forgotten. But if you’ll indulge me for the next 10 minutes, I think you’ll find that even just a little knowledge can yield impressive results if creatively woven together. I’d like to share with you how to teach PHP to compose music. Here’s an example: You’re looking at a melody generated by PHP. It’s not the most memorable, but it’s not unpleasant either. And surprisingly, the code to generate such sequences is rather brief. So what’s going on? The script calculates a probability map of melodic intervals and applies a Markov process to generate a new sequence. In friendlier terms, musical data is analyzed by a script to learn which intervals make up pleasing melodies. It then creates a new composition by selecting pitches based on the possibilities it’s observed. . Standing on Shoulders Composition doesn’t happen in a vacuum. Bach wa

Learning Prolog

I'm not quite sure exactly I was searching for, but somehow I serendipitously stumbled upon the site learnprolognow.org a few months ago. It's the home for an introductory Prolog programming course. Logic programming offers an interesting way to think about your problems; I've been doing so much procedural and object-oriented programming in the past decade that it really took effort to think at a higher level! I found the most interesting features to be definite clause grammars (DCG), and unification. Difference lists are very powerful and Prolog's DCG syntax makes it easy to work with them. Specifying a grammar such as: s(s(NP,VP)) --> np(NP,X,Y,subject), vp(VP,X,Y). np(np(DET,NBAR,PP),X,Y,_) --> det(DET,X), nbar(NBAR,X,Y), pp(PP). np(np(DET,NBAR),X,Y,_) --> det(DET,X), nbar(NBAR,X,Y). np(np(PRO),X,Y,Z) --> pro(PRO,X,Y,Z). vp(vp(V),X,Y) --> v(V,X,Y). vp(vp(V,NP),X,Y) --> v(V,X,Y), np(NP,_,_,object). nbar(nbar(JP),X,3) --> jp(JP,X). pp(pp(PREP,N

What's Wrong with OOP

Proponents of Object Oriented Programming feel the paradigm yields code that is better organized, easier to understand and maintain, and reusable. They view procedural programming code as unwieldy spaghetti and embrace OO-centric design patterns as the "right way" to do things. They argue objects are easier to grasp because they model how we view the world. If the popularity of languages like Java and C# is any indication, they may be right. But after almost 20 years of OOP in the mainstream, there's still a large portion of programmers who resist it. If objects truly model the way people think of things in the real world, then why do people have a hard time understanding and working in OOP? I suspect the problem might be the focus on objects instead of actions. If I may quote from Steve Yegge's Execution in the Kingdom of Nouns : Verbs in Javaland are responsible for all the work, but as they are held in contempt by all, no Verb is ever permitted to wander about