Skip to main content

Reading Unicode (UTF-8) in C

In working on scanner code for Kiwi I did a bit of reading up on Unicode. It's not really as difficult as one might think parsing UTF-8 character by character in C. In the end I opted to use ICU so I could take advantage of its character class functions instead of rolling my own, but the by-hand method I thought was still worth sharing.

Functions like getc() read in a byte from an input stream. ASCII was the predominant encoding scheme and encoded characters in 7-8 bits, so reading a byte was effectively the same as reading a character. But you can only represent 255 characters using 8 bits, far too little to represent all the characters of the world's languages. The most common Unicode scheme is UTF-8, is a multi-byte encoding scheme capable of representing over 2 million characters using 4 bytes or less.

The 128 characters of 7-bit ASCII encoding scheme are encoded the same, the most-significant bit is always 0. Other characters can be encoded as multiple bytes but the most-significant bit in such cases is always 1. The bit pattern in the first byte of such multi-byte characters too that make it easy to determine exactly how many bytes are used for the character.

The first bit serves as a flag. If set, the character is encoded using multiple bytes, and if not, the character is encoded using just one. Subsequent bits indicate the number of bytes used. Each byte in a multi-byte encoded character will always start with the pattern "10".

With that knowledge, one can write a wrapper that reads characters in from a UTF-8 encoded source one at a time. getc() reads a byte at a time, and a quick peek at the significant bits determines if additional bytes must be read to fetch the entire character.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define U_MAX_BYTES 4

unsigned short u_getc(FILE *stream, char *bytes) {
    /* mask values for bit pattern of first byte in multi-byte
     UTF-8 sequences: 
       192 - 110xxxxx - for U+0080 to U+07FF 
       224 - 1110xxxx - for U+0800 to U+FFFF 
       240 - 11110xxx - for U+010000 to U+1FFFFF */
    static unsigned short mask[] = {192, 224, 240}; 

    unsigned short i, j; 

    /* initialize buffer */
    memset(bytes, 0, U_MAX_BYTES + 1); 

    /* read first byte into buffer */
    bytes[0] = getc(stream);
    if (bytes[0] == EOF) { 
        return 0; 
    } 

    /* check how many more bytes need to be read for
     character */
    i = 0;
    if ((bytes[0] & mask[0]) == mask[0]) i++;
    if ((bytes[0] & mask[1]) == mask[1]) i++;
    if ((bytes[0] & mask[2]) == mask[2]) i++;

    /* read subsequent character bytes */
    j = 0;
    while (j < i) {
        j++;
        bytes[j] = getc(stream);
    }

    /* return number of bytes read into buffer */
    return i + 1;
}

The wrapper function u_getc() is passed an open file handle and a buffer to store the character. It returns the number of bytes read into the buffer.

int main(int argc, char *argv[]) {
    /* allocating +1 for null gives ability to print character
     as string */
    char *bytes = (char*)calloc(U_MAX_BYTES + 1, sizeof(char)); 

    /* read and print until end of file */
    while (u_getc(stdin, bytes)) { 
        printf("%s\n", bytes); 
    }

    return 0; 
}

Grouping characters into tokens such as operators and numeric literals is easy because UTF-8 is backward compatible with 7-bit ASCII, which allows performing character tests like byte[0] == '+' and isdigit(byte[0]). But it was the need to test multi-byte sequences that ultimately lead me to use ICU. Writing and testing a function that properly identifies membership in the set of 14,725+ possible characters that could appear in an identifier name wasn't a tempting prospect.

Comments

  1. Why are you writing a language in C. You should write it in a real language, like C#.

    ReplyDelete
  2. That's very clever! People sure were smart with their bits before they were so cheap and easy. Although, why not score the extra two bits of subsequent bytes of a multi-byte encoded set? Like for a U+0080 U+07FF one, why not get 13 bits instead of 11?
    110xxxxx - xxxxxxxx

    ReplyDelete
  3. The extra bit has to do with error correcting. If you're reading a corrupt stream and see one that starts with 10x that you're not expecting, you know something is wrong. My code sample doesn't really account for that... shame on me!

    ReplyDelete
  4. Interesting ... my roommate and I were talking the other night about how many gates for correctness checking there are due to the world of computation being fickle at the start. Backups of backups of backups, correctness and completeness audits, the list goes on. We agreed that things are stable enough now to relax some of that, but we can also thank it for the stability we now enjoy.

    ReplyDelete
  5. It's great and finally got solution of my problem.
    Thanks.

    ReplyDelete
  6. Very educational, thank you for posting this!
    Note that you should call memset with size U_MAX_BYTES+1 though.

    ReplyDelete
    Replies
    1. I didn't think it was necessary because only max bytes is written to it (the memory at max+1 should never be written to and hence remain null from the calloc call). But as I think about it more, +1 for the extra assurance it really is null is so trivial to do that you're probably right. I've updated the code... thanks!

      Delete
  7. Casting the result of calloc() and friends is unnecessary and may hide a bug. Please fix the code.

    ReplyDelete
  8. @Giedrius Statkevičius: why?

    ReplyDelete

Post a Comment

Popular posts from this blog

Geolocation Search

Services that allow users to identify nearby points of interest continue to grow in popularity. I'm sure we're all familiar with social websites that let you search for the profiles of people near a postal code, or mobile applications that use geolocation to identify Thai restaurants within walking distance. It's surprisingly simple to implement such functionality, and in this post I will discuss how to do so.

The first step is to obtain the latitude and longitude coordinates of any locations you want to make searchable. In the restaurant scenario, you'd want the latitude and longitude of each eatery. In the social website scenario, you'd want to obtain a list of postal codes with their centroid latitude and longitude.

In general, postal code-based geolocation is a bad idea; their boundaries rarely form simple polygons, the area they cover vary in size, and are subject to change based on the whims of the postal service. But many times we find ourselves stuck on a c…

Composing Music with PHP

I’m not an expert on probability theory, artificial intelligence, and machine learning. And even my Music 201 class from years ago has been long forgotten. But if you’ll indulge me for the next 10 minutes, I think you’ll find that even just a little knowledge can yield impressive results if creatively woven together. I’d like to share with you how to teach PHP to compose music. Here’s an example: You’re looking at a melody generated by PHP. It’s not the most memorable, but it’s not unpleasant either. And surprisingly, the code to generate such sequences is rather brief. So what’s going on? The script calculates a probability map of melodic intervals and applies a Markov process to generate a new sequence. In friendlier terms, musical data is analyzed by a script to learn which intervals make up pleasing melodies. It then creates a new composition by selecting pitches based on the possibilities it’s observed. . Standing on ShouldersComposition doesn’t happen in a vacuum. Bach was f…

Creepy JavaScript Tracking

I recently began allergy shots so my new Monday morning routine includes me sitting in a doctor's office for 30 minutes (I must wait after receiving the shots and be checked by a nurse to make sure there was no reaction). With nothing else better to do while I waited last week, I started playing around with some JavaScript. This is what I came up with:
<html> <head> <title>Test</title> <script type="text/javascript"> window.onload = function () { var mX = 0,  mY = 0, sX = 0,  sY = 0, queue = [], interval = 200, recIntv = null, playIntv = null, b = document.body, de = document.documentElement, cursor = document.getElementById("cursor"), record = document.getElementById("record"), play = document.getElementById("play"); window.onmousemove = function (e) { e = e || window.event; if (e.pageX || e.pageY) { …