Thursday, August 6, 2015

A Unicode fgetc() in PHP

In preparation for a presentation I'm giving at this month's Syracuse PHP Users Group meeting, I found the need to read in Unicode characters in PHP one at a time. Unicode is still second-class in PHP; PHP6 failed and we have to fallback to extensions like the mbstring extension and/or libraries like Portable UTF-8. And even with those, I didn't see a unicode-capable fgetc() so I wrote my own.

Years ago, I wrote a post describing how to read Unicode characters in C, so the logic was already familiar. As a refresher, UTF-8 is a multi-byte encoding scheme capable of representing over 2 million characters using 4 bytes or less. The first 128 characters are encoded the same as 7-bit ASCII with 0 as the most-significant bit. The other characters are encoded using multiple bytes, each byte with 1 as the most-significant bit. The bit pattern in the first byte of a multi-byte sequence tells us how many bytes are needed to represent the character.

Here's what the function looks like:

function ufgetc($fp)
    // mask values for first byte's bit patterns
    static $mask = [
        192, // 110xxxxx
        224, // 1110xxxx
        240  // 11110xxx

    // read first byte
    $ch = fgetc($fp);
    if ($ch === false) {
        // return false on EOF
        return false;

    // single-byte character
    if ((ord($ch) & $mask[0]) != $mask[0]) {
        return $ch;

    // multi-byte character
    $buf = $ch;
    for ($i = 0; $i < count($mask); $i++) {
        if ((ord($ch) & $mask[$i]) != $mask[$i]) {
        $buf .= fgetc($fp);
    return $buf;
PHP's fgetc() reads in 8 bits at a time just like it's counterpart in C, but these bytes are represented as a single-character string in PHP's type system so we need to use the byte's integer value for the mask check to succeed.

Sunday, July 26, 2015

Some Go Irks and Quirks

Now that Jump Start MySQL is published, I'm taking advantage of the spare time I have on my hands while it lasts. I've helped organize the Syracuse PHP Users Group, reconnected with some old friends, and gave some love to Kiwi, my forever-project programming language. Moreover, I decided to rewrite Kiwi using Go as it's one of those languages I found interesting but never had a reason to use in any serious fashion. And now that I've got some real experience with it, while I still find myself impressed by some of Go's features, some things have become really annoying.

I still really like Go's datatyping; it's static, but it feels dynamic because the compiler is smart enough to deduce a value's type. If you write your code well then you'll rarely see a type name outside of a function signature or struct or interface definition. It's nice to have type safety without the verbosity (yes I'm looking at you, PHP7).

I wish := behaved slightly different, though. Instead of always an allocation, it'd be nice if it could also perform basic assignments. Then we could write code like this:

foo, bar := baz()
foo.x, fizz := quux()
But as it is now, the best we can do is:
foo, bar := baz()
var fizz MyType
foo.x, fizz = quux()
If there's a go-ism that works around this that you know of, feel free to let me know.

The dangling comma in a list, but only when its closing brace is on a new line, is also irritating. No, it's not a formatting issue; gofmt won't enforce one brace placement over the other. Rather, the presence or lack of a comma is a parsing error. We can write:

And we can write:
But we can't write:
Perhaps it was because I was writing my own parser at the time that this bothered me. It should be trivial to accommodate the desired pattern, especially since structs and interface definitions are brace-delimited and don't use commas at all.

Go elides some traditional constructs, for example for handles for, foreach, and while loops, so why make and new still exist side-by-side, even when Rob Pike proposed merging them, leaves me scratching my head. &Foo{} is equivalent to new(Foo), so if there's no need for while then there's no need for new.

I recognize these gripes are largely syntactic, but the syntax of a language is its API. Programmers are immersed in it every day and it can have an effect on how we think about things.

Surprisingly though, and perhaps this is my biggest complaint, the tooling around Go is still immature. In the 6+ years after its release there is still no killer IDE. Code coverage can only be generated for one package at a time, not and entire project. It's possible to script coverage for project-wide results but that's just a hack. Debugging with GDB is brutal and I could not get Delve to work for me.

None of these irks will stop me from using Go in the future if I have the opportunity, but I'd like to suggest Go at work as the go to language (pun intended) for some of the work we do now in C. I can probably make some good technical arguments to sway our old-time C programmers, yet convincing management and the programmers fresh out of college to use Go without viable tooling is going to be a hard sell.

Wednesday, December 3, 2014

PHP Frameworks Don't Save Time

Experience has shown me frameworks can be useful for maintaining structure in large code base developed by multiple teams. Every developer has different abilities and a framework enforces structure and consistency throughout the code. But I've not experienced saving any substantial amount of time on a PHP project because of a framework.

The other day someone posted in the PHP subreddit asking for advice. He was about to begin work on a small project and wanted to know whether he should use a framework, and if so then which framework would be appropriate. I should have known better than to offer my two cents but I did anyway.

Slim + NotORM + Twig is nice. If it's a simple project, you probably don't need much more than that. I'm not a fan of frameworks in the slightest but I do enjoy the aforementioned combination. They're lightweight and stay out of my way, allowing me to write my functionality.

Another redditor picked up on my distaste for frameworks and asked:

So you're okay with being slower than someone with your same basic skill set? Serious question...

A serious question deserves a serious answer and so I replied, attempting to explain developer skill sets are not always the same and the differences in how we each might approach a problem has a greater effect of development time. If you like you can read my original response in the post's comments thread. Otherwise, here's a more refined presentation of my argument.

With regard to skill set, I'm a PHP programmer who has been coding in pure PHP for the better part of 13 years. I have an intimate relationship with the language and can probably write PHP code in my sleep. But as soon as a framework is introduced, I'm faced with a learning curve. Frustration inevitably ensues because simple things suddenly seem difficult, either because I'm unfamiliar with the new API and have to follow the framework's particular philosophy.

Many of my peers use frameworks, both co-workers and friends in the community. They've taken the time to learn the ins and outs of a given framework and probably can code in their sleep with it just as I do with pure PHP. But what happens when the need arises to go outside the bounds of the framework and they need to write something raw? That's when they confront their learning curve and have to dig into PHP's documentation.

We obviously don't all share the same basic skill set. Yes, we're all working in PHP, but my peers are experienced with a framework and I'm experienced with the nuances of the language itself. They're as fast writing their framework-based code as I am writing PHP; they're as slow writing pure PHP code as I am working with a framework.

But even if everything was equal on the skill side of the equation, there's still a human variable. Sharing exactly the same skills as someone else doesn't mean you'll share the same way of thinking about things or the same approach to solving a problem. Remember, there's more to programming than writing code; a large amount of time is spent simply on thinking about how to solve a problem. I can spend 6 hours planning and 2 hours coding, and a coworker can spend 7 hours planing and 1 hour coding, and although the coworker was technically faster at writing code, neither of us was actually more or less productive than the other. We both put in the same amount of time to the problem.

It's also noted how horribly fragmented the PHP ecosystem is. The world of a PHP programmer is not like the world of Python programmer where the community has largely settled on Django, or the world of a C# programmer where there's the .NET framework. Knowledge of Django and .NET is transferable across most Python and C# projects. But with PHP, a developer can learn ZF2, another developer can learn Yii, another may learn Larvel, and still another would learn Symfony... and little of the knowledge and experience they gain is transferable if the next project doesn't use their preferred framework. We face a potential learning curve before we even make our first keystroke on any project, and that takes time.

Promoting framework adoption is fine but I simply don't believe the time element is the proper argument for it. I probably wasn't as clear as I could have been in my initial response, so hopefully this clarifies things. Feel free to use the comments if I'm still just spewing senseless babble!

PS: Thanks to the kind redditor who felt my blathering response was worth Reddit Gold. You rock!

Thursday, May 1, 2014

New Writers Guide now on GitHub

Writing can be a fun and rewarding way to share your knowledge, experience, and opinions with others. Unfortunately, it can also be intimidating or frustrating for some people. When I was managing editor for SitePoint's PHPMaster property, I prepared a guide to help alleviate some of the frustration and self-doubt that new writers (and even experienced writers) might experience.

The guide wasn't something commissioned by SitePoint; I wrote it on my own for my authors. And though it's been about eight months since PHPMaster was absorbed into the main SitePoint site and I stepped down as managing editor, people continue to ask me about it. So, I've decided to make the guide publicly available.

The New Writers Guide offers advice for finding inspiration, structuring an article's content, growing one's self-confidence, and overcoming other challenges that programming writers may face. Hopefully it'll continue to help people write awesome articles and realize the many benefits of writing in their life.

You can find a copy of the guide on GitHub at

Friday, April 25, 2014

Ajax File Uploads with JavaScript's File API

Developers have been using Ajax techniques for years to create dynamic web forms, but handling file uploads using Ajax was always problematic. The crux of the problem was security – it's not a good idea to allow arbitrary code access to any file it wants on a user's system so JavaScript was intentionally restricted in how it could interact with things like file input elements. Uploading a file with JavaScript was essentially a standard form submission that targeted a hidden iframe. It felt dirty but it got the job done.

The W3C began work on standardizing a File API for JavaScript sometime between 2006 and 2009 and we're now at the point with browser support where developers can take advantage of it. Developers supporting web apps on IE8 and 9 still need to use iframes, but those of us targeting newer browsers can finally take a pure JavaScript approach to file uploads. And as more users migrate from IE8/9, the iframe approach will eventually be left in the dustbin.

The interesting things defined by the W3C's File API are:

  • Blob – an object to represent a sequence of bytes and is consumed by FileReader. Its size property lists the size of the sequence in bytes and its type property is a lower-case MIME-type string if such information is available.
  • File – an object that extends Blob and offers additional properties to make the file's metadata available. Its name property holds the filename (no path information) and lastModifiedDate holds a Date object instance set to when the file was last modified.
  • FileReader – an object that reads the byte sequence of a Blob or File object.
  • FileList – a property given to file input elements which essentially is a list of File objects.

The API is designed so that byte sequences are loaded asynchronously by default. This makes sense since there are several things that can cause the read process to take a while to complete: it might be a large file, the file might be on a mounted network share, etc. Reading files asynchronously ensures the main execution thread is free and the browser doesn't lock up.

So what does a basic upload look like using the API? At a high level, the steps are:

  1. Provide a file input for the user.
  2. When the user sets a file, retrieve its File object from the input's files property.
  3. Create a FileReader instance and register a callback for its onload event. This callback will have access to the read data.
  4. Initiate the read process with the FileReader methods readAsText() or readAsDataURL().

I like to use readAsDataURL() to initiate the read process, especially for binary files like images and PDFs, since the data will be base64 encoded. The ASCII URI string can then be safely sent to the server just like any other string.

I also recommend using POST for the HTTP method; yes, the encoded contents as a data URI which can be used in a GET parameter, but doing so increases the risk of getting an HTTP/414 error because of the resulting size of the request. Base64 encodes binary content to safe ASCII which increases the data's size by roughly 130%.

 <input id="fileInput" type="file" />

document.getElementById("fileInput").onchange = function () {
    // retrieve File from input
    var file = this.files[0];

    // set FileReader's onload event
    var reader = new FileReader();
    reader.onload = function () {
        // the results of the read is available with the FileReader's
        // result property when the callback is executed
        var fileContent = this.result;

        // send fileContent to server via Ajax request
        // ...
    // initiate reading

Handling the upload once it reaches the server is different than working with traditional file uploads in PHP since the file comes into the system as “normal” user input. That is, you won't be using the $_FILES superglobal or functions like move_uploaded_file(). Instead the content will be available straight from $_POST.

The data URI format is defined by RFC 2397 looks like the following:


You're free to existing libraries to parse the URI or parse it yourself. The media type is optional. If present, the value is a MIME type string. If it's missing, the default value text/plain;charset=US-ASCII should be assumed. If ;base64 is present then the data is base64 encoded.

// parse out file data
list($front, $data) = explode(',', $dataUri, 2);
if (stristr($front, ';base64') !== false) {
    $data = base64_decode($data);

// test whether the file is a valid image
try {
    $image = new \Imagick();
catch (\ImagickException $e) {
    header('HTTP/1.0 400 Bad Request');

// do something with $image
// ...

Posting a file as data URI protects you from some of the security vulnerabilities that are typically inherent when dealing with files. Data URIs don't account for filenames, for instance, so you're safe from directory traversal attacks by maliciously named files. Still, you should treat the URI as you would any other piece of user-supplied data. Your application will obviously dictate how you filter and validate the file.

A secondary concern is the possibility of a malicious person using large file posts as a vector for a denial of service attack. The traditional upload approaches must mitigate this risk, and an Ajax approach must do so as well. Make certain you review the memory_limit and post_max_size entries in your php.ini, and keep in mind the tradeoff between size and ASCII-safety when using base64 encoding.

This isn't the first post on the Internet to deal with Ajax file uploads or JavaScript's File API, but many of them provide little beyond code samples. Hopefully I've remedied the situation by providing a succinct overview of the API's important objects/interfaces and discussing how receiving the file is different using this approach. If there's something I've neglected, feel free to leave a comment!

Thursday, February 20, 2014

Fixing "MySQL server has gone away" Errors in C

I ran across an old question on Stack Overflow the other day in which a user was having issues maintaining his connection to MySQL from C. I left a brief answer there for anyone else who might stumble across the same problem in the future, but I felt it was worth expanding on a bit more.

The error "MySQL server has gone away" means the client's connection to the MySQL server was lost. This could be because of many reasons; perhaps MySQL isn't running, perhaps there's network problems, or perhaps there was no activity after a certain amount of time and the server closed the connection. Detailed information on the error is available in the MySQL documentation.

It's possible for the client to attempt to re-connect to the server when it's "gone away" although it won't try to by default. To enable the reconnecting behavior, you need to set the MYSQL_OPT_RECONNECT option to 1 using the mysql_options() function. It should be set after mysql_init() is called and before calling mysql_real_connect(). This should solve the problem if the connection was closed by the server because of a time-out.

The MySQL documentation that discusses the reconnect behavior points out that only one re-connect attempt will be made, which means the query can still fail if the server is stopped or inaccessible. I ran across this problem myself while writing a daemon in C that would periodically pull data from MySQL. The daemon was polling at set intervals far less than the time-out period, so any such errors were the result of an unreachable or stopped server. I simply jumped execution to just prior to my work loop's sleep() call and the daemon would periodically try to re-connect until the server came back up.

#define DBHOSTNAME localhost
#define DBHOSTNAME dbuser

MYSQL *db = mysql_init(NULL);
if (db == NULL) {
    fprintf(stderr, "Insufficient memory to allocate MYSQL object.");

/* enable re-connect behavior */
my_bool reconnect = 1;
int success = mysql_options(db, MYSQL_OPT_RECONNECT, &reconnect);
assert(success == 0);

    0, NULL, 0) == NULL) {
    fprintf(stderr, "Connection attempt failed: %s\n", mysql_error(db));

for (;;) {
    success = mysql_query(db, "<MYSQL QUERY HERE>");
    if (success != 0) {
        /* The error is most likely "gone away" since the query is
         * hard-coded, doesn't return much data, and the result is
         * managed properly. */
        fprintf(stderr, "Unable to query: %s\n", mysql_error(db));
        goto SLEEP;

    /* call mysql_use_result() and do something with data */


Thursday, February 13, 2014

Generating C Code and Compiling from STDIN

Lately I've been exploring some syslog configurations and needed to generate some log messages to verify they were routed correctly. Of course doing so programmatically would provide an easy and repeatable method to generate a batch of fresh log messages whenever I needed, but because of the number of facilities and priorities defined by the syslog protocol, it made sense to write a code generator to iterate the different permutations.

The following Lua script generates boilerplate C code for each of the 64 messages needed to test LOG_LOCAL 0-7 with all priorities. I chose generating the code in this manner over writing a nested facilities/priorities loop directly in C so I could easily include a textual representation of the facility and priority constants in the log message (this seemed like a cleaner solution to me than having to maintain a mapping of constants to char* strings as well). And why Lua? Well, it seemed a better idea than M4. :)

#! /usr/bin/env lua

local facilities = {

local priorities = {

#include <stdlib.h>
#include <syslog.h>
#include <libgen.h>

int main(int argc, char *argv[])
    char *appName = basename(argv[0]);

for _, facility in pairs(facilities) do 
    for _, priority in pairs(priorities) do
    openlog(appName, LOG_CONS|LOG_NDELAY|LOG_PID, %s);
    syslog(%s, "Test %s.%s message.\n");
            facility, priority, facility, priority

    return EXIT_SUCCESS;

Running the script will output the desired C code, which looks like this:

#include <stdlib.h>
#include <syslog.h>
#include <libgen.h>

int main(int argc, char *argv[])
    char *appName = basename(argv[0]);

    openlog(appName, LOG_CONS|LOG_NDELAY|LOG_PID, LOG_LOCAL0);
    syslog(LOG_DEBUG, "Test LOG_DEBUG message.\n");

    openlog(appName, LOG_CONS|LOG_NDELAY|LOG_PID, LOG_LOCAL0);
    syslog(LOG_INFO, "Test LOG_INFO message.\n");

    openlog(appName, LOG_CONS|LOG_NDELAY|LOG_PID, LOG_LOCAL0);
    syslog(LOG_NOTICE, "Test LOG_NOTICE message.\n");

If I wanted to inspect or tweak the generated code, I could pipe the script's output to a file before compiling it:

./gen-syslog-tests.lua > syslog-tests.c
gcc -o syslog-tests syslog-tests.c

But if I just wanted the compiled binary and had no need to modify the code, it seems inelegant to write things out to a file. Here's where I learned it's possible for gcc to compile code piped in on STDIN.

./gen-syslog-tests.lua | gcc -o syslog-tests -xc -

The two things of note are: gcc can't deduce the programming language from the file extension (since there is no file) so the -x flag is necessary to identify the language, and - is used as the file name (a convention commonly used to indicate reading from STDIN as a file).