Thursday, June 30, 2011

Smalltalk Challenge: Post 6 - Morphic

As the Dynabook/Smalltalk environment was the first to introduce the windowed user interface, it's no surprise that the Model View Controller (MVC) pattern also made its first appearance in Smalltalk. In Smalltalk, the term "MVC" refers to both the architecture pattern that separates code responsibilities into model, view, and controller objects, and the user interface framework used to develop visual and interactive elements. The MVC framework manages objects in the environment using the MVC pattern.
  • Model objects are responsible for maintaining the behavior and state of the element.
  • View objects are responsible for the representation or appearance of the element within the world.
  • Controller objects are responsible for accepting user input and passing messages to the model and view objects.
But because of the complexity and limitations of the MVC architecture, Squeak has replaced the MVC framework with Morphic, a direct-manipulation user interface toolkit. Unlike MVC, Morphic didn't originate with Smalltalk. Rather, it was written by Randy Smith and John Maloney for Self, another OO programming language influenced by Smalltalk, and ported to Squeak by Maloney who is also one of the core Squeak developers.

All interactive elements that exist within the environment in Morphic are subclasses of Morph, and the majority of code that would normally be required by MVC is encapsulated in the Morph class. For example:
Morph new openInWorld.
will open a blue rectangle that you can move about the screen, resize, rotate, and more. When a custom morph extends Morph, it inherits all of its parent's functionality and the programmer is freed to focus on customizing the morph instead of writing boiler-plate code. As Maloney wrote in An Introduction to Morphic, "In morphic, a little code goes a long way, and there is hardly any wasted effort."

I actually took advantage of Morphic in my Kember Identity search code, though it's not something one might think of as requiring a manipulatable object. The step message is intended for creating animations and morphs that update themselves dynamically over time (adding liveness to the interface). Morphic repeatedly sends step messages to each morph which may respond by executing their step method (the rate at which step messages are sent to an object can be specified by responding to the stepTime message, but by default I believe it's set for every few milliseconds or so). The method actions run concurrently with whatever else is happening in the rest of the environment. This builtin threading was exactly what I needed to ensure the rest of the system remained responsive to the user while the code was searching for a hash that exhibits the Kember Identity.

I subclassed the TextMorph class and supplied the following step method:
Kember » step
    found | (currHash = stopHash)
        ifFalse: [self contents: 'Testing ' , currHash.
            found := self test: currHash.
                found ifTrue: [self contents: 'Found '
                        , currHash , '!'].
            currHash := self nextHash: currHash.
            currHash = stopHash
                ifTrue: [self contents: 'KI was not found.']].
The contents: method is inherited from TextMorph to set the displayed text string, and the rest are instance variables and methods specific to my Kember class.

Morphs can respond to several other messages sent by the framework, too. Responding to the drawOn:message allows a morph to customize its appearance, or how it "draws" itself in the environment. Responding to mouseDown:, mouseUp:, mouseMove:, and keyStroke: events lets the morph interact with the user's mouse and keyboard activities.

Well that's 6 posts down now and only 4 more to go before I complete my challenge!

Wednesday, June 29, 2011

Smalltalk Challenge: Post 5 - Talking to Objects

Everything I've read about Smalltalk makes an effort to stress objects are capable of doing three things: maintaining state, receiving messages, and sending messages. The understanding that an object can maintains its own state is common across many languages, but message passing may be unfamiliar to programmers working with other languages (though message passing does play an important role in Objective-C). Alan Kay has said that message passing is the most important concept to understand in Smalltalk, and that objects have been over-emphasized.

Smalltalk's object model follows the classical (class-based) style; the programmer writes a class which is then used by the runtime as a blueprint for instantiating an object. But unlike C++ or Java, the programmer never works with objects by invoking their methods directly in Smalltalk. Instead, messages are sent to the object which receives a message, looks up the appropriate method to perform, and then executes the method. This is more than a conceptual difference; it's a concrete implementation detail as explained by the following quote from Wikipedia's Objective-C article:
In a Simula-style language [like C++, Java, etc.], the method name is in most cases bound to a section of code in the target class by the compiler. In Smalltalk and Objective-C, the target of a message is resolved at runtime, with the receiving object itself interpreting the message.
Perhaps a suitable way of looking at it is that all of the properties and methods of an object are private. The only way to interact with the object is by sending it the appropriate messages.

There are three types of messages in Smalltalk: unary messages that take no arguments, binary messages which have one argument, and keyword messages that can take one or more arguments.

Unary messages appear after the object instance they're sent to. The cr message is an example of a unary message that can be sent to a Transcript object, which responds by displaying a newline character.
Transcript cr.
Binary messages appear with an argument. For example, the binary message = can be sent to foo with the argument 42, and foo responds by comparing its value with the argument and returning a Boolean object set to whether the values were equal.
foo = 42.
Keyword messages end with a colon and can take multiple arguments. The at:put: message is an example of a keyword message which can be sent to an IntegerArray object. The array responds by placing the value 42 at index 5.
anIntegerArray at: 5 put: 42.
Keyword messages are a single message. That is, at:put is not an at: message followed by a put: message. In Java, it might be written like anIntegerArray.atPut(5, 42).

In statements that contain multiple messages, the order of precedence is from left to right with unary messages being sent first, then binary messages, and finally keyword messages, though the order can be influenced by parenthesis. Recall the example 2 + 1 / 4 from Post 3. The expression contains two binary operators, + and /. Following the precedence rules, 2 is sent the message + with the argument 1 and then the message / with the argument 4. Contrast this with how the statement would be evaluated in Java following the mathematical order of operations. To force that execution in Smalltalk, the expression would need to be parenthesized as 2 + (1 / 4).

It seems the biggest benefit of the approach is that its late/dynamic binding affords the flexibility of duck typing in a compiled language... but now that dynamic languages are widely used, and something like an inferred typing system can give the "feel" of dynamic programming in a strongly-typed language, I wonder if this is only historically important now. There is a runtime performance hit since method lookups happens at execution instead of compilation, and some Smalltalk implementations have explored method caching to help mitigate the impact, but it seems there must be a whole class of compile-time optimizations that cannot be performed by the compiler that may could affect performance as well.

I asked about the advantages and drawbacks of message passing on the PiLuD mailing list. Some of the respondents made pretty good points, and I recommend reading the exchange if you're interested in hearing some other opinions from a language-design perspective.

Tuesday, June 28, 2011

Smalltalk Challenge: Post 4 - Porting the Kember Identity

There are a few things I find myself tripping up over even after spending some time writing "meaningful" Smalltalk code, like using single quotes to delimit strings (double quotes are used for comments) and remembering the order in which different messages are sent, but the more code I write the easier it is to remember such things. After only a few hours, Smalltalk is still something new and unfamiliar.

The first programs I wrote when looking into Go were solutions to the first two Project Euler problems and a port of the Kember Identity search program. I decided to skip the Euler problems this time and go straight to the Kember Identity port.

The Kember program ultimately boils down to generating and checking MD5 hashes. I didn't find any helpful cryptography related objects or methods in the default image, so I searched Google and eventually found Ron Teitelbaum's Cryptography/Team package. Squeak uses a package management system called Monticello to load code into the image, so getting and installing the package was pretty easy. I copied and pasted the package repository's connection information into the Monticello Browser and loaded Rob Withers' contribution, Cryptography-rww.15.mcz.

Once the package was loaded, I was able to obtain hashes with MD5 » hashStream:, but the returned object was a ByteArray and I needed to interpret it as a 32-character hexadecimal string. At first I took this approach to convert the array to a hex string:
Kember » md5: aString
    "return 32-char MD5 hash of the given text" 
    | hash str |
    hash := MD5 new
                hashStream: (ReadStream on: aString).
    str := ''.
    1 to: hash size do: [:i | 
        str := str, ((hash at: i) radix: 16)].
    ↑ str.
… only to find out later that objects of the ByteArray class were modified by the cryptography package to accept a hex message and will do the conversion for me. Oops! All the bit twiddling I had done could easily be replaced with:
    ↑ (MD5 new hashStream: (ReadStream on: aString)) hex.
Converting between hash representations wasn't the only part of the program I initially over-programmed. I was also doing long-form addition to obtain the next hash value in the sequence when all I really needed was a little bit of type juggling and string padding:
Kember » nextHash: aHashStr
    "increment the MD5 hash"
    | hexHash zeroes |
    zeroes = '00000000000000000000000000000000'.
    hexHash := ((ByteArray fromHexString: aHashStr)
        asInteger + 1) asByteArray hex.
    hexHash size < 32
        ifTrue: [↑ (zeroes copyFrom: 1 to 32 - hexHash size)
            , hexHash].
    ↑ hexHash.
I guess it just proves the saying is true, "learning the libraries is the 20% of learning a new language that takes 80% of the time and effort." If anything, at least I can take comfort in knowing I'm not the first person to over-program a solution while learning a new language. For those that want to check out my Kember code, I've set up my own repository and uploaded a Monticello package to SqueakSource, and a file dump to Github (I suggest looking at it in raw mode if you go to Github because their Markdown chokes and truncates the pretty-print view).

Monday, June 27, 2011

Smalltalk Challenge: Post 3 - Awkward

After spending a short time in the Smalltalk (Squeak) environment, it's easy to understand how other existing languages at the time were not suited for realizing the full potential of the Dynabook. Working in a visual environment is a far-cry from working at a mainframe terminal. It's ironic though that some of the same issues that plague OO now are the same that held Smalltalk back in the 80's... it consumed a substantial amount of memory and the performance was not always optimal. For me, one of the issues is the awkwardness of Smalltalk's "everything is an object" philosophy.

I believe the "everything is a ___" mindset causes problems, regardless of what fills in the blank. In TCL, everything is a string. In LISP, everything is a list. In Smalltalk, everything is an object. The fact of the matter is not everything in the world is homogeneous. Some things are objects, others are numbers, and yet others are actions. Forcing everything into the same conceptual model forces the programmer to perform cognitive acrobatics.

Understanding the semantics behind a simple statement such as Transcript show: 'Hello World!' is easy; a Transcript window object is sent the message show:, which then responds by displaying the argument. But the semantics behind conditional code execution isn't so intuitive. Smalltalk doesn't have a dedicated if-construct, instead conditional execution is achieved by passing messages to Boolean objects, like so:
x > 0
    ifTrue: [Transcript show: 'Positive']
    ifFalse: [Transcript show 'Negative'].
The object represented by the variable x is sent the > message with the argument 0. x then compares its value with that of the argument and returns a Boolean object in response. If the value of x is a positive number then the resulting truth value of the returned object is true, otherwise the value is false. The ifTrue:ifFalse: message is sent to the implicit Boolean object providing two code block (which are also objects!) as arguments. All basic language constructs such as if-statements and while loops are abstracted as objects! While it does produce an appreciable amount of elegance and flexibility in the language, this thinking takes some getting used to.

There's a lot going on behind the scenes with an expression like 2 + 1 / 4 as well. A language like C will follow the rules of precedence stated by the mathematical order of operations. First 1 is divided by 4 which yields .25, and then 2 is added to .25 resulting in the final value 2.25. In Smalltalk, everything is an object which the programmer interacts with by sending messages. The literals 2, 1, and 4 are actually instances of SmallInteger objects. The + message is sent to 2 with an argument of 1, which responds by adding the value of the argument to its own value resulting in a value of 3. Then, the / message with the argument 4 is sent. The object responds by dividing its value by 4 which results in a final value of .75 for the expression.

Some things can easily be conceptualized as objects, such as windows, streams, and data structures, while it can be difficult to think of others as objects, such as text, integers, colors, and truth values. Over-personification and the perspective of the grammatical predicate/direct object performing actions is backwards of how we speak and how we think.

As awkward as the "everything is a" mindset it, I have to respect Kay for it. He intentionally adopted it so he and his team would be forced to think outside the box. Kay admits "we were actually trying for a qualitative shift in belief structures-- a new Kuhnian paradigm in the same spirit as the invention of the printing press-- and thus took highly extreme positions which almost forced these new styles to be invented." I can respect his effort and appreciate the elegant design it resulted in, but I can't expect an 8-year old to understand passing code blocks as arguments and why the order of operations is different on the computer than in the classroom.

Sunday, June 26, 2011

Smalltalk Challenge: Post 2 - Switching to Squeak

A quick survey of Smalltalk's keywords, constructs, and syntax isn't enough to understand how the language approaches programming. There is only a handful of keywords, no control structures, and the syntax is quirky in spots but is hardly exotic. To really get an appreciation, it's helpful to place the language in its historical context. Smalltalk wasn't intended to be a general purpose systems programming language like C, rather it was designed to allow ordinary people to take full advantage of their computers and to be easy enough for children to learn. The intended environment was immersive, interactive, and most-importantly visual. Smalltalk was more than a programming language, it was an entire operating environment.

Xerox was working on the Dynabook at its Palo Alto Research Center in the early 1970's. The Dynabook was to have many features we now see in personal laptops and tablet devices (such as touch screens, mice, windowed display managers, etc.). Alan Kay argued that existing languages were ill-suited and a new programming language was necessary for the end-user to realize the full potential of the system. He drew inspiration from the work of Jean Piaget and others in developmental psychology and constructionist learning, which set the direction of Smalltalk. The Dynabook project never came to fruition, but Smalltalk lives on, and both had a lasting influence on computing that we still see today.

Given this background, I suspected something like Squeak would really be better suited for learning Smalltalk as it was intended since Squeak is the modern implementation of the language and environment that was developed at Palo Alto. I asked Josh if it was okay to switch from GNU Smalltalk to Squeak. He agreed-- admitting that he only proposed GNU Smalltalk because it's what he already had on his system from Portage. I checked apt and was amazed to find Squeak in Ubuntu's repositories (with bets on the MotU pulling it without telling anyone within the next two releases). After running sudo apt-get install squeak-vm and starting Squeak, I was staring at a pretty gaudy looking graphical environment.

I took the liberty of running a "Hello World" program. The System Browser window is at the top, a Workspace window is in the middle, the Transcript window is at the bottom, and the Objects Library window is to the right. I entered the statement Transcript show: 'Hello World!' in the Workspace, highlighted it, and chose "do it" from the popup menu.

Every element is an object which you interact with by sending it messages. In the above example for instance, I interacted with the Transcript window by sending it the show: message and the argument 'Hello World!'. The window responded by displaying the text.

I did a screen recording of another example to help give you a better idea what interaction in the graphical environment can look like. I placed a Star object on the world (the main "desktop" area) and used its Inspection window to send the messages borderWidth: and rotationDegrees:. The object responded by redrawing itself with a thicker border and different rotation.

Have you been challenged to learn a new language, or have you been learning/working with Smalltalk? Feel free to leave your comments about it below.

Saturday, June 25, 2011

Smalltalk Challenge: Post 1 - Installing GNU Smalltalk

That's right... I'm the one who challenged my coworker Josh to open his mind beyond Java by spending time with a new language and blogging about it. l At first I challenged him to learn Oz, a language that combines the imperative, object-oriented, functional, logic, constraint, distributed, and concurrent programming paradigms (whew!). Unfortunately, apparently the 64-bit version of Mozart on Gentoo is broken at the moment and he didn't want to, as he put it, "build random shit on his box." So I proposed OCaml as an alternative. While it may not combine seven programming paradigms, I'm positive functional programming will be enough to show him there's more to life than what Java offers.

In return, for him to accept my challenge I had to agree to spend some time learning a language of his choosing too, and he set the number of required blog posts to 10. That's not too bad in my opinion since I enjoy looking at different programming languages anyway, but 10 posts seems a bit excessive (I would have been happy with 3). His first suggestion was Shakespeare. Really? I made him pick a more serious language, which turned out to be GNU Smalltalk. Yes, let the record show I rejected his first proposal, but he also rejected mine... and also let the record show that I'm not afraid to use a compiler.

My first attempt to install Smalltalk was through apt-get since I'm using an Ubuntu system right now, and of course I met up with the usual Ubuntu package bullshit...
sudo apt-get install smalltalk
E: Unable to locate package smalltalk

sudo apt-get install gnu-smalltalk
Package gnu-smalltalk is not available, but is referred to by
another package. This may mean that the package is missing,
has been obsoleted, or is only available from another source

E: Package 'gnu-smalltalk' has no installation candidate
It was the SpiderMonkey fiasco all over again. Apparently GNU Smalltalk was removed from Ubuntu back in Lucid Lynx. I had to go to smalltalk.gnu.org instead and download the source. In just a few minutes I had a working binary using the usual configure, make, make install mantra.

There was one minor snag during compilation, though. Initially libtool reported that it couldn't find the name of the link library for libc.la. A quick Google search showed all I had to do was remove libc.la and run make again.

Now it's time to learn some Smalltalk!

Update (6/29/11): I was able to compile GNU Smalltalk without experiencing the libc.la issue by running autoreconf -vi before configure.

Saturday, June 4, 2011

Reading Unicode (UTF-8) in C

In working on scanner code for Kiwi I did a bit of reading up on Unicode. It's not really as difficult as one might think parsing UTF-8 character by character in C. In the end I opted to use ICU so I could take advantage of its character class functions instead of rolling my own, but the by-hand method I thought was still worth sharing.

Functions like getc() read in a byte from an input stream. ASCII was the predominant encoding scheme and encoded characters in 7-8 bits, so reading a byte was effectively the same as reading a character. But you can only represent 255 characters using 8 bits, far too little to represent all the characters of the world's languages. The most common Unicode scheme is UTF-8, is a multi-byte encoding scheme capable of representing over 2 million characters using 4 bytes or less.

The 128 characters of 7-bit ASCII encoding scheme are encoded the same, the most-significant bit is always 0. Other characters can be encoded as multiple bytes but the most-significant bit in such cases is always 1. The bit pattern in the first byte of such multi-byte characters too that make it easy to determine exactly how many bytes are used for the character.


The first bit serves as a flag. If set, the character is encoded using multiple bytes, and if not, the character is encoded using just one. Subsequent bits indicate the number of bytes used. Each byte in a multi-byte encoded character will aways start with the pattern "10".

With that knowledge, one can write a wrapper that reads characters in from a UTF-8 encoded source one at a time. getc() reads a byte at a time, and a quick peek at the significant bits determines if additional bytes must be read to fetch the entire character.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define U_MAX_BYTES 4

unsigned short u_getc(FILE *stream, char *bytes) {
    /* mask values for bit pattern of first byte in multi-byte
     UTF-8 sequences: 
       192 - 110xxxxx - for U+0080 to U+07FF 
       224 - 1110xxxx - for U+0800 to U+FFFF 
       240 - 11110xxx - for U+010000 to U+1FFFFF */
    static unsigned short mask[] = {192, 224, 240}; 

    unsigned short i, j; 

    /* initialize buffer */
    memset(bytes, 0, U_MAX_BYTES + 1); 

    /* read first byte into buffer */
    bytes[0] = getc(stream);
    if (bytes[0] == EOF) { 
        return 0; 
    } 

    /* check how many more bytes need to be read for
     character */
    i = 0;
    if ((bytes[0] & mask[0]) == mask[0]) i++;
    if ((bytes[0] & mask[1]) == mask[1]) i++;
    if ((bytes[0] & mask[2]) == mask[2]) i++;

    /* read subsequent character bytes */
    j = 0;
    while (j < i) {
        j++;
        bytes[j] = getc(stream);
    }

    /* return number of bytes read into buffer */
    return i + 1;
}
The wrapper function u_getc() is passed an open file handle and a buffer to store the character. It returns the number of bytes read into the buffer.
int main(int argc, char *argv[]) {
    /* allocating +1 for null gives ability to print character
     as string */
    char *bytes = (char*)calloc(U_MAX_BYTES + 1, sizeof(char)); 

    /* read and print until end of file */
    while (u_getc(stdin, bytes)) { 
        printf("%s\n", bytes); 
    }

    return 0; 
}
Grouping characters into tokens such as operators and numeric literals is easy because UTF-8 is backward compatible with 7-bit ASCII, which allows performing character tests like byte[0] == '+' and isdigit(byte[0]). But it was the need to test multi-byte sequences that ultimately lead me to use ICU. Writing and testing a function that properly identifies membership in the set of 14,725+ possible characters that could appear in an identifier name wasn't a tempting prospect.