Saturday, October 28, 2006

How poopy is YOUR code?

After numerous years at the top, object-oriented still reigns as the number-one programming buzzword (this claim is based on a wide-ranging, highly scientific, double-blind study of my opinion on the subject). I find this interesting because, in my observation, programmers rarely use OOP. They may use OO languages, but your typical chunk of code is rarely terribly OOP-y. Despite this, there has been an obsession with OOP over the last 20 years that has possibly obscured more significant techniques.

Even the experts do not agree on what OOP truly is, but it is most
often associated with the following:
  • Subclassing
  • Interfaces
  • Polymorphism
  • Encapsulation
All of these things are programming techniques, rather than inherently desirable qualities. So why do we spend all our time writing stuff like ProtocolFactory.Instance().Create(new AddressAdapter(anAddress, aPort))? In a word, maintainability. In case you haven't heard, maintainable code is defined as "code I like". "Your code isn't maintainable," should you ever hear it, usually means, "I don't like your code but I'm not sure why exactly and/or I don't feel like explaining myself." (If someone is really unhappy with your code, rather than say it's unmaintainable, they'll say that it "doesn't scale"). Cynics would point out, here, that 20 years of struggling to use OO techniques correctly has been motivated by solely something this subjective. Of course, I don't agree with these cynics.

In my experience, OO has been useful for a few things. For one thing, it's a way to see if a programmer can understand abstract concepts such as subclassing, not only because they will periodically stumble across a use for them, but also because it provides a sense of how well they'll be able to learn all the other tools that they'll need to do their job. I.e., being able to learn to use programming techniques, in general, is a useful skill in and of itself. As with anything, though, simply using tools doesn't necessarily mean they're being used correctly or appropriately. In fact, in programming, emphasis on techniques has not been accompanied by a serious focus on why those techniques should be used. See, that problem is considered too subjective. It's easier to give the same tired example of an Animal class hierarchy than it is to explain why polymorphism should be used to make good software in some real-life case.

Fortunately, I don't think it has to be that hard or subjective. So enough theorizing, for now, let's get down to a real example. What is the biggest problem with the following code?
// Socket.h

struct Socket {
...mumble...
};

Socket CreateSocket();

void Open(Socket s, Address addr, Port port);

// Returns number of bytes actually read.
int Read(Socket s, byte[] outBuffer);

// Returns number of bytes actually written.
int Write(Socket s, byte[] buffer);

void Close(Socket s);
(Ok, no points to the wiseacre who said, "It's written in Blub.")

One could point out that it's not very OOP-y. It doesn't use encapsulation (at least not officially), in that the Socket struct is just a plain old struct. It has no class definition, and doesn't implement an interface. It also doesn't make use of polymorphism! What if I want to be able to pass this Socket to something that expects a Stream?

These might be valid concerns in one situation or another, but remain, nonetheless, mostly mechanical issues. The big problem is that this code is difficult to use correctly. Worst of all, it's difficult to debug. For example, you can write this:
// More pseudo-C
Socket s = CreateSocket();
s.Write(myBuffer);
In this case Open() was never called, but to figure out that this is a problem requires diagnosing a mysterious failure at run-time, which may not be easy to track down to its source. And while the above is a rather trivial case that you'd likely notice simply by looking at it, there are more insidious and realistic examples, such as:
void Handshake(Socket s)
{
const byte[] header = { 0xBEEF, 0xBEEF };
if (s.Write(header) != header.length) {
...mumble...
}
}

// The following is in a file/library far, far away
...
void Mumble(...)
{
...
Socket s1 = Socket();
Open(s1, "www.google.com", 80);
Socket s2 = Socket();
Open(s1, "localhost", 2341); // OOPS!
Handshake(s1);
Handshake(s2); // BOOM!
...
}
This could easily happen, just a slip of the finger while typing. Since this code is written in C, the results will be catastrophic and hard to diagnose. This code is like a city sidewalk filled with open manholes; you can avoid them, but you have to watch where you step. And did I mention you're in one of those large, cold, unfriendly cities where your fellow pedestrians are unlikely to warn you of your imminent demise?

This does actually relate to OOP. The problem is that programmers are taught all about how to write OO code, and how doing so will improve the maintainability of their code. And by "taught", I don't just mean "taken a class or two". I mean: have pounded into head in school, spend years as a professional being mentored by senior OO "architects" and only then finally kind of understand how to use properly, some of the time. Most engineers wouldn't consider using a non-OO language, even if it had amazing features ... the hype is that major.

So what, then, about all that code programmers write before their 10 years OO apprenticeship is complete? Is it just doomed to suck? (Well, obviously I don't think so becuase, if I didn't, I wouldn't have been writing this essay.) Of course not, as long as they apply other techniques than OO. These techniques are out there but aren't as widely discussed. Going back to the example, here's an improved version.
// Socket.h

enum MagicNumbers {
OpenSocket = 0xE474BEEF,
ClosedSocket = 0xBEEF4A11,
}

struct Socket {
...
MagicNumber magicNumber;
}

void Open(Socket s, Address addr, Port port)
{
...
s.magicNumber = OpenSocket;
}

// Returns number of bytes actually read.
int Read(Socket s, byte[] outBuffer)
{
Assert(s.magicNumber != ClosedSocket, "you already Closed that Socket!");
Assert(s.magicNumber == OpenSocket, "using never-Opened or corrupt Socket!");
...
}

// Returns number of bytes actually written.
int Write(Socket s, byte[] buffer)
{
...same stuff as Read...
}

void Close(Socket s)
{
Assert(s.magicNumber == OpenSocket, "attempt to Close an unopened Socket!");
...
}
The improvement here has little to do with any specific programming technique (in fact, there are better ways to implement the change). It's more a matter of empathy; in this case, for the programmer who might have to use your code. The author of this code actually thought through what kinds of mistakes another programmer might make, and strove to make the computer tell the programmer what they did wrong.

In my experience the best code, like the best user interfaces, seems to magically anticipate what you want (or need) to do next. Yet it's discussed infrequently relative to OO. Maybe what's missing is a buzzword. So let's make one up, Programming fOr Others, or POO for short.

One good example of POO in action is sqlite. One of the many relational database technologies out there, this one distinguishes itself by just working. I downloaded it and ran it without any configuration and it just kind of did what I expected it to, gave me a SQL prompt. You'd think this would be obvious, but setting up most databases is far from trivial, involving setting up users, passwords, config files, etc. It's as if the sqlite programmers actually read Futurist Programming Notes. Sqlite doesn't restrict its poopy behavior to startup. When you create a table, for example, you don't have to specify the type of each row. It just lets you put in whatever the heck you want. Of course, you may specify the type if you want to, but the point is that the creators of this fine piece of software actually realized that you just might not care.

By contrast, there are numerous software packages and APIs that aren't, frankly, very POO. For example, take your typical, basic IO API:
int read(File f, char *buffer, int bytesToRead);
int write(File f, char *buffer, int bytesToWrite);
This code has a major (albeit subtle) issue: the values chosen for bytesToRead and bytesToWrite may have serious performance implications. But how do you even know that in the first place? And once you do, what do you do about it? You have little choice but to conduct a series of laborious experiments to figure out the best buffer size to use in each particular case. Maybe on some machines it's the size of a page, maybe on others a different size. And of course it might change with the next revision of your operating system, etc.

At this point you may be wondering, "Gee, so it's a little hard to use, what's the big deal?" Well, imagine 5 people were standing around a room talking, and one of them wanted to know what 3.1451515 * 92.003 was? Now I haven't told you one critical piece of information, which is that there is, in addition to the 5 people already mentioned, a computer in the corner of the room. Given all this, what would you expect the people to do:
  • Bust out some paper and pencils and try to remember how to do long division by hand, or,
  • Use the computer?
This wasn't a hard question to answer. Yet, for some reason, in almost the exact same situation (ok, not the exact same situation, but closer than most people would like to admit), programmers tend to leave a problem (such as figuring out optimal buffer sizes) to (human) programmers rather than let computers figure it out.

Avoiding fallback in distributed systems

As previously mentioned , I was recently able to contribute to the Amazon Builders' Library . I'd also like to share another post t...