Thursday, May 26, 2016

Please Don't Write Bruby

This doc is about how to write Ruby code instead of writing hybrid yuck Bruby code. Ruby code, is, well, Ruby code; i.e., code written in Ruby and not some other language. I strongly believe that once you have chosen to write code in Ruby, you should try to keep writing code in Ruby. Specifically, please don't write Bruby, which is an unholy mishmash of Bash and Ruby. Bruby is hard to read, flaky, slower (usually), and always harder to reason about. It's even harder to write. There is almost no reason to shell out to Bash from Ruby.

Here's a canonical example of Bruby code:

output = `somecommand | egrep -v '^snord[123]' | sort | uniq`
if $? == 0
  # ...do stuff with 'output' such as:
  foo output
end

There are a number of problems with this code. For one thing, it's unnecessary to use Bash in the first place. Ruby is just as good at grepping, regular expressions, sorting and uniqifying. The more you mix different programming languages, the more the reader of the code has to switch gears mentally. As a frequent code-reviewer, I implore you: please don't make your readers switch gears mentally (unless there's a really good reason), it causes brain damage after a while. Also, Bash pipelines easily hide bugs and mask failures. By default, Bash ignores errors from an "interior" command in a pipeline, as the following code illustrates:

def lower_nuclear_plant_control_rods
  # The following line will *not* raise an exception, even though it
  # will barf.
  rods = `cat /etc/plants/springfield/control_rods | sort --revesre | uniq`
  # 'rods' is now an empty string and $? will be 0, indicating that 
  # everything is ok, when it isn't.
  if $? != 0
    # This will never happen. Millions of lives will be lost in the
    # ensuing calamity.
    page_operator "Springfield Ops Center", :sev_1, "Meltdown Imminent"
  else
    rods.split.each do |rod|
      lower rod
    end
  end
end

Despite the subtly bad arguments to sort, the above code won't raise an exception, and will fail to lower the control rods and also fail to notify the operators (because $? will be 0). Thus the town of Springfield will be wiped off the map. Do you want the same type of errors to accidentally blow away all of your production databases from an errant invocation of some admin script?

One way to fix the above would be to add the pipefail option into the string of Bash code:

rods = `set -o pipefail; cat /etc/plants/springfield/control_rods | sort --revesre | uniq`

But that's easy to forget, difficult to mechanistically catch, and ugly to read. The better solution is to remember that you can write Ruby in Ruby:

def lower_nuclear_plant_control_rods
  File.readlines('/etc/plants/springfield/control_rods').sort.reverse.uniq do |rod|
    lower rod
  end
rescue
  page_operator "Springfield Ops Center", :sev_1, "Meltdown Imminent"
  raise
end

The above code is shorter, easier to read (no switching mental gears) and is guaranteed to raise exceptions if something is wrong. The specific changes are:

cat
Ruby's good at opening files, just use File.readlines if you want to read a file line-by-line.
sort
Ruby Enumerables have sort built in.
--reverse
Use Enumerable's built-in reverse method.
uniq
Use Enumerable's uniq method.
Error-handling
Any errors in the invocation (for example misspelling reverse as revesre) will result in an exception being raised).

Tips

The rest of this document contains a series of tips to help you write Ruby instead of Bruby.

Tip 1: Google for Pure Ruby Bash Equivalents

Whenever you're tempted to shell out in a Ruby script, stop, flagellate yourself 23 times with a birch branch, and then Google for an alternative in Pure Ruby. For example, imagine your script must create a directory, and not fail if the directory already exists, and also create any intermediary directories. But you don't want to write that code yourself because it's yucky and complicated and you know you'll fail to handle some edge condition properly. If you're familiar with Unix, your first inclination might be to write Bruby:

system "mkdir -p '#{ROOT_DIR}'"
if $? != 0
  raise "Unable to create directory #{ROOT_DIR}"
end

The above is clunky. It's also easy to forget to check the value of $?, in which case your code will silently continue even though the directory was not created. Fortunately, this is easy to fix. Just Google for it (after flagellating yourself). You will see that Ruby has a built-in version of mkdir -p.

require 'fileutils'
FileUtils.mkdir_p ROOT_DIR

This version is shorter and easier to read, and, most importantly, raises an exception if anything went wrong:

irb(main):666:0> FileUtils.mkdir_p "/bogon/adfadf"
Errno::EACCES: Permission denied @ dir_s_mkdir - /bogon
 from /usr/local/Cellar/ruby/2.3.0/lib/ruby/2.3.0/fileutils.rb:253:in `mkdir'
 from /usr/local/Cellar/ruby/2.3.0/lib/ruby/2.3.0/fileutils.rb:253:in `fu_mkdir'
 from /usr/local/Cellar/ruby/2.3.0/lib/ruby/2.3.0/fileutils.rb:227:in `block (2 levels) in mkdir_p'
 from /usr/local/Cellar/ruby/2.3.0/lib/ruby/2.3.0/fileutils.rb:225:in `reverse_each'
 from /usr/local/Cellar/ruby/2.3.0/lib/ruby/2.3.0/fileutils.rb:225:in `block in mkdir_p'
 from /usr/local/Cellar/ruby/2.3.0/lib/ruby/2.3.0/fileutils.rb:211:in `each'
 from /usr/local/Cellar/ruby/2.3.0/lib/ruby/2.3.0/fileutils.rb:211:in `mkdir_p'
 from (irb):666
 from /usr/local/bin/irb:11:in `<main>'

This obnoxious error might be obnoxious, but it also might save your life.

Tip 2: Keep unavoidable Bash usage to a minimum

Sometimes it does make sense to shell out. For example, it is hard to find a Ruby equivalent of the command-line dig DNS utility. In these situations, don't throw out the baby with the bathwater; keep the Bash to a minimum. For example, in Bruby, you would write:

dig_command = "dig +qr www.springfield.us any -x 127.0.0.1 isc.org ns +noqr"
mail_server = `#{dig_command} | egrep -w MX | awk '{print $6}'`.chomp

Whereas in Ruby, by contrast, you should use Bash only to run dig, everything else (such as processing the standard output of dig) should be done in Ruby. The built-in Open3 module makes this straightforward in many cases:

require 'open3'
output, error, status = Open3.capture3 dig_command

Open3.capture3 returns a stream containing the standard output of running dig_command, as well as the status of running the command. The next tip covers how to actually replicate the rest of the above pipeline that populated the mail_server variable.

Tip 3: Use Enumerable to replace Bash pipelines

A distinguishing feature of Bash is its ability to chain commands together using pipes, as illustrated in the previous tip:

mail_server = `#{dig_command} | egrep -w MX | awk '{print $6}'`.chomp

Most of the time you can replicate pipelines using Ruby's built-in Enumerable module. Almost everything you think might be an Enumerable actually is an Enumerable: arrays, strings, open files, etc. In particular, if you're trying to convert Bruby to Ruby, you can use methods like IO.popen, or better yet the methods in Open3, to get an Enumerable (or else a string, which can be converted into one with split) over the standard output of that process. From there, you can take advantage of methods such as Enumerable.grep (which, for example, seamlessly handles regular expressions).

In those cases where Enumerable itself doesn't immediately solve the problem, you have the Ruby programming language itself at your beck and call. For example, many of the features of Awk can be found directly in Ruby (if you did enough programming language research you'd probably dig up some indirect connection between the two languages, but that shall be left as an exercise for the reader).

Here are all the equivalents of each part of the above Bash pipeline.

Bruby Ruby
`#{dig_command}` Open3.capture3(dig_command)
egrep -w MX split("\n").grep(/\WMX\W/)
awk '{print $6}' split[5]

Putting it all together:

require 'open3'
dig_command = "dig +qr www.springfield.us any -x 127.0.0.1 isc.org ns +noqr"
o, e, s = Open3.capture3 dig_command
mail_server = if s.success?
                o.split("\n").grep(/\WMX\W/)[0].split[5]
              else
                raise "Failed: #{dig_command}\n#{e}"
              end

Here are some other common mappings from Bash to Enumerable methods:

Bash Enumerable
head -n 2 take(2)
tail -n 2 reverse_each.take(2)
sort sort
uniq to_a.uniq
grep grep
grep -v grep_v

Thursday, April 14, 2016

A Guide to Naming Variables

Names Considered Useful

Software is written for people to understand; variable names should be chosen accordingly. People need to comb through your code and understand its intent in order to extend or fix it. Too often, variable names waste space and hinder comprehension. Even well-intentioned engineers often choose names that are, at best, only superficially useful. This document is meant to help engineers choose good variable names. It artificially focuses on code reviews because they expose most of the issues with bad variable names. There are, of course, other reasons to choose good variable names (such as improving code maintenance). This document is also a work-in-progress, please send me any constructive feedback you might have on how to improve it.

Why Name Variables?

The primary reason to give variables meaningful names is so that a human can understand them. Code written strictly for a computer could just as well have meaningless, auto-generated names1:

int f1(int a1, Collection<Integer> a2)
{
  int a5 = 0;
  for (int a3 = 0; a3 < a2.size() && a3 < a1; a3++) {
    int a6 = a2.get(a3);
    if (a6 >= 0) {
      System.out.println(a6 + " ");
    } else {
      a5++;
    }
  }
  System.out.println("\n");
  return a5;
}

All engineers would recognize the above code is needlessly difficult to understand, as it violates two common guidelines: 1) don't abbreviate, and 2) give meaningful names. Perhaps surprisingly, these guidelines can be counter-productive. Abbreviation isn't always bad, as will be discussed later. And meaningful is vague and subject to interpretation. Some engineers think it means that names should always be verbose (such as MultiDictionaryLanguageProcessorOutput). Others find the prospect of coming up with truly meaningful names daunting, and give up before putting in much effort. Thus, even when trying to follow the above two rules, a coder might write:

int processElements(int numResults, Collection<Integer> collection)
{
  int result = 0;
  for (int count = 0; count < collection.size() && count < numResults; count++) {
    int num = collection.get(count);
    if (num >= 0) {
      System.out.println(num + " ");
    } else {
      result++;
    }
  }
  System.out.println("\n");
  return result;
}

Reviewers could, with effort, understand the above code more easily than the first example. The variable names are accurate and readable. But they're unhelpful and waste space, because:

processElements
most code "processes" things (after all, code runs on a "processor"), so process is seven wasted characters that mean nothing more that "compute". Elements isn't much better. While suggestive that the function is going to operate on the collection, that much was already obvious. There's even a bug in the code that this name doesn't help the reader spot.
numResults
most code produces "results" (eventually); so, as with process, Results is seven wasted characters. The full variable name, numResults is suggestive that it might be intended to limit the amount of output, but is vague enough to impose a mental tax on the reader.
collection
wastes space; it's obvious that it's a collection because the previous tokens were Collection<Integer>.
num
simply recapitulates the type of the object (int)
result, count
are coding cliches; as with numResults they waste space and are so generic they don't help the reader understand the code.

However, keep in mind the true purpose of variable names: the reader is trying to understand the code, which requires both of the following:

  1. What was the coder's intent?
  2. What does the code actually do?

To see how the longer variable names that this example used are actually a mental tax on the reader, here's a re-write of the function showing what meaning a reader would actually glean from those names:

int doSomethingWithCollectionElements(int numberOfResults, 
                                      Collection<Integer> integerCollection)
{
  int resultToReturn = 0;
  for (int variableThatCountsUp = 0; 
       variableThatCountsUp < integerCollection.size() 
         && variableThatCountsUp < numberOfResults; 
       variableThatCountsUp++) {
    int integerFromCollection = integerCollection.get(count);
    if (integerFromCollection >= 0) {
      System.out.println(integerFromCollection + " ");
    } else {
      resultToReturn++;
    }
  }
  System.out.println("\n");
  return resultToReturn;
}

The naming changes have almost made the code worse than the auto-generated names, which, at least, were short. This rewrite shows that coder's intent is still mysterious, and there are now more characters for the reader to scan. Code reviewers review a lot of code; poor names make a hard job even harder. How do we make code reviewing less taxing?

On Code Reviews

There are two taxes on code reviewers' mental endurance: distance and boilerplate. Distance, in the case of variables, refers to how far away a reviewer has to scan, visually, in order to remind themselves what a variable does. Reviewers lack the context that coders had in mind when they wrote the code; reviewers must reconstruct that context on the fly. Reviewers need to do this quickly; it isn't worth spending as much time reviewing code as it took to write it2. Good variable names eliminate the problem of distance because they remind the reviewer of their purpose. That way they don't have to scan back to an earlier part of the code.

The other tax is boilerplate. Code is often doing something complicated; it was written by someone else; reviewers are often context-switching from their own code; they review a lot of code, every day, and may have been reviewing code for many years. Given all this, reviewers struggle to maintain focus during code reviews. Thus, every useless character drains the effectiveness of code reviewing. In any one small example, it's not a big deal for code to be unclear. Code reviewers can figure out what almost any code does, given enough time and energy (perhaps with some follow-up questions to the coder). But they can't afford to do that over and over again, year in and year out. It's death by 1,000 cuts.

A Good Example

So, to communicate intent to the code reviewer, with a minimum of characters, the coder could rewrite the code as follows:

int printFirstNPositive(int n, Collection<Integer> c)
{
  int skipped = 0;
  for (int i = 0; i < c.size() && i < n; i++) {
    int maybePositive = c.get(i);
    if (maybePositive >= 0) {
      System.out.println(maybePositive + " ");
    } else {
      skipped++;
    }
  }
  System.out.println("\n");
  return skipped;
}

Let's analyze each variable name change to see why they make the code easier to read and understand:

printFirstNPositive
unlike processElements, it's now clear what the coder intended this function to do (and there's a fighting chance of noticing a bug)
n
obvious given the name of the function, no need for a more complicated name
c
collection wasn't worth the mental tax it imposed, so at least trim it by 9 characters to save the reader the mental tax of scanning boilerplate characters; since the function is short, and there's only one collection involved, it's easy to remember that c is a collection of integers
skipped
unlike results, now self-documents (without a comment) what the return value is supposed to be. Since this is a short function, and the declaration of skipped as an int is plain to see, calling it numSkipped would have just wasted 3 characters
i
iterating through a for loop using i is a well-established idiom that everyone instantly understands. Give that count was useless anyway, i is preferable since it saves 4 characters
maybePositive
num just meant the same thing int did, whereas maybePositive is hard to misunderstand and may help one spot a bug

It's also easier, now, to see there are two bugs in the code. In the original version of the code, it wasn't clear if that the coder intended to only print positive integers. Now the reader can notice that there's a bug, because zero isn't positive (so n should be greater than 0, not greater-than-or-equals). (There should also be unit tests). Furthermore, because the first argument is now called maxToPrint (as opposed to, say, maxToConsider), it's clear the function won't always print enough elements if there are any non-positive integers in the collection. Rewriting the function correctly is left as an exercise for the reader.

Naming Tenets (Unless You Know Better Ones)

  • As coders our job is to communicate to human readers, not computers.
  • Don't make me think. Names should communicate the coder's intent so the reader doesn't have to try to figure it out.
  • Code reviews are essential but mentally taxing. Boilerplate must be minimized, because it drains reviewers' ability to concentrate on the code.
  • We prefer good names over comments but can't replace all comments.

Cookbook

To live up to these tenets, here are some practical guidelines to use when writing code.

Don't Put the Type in the Name

Putting the type of the variable in the name of the variable imposes a mental tax on the reader (more boilerplate to scan over) and is often a poor substitute for thinking of a better name. Modern editors like Eclipse are also good at surfacing the type of a variable easily, making it redundant to add the type into the name itself. This practice also invites being wrong; I have seen code like this:

Set<Host> hostList = hostSvc.getHosts(zone);

The most common mistakes are to append Str or String to the name, or to include the type of collection in the name. Here are some suggestions:

Bad Name(s) Good Name(s)
hostList, hostSet hosts, validHosts
hostInputStream rawHostData
hostStr, hostString hostText, hostJson, hostKey
valueString firstName, lowercasedSKU
intPort portNumber

More generally:

  • Pluralize the variable name instead of including the name of a collection type
  • If you're tempted to add a scalar type (int, String, Char) into your variable name, you should either:
    • Explain better what the variable is
    • Explain what transformation you did to derive the new variable (lowercased?)

Use Teutonic Names Most of The Time

Most names should be Teutonic, following the spirit of languages like Norwegian, rather than the elliptical vagueness of Romance languages like English. Norwegian has more words like tannlege (literally "tooth doctor") and sykehus (literally "sick house"), and fewer words like dentist and hospital (which don't break down into other English words, and are thus confusing unless you already know their meaning). You should strive to name your variables in the Teutonic spirit: straightforward to understand with minimal prior knowledge.

Another way to think about Teutonic naming is to be as specific as possible without being incorrect. For example, if a function is hard-coded to only check for CPU overload, then name it overloadedCPUFinder, not unhealthyHostFinder. While it may be used to find unhealthy hosts, unhealthyHostFinder makes it sound more generic that it actually is.

// GOOD
Set<Host> overloadedCPUs = hostStatsTable.search(overCPUUtilizationThreshold);
// BAD
Set<Host> matches = hostStatsTable.search(criteria);

// GOOD
List<String> lowercasedNames = people.apply(Transformers.toLowerCase());
// BAD
List<String> newstrs = people.apply(Transformers.toLowerCase());

// GOOD
Set<Host> findUnhealthyHosts(Set<Host> droplets) {  }
// BAD
Set<Host> processHosts(Set<Host> hosts) {  }

The exceptions to the Teutonic naming convention are covered later on in this section: idioms and short variable names.

It's also worth noting that this section isn't suggesting to never use generic names. Code that is doing something truly generic should have a generic name. For example, transform in the below example is valid because it's part of a generic string manipulation library:

class StringTransformer {
  String transform(String input, TransformerChain additionalTransformers);
}

Move Simple Comments Into Variable Names

As illustrated earlier, variable names cannot (and should not) replace all comments. But if a short comment can fit into the variable name, that's probably where it should go. This is because:

  • It's less visual clutter for the code reviewer to wade through (comments are a mental tax, so should provide true value)
  • If a variable is used a long distance from the comment, the code reviewer doesn't have to break their focus and scroll back to the comment to understand the variable

For example,

// BAD
String name; // First and last name
// GOOD
String fullName;

// BAD
int port; // TCP port number
// GOOD
int tcpPort;

// BAD
// This is derived from the JSON header
String authCode;
// GOOD
String jsonHeaderAuthCode;

Avoid Over-used Cliches

In addition to not being Teutonic, the following variable names have been so horribly abused over the years that they should never be used, ever.

  • val, value
  • result, res, retval
  • tmp, temp
  • count
  • str

The moratorium on these names extends to variations that just add in a type name (not a good idea anyway), such as tempString, or intStr, etc.

Use Idioms Where Meaning is Obvious

Unlike the cliches, there are some idioms that are so widely-understood and unabused that they're safe to use, even if by the letter of they law they're too cryptic. Some examples are (these are Java/C specific examples, but the same principles apply to all languages):

  • use of i, j and k in straightforward for loops
  • use of n for a limit/quantity when it's obvious what it would do
  • use of e for an Exception in a catch clause
// OK
for (int i = 0; i < hosts.size(); i++) { }

// OK
String repeat(String s, int n);

Warning: idioms should only be used in cases where it's obvious what they mean.

May Use Short Names Over Short Distances When Obvious

Short names, even one-letter names, are preferable in certain circumstances. When reviewers see a long name, they tend to think they should pay attention to them, and then if the name turns out to be completely useless, they just wasted time. A short name can convey the idea that the only useful thing to know about the variable is its type. So it is okay to use short variable names (one- or two-letter), when both of the following are true:

  1. The distance between declaration and use isn't great (say within 5 lines, so the declaration is within the reader's field of vision)
  2. There isn't a better name for the variable than its type
  3. The reader doesn't have too many other things to remember at that point in the code (remember, studies show people can remember about 7 things).

Here's an example:

void updateJVMs(HostService s, Filter obsoleteVersions)
{
  // 'hs' is only used once, and is "obviously" a HostService
  List<Host> hs = s.fetchHostsWithPackage(obsoleteVersions);
  // 'hs' is used only within field of vision of reader
  Workflow w = startUpdateWorkflow(hs);
  try {
    w.wait();
  } finally {
    w.close();
  }
}

You could also write:

void updateJVMs(HostSevice hostService, Filter obsoleteVersions)
{
  List<Host> hosts = hostService.fetchHostsWithPackage(obsoleteVersions);
  Workflow updateWorkflow = startUpdateWorkflow(hosts);
  try {
    updateWorkflow.wait();
  } finally {
    updateWorkflow.close();
  }
}

But this takes up more space with no real gain; the variables are all used so close to their source that the reader has no trouble keeping track of what they mean. Furthermore, the long name updateWorkflow signals to the reviewer that there's something significant about the name. The reviewer then loses mental energy determining that the name is just boilerplate. Not a big deal in this one case, but remember, code reviewing is all about death by 1,000 cuts.

Remove Thoughtless One-Time Variables

One-time variables (OTVs), also known as garbage variables, are those variables used passively for intermediate results passed between functions. They sometimes serve a valid purpose (see the next cookbook items), but are often left in thoughtlessly, and just clutter up the codebase. In the following code fragment, the coder has just made the reader's life more difficult:

List<Host> results = hostService.fetchMatchingHosts(hostFilter);
return dropletFinder(results);

Instead, coders should make a pass through their code and simplify to:

return dropletFinder(hostService.fetchMatchingHosts(hostFilter));

Use Short OTVs to Break Long Lines

Sometimes you need an OTV in order to break up a really long line:

List<Host> hs = hostService.fetchMatchingHosts(hostFilter);
return DropletFinderFactoryFactory().getFactory().getFinder(region).dropletFinder(hs);

This is ok, although since the distance is so short, you can give the OTV and one- or two-letter name (hs) to save visual clutter.

Use Short OTVs to Break up Complicated Expressions

It can be difficult to read code like this:

return hostUpdater.updateJVM(hostService.fetchMatchingHosts(hostFilter),
                             JVMServiceFactory.factory(region).getApprovedVersions(),
                             RegionFactory.factory(region).getZones());

So it's ok to write:

List<Host> hs = hostService.fetchMatchingHosts(hostFilter);
Set<JVMVersions> js = JVMServiceFactory.factory(region).getApprovedVersions(),
List<AvailabilityZone> zs = RegionFactory(region).getZones();
return hostUpdater.updateJVM(hs, js, zs);

Again, because the distances are all short and the meanings are all obvious, short names can be used.

Use Longer OTVs to Explain Confusing Code

Sometimes you need OTVs simply to explain what code is doing. For example, sometimes you have to use poorly-named code that someone else (ahem) wrote, so you can't fix the name. But you can use a meaningful OTV to explain what's going on. For example,

List<Column> pivotedColumns = olapCube.transformAlpha();
updateDisplay(pivotedColumns);

Note in this case you should not use a short OTV:

List<Column> ps = olapCube.transformAlpha();
updateDisplay(ps);

That just adds too much visual clutter without helping to explain the code sufficiently.

Updates

26-May-2016
fixed some of the issues brought up in the comments - thanks for the feedback!

Footnotes:

1
This code example is used for illustrative purposes only and is not meant to be an example of well-designed code, variable names notwithstanding.
2
There are obvious exceptions such as nuclear power plant control rod algorithms, distributed locking protocols, and so forth.

Avoiding fallback in distributed systems

As previously mentioned , I was recently able to contribute to the Amazon Builders' Library . I'd also like to share another post t...