A Nickel's Worth: 2008

Wednesday, May 21, 2008

Ulrich Drepper's Memory Paper and CL

I recently came across Ulrich Drepper's excellent paper, What Every Programmer Should Know About Memory. There is a lot of fascinating stuff in there about an important class of things you sometimes need to do to achieve excellent performance. While the paper concentrates on C, I was wondering if some of the same effects could be observed in a high-level language like CL (for the record, I think CL is both high- and low-level, but whatever...) I did a quick experiment which suggests that at least one of Ulrich's optimizations works in CL. [Note: the following is not an attempt to produce mathematically correct results, as Ulrich did; I just wanted to see if the order in which memory was accessed seemed to matter.]:

(declaim (optimize (speed 3) (safety 0)))

(defun matrix-multiply-fast? ()
  (let ((m1 (make-array '(1000 1000) :element-type 'fixnum))
        (m2 (make-array '(1000 1000) :element-type 'fixnum))
        (m3 (make-array '(1000 1000) :element-type 'fixnum)))
    (loop repeat 100
       do
       (loop for i upto 1000
          do
          (loop for j upto 1000
             do
             (setf (aref m3 i j) (* (aref m1 i j) (aref m2 i j))))))))

(defun matrix-multiply-slow? ()
  (let ((m1 (make-array '(1000 1000) :element-type 'fixnum))
        (m2 (make-array '(1000 1000) :element-type 'fixnum))
        (m3 (make-array '(1000 1000) :element-type 'fixnum)))
    (loop repeat 100
       do
       (loop for i upto 1000
          do
          (loop for j upto 1000
             do
             (setf (aref m3 i j) (* (aref m1 i j) (aref m2 j i))))))))

Note the only difference between the two functions is the order in which the elements of the 2nd array are accessed. After compiling the above file I did the following from the CL prompt:

CL-USER> (time (matrix-multiply-fast?))
Evaluation took:
  2.776 seconds of real time
  2.508157 seconds of user run time
  0.048003 seconds of system run time
  [Run times include 0.028 seconds GC run time.]
  0 calls to %EVAL
  0 page faults and
  12,000,160 bytes consed.

CL-USER> (time (matrix-multiply-slow?))
Evaluation took:
  3.926 seconds of real time
  3.632227 seconds of user run time
  0.016001 seconds of system run time
  0 calls to %EVAL
  0 page faults and
  12,008,280 bytes consed.

Assuming the above is all correct, it sort-of duplicates Ulrich's result where accessing an array in a column-major order is slower (see figure 6.1 for an illustration). Now, I have to admit that, for lack of time, I didn't read Ulrich's paper in great depth and kind of lazily jumped to section six. I'd be curious to know if this sort of optimization is already well-known in the CL world (and/or for other "high-level" programming languages). Some casual googling turned up an excellent paper about optimizing CL via type annotations, but I didn't see anyone directly addressing Ulrich's points. I'd be curious if any readers of this blog know more.

Sunday, March 23, 2008

Optimizing CL

This blog is about mechanically optimizing CL code. We will not discuss optimizing algorithms; instead we'll transform the code in rote ways to improve performance. To illustrate these techniques, we'll re-implement Perl's core Text::Soundex function in CL. Let's start with the Perl code, straight from the 5.8.8 distribution.

$soundex_nocode = undef;

sub soundex
{
 local (@s, $f, $fc, $_) = @_;

 push @s, '' unless @s;        # handle no args as a single empty string

 foreach (@s)
 {
   $_ = uc $_;
   tr/A-Z//cd;

   if ($_ eq '')
   {
     $_ = $soundex_nocode;
   }
   else
   {
     ($f) = /^(.)/;
     tr/AEHIOUWYBFPVCGJKQSXZDTLMNR/00000000111122222222334556/;
     ($fc) = /^(.)/;
     s/^$fc+//;
     tr///cs;
     tr/0//d;
     $_ = $f . $_ . '000';
     s/^(.{4}).*/$1/;
   }
 }

 wantarray ? @s : shift @s;
}

We do not implement the auxiliary behavior of the Perl version, namely we do not optionally accept a list of strings, nor do we support overriding the NIL return case. Both would be easy to implement but would be tangential to this blog entry. We port the rest of the functionality as faithfully as possible, though, so that the Perl can serve as a useful performance benchmark.

Perl's soundex uses regular expressions and string substitution operators heavily, some having analogues in CL and some not. For example, CL lacks Perl's tr/// operator, so we implement a crude version:

(defparameter *ascii-table* (let ((table (make-array '(256) :element-type 'character)))
                             (loop
                                for i below 256
                                do (setf (aref table i) (code-char i)))
                             table))

(defun tr (string from-table to-table)
 "Crude version of Perl's tr/// operator."
 (let ((table (copy-seq *ascii-table*)))
   (loop
      for from-char across from-table
      and to-char across to-table
      do (setf (aref table (char-code from-char)) to-char))
   (map 'string
        #'(lambda (c) (aref table (char-code c)))
        string)))

Our TR supports only the limited case needed by SOUNDEX (i.e., mapping one set of characters to another). The Perl version can do more, such as removing letters that don't appear in the first set, and removing duplicates. In fact, the Perl soundex relies on that ability to remove duplicates, so we implement a function to do that as well.

(defun uniq! (seq)
 (cond
   ((> (length seq) 1)
    (do* ((cur 0)
          (cur-elt (elt seq cur) (elt seq cur))
          (next 1 (1+ next)))
         ((>= next (length seq)) (subseq seq 0 (1+ cur)))
      (let ((next-char (elt seq next)))
        (unless (eql cur-elt next-char)
          (incf cur)
          (setf (elt seq cur) next-char)))))
   (t seq)))

UNIQ! coalesces adjacent duplicate items into one item. CL's built-in DELETE-DUPLICATES doesn't work because it coalesces all duplicates, not just adjacent ones.

These two utility functions make porting the rest of the soundex easy. The following shows the CL equivalents of each important line of soundex:

Perl	What It Does	CL Equivalent
$_ = uc $_;	Uppercase the string	(string-upcase ...)
tr/A-Z//cd;	Remove any non-upper-alpha characters	(remove-if-not 'alpha-char-p string)
($f) = /^(.)/;	Gets the first character of the string	(char s 0)
tr/AE.../00.../;	Map letters to digits values	(tr s "AE..." "00...")
s/^$fc+//;	Remove leading copies of the character in $fc	(string-left-trim (vector fc) s2)
tr///cs;	Remove adjacent duplicates	(uniq! ...)
tr/0//d;	Delete any '0' characters	(delete #\0 ...)
$_ = $f . $_ . '000';	Concatenate, plus ensure length of at least 4	(concatenate 'string ...)
s/^(.{4}).*/$1/;	Strip off all but the first 4 characters	(subseq ... 0 4)

Here is the actual code:

(defun soundex (string)
 (let ((s (string-upcase (remove-if-not 'alpha-char-p string))))
   (when (plusp (length s))
     (let ((f (char s 0)))
       (let* ((s2 (tr s "AEHIOUWYBFPVCGJKQSXZDTLMNR" "00000000111122222222334556"))
              (fc (char s2 0)))
         (setf s2 (delete #\0
                          (uniq! (string-left-trim (vector fc) s2))))
         (subseq (concatenate 'string (vector f) s2 "000") 0 4))))))

Now let's see if it works. Using SLIME, type C-c C-k to compile the file. Then try SOUNDEX at the CL prompt:

CL-USER> (soundex "supercalifrag")
"S162"

If you try the Perl version, you'll find it returns the same thing (I tried other test cases, as well, but they are not relevant to this blog).

At this point you might be feeling rather proud of yourself, after all you ported that Perl code pretty quickly, right? And I bet it even performs better already; after all Perl is interpreted and CL is compiled! Let's verify that assumption using the TIME macro built in to CL:

CL-USER> (time (dotimes (i 100000) (soundex "supercalifrag")))
Evaluation took:
 2.644 seconds of real time
 2.640165 seconds of user run time
 0.012001 seconds of system run time
 [Run times include 0.076 seconds GC run time.]
 0 calls to %EVAL
 0 page faults and
 207,987,480 bytes consed.

Now let's compare that to the Perl code's performance:

$ time ./soundex-bench.pl 100000
real    0m1.069s
user    0m1.064s
sys     0m0.008s

D'oh! The Perl code kicked our ass! It's more than twice as fast as our CL! How could this have happened to us? Maybe we forgot to turn on some optimizations? We add:

(declaim (optimize (speed 3) (safety 0)))

to the top of our file and recompile. Now let's see the results:

CL-USER> (time (dotimes (i 100000) (soundex "supercalifrag")))
Evaluation took:
 2.061 seconds of real time
 2.040128 seconds of user run time
 0.020001 seconds of system run time
 [Run times include 0.1 seconds GC run time.]
 0 calls to %EVAL
 0 page faults and
 183,988,752 bytes consed.

Ok, that's a little better, but come on, it is still approximately twice as slow as the Perl! And look at how much memory is allocated ("consed") in the course of doing only 100,000 calls: that's approximately 175 megabytes (admittedly not all at once, but still, that's just plain embarrassing!)

Now before diving into optimization, let us review a good approach to optimizing CL.

Measure first.
Avoid guessing!
Fix your algorithm(s) first (not shown in this blog entry).
Fix memory consumption next.
Then go after CPU consumption, primarily by adding type information.

Remember, this blog is not about algorithmic optimizations; we will pretend (for the sake of illustration only!) that you've already ruled out the need, in order to focus on mechanical optimization.

Following step one from the above strategy, start by profiling. Define a function, MANY-SOUNDEX that performs our TIME loop from above. Also define a function PROFILE-SOUNDEX that employs SBCL's sb-profile package to profile the functions involved in implementing SOUNDEX, including some built-ins that it calls. To profile a function, pass its name to SB-PROFILE:PROFILE. After exercising the code, call SB-PROFILE:REPORT, which prints timing information. Call SB-PROFILE:RESET between runs unless you want the results to accumulate (we don't, in this case).

(defun many-soundex ()
 (time
  (dotimes (i 100000)
    (soundex "supercalifrag"))))

(defun profile-soundex ()
 ;; Don't accumulate results between runs.
  (sb-profile:reset)
 ;; Calling this every time through in case any of the user-defined
  ;; functions was recompiled.
  (sb-profile:profile soundex soundex-tr uniq!
                     concatenate make-string subseq
                     string-left-trim delete-if-not
                     tr delete string-upcase nsubseq)
 (many-soundex)
 (sb-profile:report))

If you notice some unfamiliar functions there, don't worry, they're going to be defined later; for the purposes of this blog I just want to define this function once, even though in real life I modified it many times.

Let's see what it produces:

CL-USER> (profile-soundex)
 3.713 seconds of real time
 3.160197 seconds of user run time
 0.548034 seconds of system run time
 [Run times include 0.08 seconds GC run time.]
 0 calls to %EVAL
 0 page faults and
 200,022,056 bytes consed.
 seconds  |    consed   |  calls  |  sec/call  |  name 
-----------------------------------------------------------
    1.394 | 140,115,928 | 100,000 |   0.000014 | TR
    0.618 |  26,230,456 | 100,000 |   0.000006 | CONCATENATE
    0.210 |     661,840 | 100,000 |   0.000002 | REMOVE-IF-NOT
    0.170 |   3,864,248 | 100,000 |   0.000002 | DELETE
    0.164 |   9,865,784 | 100,000 |   0.000002 | SOUNDEX
    0.131 |           0 | 100,000 |   0.000001 | UNIQ!
    0.130 |   9,274,680 | 100,000 |   0.000001 | STRING-UPCASE
    0.058 |  10,012,888 | 100,003 |   0.000001 | SUBSEQ
    0.000 |           0 |       7 |   0.000000 | STRING-LEFT-TRIM
-----------------------------------------------------------
    2.878 | 200,025,824 | 800,010 |            | Total

This shows that we probably should optimize TR for memory consumption first. What's wrong with TR? Every call to TR copies *ASCII-TABLE* into the TABLE local variable, which will later be used to map each character to a (potentially) different one. The inefficiency is that TABLE only varies based on the 2nd and 3rd arguments (FROM-TABLE and TO-TABLE). Since SOUNDEX always passes in the same thing for those two arguments every time, it is wasteful to continually recreate the same TABLE. To fix it use a closure that "closes over" a single instance of TABLE:

(defun make-tr-fn (from-table to-table)
 (let ((table (copy-seq *ascii-table*)))
   (loop
      for from-char across from-table
      and to-char across to-table
      do (setf (aref table (char-code from-char)) to-char))
   (lambda (string)
     (declare ((simple-array character) string))
     (map-into string
               #'(lambda (c) (aref table (char-code c)))
               string))))

(defparameter *soundex-tr-fn* (make-tr-fn "AEHIOUWYBFPVCGJKQSXZDTLMNR" "00000000111122222222334556"))

(defun soundex-tr (string)
 (funcall *soundex-tr-fn* string))

This is an example of the sort of rote transformations you often need to do in order to speed up performance.

Separate out the part of the function that varies dynamically into a LAMBDA.
Return the LAMBDA instead of the original value so that it can be reused over and over again.

This is something you often do in Java, too, where the pattern goes:

Create a class.
Have the constructor do the stuff that only needs to happen once (equivalent to the part outside of the LAMBDA.)
Create a method that does the dynamic part.

The main difference the CL and Java is that the Java would be more verbose.

You may notice another difference between MAKE-TR-FN and TR; it's now calling MAP-INTO instead of MAP, which avoids making a copy of the result, thus reducing the memory consumption further. Now this particular optimization does change the semantics of the function to modify its argument. In this case it's ok because we've already made a copy of the string passed to SOUNDEX, but do not use this technique without thinking through the consequences. Also, the DECLARE statement fixes a minor compiler warning caused by the change. DECLARE is discussed later.

When originally doing this work, I re-ran PROFILE-SOUNDEX after making each change, to see if it helped. That is not shown here in order to save space, but obviously you'd want to do the same thing. Let's now move on to the next biggest offender in the list, CONCATENATE. It's called towards the end of SOUNDEX in order to ensure that the return value is at least 4 characters (right-padded with '0' characters). This was a direct port of the Perl code (or at least as direct as you can get without using CL-PPCRE), and it's inefficient. Since the return value is always a 4-character string, we need only allocate a fixed-size string (using MAKE-STRING), prefilled with '0', and then copy in the (up to) 4 characters we need. We no longer need SUBSEQ which allocates a copy of its result (note: most of the memory consumed by SUBSEQ comes from it being called by UNIQ!).

In fact, since SUBSEQ is next on our list of memory hogs, let's find a way to fix it. When UNIQ! calls it, it's operating on a so-called "garbage" sequence (S2 from SOUNDEX); i.e., an intermediate result that is not used outside of the bowels of SOUNDEX. As such, we could modify it, so it is a shame to use SUBSEQ on it within UNIQ!. CL has a loosely-followed convention that "destructive" (non-copying) version of functions begin with the letter N. There is no built-in NSUBSEQ but a quick Google search finds one that works for our purposes:

;; From http://darcs.informatimago.com/lisp/common-lisp/utility.lisp
(defun nsubseq (sequence start &optional (end nil))
 "
RETURN:  When the SEQUENCE is a vector, the SEQUENCE itself, or a displaced
        array to the SEQUENCE.
        When the SEQUENCE is a list, it may destroy the list and reuse the
        cons cells to make the subsequence.
"
 (if (vectorp sequence)
     (if (and (zerop start) (or (null end) (= end (length sequence))))
         sequence
         (make-array (- (if end
                            (min end (length sequence))
                            (length sequence))
                        start)
                     :element-type (array-element-type sequence)
                     :displaced-to sequence
                     :displaced-index-offset start))
     (let ((result (nthcdr start sequence)))
       (when end
         (setf (cdr (nthcdr (- end start -1) sequence)) nil))
       result)))

We'll call this from UNIQ! and that should take care of that 10MB of allocations. The last bit of memory we're going to take care of is that allocated by STRING-UPCASE and REMOVE-IF-NOT. Now we have to be careful here because we want one copy of the argument to SOUNDEX; it would be rude to transform the caller's argument unexpectedly. So we actually relying on either STRING-UPCASE or REMOVE-IF-NOT making a copy of STRING. As such, we have to pick one to optimize. Since REMOVE-IF-NOT has a destructive equivalent, DELETE-IF-NOT, we will use that instead. Unfortunately, it's not as simple as just dropping in DELETE-IF-NOT. If you read the CLHS entry you'll discover that STRING-UPCASE is allowed to return the same string it was passed in (i.e., not a copy) if it doesn't need to change it (e.g., all the letters are already uppercase). We account for this by calling REMOVE-IF-NOT in the case where STRING-UPCASE does this by seeing if the return value is EQ (same object) to the original string. You can see this in the new code listing, which has certain key optimizations highlighted for your convenience:

(defparameter *ascii-table* (let ((table (make-array '(256) :element-type 'character)))
                             (loop
                                for i below 256
                                do (setf (aref table i) (code-char i)))
                             table))

(defun make-tr-fn (from-table to-table)
 (let ((table (copy-seq *ascii-table*)))
   (loop
      for from-char across from-table
      and to-char across to-table
      do (setf (aref table (char-code from-char)) to-char))
   (lambda (string)
     (declare ((simple-array character) string))
     (map-into string
               #'(lambda (c) (aref table (char-code c)))
               string))))

(defparameter *soundex-tr-fn* (make-tr-fn "AEHIOUWYBFPVCGJKQSXZDTLMNR" "00000000111122222222334556"))

(defun soundex-tr (string)
 (funcall *soundex-tr-fn* string))

(defun uniq! (seq)
 (cond
   ((> (length seq) 1)
    (do* ((cur 0)
          (cur-elt (elt seq cur) (elt seq cur))
          (next 1 (1+ next)))
         ((>= next (length seq)) (nsubseq seq 0 (1+ cur)))
      (let ((next-char (elt seq next)))
        (unless (eql cur-elt next-char)
          (incf cur)
          (setf (elt seq cur) next-char)))))
   (t seq)))

(defun soundex (string)
 (let ((s (let ((maybe-a-copy (string-upcase string)))
            ;; STRING-UPCASE can return original string if no changes
             ;; were needed.
             (if (eq maybe-a-copy string)
                ;; REMOVE-IF-NOT makes a copy
                 (remove-if-not 'alpha-char-p maybe-a-copy)
                ;; DELETE-IF-NOT doesn't
                 (delete-if-not 'alpha-char-p maybe-a-copy)))))
   (when (plusp (length s))
     (let ((f (char s 0)))
       (let* ((s2 (soundex-tr s))
              (fc (char s2 0))
              (result (make-string 4 :initial-element #\0)))
         (setf s2 (delete #\0 (uniq! (string-left-trim (vector fc) s2))))
         (setf (char result 0) f)
         (loop
            for i from 0 below (min (length s2) 4)
            do (setf (char result (1+ i)) (char s2 i)))
         result)))))

As mentioned above, in the course of actually doing this work I ran PROFILE-SOUNDEX many times, but I don't show the intermediate results here in order to save space. Now that we've completed a major chunk of work, however, let's see how we're doing:

CL-USER> (many-soundex)
Evaluation took:
 0.464 seconds of real time
 0.460029 seconds of user run time
 0.0 seconds of system run time
 [Run times include 0.008 seconds GC run time.]
 0 calls to %EVAL
 0 page faults and
 22,402,176 bytes consed.

We run MANY-SOUNDEX here in a fresh instance of SBCL to remove any injected profiling code or other stuff that might affect the results. It also has speed optimizations turned on. As you can see, the benefit of simply removing memory allocations is significant. The new code is a whopping 4.4 times faster! It's also now more than twice as fast as the Perl code.

We could continue to to whittle away at the memory allocation, as you can see we're still at around 21MB of memory allocated. However, since we have to copy the argument to SOUNDEX no matter what, 21MB is only about twice the minimum (if you look at the profiling results, we can't do better than the memory allocated by STRING-UPCASE). To keep this blog entry to a reasonable length we shall deem this acceptable.

On to CPU optimization. SBCL automatically detects certain performance problems when you up SPEED to 3. Fix these first, as it makes little sense to manually search out problems when the compiler has already found some. Using SLIME makes it easy:

Put

(declaim (optimize (speed 3) (debug 0) (safety 0)))

at the top of the file.

Type C-c C-k to compile the file.
Hit M-n and M-p to get SLIME to highlight the next (or previous) compiler warning.

I won't go through all of the compiler warnings I got when I did this, but instead will highlight some of them.

(defun soundex-tr (string)
 (funcall *soundex-tr-fn* string))
; note: unable to
;   optimize away possible call to FDEFINITION at runtime
; due to type uncertainty:
;   The first argument is a (OR FUNCTION SYMBOL), not a FUNCTION.

This example seems a little strange at first, if, say, you're used to primitive type systems such as Java's. CL types can be defined in sophisticated ways that deserve a blog entry of their own. In this particular case, the compiler has inferred the type of *SOUNDEX-TR-FN* to be (OR FUNCTION SYMBOL), meaning it isn't sure if it could sometimes be null (figuring that out would require "global" analysis to ensure that *SOUNDEX-TR-FN* is never modified by any function anywhere, which is still, apparently, beyond the scope of most compilers). We can fix the warning with a THE expression. THE is one of CL's ways of adding manual static typing (which you may be familiar with from more primitive languages such as Java) into your code. Although CL implementations such as SBCL perform type inference, the compiler occasionally needs your help:

(funcall (the function *soundex-tr-fn*) ...)

Since we're sure that *SOUNDEX-TR-FN* is effectively constant (it's all private code under our own control, after all), it is safe to add in this THE.

Another interesting set of warnings comes from NSUBSEQ:

(defun nsubseq (sequence start &optional (end nil))
 (if (vectorp sequence)
     (if (and (zerop start) (or (null end) (= end (length sequence))))
         sequence
         (make-array (- (if end
                            (min end (length sequence))
                            (length sequence))
                        start)
                     :element-type (array-element-type sequence)
                     :displaced-to sequence
                     :displaced-index-offset start))
     (let ((result (nthcdr start sequence)))
       (when end
         (setf (cdr (nthcdr (- end start -1) sequence)) nil))
       result))); in: DEFUN NSUBSEQ
;     (ZEROP START)
; ==>
;   (= START 0)
;
; note: unable to
;   open-code FLOAT to RATIONAL comparison
; due to type uncertainty:
;   The first argument is a NUMBER, not a FLOAT.
;
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a (COMPLEX SINGLE-FLOAT).
;
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a NUMBER, not a (COMPLEX DOUBLE-FLOAT).
...blah blah blah...

Those warnings go on for a while. What they all boil down to, though, is that the compiler doesn't know the type of START or END. Now here is a key point. Should you find yourself in this sort of situation, always remember what the great computer scientist Igor Stravinsky once said, "Lesser artists borrow, great artists steal." In other words, don't try to figure out the type declarations; steal them from SBCL. Using SLIME makes this easy. Just type M-. and then type SUBSEQ, because obviously it should take the same parameters. This will take you to the source code (you should always have the SBCL source code handy) for SUBSEQ:

(defun subseq (sequence start &optional end)
 #!+sb-doc
 "Return a copy of a subsequence of SEQUENCE starting with element number
  START and continuing to the end of SEQUENCE or the optional END."
 (seq-dispatch sequence
   (list-subseq* sequence start end)
   (vector-subseq* sequence start end)
   (sb!sequence:subseq sequence start end)))

Since we only care about the vector case here, move the cursor to VECTOR-SUBSEQ* and hit M-. again, to see how it declares its arguments:

(defun vector-subseq* (sequence start end)
 (declare (type vector sequence))
 (declare (type index start)
          (type (or null index) end))
 ;; blah blah blah

Ah! So what is the type INDEX? Use M-. once more and you'll discover this:

(def!type index () `(integer 0 (,sb!xc:array-dimension-limit)))

This makes sense. The type obviously has to be an integer, cannot be negative, and cannot be more than the maximum array length allowed by your CL implementation. This particular type definition is internal to SBCL and cannot be used directly, but we create our own. Improved NSUBSEQ looks like:

(deftype index () `(integer 0 ,array-dimension-limit))

;; From http://darcs.informatimago.com/lisp/common-lisp/utility.lisp
(defun nsubseq (sequence start &optional (end nil))
 (if (vectorp sequence)
     (locally
         (declare (index start)
                  ((or null index) end))
       (if (and (zerop start) (or (null end) (= end (length sequence))))
           sequence
           (make-array (- (if end
                              (min end (length sequence))
                              (length sequence))
                          start)
                       :element-type (array-element-type sequence)
                       :displaced-to sequence
                       :displaced-index-offset start)))
     (let ((result (nthcdr start sequence)))
       (when end
         (setf (cdr (nthcdr (- end start -1) sequence)) nil))
       result)))

Note we use LOCALLY only when the argument is actually a vector; we can't make the same assumptions about other types of sequences. Plus we don't care about the list case (in fact a compiler warning for that case remains, but since we're not using lists in this example we'll skip the fix for it).

Let's look at optimizing UNIQ!. When we compile it, we get a number of warnings such as:

(defun uniq! (seq)
 (cond
   ((> (length seq) 1)
    (do* ((cur 0)
          (cur-elt (elt seq cur) (elt seq cur))
          (next 1 (1+ next)))
         ((>= next (length seq)) (nsubseq seq 0 (1+ cur)))
      (let ((next-char (elt seq next)))
        (unless (eql cur-elt next-char)
          (incf cur)
          (setf (elt seq cur) next-char)))))
   (t seq))); in: DEFUN UNIQ!
;     (LENGTH SEQ)
;
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a SEQUENCE, not a (SIMPLE-ARRAY * (*)).
;
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a SEQUENCE, not a VECTOR.

;     (ELT SEQ CUR)
;
; note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a SEQUENCE, not a (SIMPLE-ARRAY * (*)).
...blah blah blah...

Here the compiler is telling us that it could make some optimizations if it knew for sure that the SEQ were a SIMPLE-ARRAY. What the heck is a simple array? Clicking link (or using C-c C-d h in SLIME), shows:

The type of an array that is not displaced to another array, has no fill pointer, and is not
expressly adjustable is a subtype of type simple-array. The concept of a simple array exists to
allow the implementation to use a specialized representation and to allow the user to declare that
certain values will always be simple arrays.

In case you're wondering, "displaced" means an array slice (as in Perl or Python), and the other two concepts refer to various ways an array can be (or appear to be) extensible. Is it safe for UNIQ! to assume it is receiving a simple array? Yes; its input comes from STRING-LEFT-TRIM. Although STRING-LEFT-TRIM is allowed to return its input when there are no changes to be made, we know that it will always make a change because we always remove at least the first character of S2, and thus its return value will always be simple. So let's add a function declaration for UNIQ!:

(declaim (ftype (function (simple-array) string) uniq!))

This tells the compiler that UNIQ!'s single argument is a SIMPLE-ARRAY and that it returns a STRING. The reason we say it returns a STRING (instead of a "simple" sequence type) is that UNIQ! uses NSUBSEQ, which does return an array slice. Unfortunately, we still get compiler warnings after making this change (albeit fewer):

             (cur-elt (elt seq cur) (elt seq cur)); note: unable to
;   optimize
; due to type uncertainty:
;   The first argument is a (SIMPLE-ARRAY * (*)), not a SIMPLE-STRING.
...blah blah blah...

This is telling us that it could generate better code if it knew the sequence was a string. ELT is a generic accessor that works on any sequence type; presumably if the compiler knows the sequence type is a string it can take advantage of the fact that each element is the same size. Anyway, this is easy to fix. SOUNDEX only deals in strings, so we can safely change the declaration from SIMPLE-ARRAY to SIMPLE-STRING. Sure enough, this makes the compiler warnings disappear.

The remaining changes made were generally of the same nature as the above so are not covered here in detail, but you can see them in the following complete code listing that highlights all the CPU-related changes:

(declaim (optimize (speed 3) (debug 0) (safety 0)))

(defparameter *ascii-table* (let ((table (make-array '(256) :element-type 'character)))
                             (loop
                                for i below 256
                                do (setf (aref table i) (code-char i)))
                             table))

(defun make-tr-fn (from-table to-table)
  (declare (simple-string from-table to-table)
          (simple-array *ascii-table*))
 (let ((table (the (simple-array character) (copy-seq *ascii-table*))))
    (loop
      for from-char across from-table
      and to-char across to-table
      do (setf (aref table (char-code from-char)) to-char))
   (lambda (string)
     (declare ((simple-array character) string))
     (map-into string
               #'(lambda (c) (aref table (char-code c)))
               string))))

(defparameter *soundex-tr-fn* (make-tr-fn "AEHIOUWYBFPVCGJKQSXZDTLMNR" "00000000111122222222334556"))

(declaim (ftype (function (simple-string) simple-string) soundex-tr))
(defun soundex-tr (string)
  (funcall (the function *soundex-tr-fn*) string))

(deftype index () `(integer 0 ,array-dimension-limit))

;; From http://darcs.informatimago.com/lisp/common-lisp/utility.lisp
(defun nsubseq (sequence start &optional (end nil))
 "
RETURN:  When the SEQUENCE is a vector, the SEQUENCE itself, or a displaced
        array to the SEQUENCE.
        When the SEQUENCE is a list, it may destroy the list and reuse the
        cons cells to make the subsequence.
"
 (if (vectorp sequence)
      (locally
         (declare (index start)
                  ((or null index) end))
       (if (and (zerop start) (or (null end) (= end (length sequence))))
           sequence
           (make-array (- (if end
                              (min end (length sequence))
                              (length sequence))
                          start)
                       :element-type (array-element-type sequence)
                       :displaced-to sequence
                       :displaced-index-offset start)))
      (let ((result (nthcdr start sequence)))
       (when end
         (setf (cdr (nthcdr (- end start -1) sequence)) nil))
       result)))

(declaim (ftype (function (simple-string) string) uniq!))
(defun uniq! (seq)
  (let ((seq-len (length seq)))
   (cond
     ((> seq-len 1)
      (do* ((cur 0)
            (cur-elt (elt seq cur) (elt seq cur))
            (next 1 (1+ next)))
           ((>= next seq-len) (nsubseq seq 0 (1+ cur)))
        (let ((next-char (elt seq next)))
          (unless (eql cur-elt next-char)
            (incf cur)
            (setf (elt seq cur) next-char)))))
     (t seq))))
(defun soundex (string)
  (let ((s (the simple-string
            (let ((maybe-a-copy (string-upcase string)))
              ;; STRING-UPCASE can return original string if no changes
              ;; were needed.
              (if (eq maybe-a-copy string)
                  ;; REMOVE-IF-NOT makes a copy
                  (remove-if-not 'alpha-char-p maybe-a-copy)
                  ;; DELETE-IF-NOT doesn't
                  (delete-if-not 'alpha-char-p maybe-a-copy))))))
    (when (plusp (length s))
     (let ((f (char s 0)))
       (let* ((s2 (soundex-tr s))
              (fc (char s2 0))
              (result (make-string 4 :initial-element #\0)))
          (setf s2 (the string (delete #\0 (uniq! (string-left-trim (vector fc) s2)))))
          (setf (char result 0) f)
          (let ((end (min (length s2) 4)))
           (loop
              for i from 0 below end
              do (setf (char result (1+ i)) (char s2 i))))
          result)))))

Here's the effect of the CPU optimizations:

CL-USER> (many-soundex)
Evaluation took:
 0.357 seconds of real time
 0.360022 seconds of user run time
 0.0 seconds of system run time
 [Run times include 0.008 seconds GC run time.]
 0 calls to %EVAL
 0 page faults and
 22,406,384 bytes consed.

Although the results vary from run to run, 0.357 seconds is a pretty typical on my hardware. This is approximately a 23% reduction in time compared to before. Not nearly as big of a gain, which is why one should optimize memory first. Still, 23% is nothing to sneeze at for a relatively small number of changes, most of which were pointed out to us by the compiler! If you're interested in learning more about optimizing CL, I recommend checking out SBCL's SB-SPROF package, which is more sophisticated than the SB-PROFILE package used here.

Tuesday, February 12, 2008

Scripting in CL

CL has been overlooked as a scripting language, even though scripting in CL is faster than it is in traditional scripting languages. CL's interactivity and macro system make it ideal for quick-and-dirty hacking.

For one thing, its syntax is so simple that almost anything can be typed in at the CL command-line:

CL-USER> 23
23

That's a valid CL "program". It returns 23. So say that you forgot how part of the language works; resolve the problem by typing some test code at the command-line without breaking your concentration. For example, say you forget how division works in CL; you may have forgotten whether CL has support for rationals. Don't look it up, just try it out immediately:

CL-USER> (/ 23 45)
23/45

Sure enough, it doesn't return a float! This interactivity isn't just useful for exploring (or remembering) basic language features. It's also how you script in CL.

For example, here is how to write a CL version of the popular Unix utility grep, which you might then use as the basis for a log analyzer or some other script. Start by working interactively on a small part of the problem, such as how to see if a string contains a regexp. Of course our goal is to pick lines out of a file, but let's ignore the file part for now, we have to walk before we run. First define some sample data:

CL-USER> (defparameter *lines* "hi there
my name is bob
I live in Kalamazoo")

Now google for how to use regular expressions in CL. All roads lead to CL-PPCRE. To install it, do this:

CL-USER> (require :asdf-install) ; if you haven't already
CL-USER> (asdf-install:install :cl-ppcre) ; if you don't have it already

To use it, do this:

CL-USER> (require :cl-ppcre)
CL-USER> (use-package :cl-ppcre) ; to save typing cl-ppcre: a lot

Next use SLIME's autocomplete to figure out what functions are available:

CL-USER> (cl-ppcre:<TAB>
Flags: boundp fboundp generic-function class macro special-operator package

Completion:                        Flags:  Score:
---------------------------------- ------- --------
cl-ppcre:                          b------    98.44
cl-ppcre-test:                     b------    90.11
cl-ppcre:*allow-named-registers*   b------     0.98
cl-ppcre:*allow-quoting*           b------     0.98
cl-ppcre:*regex-char-code-limit*   b------     0.98
cl-ppcre:*use-bmh-matchers*        b------     0.98
cl-ppcre:all-matches               -f-----     0.98
cl-ppcre:all-matches-as-strings    -f-----     0.98
cl-ppcre:create-scanner            -fg----     0.98
cl-ppcre:define-parse-tree-synonym -f--m--     0.98
cl-ppcre:do-matches                -f--m--     0.98
cl-ppcre:do-matches-as-strings     -f--m--     0.98
cl-ppcre:do-register-groups        -f--m--     0.98
cl-ppcre:do-scans                  -f--m--     0.98
cl-ppcre:parse-tree-synonym        -f-----     0.98
cl-ppcre:ppcre-error               ---c---     0.98
...etc...

Although this technique may seem haphazard, in practice one finds what one wants in about a minute or two (at most), typically without losing one's concentration. Most of the functions can be trivially filtered out simply based on their names (e.g., PARSE-TREE-SYNONYM is obviously not what I'm looking for). Of the rest, you can either quickly check their documentation (C-c C-d d if you're using SLIME) or else, heck, just try them out immediately. In this particular case I settled on SCAN.

CL-USER> (scan "Kalamazoo" *line*)
34
43
#()
#()

It doesn't matter what those return values mean (probably the offset of the match is in there somewhere), the key point is that SCAN would have returned NIL if there hadn't been a match. Or so we hope. We had better verify that (quickly; don't look at the documentation, just try it!):

CL-USER> (scan "snord" *line*)
NIL

Ok, now how do we read in a line from a file? Well, first we have to open the file:

CL-USER> (open "/etc/motd")
#<SB-SYS:FD-STREAM for "file /etc/motd" {A78C2E9}>

That looks promising. A little more research on Google shows that it's best to use WITH-OPEN-FILE because that will automatically close the file (even if an exception is thrown).

CL-USER> (with-open-file (stream "/etc/motd"))
NIL

How do you read a line from a file? Turns out there's a function READ-LINE:

CL-USER> (with-open-file (stream "/etc/motd") (read-line stream nil nil))
"Linux D830 2.6.22-14-generic #1 SMP Fri Feb 1 04:59:50 UTC 2008 i686"

Ok, so we can read lines from a file, now what? Well, let's play around with the LOOP macro to print out all the lines of the file (don't worry about the regexp matching just yet):

CL-USER> (defparameter *motd* (open "/etc/motd"))
*MOTD*
CL-USER> (loop for line = (read-line *motd* nil nil)
      while line
      do (format t "~A~%" line))
Linux D830 2.6.22-14-generic #1 SMP Fri Feb 1 04:59:50 UTC 2008 i686

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

Ok, that worked. Now let's scan for all the lines that contain the word "Ubuntu".

CL-USER> (defparameter *motd* (open "/etc/motd"))
*MOTD*
CL-USER> (loop for line = (read-line *motd* nil nil)
             while line
             do (when (scan "Ubuntu" line)
                  (format t "~A~%" line)))
The programs included with the Ubuntu system are free software;
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by

Ok, that looks good. Now let's write a function with everything we just did:

(defun grep (regexp filename)
  (with-open-file (stream filename)
                  (loop for line = (read-line stream nil nil)
                        while line
                        do (when (scan regexp line)
                             (format t "~A~%" line)))))
CL-USER> (grep "Ubuntu" "/etc/motd")
The programs included with the Ubuntu system are free software;
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by

At this point you've seen what a typical CL scripting session is like, working from the bottom up.

Now the above was a pretty simplistic example, and it used (apart from CL-PPCRE) good old fashioned CL, which hasn't changed since the mid-1980s. Since then, scripting languages like Perl have been invented by smart people like Larry Wall, having many built-in features that are missing from stock CL. Why suffer without them? It would seem that you have to either choose between a general-purpose, compiled language like CL or purpose-built, interpreted Perl. Fortunately, this is not the case! I've found that each piece of Perl functionality I've wanted has taken on the order of 45 minutes to 1 hour to implement using CL's powerful DEFMACRO facility.

Let's take, for example, Perl's magic <> operator. In Perl, you can write:

#!/usr/bin/env perl
# file: "foo.pl"

for $line (<>) {
    print $line;
}

Which will slurp up all the filenames on the command-line and let you iterate over each line. E.g.,

% foo.pl /etc/motd /etc/resolv.conf

would print out those files, sort of like cat. Wouldn't it be nice to have a similar feature in CL? Let's start by taking our GREP function and adding the ability to customize the action taken on each line:

(defun do-lines (fn filename)
  (with-open-file (stream filename)
    (loop for line = (read-line stream nil nil)
       while line
       do (when (scan regexp line)
            (funcall fn line)))))

DO-LINES is almost exactly the same as GREP, except that it just calls a function on each line of the file rather than assume grep-like behavior. To verify that we can implement GREP in terms of DO-LINES we try the following at the command-line:

CL-USER> (do-lines (lambda (line)
                 (when (scan "Ubuntu" line) (format t "~A~%" line))) "/etc/motd")
The programs included with the Ubuntu system are free software;
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by

Isn't that great? Baaaarf. Not really, eh? Kind of verbose compared to the Perl code, and it doesn't even handle multiple files like the Perl code does!

What we'd really like to be able to do is something CL-esque yet approximately equal to the Perl code in brevity. This sounds like a good use for a CL macro. Let's start by just fantasizing about what we'd like:

(with-lines-from-files (line "/etc/motd" "/etc/resolv.conf") 
  (format t "~A~%" line)) ; just a simple example, could put anything here

Well, before going further, we need to enhance DO-LINES to take more than one file:

(defun do-all-lines (fn &rest filenames)
  (dolist (filename-or-list filenames)
    ;; For convenience we allow you also to pass in lists of filenames
    (dolist (filename (typecase filename-or-list
                        (cons filename-or-list)
                        (t (list filename-or-list))))
      (with-open-file (stream filename)
        (loop
           for line = (read-line stream nil nil)
           while line
           do (funcall fn line))))))

Now we just need to write a macro so that you don't have to write all that crazy LAMBDA stuff:

(defmacro with-lines-from-files ((var &rest filenames) &body body)
  `(do-all-lines (lambda (,var) ,@body) ,@filenames))

Armed with this macro, we can re-write GREP with way fewer lines (and the ability to grep across multiple files):

(defun grep (regexp &rest filenames)
  (with-lines-from-files (line filenames)
    (when (scan regexp line)
      (format t "~A~%" line))))
CL-USER> (grep "Ubuntu|name" "/etc/motd" "/etc/resolv.conf")
The programs included with the Ubuntu system are free software;
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
nameserver 68.87.69.146
nameserver 68.87.85.98

If it weren't for macros, we'd be stuck implementing grep with the clunky use of LAMBDA:

(defun clunky-grep (regexp &rest filenames)
  (do-all-lines (lambda (line) 
                  (when (scan regexp line) 
                    (format t "~A~%" line)))
    filenames))

The LAMBDA-free version of GREP is much easier to read, it almost reads like English: "with lines from the files, when lines scan the regexp, format the line". Whereas the LAMBDA version reads "with lines, LAMBDA (huh!!!?), when lines scan the regexp, format the line, across the filenames". No contest!

Similarly, other Perl operations can be easily implemented with macros:

;; The following macro is mostly like Perl's foreach, in the sense
;; that you can pass in as many references to sequences or "scalars"
;; as you want and it will iterate over them and allow you to modify
;; them.  Unlike the Perl code, it sets the variable IT to each
;; element rather than $_.  Also, you have to just pass in the hash
;; table directly, not a flattened list of hash keys.
(defmacro perl-foreach ((&rest refs) &body body)
  (let* ((gensyms (loop repeat (length refs) collect (gensym))))
    (list*
     'let
     (mapcar #'list gensyms refs)
     (loop
        for ref in refs
        and indirect-ref in gensyms
        collect
        `(typecase ,indirect-ref
           (hash-table 
            (maphash #'(lambda (key value)
                         (declare (ignore value))
                         (symbol-macrolet ((it (gethash key ,indirect-ref)))
                           ,@body))
                     ,indirect-ref))
           ((and (or vector list) (not string))
            (map-into ,indirect-ref
                      #'(lambda (it)
                          ,@body
                          it)
                      ,indirect-ref))
           (t 
            (symbol-macrolet ((it ,ref))
              ,@body)))))))

;; trim whitespace in the scalar, the list, the array, and all the
;; values in the hash
(perl-foreach (scalar my-list my-array my-hash)
  (setf it (regex-replace "^\\s+" it ""))
  (setf it (regex-replace "\\s+$" it "")))

Or:

(defmacro perl-splice (sequence-place &optional (offset 0) length replacement-sequence)
  (let* ((seq (gensym "SEQUENCE-PLACE-"))
         (off-arg (gensym "OFFSET-ARG-"))
         (off (gensym "OFFSET-"))
         (len (gensym "LENGTH-"))
         (end (gensym "END-"))
         (rep (gensym "REPLACEMENT-SEQUENCE-"))
         (left-part (list `(subseq ,seq 0 ,off)))
         (right-part (when length
                       (list `(subseq ,seq ,end)))))
    `(let* ((,seq ,sequence-place)
            (,off-arg ,offset)
            (,off (if (minusp ,off-arg)
                      (+ (length ,seq) ,off-arg)
                      ,off-arg))
            (,len ,length)
            (,end (when ,len
                    (if (minusp ,len)
                        (+ (length ,seq) ,len)
                        (+ ,off ,len))))
            (,rep ,replacement-sequence))
       (prog1 (subseq ,seq ,off ,end)
         (when (or ,rep (not (eql ,off ,end)))
           (setf ,sequence-place (concatenate (typecase ,seq
                                                (cons 'list)
                                              (t 'vector))
                                            ,@left-part
                                            ,rep
                                            ,@right-part)))))))

;; Now the syntax is almost exactly the same.
(setf front (perl-splice my-array 0 n))
(setf end (perl-splice my-array 0 (- n)))