A Nickel's Worth: Ulrich Drepper's Memory Paper and CL

I recently came across Ulrich Drepper's excellent paper, What Every Programmer Should Know About Memory. There is a lot of fascinating stuff in there about an important class of things you sometimes need to do to achieve excellent performance. While the paper concentrates on C, I was wondering if some of the same effects could be observed in a high-level language like CL (for the record, I think CL is both high- and low-level, but whatever...) I did a quick experiment which suggests that at least one of Ulrich's optimizations works in CL. [Note: the following is not an attempt to produce mathematically correct results, as Ulrich did; I just wanted to see if the order in which memory was accessed seemed to matter.]:

(declaim (optimize (speed 3) (safety 0)))

(defun matrix-multiply-fast? ()
  (let ((m1 (make-array '(1000 1000) :element-type 'fixnum))
        (m2 (make-array '(1000 1000) :element-type 'fixnum))
        (m3 (make-array '(1000 1000) :element-type 'fixnum)))
    (loop repeat 100
       do
       (loop for i upto 1000
          do
          (loop for j upto 1000
             do
             (setf (aref m3 i j) (* (aref m1 i j) (aref m2 i j))))))))

(defun matrix-multiply-slow? ()
  (let ((m1 (make-array '(1000 1000) :element-type 'fixnum))
        (m2 (make-array '(1000 1000) :element-type 'fixnum))
        (m3 (make-array '(1000 1000) :element-type 'fixnum)))
    (loop repeat 100
       do
       (loop for i upto 1000
          do
          (loop for j upto 1000
             do
             (setf (aref m3 i j) (* (aref m1 i j) (aref m2 j i))))))))

Note the only difference between the two functions is the order in which the elements of the 2nd array are accessed. After compiling the above file I did the following from the CL prompt:

CL-USER> (time (matrix-multiply-fast?))
Evaluation took:
  2.776 seconds of real time
  2.508157 seconds of user run time
  0.048003 seconds of system run time
  [Run times include 0.028 seconds GC run time.]
  0 calls to %EVAL
  0 page faults and
  12,000,160 bytes consed.

CL-USER> (time (matrix-multiply-slow?))
Evaluation took:
  3.926 seconds of real time
  3.632227 seconds of user run time
  0.016001 seconds of system run time
  0 calls to %EVAL
  0 page faults and
  12,008,280 bytes consed.

Assuming the above is all correct, it sort-of duplicates Ulrich's result where accessing an array in a column-major order is slower (see figure 6.1 for an illustration). Now, I have to admit that, for lack of time, I didn't read Ulrich's paper in great depth and kind of lazily jumped to section six. I'd be curious to know if this sort of optimization is already well-known in the CL world (and/or for other "high-level" programming languages). Some casual googling turned up an excellent paper about optimizing CL via type annotations, but I didn't see anyone directly addressing Ulrich's points. I'd be curious if any readers of this blog know more.

1 comment:

Steve Knight said...: Although I haven't read Ulrich's paper I did buy and read a book (now out of print) about high performance computing that makes the same point about memory access. I also used some of the other techniques for loop unrolling in the book with some success.

In fact looking at the table of contents there's a lot of good stuff in there. I'm guessing that a lot of it would apply to CL too ...; Fri Jul 11, 02:22:00 AM PDT

A Nickel's Worth

Wednesday, May 21, 2008

Ulrich Drepper's Memory Paper and CL

1 comment:

Avoiding fallback in distributed systems