Wednesday, May 21, 2008

Ulrich Drepper's Memory Paper and CL

I recently came across Ulrich Drepper's excellent paper, What Every Programmer Should Know About Memory. There is a lot of fascinating stuff in there about an important class of things you sometimes need to do to achieve excellent performance. While the paper concentrates on C, I was wondering if some of the same effects could be observed in a high-level language like CL (for the record, I think CL is both high- and low-level, but whatever...) I did a quick experiment which suggests that at least one of Ulrich's optimizations works in CL. [Note: the following is not an attempt to produce mathematically correct results, as Ulrich did; I just wanted to see if the order in which memory was accessed seemed to matter.]:
(declaim (optimize (speed 3) (safety 0)))

(defun matrix-multiply-fast? ()
(let ((m1 (make-array '(1000 1000) :element-type 'fixnum))
(m2 (make-array '(1000 1000) :element-type 'fixnum))
(m3 (make-array '(1000 1000) :element-type 'fixnum)))
(loop repeat 100
do
(loop for i upto 1000
do
(loop for j upto 1000
do
(setf (aref m3 i j) (* (aref m1 i j) (aref m2 i j))))))))

(defun matrix-multiply-slow? ()
(let ((m1 (make-array '(1000 1000) :element-type 'fixnum))
(m2 (make-array '(1000 1000) :element-type 'fixnum))
(m3 (make-array '(1000 1000) :element-type 'fixnum)))
(loop repeat 100
do
(loop for i upto 1000
do
(loop for j upto 1000
do
(setf (aref m3 i j) (* (aref m1 i j) (aref m2 j i))))))))
Note the only difference between the two functions is the order in which the elements of the 2nd array are accessed. After compiling the above file I did the following from the CL prompt:
CL-USER> (time (matrix-multiply-fast?))
Evaluation took:
2.776 seconds of real time
2.508157 seconds of user run time
0.048003 seconds of system run time
[Run times include 0.028 seconds GC run time.]
0 calls to %EVAL
0 page faults and
12,000,160 bytes consed.

CL-USER> (time (matrix-multiply-slow?))
Evaluation took:
3.926 seconds of real time
3.632227 seconds of user run time
0.016001 seconds of system run time
0 calls to %EVAL
0 page faults and
12,008,280 bytes consed.
Assuming the above is all correct, it sort-of duplicates Ulrich's result where accessing an array in a column-major order is slower (see figure 6.1 for an illustration). Now, I have to admit that, for lack of time, I didn't read Ulrich's paper in great depth and kind of lazily jumped to section six. I'd be curious to know if this sort of optimization is already well-known in the CL world (and/or for other "high-level" programming languages). Some casual googling turned up an excellent paper about optimizing CL via type annotations, but I didn't see anyone directly addressing Ulrich's points. I'd be curious if any readers of this blog know more.

1 comment:

Steve Knight said...

Although I haven't read Ulrich's paper I did buy and read a book (now out of print) about high performance computing that makes the same point about memory access. I also used some of the other techniques for loop unrolling in the book with some success.

In fact looking at the table of contents there's a lot of good stuff in there. I'm guessing that a lot of it would apply to CL too ...

Avoiding fallback in distributed systems

As previously mentioned , I was recently able to contribute to the Amazon Builders' Library . I'd also like to share another post t...