stereopsis : fast p3 memory access

Fast P3 Memory Access

Michael Herf

Sree found this wonderful article on SGI's site:
Optimizing CPU to Memory Accesses on the SGI Visual Workstations 320 and 540

Of course, since the Visual Workstations are just P3's, these techniques are useful to a lot of people. The SGI article describes a technique to alleviate alignment problems, wildly improve caching, and use all the write-combining features of the P3. It's almost magic, it's so fast.

To give you a teaser, numbers from my P3/600:

Windows memcpy: 174 MB/sec
SGI Example 4: 310 MB/sec

Inspired, I wrote a "memset"-like routine, which has a similar effect.

Windows memset: 342 MB/sec
My memfill: 646 MB/sec

This stuff is so generally useful, I thought I'd post the project. You'll need the Visual C++ 6.0 Processor Pack to get at the SSE stuff (or VTune's compiler if you have that.) VCPP also comes with a free copy of masm, which is nice.

Here's memcpy.zip [24k].

If you have an Athlon, I'd love to hear how this code runs there. Or if your machine is just amazingly fast, let me know that too. Here's the complete output for my P3/600, with a P3V4X motherboard:

SGI ex1: 246.866850ms = 129.624532mb/sec
SGI ex2: 156.955982ms = 203.878818mb/sec
SGI ex3: 159.718903ms = 200.351990mb/sec
SGI ex4: 102.957219ms = 310.808705mb/sec

memcpy 183.452366ms = 174.432201mb/sec

memfill 50.355841ms = 635.477418mb/sec
memfill 50.456692ms = 634.207251mb/sec
memfill 49.485340ms = 646.656166mb/sec
memfill 49.908857ms = 641.168759mb/sec

memset 93.590386ms = 341.915459mb/sec

(I repeat the same memfill test multiple times to test variability.)