What's the big (64-bit) deal, anyway?

drmike · Post by **drmike** » Thu Jan 17, 2008 3:00 am

I run Linux at home. It is so much easier to code for external periferals it isn't funny! The only problem I've had with my particular version is that it chokes on malloc, but works fine if I just declare 500MB of static ram.

I think the model we use can be tuned to fit the machines we have at hand. If we go a million steps in time over a 400x400x400 cube of space and things look ok, it is a pretty good clue on how to build an experiment. If it blows up, you know how to *not* build an experiment!

The first nuclear reactors were built with slide rules and 10 digit accuracy lookup tables. I think the toys we have on our desks are quite sufficient to build a fusion reactor.

scareduck · Post by **scareduck** » Thu Jan 17, 2008 7:30 am

drmike wrote:The first nuclear reactors were built with slide rules and 10 digit accuracy lookup tables. I think the toys we have on our desks are quite sufficient to build a fusion reactor.

True that. But how much better the toys available for just a bit more!

Our hardware guy tells me that we're paying ~$6k for a dual core 4-processor Xeon Dell with 16-20 GB of RAM.

I read your white paper on the calculations, BTW. It's exercising brain cells I haven't used in 20 years almost. Good stuff.

scareduck · Post by **scareduck** » Thu Jan 17, 2008 7:53 am

Clearspeed has a floating-point accelerator that will turn your desktop into a floating-point monster.

http://www.wired.com/science/discoverie ... 3/10/60791

That five-year-old Wired story is obsolete... Clearspeed's latest product, the e620, puts 80 GFLOPS on your desktop.

http://www.clearspeed.com/docs/resource ... _05_07.pdf

Too bad they can't seem to get their product into the channel. IBM is the only one carrying it, and they seem to want $15k a copy.

MSimon · Post by **MSimon** » Thu Jan 17, 2008 9:02 am

Dr Mike,

Yeah, the current driver situation in Winders machines sucks.

When I have my drutthers I like DOS. Well tested. Simple. And drivers are a real piece of cake.

Writing drivers that interface with C is butt ugly though. Still better than Winders 13 levels of indirection.

drmike · Post by **drmike** » Thu Jan 17, 2008 2:48 pm

I remember looking up Clearspeed. I drooled a lot! Glad to hear they are still alive. But if push comes to shove, you just get a bunch of FPGA's and build your own compute engine that is dedicated to the problem at hand. It just costs a lot

I've been looking at the math some more, and I like the idea of going with integrals versus differentials. I think in the end, the difference between charged species can be dealt with more easily if we integrate first, then compute differences to get forces, then operate on particles. In a purely differential equation set, small differences really are a big problem.

It will be fun to see what kinds of toys we can put the models on!

MSimon · Post by **MSimon** » Thu Jan 17, 2008 3:03 pm

If you decide to go the FPGA route let me know. I have some experience along those lines.

We can build a FORTH engine to handle the CPU type stuff needed along with a custom ALU or three to do the math.

We could even make it look like an x86 machine, since that is stack oriented.

We would just make the stacks deeper for convenience.

hanelyp · Post by **hanelyp** » Fri Jan 18, 2008 6:51 pm

If someone wants a challenge, look at writing the simulator to run on a modern video card. http://en.wikipedia.org/wiki/GPGPU The fairly simple code running on a massively parallel dataset of vectors might be a good fit.

drmike · Post by **drmike** » Fri Jan 18, 2008 7:04 pm

Yeah, that sounds like a great idea! Here is an abstract from one of the references

Abstract

In visualization and computer graphics it has been shown that the numerical solution of
PDE problems can be obtained much faster on graphics processors (GPUs) than on CPUs.
However, GPUs are restricted to single precision floating point arithmetics which is insufficient
for most technical scientific computations. Since we do not expect double precision
support natively in graphics hardware in the medium-term, we demonstrate how to accelerate
double precision iterative solvers for Finite Element simulations with current GPUs by
applying a mixed precision defect correction approach. Our prototypical algorithm already
runs more than two times faster than a highly tuned pure CPU solver while maintaining the
same accuracy. We present a series of tests and discuss multiple optimization options.

scareduck · Post by **scareduck** » Fri Jan 18, 2008 8:25 pm

Acceleware seems to be building whole systems bundled with third-party software bolted on:

http://www.acceleware.com/about/overview_LoLC8h.cfm

They claim with the latest NVIDIA card, the Tesla, they can get near to 1 TFLOP:

http://www.tgdaily.com/content/view/34656/135/

Here's the NVIDIA page for the Tesla:

http://www.nvidia.com/object/tesla_comp ... tions.html

The product spec says they're still only doing single-precision floating-point, but they plan on adding a 64-bit version Real Soon Now:

http://www.nvidia.com/docs/IO/43395/Com ... _Dec07.pdf

The chipset comes with a C SDK, apparently something nobody else had thought of before.

Some interesting discussions here:

http://www.gpgpu.org

scareduck · Post by **scareduck** » Fri Jan 18, 2008 9:49 pm

drmike, here's the core of their approach from the paper you linked to:

We present a mixed precision defect correction algorithm for the iterative solution of linear equation systems. The core idea of the algorithm is to split the solution process into a computationally intensive but less precise inner iteration running in 32 bit on the GPU and a computationally simple but precise outer correction loop running in 64 bit on the CPU. Our approach can be easily implemented on top of an existing GPU-based single precision iterative solver in applications where higher precision is necessary. The algorithm requires two input parameters, (epsilon sub)inner and (epsilon sub)outer as stopping criteria for the inner and outer solver respectively. Let A denote the (sparse) coefficient matrix, b the right hand side, x the initial guess for the solution and a scaling factor (alpha). Subscript 32 indicates single precision vectors stored in GPU memory and 64 indicates double precision vectors stored in CPU memory.

So I guess it's easier to get the second 32-bit float representing the second half of the calculation than it is to get the first half. (I had to punt a bit to get some of this to display, as it doesn't seem to want to let me input HTML entities. Bother.)

drmike · Post by **drmike** » Fri Jan 18, 2008 10:12 pm

NICE!

It's worth comparing some simple calculations with single and double precision. The GPU's job is really to create pretty pictures, and maybe
the visualization part of the task is complicated enough that is all we need it
for. But that Accelware looks really cool.

As I grind thru the math I'll keep all these comments floating by, it might help guide my thought process about how to go about solving the problem.

MSimon · Post by **MSimon** » Fri Jan 18, 2008 10:16 pm

It might be useful to do multiplies on the accelerator and adds/subtracts on the main processor.

scareduck · Post by **scareduck** » Fri Jan 18, 2008 10:34 pm

It's somewhat interesting in that the authors took the approach of adding precision one step at a time, not unlike what you were suggesting elsewhere, MSimon. They thought that further speedups could be made by doing a first iteration using half-precision.

scareduck · Post by **scareduck** » Sat Jan 19, 2008 2:11 pm

Lost in the announcement regarding the Mac Air notebook was the new Mac Pro deskside computer:

http://www.apple.com/macpro/

Dual Intel Xeon Harpertown (four cores each) @ 3.2 GHz, up to 32 GB RAM (ships with 2 GB, but Apple always overcharges for RAM), $2799 a copy.

drmike · Post by **drmike** » Sat Jan 19, 2008 2:34 pm

Interesting! It seems Apple finally woke up. That's a reasonable cost for what you get, and you can even program it! Check out their developer stuff:
Open Source

It'd be fun to make all the fans hum on that puppy. 32GB of *RAM*!!

Can't wait to see what comes out next week.