Will wrote:He's been complaining the process is too slow however.
it's as slow as it needs to be. it takes a little under a tenth of a second to calculate the lorentz and coloumb forces from every 14k*14k pair of particles. at that rate, that's about 3,082,813,440 (3 trillion) lorentz+coloumb force calculations per second. at that rate, by the time it completes a full particle pair calculation, adjusting for relativistic effects and what not, with all the memory transfers included, a top of the line cpu would have barely finished doing a single bitwise OR operation (assuming a bitwise Or takes only 1 clock cycle), and that's not even considering memory transfers.
EDIT: I just counted it up, there's 82 floating point operations in each pairwise calculation (counting reciprocal square root as 2). so that comes out to 252,790,702,080 flops, or about 253 gigaflops. at 900Mhz (its current core clock rate) my gpu is capable of a little over a teraflop, so that's about 1/4 of it's theoretical peak performance. not bad. but it means i'm either not at full occupancy, i/o limited, or both. considering the code, i'm probably using too many registers. that it would be difficult to cut back. which is bad news for earlier gpus which have fewer registers per thread. (the newer ones are balanced better, imho.)
EDIT: without per-particle-pair lorentz force and its relativistic correction that comes out to only 26 flops per pair. so presumably (assuming it's compute-limited) that would go about 3 times as fast.