> According to a Prof. of mine[1] loading/storing from/to SIMD registers is in
> fact slow and should be avoided.

That would be a assumption that I would consider do be dangerous to assume.

I once benchmarked the difference in using standard 32-bit registers
for copying data from one portion of memory (64MB chunk). The standard
registers clocked in at about 2GB/sec while I hit almost 12GB/sec
using 4x SIMD registers as I could load 128 bits at once.

Now what your prof probably said was that it was slower to load a
128bit register than a 32bit register so you shouldn't use SSE unless
you are doing SIMD. So in this case, we want to use SIMD as we are
working with 3-4 32bit floats that will just "happen" to fit perfectly
in a 128bit SIMD register.


