For those of us that work in industries where detailed reports are generated for users, you are probably aware of the importance of maintaining bit-exactness. If a customer takes your software, runs the same pipeline on another system, and sees that there are minor differences in the output, he may rightly be upset. This can be especially devious when dealing with floating point output.
IEEE 754 floats are not associative. That is, (a+b)+c ≠ a+(b+c). This can be easily shown in Python.
>>> (.1+.2)+.3
0.6000000000000001
>>> .1+(.2+.3)
0.6
To make matters worse, when dealing with millions or even billions of iterations, the floating point errors can accrue, yielding vastly different outputs. The issue also arises when hand-vectorizing a function that accumulates into a float register. Let’s take a straightforward example that sums an array of floats. For this example inputs = { .1, .2, .3, .4, .5, .6, .7, .8 }
float ScalarSum(const std::vector<float> & inputs)
{
float sum = 0.0f;
for(int i = 0; i < inputs.size(); ++i)
{
sum += inputs[i];
}
return sum;
}
float VectorSum(const std::vector<float> & inputs)
{
__m128 sum4x = _mm_setzero_ps();
for(int i = 0; i < inputs.size(); i+=4)
{
sum4x = _mm_add_ps(sum4x, _mm_loadu_ps(&inputs.data()[i]));
}
return simd::Hsum_4xPS(sum4x);
}
Both functions produce slightly different outputs when run on the same input array.
scalar=3.5999999
vector=3.6000001
float ScalarBitExact(const std::vector<float> & inputs)
{
std::array<float, 4> sum4x = {};
for(int i = 0; i < inputs.size(); i+=4)
{
for(int j = 0; j < 4; ++j)
{
sum4x[j] += inputs[i+j];
}
}
// make sure the accumulate order matches your hsum!
return std::accumulate(sum4x.begin(), sum4x.end(), 0.0f);
}
We work on strides of 4, implementing the above equation in scalar. This produces the exact same result.
scalar bit exact=3.6000001
vector =3.6000001
The users will be satisfied and you don't need to throw out your super fast implementation because you couldn't get the results to match.