Wednesday, March 8, 2023

Notes on scalar/vectorized bit-exactness

For those of us that work in industries where detailed reports are generated for users, you are probably aware of the importance of maintaining bit-exactness. If a customer takes your software, runs the same pipeline on another system, and sees that there are minor differences in the output, he may rightly be upset. This can be especially devious when dealing with floating point output.

IEEE 754 floats are not associative. That is, (a+b)+c ≠ a+(b+c). This can be easily shown in Python.

>>> (.1+.2)+.3

0.6000000000000001

>>> .1+(.2+.3)

0.6

To make matters worse, when dealing with millions or even billions of iterations, the floating point errors can accrue, yielding vastly different outputs. The issue also arises when hand-vectorizing a function that accumulates into a float register. Let’s take a straightforward example that sums an array of floats. For this example inputs = { .1, .2, .3, .4, .5, .6, .7, .8 }

float ScalarSum(const std::vector<float> & inputs)

{

  float sum = 0.0f;

  for(int i = 0; i < inputs.size(); ++i)

  {

    sum += inputs[i];

  }

  return sum;

}


We can transform this to a vectorized implementation. We will use SIMD and operate on the inputs in strides of 4.

float VectorSum(const std::vector<float> & inputs)

{

  __m128 sum4x = _mm_setzero_ps();

  for(int i = 0; i < inputs.size(); i+=4)

  {

    sum4x = _mm_add_ps(sum4x, _mm_loadu_ps(&inputs.data()[i])); 

  }

  return simd::Hsum_4xPS(sum4x);

}


Both functions produce slightly different outputs when run on the same input array.


scalar=3.5999999

vector=3.6000001


The difference is subtle and is caused by the rounding errors discussed earlier. The scalar version computes (.1+.2+.3+.4+.5+.6+.7+.8) while the vectorized code aggregates the two partial sums (.1+.5+.2+.6)+(.3+.7+.4+.8). More generally, given a stride S and length N, a vectorized accumulation can be written as: 

$$\begin{equation}  \sum_{i}^{N/S} \sum_{j}^S a_{i,j} \end{equation}$$

We may not care about the difference in the output as the inferences made are functionally the same, but the difference serves to confuse the user. Say he runs it on an ARM machine and we have not yet implemented the function with NEON intrinsics. Therefore we still need the SIMD to be bit-exact with the scalar code. There are algorithms to mitigate numerical errors, such as the Kahan summation, but that only guarantees a bounded error — we need bit-exactness. The key here is to emulate the vectorized order of operations in scalar.

float ScalarBitExact(const std::vector<float> & inputs)

{

  std::array<float, 4> sum4x = {};

  for(int i = 0; i < inputs.size(); i+=4)

  {

    for(int j = 0; j < 4; ++j)

    {

      sum4x[j] += inputs[i+j];

    }

  }

  // make sure the accumulate order matches your hsum!

  return std::accumulate(sum4x.begin(), sum4x.end(), 0.0f); 

}


We work on strides of 4, implementing the above equation in scalar. This produces the exact same result.


scalar bit exact=3.6000001

vector          =3.6000001


The users will be satisfied and you don't need to throw out your super fast implementation because you couldn't get the results to match.

No comments:

Post a Comment

A unified file streaming API for local and remote storage

Oftentimes, we want a simple API for streaming IO that works seamlessly across multiple sources. I am looking for an interface that not only...