Understanding and Preventing Overflow

翻译 ALLEN ⋅ 于 2019-07-13 17:00:23 ⋅ 38 阅读

这是一篇协同翻译


感恩节快乐!也许吃太多火鸡在你脑中还记忆犹新。如果是这样,这将是讨论溢出的好时机。

在浮点运算的世界中,溢出是可能的,但不是特别常见。当数字变得太大时,就会溢出;IEEE双精度浮点数支持2^1024以下的范围,如果超过这个范围,就会出现问题:

for k in [10, 100, 1000, 1020, 1023, 1023.9, 1023.9999, 1024]:
    try:
        print "2^%.4f = %g" % (k, 2.0 ** k)
    except OverflowError, e:
        print "2^%.4f ---> %s" % (k,e)
2^10.0000 = 1024
2^100.0000 = 1.26765e+30
2^1000.0000 = 1.07151e+301
2^1020.0000 = 1.12356e+307
2^1023.0000 = 8.98847e+307
2^1023.9000 = 1.67731e+308
2^1023.9999 = 1.79757e+308
2^1024.0000 ---> (34, 'Result too large')

2^1024 是一个非常非常大的数字。你可能永远不会需要像2^1024这样的数字,除非你在用组合学。为了方便比较,这里有一些物理量的数值区间:

宇宙中的质子数估计为 10^80≈2^266
宇宙大小与普朗克长度之比约为 46×10^9 光年/ 1.62×10^-35 m≈2^204
宇宙年龄与普朗克时间之比约为138 亿年 / 5.39×10^-44 秒 ≈2^202
因此,双精度数字应该是相当安全的。


单精度浮点数能到略小于2^128,在巨大的物理计算中容易溢出。更可能的是,由于精度而不是数值区间,不会使用它们。

在嵌入式系统世界中,我们通常没有奢侈去使用浮点计算。它无论在 CPU 执行时间,还是在钱上都需要额外的开销。我们只能使用整数算术。(实际上我们中的一些人喜欢整数运算!)

所以本文的其余部分是关于整数算术中的溢出。接下来的事情可能会让你吃惊。


溢出源

常见的疑点是:加法和乘法

什么会导致整数数学溢出?常见的疑点为加法和乘法:

import numpy as np

a = np.int16(32767)
b = np.int16(2)
print "32767 + 2 = %s" % (a+b)
print "32767 * 2 = %s" % (a*b)
32767 + 2 = -32767
32767 * 2 = -2
-c:3: RuntimeWarning: overflow encountered in short_scalars
-c:4: RuntimeWarning: overflow encountered in short_scalars

每个整数格式都有一个可表示的范围:
有符号 16 位整数支持范围 [ -32768, 32767 ]
无符号 16 位整数支持范围 [ 0, 65535 ]
有符号 32 位整数支持范围 [ -2147483648, 21474833647 ]
无符号 32 位整数支持范围 [ 0, 4294967295]
如果超出这些范围,即使是暂时的,你也需要非常小心。大多数环境都能很好地处理溢出,并提供“舍入”或“模”行为(在有符号 16 位环境中为32767+1=-32768),其中,超出范围的进位将消失,剩下的是与精确结果对应的低阶位。


If you're programming in C with a compiler that meets the C99 standard, you may be surprised to learn that unsigned integer math is guaranteed to have modulo behavior under overflow conditions, but the behavior of overflow with signed integer math is undefined.

To would-be language lawyers, the relevant sections in C99 are:

Section 6.5 item 5:

If an exceptional condition occurs during the evaluation of an expression (that is, if the result is not mathematically defined or not in the range of representable values for its type), the behavior is undefined.

Section 6.2.5 item 9:

A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.

(In practice, modern compilers such as gcc and clang tend to use modulo behavior anyway for basic arithmetic calculations, but be careful, as sometimes compiler optimizations will treat comparisons or conditional expressions incorrectly even though the basic arithmetic is "correct". Symptoms of this behavior can be really tricky to recognize; the LLVM website has some explanation of this, along with other bugbears of undefined behavior. If you want to be safe, use the -fwrapv compiler flag, which both gcc and clang support; this guarantees modulo behavior with all aspects of signed arithmetic.)


The historical reason for this little quirk in the C standard, is that in the olden days some processors used ones-complement representation for signed numbers, in which case arithmetic for signed and unsigned numbers is slightly different. Nowadays twos-complement is the norm, and the implementations of addition, subtraction, and multiplication are the same for signed and unsigned when using base word lengths for the results. But one of the main principles of C, for better or for worse, is to allow machine-specific behavior to maintain efficient code, so the standard was written with no guarantees for the results of signed arithmetic overflow. Languages like Java, on the other hand, have machine-independent definitions of arithmetic. Any arithmetic operation in Java will give you the same answer no matter which processor you are running it on. The price of this guarantee is that on some processors, extra instructions will be necessary.

By the way, if you're using C for integer math, be sure to #include <stdint.h> and use the typedefs it includes, such as int16_t, uint16_t, int32_t, and uint32_t. These are portable, whereas the number of bits in a short or an int or a long may vary with processor architecture.

If you are using the fixed-point features of MATLAB, beware that the default behavior of integer overflow is to saturate at the limits of the integer range. This avoids some of the problems of overflow, but it may not give you the results you expect, if you're used to languages that provide wraparound semantics.


PREVENTING OVERFLOW

Just because we can use -fwrapv in gcc or clang, and guarantee wraparound behavior, doesn't make it the correct behavior for an application program.

If I am controlling a valve, and I want the output to increase, and I keep adding 1 to a process control variable, so that I get 32765, 32766, 32767, -32768, -32767, etc. this creates a jump discontinuity which is BAD. The only machine-independent method of preventing this is to stay out of overflow. One approach is to upcast to a larger type size, check for overflow, and saturate the results:

#include <stdint.h>

int16_t add_and_saturate(int16_t x, int16_t y)
{
    int32_t z = (int32_t)x + y;
    if (z > INT16_MAX)
    {
        z = INT16_MAX;
    }
    else if (z < INT16_MIN)
    {
        z = INT16_MIN;
    }
    return (int16_t)z;
}

You could also do something like this to stay within the range a 16-bit type, but it gets a little tricky, uses more operations, and I'm not 100% confident I've got my code correct:

#include <stdint.h>

int16_t add_and_saturate(int16_t x, int16_t y)
{
   int16_t z = x+y;
   if ((y > 0) && (x > (INT16_MAX - y)))
   {
       z = INT16_MAX;
   }
   else if ((y < 0) && (x < (INT16_MIN - y)))
   {
       z = INT16_MIN;
   }
   return z;
}

Handling multiplication is similar, and is really only practical using the type-widening approach.


减法?

下面的 C 代码有个 bug,你能明白为什么吗?

int16_t calcPI(int16_t command, int16_t measurement)
{
   int16_t error = command - measurement;
   /* other stuff */
}

问题是减法。 如果 command = 32767 并且 measurement = -32768,那么误差的“实际”值是 65535 。如果 command = -32768 and measurement = 32767, 那么误差的“实际”值是 -65535。 刚和加法一样,在减法中两个 k-位数字操作的结果是 (k+1)-位数字。有几个方法可以避免不正确的结果。

首先,我们可以使用32-位中间计算值:

int32_t error = (int32_t)command - measurement;
这有个缺陷,就是加法操作的越多,输出值的范围就越大。但是只要中间计算值不陷入溢出,就没事。

其次,我们可以把误差值规范到 int16_t 的区间,所以上下限为-32768 到 +32767。这有个缺陷,就是极小边缘值和极大边缘值没什么区别。 (比如: command = 32767 和 measurement = 0 得出 error = 32767,但是 measurement = -32768 也是这个结果。) 实际上, 我们会想在整个区间保持线性关系(线性增长或线性缩小)。

第三,我们可以修改这个计算操作,使得计算结果在上下限范围内:

int16_t error = ((int32_t)command - measurement) / 2;
这个减少了净增益,但是避免了溢出和饱和。


除法?

除法是算术运算中最难看的东西。幸运的是,溢出情况并不太复杂。

如果您使用的是 C ,那么这里的除法意味着取一个n位数字并除以另一个n位数字,那么只有一个例子可以导致溢出,稍后我们将讨论这个问题。(它不被零除。如果在不首先排除d=0的情况下对n/d进行除法,那么你会得到任何计算结果。)

一些处理器具有内置的除法功能,编译器通过直接映射到处理器的除法特性所拥有的“内置”或“固有”功能来支持这些运算。在这里,您可以做一些事情,比如用一个 32 位整数除以一个16位整数,得到一个 16 位的结果,但是您需要确保避免用一个会产生一个不符合16位结果的商的数字来做除法。例如:180000/2=90000,90000 超出了 16 位数字(有符号或无符号)的界限。


SHIFTS?

Don't forget your shift operators << and >>. These won't cause overflow, but in C, don't forget the pesky undefined and implementation-specific behaviors. Here the relevant section from the C99 standard is section 6.5.7 paragraphs 3-5:

3 The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

4 The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 × 2^E2, reduced modulo one more than the maximum value representable in the result type. If E1 has a signed type and nonnegative value, and E1 × 2^E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.

5 The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 / 2^E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.

That's right! If you have a signed integer E1 containing a negative number, and you shift right, the results are implementation defined. The "sensible" thing to do is an arithmetic shift right, putting 1 bits into the most significant bits to handle sign-extension. But the C standard allows the compiler to produce a logical shift right, putting 0 bits into the most significant bits, and that's probably not what you wanted.

More recent languages like Java and Javascript give you two shift right operators: the regular >> operator for computing arithmetic shift right, and the >>> operator for computing logical shift right to produce an unsigned binary integer with zeros shifted into the most significant bits.


IS THAT IT? WHEW!

No, that's not it! What else is there? Well, there are the increment and decrement operators ++ and --, but with -fwrapv they should work with wraparound semantics as expected.

The rest of the overflow items fall into what might be called Pathological Cases. These all involve asymmetry between INT_MIN and INT_MAX that causes unwanted aliasing.

We talked about divide earlier, and the one integer divide overflow is when you take -32768 and divide by -1.

# Let's divide -32768 / -1
a=np.int16(-32768)
b=np.int16(-1)
a/b
-32768

D'OH! We should have gotten +32768 but that just won't fit into a 16-bit signed integer, so it aliases to the int16_t bit-equivalent value -32768.

The same thing happens with unary minus:

# Let's negate -32768 
-a
-32768

Then there are the fixed-point multiplication flaws along the same line. Let's say you're doing Q15 math, where integers represent numbers of the form $\frac{k}{2^{15}}$. C code looks like this:

int16_t a = ...;
int16_t b = ...;
int16_t c = ((int32_t)a * b) >> 15;

Does this work properly without overflow? Well, it does, except for one single case. $-32768 \times -32768 = 2^{30}$, and if we shift that right by 15 we get $2^{15} = 32768$. But in a signed 16-bit integer, 32768 aliases to -32768. Ouch!


HOW DO WE DEAL WITH THIS?

Q15 multiplication

Well, the Q15 multiplication is a tough one. If you are absolutely certain you won't ever evaluate -32768 × -32768, go ahead and use the usual C code as is. One example is PI control loops where the gains are always nonnegative. Or if you know the range of one of the numbers doesn't include -32768 then you're fine.

Alternatively, if you're willing to have an extra shift right, you can do this:

int16_t a = ...;
int16_t b = ...;
int16_t c = ((int32_t)a * b) >> 16;

The shift-right-by-16 operation is usually faster in embedded systems than a shift by 15 by one instruction cycle, since it often maps to a "grab the high word" operation in assembly, which you usually get for free when storing memory. This operation represents $a \times b \times \frac{1}{2}$ if a and b are Q15 fixed-point integers, or $a \times b$
if one of the values is Q15 and the other is Q16.

Other than that, you need some way of saturating the intermediate product so it is safe to shift right:

int16_t a = ...;
int16_t b = ...;
int16_t c = sat30((int32_t)a * b) >> 15;

where sat30(x) limits the results so that they are within the range $-2^{30} + k_1$ to $2^{30} - 1 - k_2$, where k1 and k2 are any numbers between 0 and 32767. (Why this ambiguity? Because the least significant 15 bits don't matter, and certain processors may have quirks that allow you to execute code faster by choosing different values of k. For example, the literal value of $2^{30} - 1$ may not be possible to load in one instruction, whereas the literal value of $2^{30} - 32768$= 32767 << 15 may be possible to load in one instruction.)


Unary minus

The unary minus case is somewhat similar. We have a few alternatives:

If we are absolutely sure that the input cannot be -32768, we are fine leaving it as is.

Otherwise, we have to test for -32768 and convert to +32767.

Alternatively, in C we can use the bitwise complement operator ~, because ~x = -x-1 for signed integers: ~-32768 = 32767, ~0 = -1, ~-1 = 0, ~32767 = -32768. This adjustment by 1 for all inputs may not be acceptable in some applications, but in others it is fine, representing a small offset, and the complement operation is often a fast single-instruction operation.

The same idea applies to absolute value; instead of

define ABS(x) ((x) < 0 ? (-x) : (x))

we can use the bitwise complement operator:

define ABS_Q15(x) ((x) < 0 ? (~x) : (x))

Alternatively, we'd have to special-case the test for -32768.

Just a reminder: #define macros are invisible to the compiler (because they happen in the preprocessor) and are also unsafe to use with arguments that have side effects e.g. ABS(++x) which would get incremented three times.

回复数量: 0
    暂无评论~~
    • 请注意单词拼写,以及中英文排版,参考此页
    • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`, 更多语法请见这里 Markdown 语法
    • 支持表情,使用方法请见 Emoji 自动补全来咯,可用的 Emoji 请见 :metal: :point_right: Emoji 列表 :star: :sparkles:
    • 上传图片, 支持拖拽和剪切板黏贴上传, 格式限制 - jpg, png, gif
    • 发布框支持本地存储功能,会在内容变更时保存,「提交」按钮点击时清空
      请勿发布不友善或者负能量的内容。与人为善,比聪明更重要!
    Ctrl+Enter