Chapter Contents |
Previous |
Next |
Floating-Point Representation |
Numbers in scientific notation are comprised of the following parts:
Floating-point representation is a form of scientific notation, except that on most operating systems the base is not 10, but is either 2 or 16. The following table summarizes various representations of floating-point numbers that are stored in 8 bytes.
Representation | Base | Exponent Bits | Maximum Mantissa Bits | |
---|---|---|---|---|
IBM mainframe | 16 | 7 | 56 | |
OpenVMS VAX | 2 | 8 | 56 | |
IEEE | 2 | 11 | 52 |
SAS allows for truncated floating-point numbers via the LENGTH statement, which reduces the number of mantissa bits. For more information on the effects of truncated lengths, see Storing Numbers with Less Precision.
In most situations, the way that SAS stores numeric values does not affect you as a user. However, floating-point representation can account for anomalies you might notice in SAS program behavior. The following sections identify the types of problems that can occur in various operating environments and how you can anticipate and avoid them.
Floating-Point Representation on IBM Mainframes |
SEEEEEEE MMMMMMMM MMMMMMMM MMMMMMMM byte 1 byte 2 byte 3 byte 4 MMMMMMMM MMMMMMMM MMMMMMMM MMMMMMMM byte 5 byte 6 byte 7 byte 8
This representation corresponds to bytes of data with each character being 1 bit, as follows:
Each bit in the mantissa represents a fraction whose numerator is 1 and whose denominator is a power of 2. For example, the leftmost bit in byte 2 represents , the next bit represents , and so on. In other words, the mantissa is the sum of a series of fractions such as , , , and so on. Therefore, for any floating-point number to be represented exactly, you must be able to express it as the previously mentioned sum. For example, 100 is represented as the following expression:
To illustrate how the above expression is obtained, two examples follow. The first example is in base 10. The value 100 is represented as follows:
100.
The period in this number is the radix point. The mantissa must be less than 1; therefore, you normalize this value by shifting the radix point three places to the right, which produces the following value:
Because the radix point is shifted three places to the right, 3 is the exponent:
The second example is in base 16. In hexadecimal notation, 100 (base 10) is written as follows:
Shifting the radix point two places to the right produces the following value:
Shifting the radix point also produces an exponent of 2, as in:
The binary value of this number is
.01100100
, which can be represented in the following expression:
In this example, the exponent is 2. To represent the exponent, you add the bias of 64 to the exponent. The hexadecimal representation of the resulting value, 66, is 42. The binary representation is as follows:
01000010 01100100 00000000 00000000 00000000 00000000 00000000 00000000
Floating Point Representation on OpenVMS |
On OpenVMS, SAS stores numeric values in the D-floating format, which has the following scheme:
MMMMMMMM MMMMMMMM MMMMMMMM MMMMMMMM byte 8 byte 7 byte 6 byte 5 MMMMMMMM MMMMMMMM SEEEEEEE EMMMMMMM byte 4 byte 3 byte 2 byte 1
In D-floating format, the exponent is 8 bits instead of 7, but uses base 2 instead of base 16 and a bias of 128, which means the magnitude of the D-floating format is not as great as the magnitude of the IBM representation. The mantissa of the D-floating format is, physically, 55 bits. However, all floating-point values under OpenVMS are normalized, which means it is guaranteed that the high-order bit will always be 1. Because of this guarantee, there is no need to physically represent the high-order bit in the mantissa; therefore, the high-order bit is hidden.
For example, the decimal value 100 represented in binary is as follows:
01100100.
This value can be normalized by shifting the radix point as follows:
0.1100100
Because the radix was shifted to the left seven places, the exponent, 7 plus the bias of 128, is 135. Represented in binary, the number is as follows:
10000111
To represent the mantissa, subtract the hidden bit from the fraction field:
.100100
You can combine the sign (0), the exponent, and the mantissa to produce the D-floating format:
MMMMMMMM MMMMMMMM MMMMMMMM MMMMMMMM 00000000 00000000 00000000 00000000 MMMMMMMM MMMMMMMM SEEEEEEE EMMMMMMM 00000000 00000000 01000011 11001000
Floating-Point Representation Using the IEEE Standard |
3F F0 00 00 00 00 00 00 (most operating systems) 00 00 00 00 00 00 F0 3F (OS/2)
Precision Versus Magnitude |
In Floating-Point Representation, you can see that the number of exponent bits and mantissa bits varies. The more bits that are reserved for the mantissa, the more precise the number; the more bits that are reserved for the exponent, the greater the magnitude the number can have.
Whether precision or magnitude is more important depends on the characteristics of your data. For example, if you are working with physics applications, very large numbers may be needed, and magnitude is probably more important. However, if you are working with banking applications, where every digit is important but the number of digits is not great, then precision is more important. Most often, SAS applications need a moderate amount of both precision and magnitude, which is sufficiently provided by floating-point representation.
Computational Considerations of Fractions |
Consider the IBM mainframe representation of .1:
40 19 99 99 99 99 99 99
Notice the trailing 9 digit, similar to the trailing 3 digit in the attempted decimal representation of 1/3 (.3333 ...). This lack of precision is aggravated by arithmetic operations. Consider what would happen if you added the decimal representation of 1/3 several times. When you add .33333 ... to .99999 ... , the theoretical answer is 1.33333 ... 2, but in practice, this answer is not possible. The sums become imprecise as the values continue.
Likewise, the same process happens when the following DATA step is executed:
data _null_; do i=-1 to 1 by .1; if i=0 then put 'AT ZERO'; end; run;
The AT ZERO message in the DATA step is never printed because the accumulation of the imprecise number introduces enough error that the exact value of 0 is never encountered. The number is close, but never exactly 0. This problem is easily resolved by explicitly rounding with each iteration, as the following statements illustrate:
data _null_; i=-1; do while(i<=1); i=round(i+.1,.001); if i=0 then put 'AT ZERO'; end; run;
Numeric Comparison Considerations |
As discussed in Computational Considerations of Fractions, imprecision can cause problems with computations. Imprecision can also cause problems with comparisons. Consider the following example in which the PUT statement is not executed:
data _null_; x=1/3; if x=.33333 then put 'MATCH'; run;
However, if you add the ROUND function, as in the following example, the PUT statement is executed:
data _null_; x=1/3; if round(x,.00001)=.33333 then put 'MATCH'; run;
In general, if you are doing comparisons with fractional values, it is good practice to use the ROUND function.
Storing Numbers with Less Precision |
As discussed in Floating-Point Representation, the SAS System allows for numeric values to be stored on disk with less than full precision. Use the LENGTH statement to dictate the number of bytes that are used to store the floating-point number. Use the LENGTH statement carefully to avoid significant data loss.
For example, the IBM mainframe representation uses 8 bytes for full precision, but you can store as few as 2 bytes on disk. The value 1 is represented as 41 10 00 00 00 00 00 00 in 8 bytes. In 2 bytes, it would be truncated to 41 10. You still have the full range of magnitude because the exponent remains intact; there are simply fewer digits involved. A decrease in the number of digits means either fewer digits to the right of the decimal place or fewer digits to the left of the decimal place before trailing zeroes must be used.
For example, consider the number 1234567890, which would be .1234567890 to the 10th power of 10 (in base 10). If you have only five digits of precision, the number becomes 123460000 (rounding up). Note that this is the case regardless of the power of 10 that is used (.12346, 12.346, .0000012346, and so on).
The only reason to truncate length by using the LENGTH statement is to save disk space. All values are expanded to full size to perform computations in DATA and PROC steps. In addition, you must be careful in your choice of lengths, as the previous discussion shows.
Consider a length of 2 bytes on an IBM mainframe system. This value allows for 1 byte to store the exponent and sign, and 1 byte for the mantissa. The largest value that can be stored in 1 byte is 255. Therefore, if the exponent is 0 (meaning 16 to the 0th power, or 1 multiplied by the mantissa), then the largest integer that can be stored with complete certainty is 255. However, some larger integers can be stored because they are multiples of 16. For example, consider the 8-byte representation of the numbers 256 to 272 in the following table:
Value | Sign/Exp | Mantissa 1 | Mantissa 2-7 | Considerations | |
---|---|---|---|---|---|
256 | 43 | 10 | 000000000000 | trailing zeros; multiple of 16 | |
257 | 43 | 10 | 100000000000 | extra byte needed | |
258 | 43 | 10 | 200000000000 | ||
259 | 43 | 10 | 300000000000 | ||
. | |||||
. | |||||
. | |||||
271 | 43 | 10 | F00000000000 | ||
272 | 43 | 11 | 000000000000 | trailing zeros; multiple of 16 |
The numbers from 257 to 271 cannot be stored exactly in the first 2 bytes; a third byte is needed to store the number precisely. As a result, the following code produces misleading results:
data temp; length x 2; x=257; y1=x+1; run; data _null_; set temp; if x=257 then put 'FOUND'; y2=x+1; run;
The PUT statement is never executed because the value of X is actually 256 (the value 257 truncated to 2 bytes). Recall that 256 is stored in 2 bytes as 4310, but 257 is also stored in 2 bytes as 4310, with the third byte of 10 truncated.
You receive no warning that the value of 257 is truncated in the first DATA step. Note, however, that Y1 has the value 258 because the values of X are kept in full, 8-byte floating-point representation in the program data vector. The value is only truncated when stored in a SAS data set. Y2 has the value 257, because X is truncated before the number is read into the program data vector.
Truncating Numbers and Making Comparisons |
x=1/3;is stored with a length of 3, then the following comparison is not true:
if x=1/3 then ...;However, adding the TRUNC function makes the comparison true, as in the following:
if x=trunc(1/3,3) then ...;
Determining How Many Bytes Are Needed to Store a Number Accurately |
data numbers; input value; datalines; 269 270 271 272 ; data temp; set numbers; x=value; do L=8 to 1 by -1; if x NE trunc(x,L) then do; minlen=L+1; output; return; end; end; run; proc print noobs; var value minlen; run;
The following output shows the results from this code.
Using the TRUNC Function
The SAS System VALUE MINLEN 269 3 270 3 271 3 272 2 |
Note that the minimum length required for the value 271 is greater than the minimum required for the value 272. This fact illustrates that it is possible for the largest number in a range of numbers to require fewer bytes of storage than a smaller number. If precision is needed for all numbers in a range, you should obtain the minimum length for all the numbers, not just the largest one.
Double-Precision Versus Single-Precision Floating-Point Numbers |
The RBw.d informat might truncate double-precision floating-point numbers if the w value is less than the size of the double-precision floating-point number (8 on all the operating systems discussed in this section). Therefore, the RB8. informat corresponds to a full 8-byte floating point. The RB4. informat corresponds to an 8-byte floating point truncated to 4 bytes, exactly the same as a LENGTH 4 in the DATA step.
An 8-byte floating point that is truncated to 4 bytes might not be the same as float in a C program. In the C language, an 8-byte floating-point number is called a double. In FORTRAN, it is a REAL*8. In IBM's PL/I, it is a FLOAT BINARY(53). A 4-byte floating-point number is called a float in the C language, REAL*4 in FORTRAN, and FLOAT BINARY(21) in IBM's PL/I.
On the IBM mainframes and OpenVMS VAX, a single-precision floating-point number is exactly the same as a double-precision number truncated to 4 bytes. On operating systems that use the IEEE standard, this is not the case; a single-precision floating-point number uses a different number of bits for its exponent and uses a different bias, so that reading in values using the RB4. informat does not produce the expected results.
Transferring Data between Operating Systems |
Summary of Floating-Point Numbers Stored in 8 Bytes shows the maximum number of digits of the base, exponent, and mantissa. Because there are differences in the maximum values that can be stored in different operating environments, there might be problems in transferring your floating-point data from one machine to another.
Consider, for example, transporting data between an IBM mainframe and a PC. The IBM mainframe has a range limit of approximately .54E-78 to .72E76 (and their negative equivalents and 0) for its floating-point numbers. Other machines, such as the PC, have wider limits (the PC has an upper limit of approximately 1E308). Therefore, if you are transferring numbers in the magnitude of 1E100 from a PC to a mainframe, you lose that magnitude. During data transfer, the number is set to the minimum or maximum allowable on that operating system, so 1E100 on a PC is converted to a value that is approximately .72E76 on an IBM mainframe.
Chapter Contents |
Previous |
Next |
Top of Page |
Copyright 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.