* One of the common misunderstandings among computer users is a certain faith in the infallibility of numerical computations. That is, if a computer user multiplies, say:

3 * (1/3)

-- the result that would be expected would be exactly 1. It is a shock to this faith to find out that, on closer inspection, the result might prove to be something more like 0.99999999923475. This seems to clearly indicate a bug in the system, and it's a bigger shock to find out that, no, that's the way it's supposed to work. This document explains such issues in detail.

[1] BITS, BYTES, NYBBLES, & UNSIGNED INTEGERS

[2] OCTAL & HEX NUMBERS

[3] SIGNED INTEGERS & 2'S COMPLEMENT

[4] FLOATING-POINT NUMBERS

[5] ENCODING TEXT: ASCII & STRINGS

[6] COMMENTS: BINARY CONVERSION, BINARY MATH, BCD, INDEFINITE PRECISION

[7] COMMENTS, SOURCES, & REVISION HISTORY

* Most computer users understand the concept of a "bit", or in computer terms, a 1 or 0 value encoded by the setting of a switch. It's not much more difficult to see that two bits can be used to represent four unique states:

00 01 10 11

Three bits can be used to represent eight unique states:

000 001 010 011 100 101 110 111

Every bit that is added doubles the number of states that can be represented. Most computers operate on information 8 bits, or some multiple of 8 bits, like 16, 32, or 64 bits, at a time. This collection of bits is more-or-less fundamental, and has been given the odd name of the "byte". A computer's processor accepts data a byte or multiples of a byte at a time. Memories are organized to store data a byte or multiples of a byte per each addressable location.

Actually, in some cases 4 bits is a convenient number of bits to deal with, and this collection of bits is called, somewhat painfully, the "nybble". In this document, we will refer to "nybbles" often, but please remember that in reality the term "byte" is common, while the term "nybble" is not. A nybble can encode 16 different values, such as the numbers 0 to 15. Any arbitrary sequence of bits could be used in principle, but in practice the most natural way is as follows:

0000 = decimal 0 1000 = decimal 8 0001 = decimal 1 1001 = decimal 9 0010 = decimal 2 1010 = decimal 10 0011 = decimal 3 1011 = decimal 11 0100 = decimal 4 1100 = decimal 12 0101 = decimal 5 1101 = decimal 13 0110 = decimal 6 1110 = decimal 14 0111 = decimal 7 1111 = decimal 15

This is natural because it follows our instinctive way of considering a normal decimal number. For example, given the decimal number:

7531

-- we automatically interpret this as:

7000 + 500 + 30 + 1 7 * 1000 + 5 * 100 + 3 * 10 + 1 * 1

-- or, using powers-of-10 notation:

7 * 10^3 + 5 * 10^2 + 3 * 10^1 + 1 * 10^0

Remember that any number to the "0th" power is 1.

Each digit in the number represents a value from 0 to 9, which is ten
different possible values, and that's why it's called a decimal or "base-10"
number. Each digit also has a "weight" of a power of ten proportional to its
position. This sounds complicated, but it's not; it's exactly what we take
for granted when we look at a number. We know the weighted value of each
digit without even having to *think* about it.

Similarly, in the bit-oriented number encoding scheme explained above, the value 13 is encoded as:

1101

Each bit can only have a value of 1 or 0, which is two values, making this a "binary", or "base-2" number. Accordingly, the "positional weighting" is as follows:

1101 = 1 x 2^3 + 1 x 2^2 + 0 x 2^1 + 1 x 2^0 = 1 x 8 + 1 x 4 + 0 x 2 + 1 x 1 = 13 decimal

Notice the values of powers of 2 used here: 1, 2, 4, 8. People who get into the guts of computers generally get to know the powers of 2 up to the 16th power by heart, not because they memorize them, but because they use them a great deal:

2^0 = 1 2^8 = 256 2^1 = 2 2^9 = 512 2^2 = 4 2^10 = 1,024 2^3 = 8 2^11 = 2,048 2^4 = 16 2^12 = 4,096 2^5 = 32 2^13 = 8,192 2^6 = 64 2^14 = 16,384 2^7 = 128 2^15 = 32,768 2^16 = 65,536

Another thing to remember is that, aping the metric system, the value 2^10 = 1,024 is referred to as "kilo", or simply "K", so any higher powers of 2 are often conveniently referred to as multiples of that value:

2^11 = 2 K = 2,048 2^12 = 4 K = 4,096 2^13 = 8 K = 8,192 2^14 = 16 K = 16,384 2^15 = 32 K = 32,768 2^16 = 64 K = 65,536

Similarly, the value 2^20 = 1,024 x 1,024 = 1,048,576 is referred to as a "meg", or simply "M":

2^21 = 2 M 2^22 = 4 M

-- and the value 2^30 is referred to as a "gig", or simply "G". Use of these prefixes can get a bit confusing since in some documents it can be unclear if, say, "kilo" actually means 1,024 or if it means 1,000. Usually in computer discussions it means 1,024, but some writers are sloppy on this point. In any case, we'll see these prefixes often as we continue.

There is a subtlety in this discussion. If we use 16 bits, we can have 65,536 different values, but the values are from 0 to 65,535. People start counting at one, machines start counting from zero -- think of the odometer on a car for an example. This small and mildly confusing fact even trips up the professionals every now and then.

* Anyway, this defines a simple way to count with bits, but it has a few restrictions:

- Arithmetic can only be performed within the bounds of the number of bits
available. That is, if we're working with 16 bits at a time, we can't
perform arithmetic that gives a result of 65,536 or more, or we get an
error called a "numeric overflow". The formal term is that we are working
with "finite precision" values.

- There's no way to represent fractions with this scheme. We can only work
with non-fractional "integer" quantities.

- There's no way to represent negative numbers with this scheme. All the numbers are "unsigned".

Despite these limitations, such "unsigned integer" numbers are very useful in computers, mostly for simply counting things up one-by-one. They're very easy for the computer to manipulate. Generally computers use 16-bit and 32-bit unsigned integers, normally referred to as "integer" and "long integer" (or just "long") numbers. An "integer" allows for numbers from 0 to 65,535, while a "long" allows for integer numbers from 0 to 4,294,967,296.

* Let's take a side-trip to discuss representation of binary numbers. Computer professionals often need to write out binary quantities, but in practice writing out a binary number like:

1001 0011 0101 0001

-- is a pain, and inclined to error as well. Generally computer professionals write binary quantities in a base-8 ("octal") or, much more commonly in modern times, a base-16 ("hexadecimal" or "hex") number format.

This is another thing that sounds tricky but is actually simple -- if it wasn't, there wouldn't be any point in doing it. In our normal decimal system, we have 10 digits (0 through 9) and count up as follows:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ...

In an octal system, we only have 8 digits (0 through 7) and we count up through the same sequence of numbers as follows:

0 1 2 3 4 5 6 7 10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 ...

That is, an octal "10" is the same as a decimal "8", an octal "20" is a decimal 16, and so on.

In a hex system, we have 16 digits (0 through 9 followed, by convention, with a through f) and we count up through the sequence as follows:

0 1 2 3 4 5 6 7 8 9 a b c d e f 10 11 12 13 14 15 16 ...

That is, a hex "10" is the same as a decimal "16" and a hex "20" is the same as a decimal "32".

Each of these number systems are positional systems, but while decimal weights are powers of 10, the octal weights are powers of 8 and the hex weights are powers of 16. For example:

octal 756 = 7 * 8^2 + 5 * 8^1 + 6 * 8^0 = 7 * 64 + 5 * 8 + 6 * 1 = 448 + 40 + 6 = decimal 494 hex 3b2 = 3 * 16^2 + 11 * 16^1 + 2 * 16^0 = 3 * 256 + 11 * 16 + 2 * 1 = 768 + 176 + 2 = decimal 946

OK, if this is unclear, don't worry about it too much. The real point is that an octal digit has a perfect correspondence to a 3-bit binary value number:

000 = octal 0 001 = octal 1 010 = octal 2 011 = octal 3 100 = octal 4 101 = octal 5 110 = octal 6 111 = octal 7

Similarly, a hex digit has a perfect correspondence to a 4-bit binary number:

0000 = hex 0 1000 = hex 8 0001 = hex 1 1001 = hex 9 0010 = hex 2 1010 = hex a 0011 = hex 3 1011 = hex b 0100 = hex 4 1100 = hex c 0101 = hex 5 1101 = hex d 0110 = hex 6 1110 = hex e 0111 = hex 7 1111 = hex f

So it is easy to convert a long binary number, such as 1001001101010001, to octal:

1 001 001 101 010 001 binary = 111521 octal

-- and easier still to convert that number to hex:

1001 0011 0101 0001 = 9351 hex

-- but it takes a lot of figuring to convert it to decimal (37,713 decimal). Octal and hex provide a convenient way to represent binary "machine" quantities.

* After defining unsigned binary numbers, the next step is to modify this scheme to provide negative numbers, or "signed integers". The simplest thing to do would be to reserve one bit to indicate the sign of a number. This "sign bit" would probably be the leftmost bit, though it could be the rightmost for all it matters. If the sign bit is 0, the number is positive, and if the sign bit is 1, the number is negative.

This works, and it is used in a few obscure applications, but although it's the obvious solution from a human point of view, it makes life hard for machines. For example, this scheme gives a positive and a negative value for zero! A human might shrug at that, but it gives a machine headaches.

The more natural way, from the machine point of view, is to split the range of binary numbers for a given number of bits in half and use the top half to represent negative numbers. For example, 4-bit data give eight positive and eight negative values:

0000 = decimal 0 0001 = decimal 1 0010 = decimal 2 0011 = decimal 3 0100 = decimal 4 0101 = decimal 5 0110 = decimal 6 0111 = decimal 7 1000 = decimal -8 1001 = decimal -7 1010 = decimal -6 1011 = decimal -5 1100 = decimal -4 1101 = decimal -3 1110 = decimal -2 1111 = decimal -1

Now we have a "signed integer" number system, using a scheme known as, for reasons unimportant here, "two's complement". With a 16-bit signed integer number, we can encode numbers from -32,768 to 32,767. With a 32-bit signed integer number, we can encode numbers from -2,147,483,648 to 2,147,482,647.

This has some similarities to the sign-bit scheme in that a negative number has its topmost bit set to "1", but the two concepts are different. In sign-magnitude numbers, a "-5" is:

1101

-- while in two's complement numbers, it is:

1011

-- which in sign-magnitude numbers is "-3". Why two's complement is simpler for machines to work with is explained later.

* So now we can represent unsigned and signed integers as binary quantities. Remember that these are just two ways of interpreting a pattern of bits. If a computer has a binary value in memory of, say:

1101

-- this could be interpreted as a decimal "13" or a decimal "-3".

* While both unsigned and signed integers are used in digital systems, even a 32-bit integer isn't big enough to handle all the range of numbers a calculator can handle, and we still haven't addressed the problem of fractions. To handle fractions and obtain greater range we have to abandon signed integers and go to a "floating-point" format.

In the decimal system, we are familiar with floating-point numbers of the form:

1.1030402 x 10^5 = 1.1030402 x 100000 = 110304.02

-- or, more compactly:

1.1030402E5

-- which means "1.103402 times 1 followed by 5 zeroes". We have a certain numeric value (1.1030402) known as a "mantissa", multiplied by a power of 10 (E5, meaning 10^5 or 100,000), known as an "exponent".

If we have a negative exponent, that means the number is multiplied by 1 that many places to the right of the decimal point. For example:

2.3434E-6 = 2.3434 x 10^-6 = 2.3434 x 0.000001 = 0.0000023434

The advantage of this scheme is that by using the exponent we can get a much wider range of numbers, even if the number of digits in the mantissa, or the "numeric precision", is much smaller than the range.

Similar binary floating-point formats can be defined for computers. There are a number of such schemes, but one of the most popular has been defined by IEEE (Institute of Electrical & Electronic Engineers, a US professional and standards organization). The IEEE 754 specification defines 64 bit floating-point format with:

- An 11-bit binary exponent, using "excess-1023" format. Excess-1023 means
the exponent appears as a unsigned binary integer from 0 to 2047, and we
need to subtract 1023 from it to get the actual signed value.

- A 52-bit mantissa, also an unsigned binary number, defining a fractional
value with a leading implied "1".

- A sign bit, giving the sign of the mantissa.

Let's see what this format looks like by showing how such a number would be stored in eight bytes of memory:

byte 0: S x10 x9 x8 x7 x6 x5 x4 byte 1: x3 x2 x1 x0 m51 m50 m49 m48 byte 2: m47 m46 m45 m44 m43 m42 m41 m40 byte 3: m39 m38 m37 m36 m35 m34 m33 m32 byte 4: m31 m30 m29 m28 m27 m26 m25 m24 byte 5: m23 m22 m21 m20 m19 m18 m17 m16 byte 6: m15 m14 m13 m12 m11 m10 m9 m8 byte 7: m7 m6 m5 m4 m3 m2 m1 m0

-- where "S" denotes the sign bit, "x" denotes an exponent bit, and "m" denotes a mantissa bit. Once the bits here have been extracted, they are converted with the computation:

<sign> * (1 + <fractional_mantissa>) * 2^(<exponent> - 1023>)

This scheme provides numbers valid out to 15 decimal digits, with the following range of numbers:

____________________________________________________________ maximum minimum ____________________________________________________________ positive 1.797693134862231E+308 4.940656458412465E-324 negative -4.940656458412465E-324 -1.797693134862231E+308 ____________________________________________________________

The spec also defines several special values that are not defined numbers, and are known as "NANs", for "Not A Number". These are used by programs to designate overflow errors and the like. The only reason for most users to know about NANs is to avoid confusion when an error message pops up that says: "Not A Number!"

Some programs also use 32-bit floating-point numbers. The most common scheme uses a 23-bit mantissa with a sign bit, plus an 8-bit exponent in "excess-127" format, giving 7 valid decimal digits.

byte 0: S x7 x6 x5 x4 x3 x2 x1 byte 1: x0 m22 m21 m20 m19 m18 m17 m16 byte 2: m15 m14 m13 m12 m11 m10 m9 m8 byte 3: m7 m6 m5 m4 m3 m2 m1 m0

The bits are converted to a numeric value with the computation:

<sign> * (1 + <fractional_mantissa>) * 2^(<exponent> - 127)

-- leading to the following range of numbers:

________________________________________ maximum minimum ________________________________________ positive 3.402823E+38 2.802597E-45 negative -2.802597E-45 -3.402823E+38 ________________________________________

Such floating-point numbers are known as "reals" or "floats" in general, but with a number of inconsistent variations, depending on context:

- A 32-bit float value is sometimes called a "real32" or a "single",
meaning "single-precision floating-point value".

- A 64-bit float is sometimes called a "real64" or a "double", meaning "double-precision floating-point value".

The term "real" without any elaboration generally means a 64-bit value, while the term "float" generally means a 32-bit value.

* Once again, remember that bits are bits. If we have eight bytes stored in computer memory, it might be a 64-bit real, two 32-bit reals, or four signed or unsigned integers, or some other kind of data that fits into eight bytes. The only difference is how the computer interprets them. If the computer stored four unsigned integers and then read them back from memory as a 64-bit real, it almost always would be a perfectly valid real number, though it would be junk data.

* So now our computer can handle positive and negative numbers with fractional parts. However, even with floating-point numbers, we run into some of the same problems that we did with integers, and then some:

- As with integers, we only have a finite range of values to deal with.
Granted, it's a much
*bigger*range of values than even a 32-bit integer, but if we keep multiplying numbers we'll eventually get one bigger than the real value can hold and have a "numeric overflow".

If we keep dividing we'll eventually get one with a negative exponent too big for the real value to hold and have a "numeric underflow". Remember that a negative exponent gives the number of places to the*right*of the decimal point and means a really*small*number.

The maximum real value is sometimes called "machine infinity", since that's the biggest value the computer can wrap its little silicon brain around.

- A related problem is that we have only limited "precision" as well. That
is, we can only represent 15 decimal digits with a 64-bit real. If the
result of a multiply or a divide has more digits than that, they're just
dropped and the computer doesn't inform the user of an error.

That means that if we add a very small number to a very large one, the result is just the large one. The small number was too small to even show up in 15 or 16 digits of resolution, and the computer effectively discards it. Calculations that give really insane answers when they normally work may be suffering from a range mismatch between the values involved in the calculation. It may be possible to "scale" the values to get more accurate results.

Limited precision also means that if we do floating-point computations, there's likely to be a small error in the result since some lower digits have been dropped. This effect is unnoticeable in most cases, but if we do some math analysis that requires lots of computations, the errors tend to build up and can throw off the results, sometimes wildly. The faction of people who use computers for doing math understand these errors very well, and have methods for minimizing the effects of such errors, as well as for estimating how big the errors are.

By the way, this "precision" problem is not the same as the "range" problem at the top of this list. The range issue deals with the maximum size of the exponent, while the precision issue deals with the number of digits that can fit into the mantissa.

- Another more obscure error that creeps in with floating-point numbers is
the fact that the mantissa is expressed as
*binary*fraction that doesn't necessarily perfectly match a decimal fraction.

That is, if we want to do a computation on a decimal fraction that is a neat sum of reciprocal powers of two, such as 0.75, the binary number that represents this fraction will be 0.11, or 1/2 + 1/4, and all will be fine.

Unfortunately, in many cases we can't get a sum of these "reciprocal powers of 2" that precisely matches a specific decimal fraction, and the results of computations will be very slightly off, way down in the very small parts of a fraction. For example, the decimal fraction "0.1" is equivalent to an infinitely repeating binary fraction: 0.000110011 ...

If this seems unclear, once again, don't worry too much. The point here is that a computer isn't magic, it's a machine and is subject to certain rules and constraints. Although many people place a naive faith in the answers computers give, even under the best of circumstances these machines have a certain unavoidable inexactness built into them.

* So now we have several means for using bits to encode numbers. But what about text? How can a computer store names, addresses, letters to the folks?

Well, recall that bits are bits, and there's no reason that a set of bits can't be used to represent a character like "A" or "?" or "z" or whatever. Since most computers work on data a byte at a time, it is convenient to use a single byte to represent a single character. For example, we could assign the bit pattern:

0100 0110 (hex 46)

-- to the letter "F", for example. The computer sends such "character codes" to its display to print the characters that make up the text we see.

There is a traditional binary encoding for western text characters, known as the "American Standard Code for Information Interchange (ASCII)". The following table shows the characters represented by ASCII, with each character followed by its value in decimal ("d"), hex ("h"), and octal ("o"):

ASCII Table ______________________________________________________________________ ch ctl d h o ch d h o ch d h o ch d h o ______________________________________________________________________ NUL ^@ 0 0 0 sp 32 20 40 @ 64 40 100 ' 96 60 140 SOH ^A 1 1 1 ! 33 21 41 A 65 41 101 a 97 61 141 STX ^B 2 2 2 " 34 22 42 B 66 42 102 b 98 62 142 ETX ^C 3 3 3 # 35 23 43 C 67 43 103 c 99 63 143 EOT ^D 4 4 4 $ 36 24 44 D 68 44 104 d 100 64 144 ENQ ^E 5 5 5 % 37 25 45 E 69 45 105 e 101 65 145 ACK ^F 6 6 6 & 38 26 46 F 70 46 106 f 102 66 146 BEL ^G 7 7 7 ` 39 27 47 G 71 47 107 g 103 67 147 BS ^H 8 8 10 ( 40 28 50 H 72 48 110 h 104 68 150 HT ^I 9 9 11 ) 41 29 51 I 73 49 111 i 105 69 151 LF ^J 10 a 12 * 42 2a 52 J 74 4a 112 j 106 6a 152 VT ^K 11 b 13 _ 43 2b 53 K 75 4b 113 k 107 6b 153 FF ^L 12 c 14 , 44 2c 54 L 76 4c 114 l 108 6c 154 CR ^M 13 d 15 _ 45 2d 55 M 77 4d 115 m 109 6d 155 SO ^N 14 e 16 . 46 2e 56 N 78 4e 116 n 110 6e 156 SI ^O 15 f 17 / 47 2f 57 O 79 4f 117 o 111 6f 157 DLE ^P 16 10 20 0 48 30 60 P 80 50 120 p 112 70 160 DC1 ^Q 17 11 21 1 49 31 61 Q 81 51 121 q 113 71 161 DC2 ^R 18 12 22 2 50 32 62 R 82 52 122 r 114 72 162 DC3 ^S 19 13 23 3 51 33 63 S 83 53 123 s 115 73 163 DC4 ^T 20 14 24 4 52 34 64 T 84 54 124 t 116 74 164 NAK ^U 21 15 25 5 53 35 65 U 85 55 125 u 117 75 165 SYN ^V 22 16 26 6 54 36 66 V 86 56 126 v 118 76 166 ETB ^W 23 17 27 7 55 37 67 W 87 57 127 w 119 77 167 CAN ^X 24 18 30 8 56 38 70 X 88 58 130 x 120 78 170 EM ^Y 25 19 31 9 57 39 71 Y 89 59 131 y 121 79 171 SUB ^Z 26 1a 32 : 58 3a 72 Z 90 5a 132 z 122 7a 172 ESC ^[ 27 1b 33 ; 59 3b 73 [ 91 5b 133 { 123 7b 173 FS ^\ 28 1c 34 < 60 3c 74 \ 92 5c 134 | 124 7c 174 GS ^] 29 1d 35 = 61 3d 75 ] 93 5d 135 } 125 7d 175 RS ^^ 30 1e 36 > 62 3e 76 ^ 94 5e 136 ~ 126 7e 176 US ^_ 31 1f 37 ? 63 3f 77 _ 95 5f 137 DEL 127 7f 177 ______________________________________________________________________

The strange characters listed in the leftmost column, such as "FF" and "BS", do not correspond to text characters. Instead, they correspond to "control" characters that, when sent to a printer or display device, execute various control functions. For example, "FF" is a "form feed" or printer page eject, "BS" is a backspace, , and "BEL" in principle causes a beep ("bell").

In a text editor, they'll just be shown as a little block or a blank space or (once upon a time) little smiling faces, musical notes, and other bizarre items. In some application programs, they can be typed in by holding down the CTRL key and pressing an appropriate code. For example, pressing CTRL and entering "G" gives CTRL-G, or "^G" in the table above, the BEL character.

The ASCII table above only defines 128 characters, which implies that ASCII characters only need 7 bits. However, since most computers store information in terms of bytes, normally there will be one character stored to a byte. This extra bit allows a second set of 128 characters, an "extended" character set, to be defined beyond the 128 defined by ASCII. In practice, there are a number of different extended character sets, providing such features as math symbols, cute little line-pattern building block characters for building forms, and extension characters for non-English languages. The fact that there are a number of different definitions for the 128 extended characters can lead to confusion on occasion.

This table serves to emphasize one of the main ideas of this document: bits are bits. In this case, we have bits representing characters. We can describe the particular code for a particular character in decimal, octal, or hexadecimal, but it's still the same code. The value that is expressed, whether it is in decimal, octal, or hex, is simply the same pattern of bits.

Of course, we normally want to use many characters at once to display sentences and the like, such as:

Tiger, tiger burning bright!

This is simply represented as a sequence of ASCII codes, represented in hex below:

54 69 67 65 72 2c 20 74 69 67 65 72 20 62 75 ...

Computers store such "strings" of ASCII characters as "arrays" of consecutive memory locations. Some applications also include a binary value as part of the string to show many characters are stored in it. More commonly, applications use a special character, usually a NULL (the character with ASCII code 0), as a "terminator" to indicate the end of the string. Most of the time users will not need to worry about these details, since the application takes care of them automatically -- though users who are writing programs that manipulate characters and strings will have to understand how they are implemented.

* Now let's consider a particularly confusing issue for the newcomer: the fact that we can represent a number in ASCII as a string, for example:

1.537E3

When a computer displays this value, it actually sends the following ASCII codes, represented in hex, to the display:

31 ["1"] 2e ["."] 35 ["5"] 33 ["3"] 37 ["7"] 45 ["E"] 33 ["3"]

The confusion arises because the computer could store the value 1.537E3 as, say, a 32-bit real, in which we get a pattern of 4 bytes that make up the exponent and mantissa and all that. To display the 32-bit real, the computer has to translate it to the ASCII string just shown above, as an "ASCII numeric representation", or just "ASCII number". If it just displayed the 32-bit binary real number directly, we'd get four "garbage" characters.

But, now to get really confusing, suppose we wanted to view the bits of a 32-bit real directly, bypassing conversion to the ASCII string value. Then the computer would display something like:

10110011 10100000 00110110 11011111

The trick is that to display these values, the computer uses the ASCII characters for "1", "0", and " " (space character), with hex values as follows:

31 30 31 31 30 30 31 31 20 31 30 31 30 30 ...

It could also display the bits as an octal or hex ASCII value. Sometimes users will say they are dealing with "hex numbers"; what they usually mean is that they are manipulating binary values that are presented by hex numbers in ASCII.

Confused? Don't feel too bad, even the experienced get subtly confused with this issue sometimes. The essential point is that the values the computer works on are just sets of bits. To actually see the values requires an ASCII representation of them. Or to put it simply: machines work with bits and bytes, humans work with ASCII, and there has to be translation between the two to communicate.

* Eight bits is clearly not enough to allow representation of, say, Japanese characters, since their basic set is a little over 2,000 different characters, and the full set several times that. The need to encode all the world's character sets led to a master character encoding scheme known as "Unicode" which defines character codes for Western, Asian, Indic, Hebrew, and other character sets, including even Egyptian hieroglyphics. There are several implementations of Unicode, the best-known being the "Unicode Transformation Format (UTF)", with different variants depending on bit length -- including "UTF-8", "UTF-16", and "UTF-32".

* For readers who have managed to wade this far without becoming bewildered and want a more complete understanding, a few more details may be considered.

The first detail is the translation of binary numbers into decimal numbers and the reverse. It's easy to do. There are formal arithmetic methods, "algorithms", for performing such conversions, but most people who do such things often have a fancy calculator to do the work for them, and those who don't do it often won't remember how to do anything but the brute-force method. The brute-force method can be performed simply on an ordinary pocket calculator.

In almost all cases where people want to do this, they're converting an unsigned integer decimal value to a hex value that corresponds to a binary number. They almost never convert a signed or floating-point value. The main reason to convert to binary is just to get a specific pattern of bits, commonly for interfacing to computer hardware, since computer professionals may need to send various patterns of "on" and "off" bits to control a certain device.

Usually the binary value is not more than 16 bits long, meaning the unsigned binary equivalent of the bit pattern is no larger than 65,535. Anyone who can remember the powers of 2, or just have them written down handy, can perform the conversion very easily.

Suppose we have a decimal number like, say, 46,535, that we want to convert to a 16-bit binary pattern. All we have to do is work our way down the list of powers of two and follow a few simple rules.

First, take the highest power of 2 in the table, or 32,768, and check to see if it is larger than the decimal number or not. It isn't, write down a "1":

1

-- and then subtract 32,768 from 46,535 on a calculator. This gives 13,767.
Go down to the next power of two, or 16,384, and compare. This *is* larger
than 13,767, so write down a "0":

10

Compare 13,767 to the next lower power of 2, which is 8192. This isn't larger than 13,767, so write down a "1":

101

-- and subtract 8192 from 13,767 to get 5,575. Repeat this procedure until we have all 16 binary digits. In summary, the conversion looks like this:

46,535 greater than or equal to 32,768? Yes, subtract, write: 1 13,767 greater than or equal to 16,384? No, write: 0 13,767 greater than or equal to 8,192? Yes, subtract, write: 1 5,575 greater than or equal to 4,096? Yes, subtract, write: 1 1,479 greater than or equal to 2,048? No, write: 0 1,479 greater than or equal to 1,024? Yes, subtract, write: 1 455 greater than or equal to 512? No, write: 0 455 greater than or equal to 256? Yes, subtract, write: 1 199 greater than or equal to 128? Yes, subtract, write: 1 71 greater than or equal to 64? Yes, subtract, write: 1 7 greater than or equal to 32? No, write: 0 7 greater than or equal to 16? No, write: 0 7 greater than or equal to 8? No, write: 0 7 greater than or equal to 4? Yes, subtract, write: 1 3 greater than or equal to 2? Yes, subtract, write: 1 1 greater than or equal to 1? Yes, subtract, write: 1

This gives:

46,535 decimal = 1011 0101 1100 0111 binary = b5c7 hex

Of course, converting back is a simple multiplication exercise. Doing it in hex is easier:

b5c7 hex = 11 x 16^3 + 5 x 16^2 + 12 x 16 + 7 x 1 = 11 x 4096 + 5 x 256 + 12 x 16 + 7 = 45,056 + 1,280 + 192 + 7 = 46,535

Again, this is a clumsy means of converting between the number bases, but since most people don't do this often, or at all, it pays to have a technique that is at least easy to remember. Those who end up doing it often will find some better way of doing it than by hand.

* The next interesting topic is the operations that can be performed on binary numbers. We'll consider signed and unsigned integers first.

The most basic operations on binary values are the simple logical operations AND, OR, XOR ("exclusive OR"), and NOT. Performing them on some sample byte data gives:

1111 0000 1111 0000 1111 0000 OR 1010 1010 AND 1010 1010 XOR 1010 1010 NOT 1010 1010 ____________ _____________ _____________ _____________ 1111 1010 1010 0000 0101 1010 0101 0101

The rules for these operations are as follows:

AND: The result is 1 if both values are 1, otherwise it's 0. OR: The result is 1 if either value is 1, otherwise it's 0. XOR: The result is 1 if only one value is 1, otherwise it's 0. NOT: The result is 1 if the value is 0 and 0 if the value is 1.

A NOT operation is also called "inverting". A computer uses these operations a great deal, either for checking and setting or "twiddling" bits in its hardware, or as building blocks for more complicated operations, such as addition or subtraction.

* Binary addition is a more interesting operation. As discussed in the formal rules for adding a decimal number, we add the numbers one digit at a time, and if the result is ten or more, we perform a "carry" to add to the next digit. For example, if we perform the addition:

374 + 452

-- we first add:

4 + 2 = 6

-- then:

7 + 5 = 12

-- which means that we have to "carry" the "1" to the next addition:

( 1 + ) 3 + 4 = 8

-- giving the result: 374 + 452 = 826.

Performing additions in binary are essentially the same, except that the number of possible results of an addition of two bits is minimal:

0 + 0 = 0 0 + 1 = 1 1 + 0 = 1 1 + 1 = 10

In the last case, this implies a "carry" to the next digit in the binary number. So performing a binary addition on two bytes would look like this:

0011 1010 58 + 0001 1101 + 29 ___________ ____ 0101 0111 87 CC C

The bits on which a carry occurred are marked with a "C". The equivalent decimal addition is shown to the right.

Assuming that we are adding unsigned integers, notice what happens if we add:

1000 1001 137 + 0111 1011 + 123 _____________ ___________ (1) 0000 0100 ( 256 + ) 4 ? CCCC C CC

The result, equivalent to a decimal 260, is beyond the range of an 8-bit unsigned value (maximum of 255) and won't fit into 8 bits, so all we get is a value of 4, since the "carry-out" bit is lost. A "numeric overflow" has occurred.

* OK, now to get really tricky. Remember how we defined signed integers as two's complement values? That is, we chop the range of binary integer values in half and assign the high half to negative values. In the case of 4-bit values:

0000 0001 ... 0110 0111 1000 1001 ... 1110 1111 0 1 ... 6 7 -8 -7 ... -2 -1

Now we can discuss exactly why this scheme makes life easier for the computer. Two's complement arithmetic has some interesting properties, the most significant being that it makes subtraction the same as addition.

To see how this works, pretend that we have the binary values above written on a strip of stiff paper taped at the ends into a loop, with each binary value written along the loop, and the loop joined together between the "1111" and "0000" values.

Now further consider that we have a little slider on the loop that we can move in the direction of increasing values, but not backwards, sort of like a slide rule that looks like a ring, with 16 binary values on it.

If we wished to add, say, 2 to 4, we would just move the slider up two values from 4, and we would get 6. Now let's see what happens if we want to subtract 1 from 3. This is the same as adding -1 to 3, and since a -1 in two's complement is 1111, or the same as decimal 15, we move the slider up 15 values from 3. This takes us all the way around the ring to ... 2.

This is a bizarre way to subtract 1 from a value, or so it seems from a human point of view -- but from the machine point of view, it's simple. Try a few other simple additions and subtractions to see how this works. The big problem is that overflow conditions have become much trickier. Consider the following examples:

0111 0100 116 1000 1110 -112 + 0001 0101 + 21 + 1001 0001 -111 ___________ _____ ______________ ____________ 1000 1001 -119 ( = 137) ? (1) 0001 1111 ( 256 + ) 21 ?

In the case on the left, we add two positive numbers and end up with a negative one. In the case on the right, we add two negative numbers and end up with a positive one. Both are absurd results. Again, the results have exceeded the range of values that can be represented in the coding system we have used. One nice thing is that we can't get into trouble adding a negative number to a positive one, since obviously the result is going to be within the range of allowed values.

* As an aside, to convert a positive binary value to a two's complement value, all we have to do is invert (NOT) all the bits and add 1. For example, to convert a binary 7 to -7:

0000 0111 -> 1111 1000 + 1 -> 1111 1001

Another approach is to simply scan from right to left through the binary number, copy down each digit up to and including the first "1" we encounter, then invert all the bits to the left of that "1".

* This covers addition and multiplication, but what about multiplication and division? Well, take the binary value:

0010 1001 = 41

Suppose we move, or "shift", all the bits one place to the left, and shove a zero in on the right. We get:

0101 0010 = 82

Surprise -- shifting a binary quantity left one place multiplies it by two. Suppose we shift it to the right, and shove a zero in on the left:

0001 0100 = 20 (losing a 1 that was shifted off the right side)

This is equivalent to dividing by 2. The bit shifted off the right side is the "remainder" that's left over from division.

This implies that we can implement multiplication by performing a combination of shifts and adds. For example, to multiply a binary value by 5, we would shift the value right by two bits, multiplying it by 4, and then add the original value. Similarly, division can be implemented by a shift and subtract procedure. This scheme allows simple hardware to perform these operations.

One interesting feature of this scheme are the complications two's complement introduces. Suppose we have a two's complement number such as:

1001 1010 = -102

If we shift this left, the result is obviously invalid, and there's no way to fix it. Multiplying -102 by 2 would give -204, and that would exceed the range of values available in signed 8-bit arithmetic (-128 to 127). But see what happens if we shift right:

0100 1101 = 77

We want to get -51, but since we lost the uppermost "1" bit, the value becomes abruptly positive. In the case of signed arithmetic, we have to use a slightly different shift method, one that shoves a "1" into the upper bit instead of a "0", and gives the right result:

1100 1101 = -51

This is called an "arithmetic shift". The other method that shoves in a "0" is known in contrast as a "logical shift". This is another example that shows how well two's complement works from the hardware level, since this wouldn't work with sign-magnitude numbers.

There is also a "rotate" operation that is a variation on a "shift", in which the bit shifted out of one end of the binary value is recirculated into the other. This is of no practical concern in this discussion, it's just mentioned for the sake of completeness.

* So that covers the basic operations for signed and unsigned integers. What about floating-point values?

The operations are basically the same. The only difference in addition and subtraction is that we have two quantities that have both a mantissa and an exponent, and they both have to have the same exponent to allow us to add or subtract. For example, using decimal math:

3.13E2 + 2.7E3 = 0.313E3 + 2.73E3 = 3.043E3

The same principle applies to binary floating-point math: we shift the mantissa of one value left or right and adjust the exponent until both values have the same exponent, and the perform the add, or subtract as the case may be.

A sharp reader will realize this is why performing calculations between sets of floating-point quantities with greatly different ranges of values is tricky and hazardous. If the two sets of values are adjusted to have the same exponents, a process known as "normalization", the mantissa of the much smaller number may be shifted so far down that it doesn't even show up in the result of an addition or subtraction.

Also recalling high-school math, to multiply two floating-point values, we simply multiply the mantissas and add the exponents:

3.13E2 * 2.7E3 = (3.13 * 2.7)E(2 + 3) = 8.45E5

To divide, we divide the mantissas and subtract the exponents. The same applies to binary floating-point quantities. Note that it's much easier to do this when both binary mantissas have the same assumed decimal point, but that will be true by the very definition of the format of IEEE 754 floating-point numbers.

* Finally, what about higher functions, like square roots and sines and logs and so on?

There are two approaches. First, there are standard algorithms that mathematicians have long known that use the basic add-subtract-multiply-divide operations in some combination and sequence to generate as good an approximation to such values as we would like. For example, sines and cosines can be approximated with a "power series", which is a sum of specified powers of the value to be converted divided by specific constants that follow a certain rule.

Second, we can use something equivalent to the "look-up tables" used in math texts in the days before calculators that give values of sine, cosine, and so on, where readers would obtain the nearest values for, say, sine for a given value and then "interpolate" between them to get a good approximation of the value as wanted. (Such tables can be found in musty old textbooks.) In the case of the computer, it can store a table of sines or the like, look up the values, and then interpolate between them to give the value required.

* That covers the basics of bits and bytes and math for a computer. However, just to confuse things a little bit, while computers normally use binary floating-point numbers, calculators normally don't.

Recall that there is a slight error in translating from decimal to binary and back again. While this problem is easily handled by someone familiar with it, calculators in general have to be simple to operate and it would be better if this particular problem didn't arise, particularly in financial calculations where minor discrepancies might lead to a tiresome audit.

As a result, calculators generally really perform their computations in decimal, using a scheme known as "binary-coded decimal (BCD)". This scheme uses groups of four bits to encode the individual digits in a decimal number:

0000 = decimal 0 0001 = decimal 1 ... 1000 = decimal 8 1001 = decimal 9 1010 = ILLEGAL 1011 = ILLEGAL ... 1110 = ILLEGAL 1111 = ILLEGAL

With four bits, 10 BCD values can be represented. With 8, 100 BCD values can be handled:

0001 0000 = decimal 10 0001 0001 = decimal 11 0001 0010 = decimal 12 0001 0011 = decimal 13 ...

BCD obviously wastes bits. For example, 16 bits can be used to handle 65,536 binary integers, but only 10,000 BCD integers. However, as mentioned, BCD is not troubled by small errors due to translation between decimal and binary, though there are still resolution and range problems.

BCD can be used to implement integer or floating-point number schemes. Of course -- it's just decimal digits represented by bits. It can be used in computer programs as well, but except for calculators, most users won't normally come in contact with it. It makes the arithmetic substantially slower and so is generally only used in applications, such as financial calculations, where minor errors are unacceptable.

* One final comment on computer math: there are math software packages that offer "indefinite precision" math, or that is, math taken out to a precision defined by the user. What these packages do is define ways of encoding numbers that can change in the number of bits they use to store values, as well as ways of performing math on them. They're very slow compared to normal computer computations, and most applications actually do fine with 64-bit floating-point math.

* I wrote this article because floating-point issues are a common source of confusion. I can sympathize with this, because it took a long time for me to understand the issue as well as I do, and in fact writing this article helped me clear up my own thinking on the matter. I'm afraid I can't cite sources on this document, however, since I picked up the information in bits and pieces from various manuals, documents, and my own background in computing.

* As concerns copyrights and permissions for this document, all illustrations and images credited to me are public domain. I reserve all rights to my writings. However, if anyone does want to make use of my writings, just contact me, and we can chat about it. I'm lenient in giving permissions, usually on the basis of being properly credited.

* Revision history:

v1.0 / 12 jan 97 / Derived from older documents. v1.1 / 24 jul 98 / Minor tweaks. v1.2 / 04 dec 98 / Review & polish. v1.0.3 / 01 dec 01 / Review & polish. v1.0.4 / 01 dec 02 / Fixed typos. v1.0.5 / 01 may 03 / Fixed typos. v1.0.6 / 01 may 05 / Review & polish. v1.0.7 / 01 sep 06 / Review & polish. v1.0.8 / 01 aug 08 / Review & polish. v1.0.9 / 01 jul 10 / Review & polish. v1.1.0 / 01 sep 11 / Review & polish. v1.1.1 / 01 sep 13 / Review & polish. v1.1.2 / 01 aug 15 / Review & polish. v1.1.3 / 01 jul 17 / Review & polish.