Adding two binary numbers works in the same way as adding decimal numbers. Let us first consider the sum \(a+b\) of two bits \(a\) and \(b\text{.}\) In general, the sum of two bits is two bits long, because the addition can cause a carry. Bit 0 of the sum is given by the xor of both bits: \(a\bxor b\text{.}\) The carry \(c\) of that position is 1 if and only if \(a=b=1\text{,}\) i.e. \(a\band b=1\text{.}\)
When adding entire bit strings not just single digits, the carry at a position needs to be propagated into the next higher position. So, if \(c_i\) denotes the carry bit position \(i-1\) generates, the sum bit at position \(i\) is given as
The carry bit of position \(i-1\) may also influence the carry bit \(c_{i+1}\) out of position \(i\text{.}\) For example, if \(a_i=b_i=c_i=1\text{,}\) then \(c_{i+1}=1\text{.}\) The carry bit position \(i\) generates is defined as
Now, consider two bit strings \(a\) and \(b\) of equal length \(n\text{.}\) When adding \(a\) and \(b\) we assume that there flows no carry into the least-significant digit, i.e. \(c_0=0\text{.}\)
Definition1.3.3.Addition of bit strings.
\(a+b\defeq c_ns_{n-1}\dots s_0\)
Remark1.3.4.Efficiency and practicability of our adder.
While Lemma 1.3.5 shows that the addition algorithm we defined is sound, it has a significant practical disadvantage. Our algorithm is a so-called ripple-carry adder which means that the carry “ripples” through the positions: In order to compute \(s_i\) you have to have computed \(c_i\) for which you need \(s_{i-1}\) and so forth. This makes the depth of the actual circuit of ands, ors, xors that implements the adder linear in terms of the number of positions. Since the circuit depth significantly influences the frequency by which you can clock the computer and therefore its speed, practical implementations use different adders, so called carry-lookahead adders that have more gates but are flatter and can therefore be clocked faster.
The following lemma shows that this definition is sound, i.e. the bit operations by which we defined the addition really produce a bit string that represents the natural number of the sum of the natural numbers that are represented by the individual bit strings.
Lemma1.3.5.Addition of bit strings is sound.
Let \(a\) and \(b\) be two bit strings of length \(n\text{.}\) Then, \(\buintp{a+b}=\buintp{a}+\buintp{b}+c_0\text{.}\)
Note that the sum contains \(n+1\) bits. If the carry bit out of the last position is \(0\text{,}\)Lemma 1.3.5 tell us that the lower \(n\) bits are accurate, i.e. \(\buintp{\bseq sn}=\buintp{a}+\buintp{b}\text{.}\) If the carry out of the last position is \(1\text{,}\) the lower \(n\) bits are not accurate and the sum is greater or equal \(2^n\text{.}\) In this case an overflow happened and we would need all \(n+1\) bits to accurately represent the result. Since computers internally operate on bit strings of fixed size, there is no space to store this additional bit directly in-place with the result. If we are only interested in the \(n\) least significant bits when adding, we write shortly
to indicate that we ignore the carry out of the most significant place.
Let us now turn to subtraction. Instead of inventing a new algorithm for subtracting, we are going to use a trick to express subtraction by addition. This trick hinges on the fact that we are only interested in subtracting words of \(n\) bits and not bit strings of arbitrary length. Let us first understand the overall idea using an example in base 10 arithmetic:
Assume we want to subtract base 10 numbers of length 3, and more concretely suppose we want to compute \(3-2\text{.}\)
What essentially happens here is that we, instead of subtracting \(2\text{,}\) we added \(998=1000-2\) and just considered the result modulo \(1000\) which is \(1\text{,}\) the desired result. This of course only works if we limit the digit sequences to a certain length before (here 3).
We use the same trick to subtract two bit strings of length \(n\) and therefore define subtraction to fulfill the following specification:
So, instead of subtracting \(\buintp b\) from \(\buintp a\) we are adding\((2^n-\buintp b)\) to \(\buintp a\text{.}\) Let us first observe that the result \(\buintp a+(2^n-\buintp b)\) is equivalent to \(\buintp a-\buintp b\) modulo \(2^n\) analogously to the example above. This means that the lower \(n\) bits of \(\buintp a+(2^n-\buintp b)\) and lower \(n\) bits of the true difference are equal which is exactly what we wanted. The number \(2^n-\buintp b\) is called the two's complement of \(\buintp b\text{.}\)
We will see in the proof of Lemma 1.1.6 that we can easily compute the two's complement of \(\buintp b\) by flipping all bits of \(b\) and adding \(1\text{.}\) The latter can easily be achieved by setting the carry into the least significant position (\(c_0\)) to \(1\text{.}\) Additionally, we see that if \(\buintp a\ge\buintp b\) (i.e. the result of the subtraction is defined), \(\buintp{a-b}\) will be greater than \(2^n\text{.}\) Hence, the carry out of the most significant bit is set if the result of the subtraction is valid and is clear when the subtraction overflowed.
Definition1.3.6.Subtraction.
\(a-b\defeq a+\bneg b\text{ with }c_0=1\) and \(a\mathop{-_n}b\defeq a\mathop{+_n}\bneg b\) with \(c_0=1\)
Definition 1.3.6 already suggests how an adder has to be extended to support subtraction: The right operand of the ALU must be optionally negateable and the carry bit into the least position must be presettable to either \(0\) or \(1\text{.}\)
Many processors provide the carry out of the last position in a special register, the so-called flag register. It is then often just called the carry bit because all the intermediate carries are unaccessible to the programmer. Some architectures negate the carry bit when subtracting, actually making it a borrow bit. For example, on an x86 machine the carry bit is clear when computing \(3-2\) whereas an ARM processor would set the carry in that case.
Pro comment: This is why on x86 the instruction that subtracts and reading \(c_0\) from the flags is called sbb (sub with borrow) while it is called sbc (sub with carry) on ARM processors. Such instructions are used when implementing arithmetic on bit strings that are longer than the word size of the processors.