Without considering the question of access, however, we can show, or at least very strongly suggest, that information processing is inevitably accompanied by a certain minimum amount of heat generation. In a general way this is not surprising. Computing, like all processes proceeding at a finite rate, must involve some dissipation. Our arguments, however, are more basic than this, and show that there is a minimum heat generation, independent of the rate of the process. Naturally the amount of heat generation involved is many orders of magnitude smaller than the heat dissipation in any practically conceivable device. The relevant point, however, is that the dissipation has a real function and is not just an unnecessary nuisance. The much larger amounts of dissipation in practical devices may be serving the same function.
Our conclusion about dissipation can be anticipated in several ways, and our major contribution will be a tightening of the concepts involved, in a fashion which will give some insight into the physical requirements for logical devices. The simplest way of anticipating our conclusion is to note that a binary device must have at least one degree of freedom associated with the information. Classically a degree of freedom is associated with kT of thermal energy. Any switching signals passing between devices must therefore have this much energy to override the noise. This argument does not make it clear that the signal energy must actually be dissipated. An alternative way of anticipating our conclusions is to refer to the arguments by Brillouin and earlier authors, as summarized by Brillouin in his book, Science and Information Theory,1 to the effect that the measurement process requires a dissipation of the order of kT. The computing process, where the setting of various elements depends upon the setting of other elements at previous times, is closely akin to a measurement. It is difficult, however, to argue out this connection in a more exact fashion. Furthermore, the arguments concerning the measurement process are based on the analysis of specific models (as will some of our arguments about computing), and the specific models involved in the measurement analysis are rather far from the kind of mechanisms involved in data processing. In fact the arguments dealing with the measurement process do not define measurement very well, and avoid the very essential question: When is a system A coupled to a system B performing a measurement? The mere fact that two physical systems arc coupled does not in itself require dissipation.
Our main argument will be a refinement of the following line of thought. A simple binary device consists of a particle in a bistable potential well shown in Fig. 1. Let us arbitrarily label the particle in the left-hand well as the ZERO state. When the particle is in the right-hand well, the device is in the ONE state. Now consider the operation RESTORE TO ONE, which leaves the particle in the ONE state, regardless of its initial location. If we are told that the particle is in the ONE state, then it is easy to leave it in the ONE state, without spending energy. If on the other hand we are told that the particle is in the ZERO state, we can apply a force to it, which will push it over the barrier, and then, when it has passed the maximum, we can apply a retarding force, so that when the particle arrives at ONE, it will have no excess kinetic energy, and we will not have expended any energy in the whole process, since we extracted energy from the particle in its downhill motion. Thus at first sight it seems possible to RESTORE To ONE without any expenditure of energy. Note, however, that in order to avoid energy expenditure we have used two different routines, depending on the initial state of the device. This is not how a computer operates. In moat instances a computer pushes information around in a manner that is independent of the exact data which are being handled, and is only a function of the physical circuit connections.
Can we then construct a single time-varying force, F(f), which when applied to the conservative system of Fig. 1 will cause the particle to end up in the ONE state, if it was initially in either the ONE state or the ZERO state? Since the system is conservative, its whole history can be reversed in time, and we will still have a system satisfying the laws of motion. In the time-reversed system we then have the possibility that for a single initial condition (position in the ONE state, zero velocity) we can end up in at least two places: the ZERO state or the ONE state. This, however, is impossible. The laws of mechanics are completely deterministic and a trajectory is determined by an initial position and velocity. (An initially unstable position can, in a sense, constitute an exception. We can roll away from the unstable point in one of at least two directions. Our initial point ONE is, however, a point of stable equilibrium.) Reverting to the original direction of time development, we see then that it is not possible to invent a single F(f) which causes the particle to arrive at ONE regardless of its initial state.
If, however, we permit the potential well to be lossy, this becomes
easy. A very strong positive initial force applied slowly enough so that
the damping prevents oscillations will push the particle to the right,
past ONE, regardless of the particle's initial state. Then if the force
is taken away slowly enough, 90 that the damping has a chance to prevent
appreciable oscillations, the particle is bound to arrive at ONE. This
example also illustrates a point argued elsewhere2
in more detail: While a heavily overdamped system is obviously undesirable,
since it is made sluggish, an extremely underdamped one is also not desirable
for switching, since then the system may bounce back into the wrong state
if the switching force is applied and removed too quickly.
X is a generalized coordinate representing quantity which is switched.
Figure 2. Potential well in which ZERO and ONE state are not
separated by barrier.
Information is preserved because random motion is slow
The second class of devices consists of structures which are in a steady
(time invariant) state, but in a dissipative one, while holding on to information.
Electronic flip-flop circuits, relays, and tunnel diodes are in this class.
The latter, whose characteristic with load line is shown in Fig.
3, typifies the behavior. Two stable points of operation are separated
by an unstable position, just as for the device in Fig.
1. It is noteworthy that this class has no known representatives analogous
to Fig. 2. All the active bistable devices (latches)
have built-in means for restoration to the desired state. The similarity
between Fig. 3 and the device of Fig.
1 becomes more conspicuous if we represent the bistable well of Fig.
1 by a diagram plotting force against distance. This is shown in Fig.
4. The line F = 0 intersects the curve
in three positions, much like the load line (or a line of constant current),
in Fig. 3. This analogy leads us to expect that in
the case of the dissipative device there will be transitions from the desired
state, to the other stable state, resulting from thermal agitation or quantum
mechanical tunneling, much like for the dissipationless case, and as has
been discussed for the latter in detail by Swanson.5
The dissipative device, such as the single tunnel diode, will in general
be an analog, strictly speaking, to an unsymmetrical potential well, rather
than the symmetrical well shown in Fig. 1. We can therefore
expect that of the two possible states for the negative resistance device
only one is really stable, the other is metastable. An assembly of bistable
tunnel diodes left alone for a sufficiently long period would eventually
almost all arrive at the same state of absolute stability.
Figure 3. Negative resistance characteristic (solid line) with a load line (dashed).
ZERO and ONE are stable states, U is unstable.
Figure 4. Force versus distance for the bistable well of Fig.
1.
ZERO and ONE are the stable states, U
the unstable one.
In general when using such latching devices in computing circuits one tries hard to make the dissipation in the two allowed states small, by pushing these states as closely as possible to the voltage or current axis. If one were successful in eliminating this dissipation almost completely during the steady state, the device would become a member of our first class. Our intuitive expectation is, therefore, that in the steady state dissipative device the dissipation per switching event is at least as high as in the devices of the first class, and that this dissipation per switching event is supplemented by the steady state dissipation.
The third and remaining class is a "catch-all"; namely, those devices where time variation is essential to the recognition of information. This includes delay lines, and also carrier schemes, such as the phase-bistable system of von Neumann.6 The latter affords us a very nice illustration of the need for dissipative effects; most other members of this third class seem too complex to permit discussion in simple physical terms.
In the von Neumann scheme, which we shall not attempt to describe here
in complete detail, one uses a "pump" signal of frequency
The von Neumann system depends largely on a coupling scheme called majority
logic, in which one couples to three subharmonic oscillators and uses
the sum of their oscillations to synchronize a subharmonic oscillator whose
pump will cause it to build up at a later time than the initial three.
Each of the three signals which are added together can have one of two
possible phases. At most two of the signals can cancel, one will always
survive, and thus there will always be a phase determined for the build-up
of the next oscillation. The synchronization signal can, therefore, have
two possible magnitudes. If all three of the inputs agree we get a synchronization
signal three times as big as in the case where only two inputs have a given
phase. If the subharmonic circuit is lossless the subsequent build-up will
then result in two different amplitudes, depending on the size of the initial
synchronization signal. This, however, will interfere with the basic operation
scheme at the next stage, where we will want to combine outputs of the
three oscillators again, and will want all three to be of equal amplitude.
We thus see that the absence of the losses gives us an output amplitude
from each oscillator which is too dependent on inputs at an earlier stage.
While perhaps the deviation from the desired amplitudes might still be
tolerable after one cycle, these deviations could build up, through a period
of several machine cycles. The losses, therefore, are needed so that the
unnecessary details of a signal's history will be obliterated. The losses
are essential for the standardization of signals, a function which in past
theoretical discussions has perhaps not received adequate recognition,
but has been very explicitly described in a recent paper by A. W. Lo.7
We shall think of a computer as a distinctly finite array of N binary elements which can hold information, without dissipation. We will take our machine to be synchronous, so that there is a well-defined machine cycle and at the end of each cycle, the N elements are a complicated function of their state at the beginning of each cycle.
Our arguments for logical irreversibility will proceed on three distinct levels. The first-level argument consists simply in the assertion that present machines do depend largely on logically irreversible steps, and that therefore any machine which copies the logical organization of present machines will exhibit logical irreversibility, and therefore by the argument of the next Section, also physical irreversibility.
The second level of our argument considers a particular class of computers, namely those using logical functions of only two variables. After a machine cycle each of our N binary elements is a function of the state of at most two of the binary elements before the machine cycle. Now assume that the computer is logically reversible. Then the machine cycle maps the 2N possible initial states of the machine onto the same space of 2N states, rather than just a subspace thereof. In the 2N possible states each bit has a ONE and a ZERO appearing with equal frequency. Hence the reversible computer can utilize only those truth functions whose truth table exhibits equal numbers of ONES and ZEROS. The admissible truth functions then are the identity and negation, the EXCLUSIVE OR and its negation. These, however, are not a complete set8 and do not permit a synthesis of all other truth functions.
In the third level of our argument we permit more general devices. Consider, for example, a particular three-input, three-output device, i.e. a small special purpose computer with three bit positions. Let p, q, and r be the variable before the machine cycle. The particular truth function under consideration is the one which replaces r by (p • q) if r = 0, and replaces r by ~(p • q) if r = 1. The variables p and q are left unchanged during the machine cycle. We can consider r as giving us a choice of program, and p, q as the variables on which the selected program operates. This is a logically reversible device, its output always defines its inputs uniquely. Nevertheless it is capable of performing an operation such as AND which is not, in itself reversible. The computer, however, saves enough of the input information so that it supplements the desired result to allow reversibility. It is interesting to note, however, that we did not "save" the program; we can only deduce what it was.
Now consider a more general purpose computer, which usually has to go through many machine cycles to carry out a program. At first sight it may seem that logical reversibility is simply obtained by saving the input in some corner of the machine. We shall, however, label a machine as being logically reversible, if and only if all its individual steps are logically reversible. This means that every single time a truth function of two variables is evaluated we must save some additional information about the quantities being operated on, whether we need it or not. Erasure, which is equivalent to RESTORE TO ONE, discussed in the Introduction, is not permitted. We will, therefore, in a long program clutter up our machine bit positions with unnecessary information about intermediate results. Furthermore if we wish to use the reversible function of three variables, which was just discussed, as an AND, then we must supply the initial programming and separate ZERO for every AND operation which is subsequently required, since the "bias" which programs the device is not saved, when the AND is performed. The machine must therefore have a great deal of extra capacity to store both the extra "bias" bits and the extra outputs. Can it be given adequate capacity to make all intermediate steps reversible? If our machine is capable, as machines are generally understood to be, of a non-terminating program, then it is clear that the capacity for preserving all the information about all the intermediate steps cannot be there.
Let us, however, not take quite such an easy way out. Perhaps, it is just possible to devise a machine, useful in the normal sense, but not capable of embarking on a non-terminating program. Let us take such a machine as it normally comes, involving logically irreversible truth functions. An irreversible truth function can be made into a reversible one, as we have illustrated, by "embedding" it in a truth function of a large number of variables. The larger truth function, however, requires extra inputs to bias it, and extra outputs to hold the information which provides the reversibility. What we now contend is that this larger machine, while it is reversible, is not a useful computing machine in the normally accepted sense of the word.
First of all, in order to provide space for the extra inputs and outputs,
the embedding requires knowledge of the number of times each of the operations
of the original (irreversible) machine will be required. The usefulness
of computer systems, however, from the fact that it is more than just a
table look-up device; it can do many programs which are not anticipated
in full detail by the designer. Our enlarged machine must have a number
of bit positions, for every embedded device of the order of the number
of program steps and requires a number of switching events during program
loading comparable to the number that occur during the program itself.
The setting of bias during program loading, which would typically consist
of restoring a long row of bits to say ZERO, is just the type of nonreversible
logical operation we are trying to avoid. Our unwieldy machine has therefore
avoided the irreversible operations during the running of the program,
only at the expense of added comparable irreversibility during the loading
of the program.
Imagine first a situation in which the RESTORE operation has already been carried out on each member of an assembly of such bits. This is somewhat equivalent to an assembly of spins, all aligned with the positive z-axis. In thermal equilibrium the bits (or spins) have two equally favored positions. Our specially prepared collections show much more order, and therefore a lower temperature and entropy than is characteristic of the equilibrium state. In the adiabatic demagnetization method we use such a prepared spin state, and as the spins become disoriented they take up entropy from the surroundings and thereby cool off the lattice in which the spins are embedded. An assembly of ordered bits would act similarly. As the assembly thermalizes and forgets its initial state the environment would be cooled off. Note that the important point here is not that all bits in the assembly initially agree with each other, but only that there is a single, well-defined initial state for the collection of bits. The well-defined initial state corresponds, by the usual statistical mechanical definition of entropy, S = k loge W, to zero entropy. The degrees of freedom associated with information can, through thermal relaxation, go to any of one of 2N states (for N bits in the assembly) and therefore the entropy can increase by kN loge 2 as the initial information becomes thermalized.
Note that our argument here does not necessarily depend upon connections, frequently made in other writings, between entropy and information. We simply think of each bit as being located in a physical system, with perhaps a great many degrees of freedom, in addition to the relevant one. However, for each possible physical state which will be interpreted as a ZERO, there is a very similar possible physical state in which the physical system represents a ONE. Hence a system which is a ONE state has only half as many physical states available to it as a system which can be in a ONE or ZERO state. (We shall ignore in this Section and in the subsequent considerations the case in which the ONE and ZERO are represented by states with different entropy. This case requires arguments of considerably greater complexity but leads to similar physical conclusions.)
In carrying out the RESTORE TO ONE operation we are doing the opposite of the thermalization. We start with each bit in one of two states and end up with a well-defined state. Let us view this operation in some detail.
Consider a statistical ensemble of bits in thermal equilibrium. If these are all reset to ONE, the number of states covered in the ensemble has been cut in half. The entropy therefore has been reduced by k loge2 = 0.6931 k per bit. The entropy of a closed system, e.g., a computer with its own batteries, cannot decrease; hence this entropy must appear elsewhere as a heating effect, supplying 0.6931 kT per restored bit to the surroundings. This is, of course, a minimum heating effect, and our method of reasoning gives no guarantee that this minimum is in fact achievable.
Our reset operation, in the preceding discussion, was applied to a thermal equilibrium ensemble. In actuality we would like to know what happens in a particular computing circuit which will work on information which has not yet been thermalized, but at any one time consists of a well-defined ZERO or a well-defined ONE. Take first the case where, as time goes on, the reset operation is applied to a random chain of ONES and ZEROS. We can, in the usual fashion, take the statistical ensemble, equivalent to a time average and therefore conclude that the dissipation per reset operation is the same for the time-wise succession as for the thermalized ensemble.
A computer, however, is seldom likely to operate on random data. One of the two bit possibilities may occur more often than the other, or even if the frequencies are equal, there may be a correlation between successive bits. In other words the digits which are reset may not carry the maximum possible information. Consider the extreme case, where the inputs are all ONE, and there is no need to carry out the operation. Clearly then no entropy changes occur and no heat dissipation is involved. Alternatively if the initial states are all ZERO they also carry no information, and no entropy change is involved in resetting them all to ONE. Note, however, that the reset operation which sufficed when the inputs were all ONE (doing nothing) will not suffice when the inputs are all ZERO. When the initial states are ZERO, and we wish to go to ONE, this is analogous to a phase transformation between two phases in equilibrium, and can, presumably, be done reversibly and without an entropy increase in the universe, but only by a procedure specifically designed for that task. We thus see that when the initial states do not have their fullest possible diversity, the necessary entropy increase in the RESET operation can be reduced, but only by taking advantage of our knowledge about the inputs, and tailoring the reset operation accordingly.
The generalization to other logically irreversible operations is apparent, and will be illustrated by only one additional example. Consider a very small special-purpose computer, with three binary elements p, q, and r. A machine cycle replaces p by r, replaces q by r and replaces r by p • q There are eight possible initial states, and in thermal equilibrium they will occur with equal probability. How much entropy reduction will occur in a machine cycle? The initial and final machine states are shown in Fig. 5. States a and b occur with a probability of 1/8 each: states g and d have a probability of occurrence of 3/8 each. The initial entropy was
|
|
|
|
The difference Si - Sf = 1.18 k. The minimum dissipation, if the initial state has no useful information, is therefore 1.18 kT.
Figure 5. Three input - three output device which maps eight
possible states onto only four different states
The question arises whether the entropy is really reduced by the logically
irreversible operation. If we really map the possible initial ZERO states
and the possible initial ONE states into the same space, i.e., the space
of ONE states, there can be no question involved. But, perhaps, after we
have performed the operation there can be some small remaining difference
between the systems which were originally in the ONE state already and
those that had to be switched into it. There is no harm in such differences
persisting for some time, but as we saw in the discussion of the dissipationless
subharmonic oscillator, we cannot tolerate a cumulative process, in which
differences between various possible ONE states become larger and larger
according to their detailed past histories. Hence the physical "many into
one" mapping, which is the source of the entropy change, need not happen
in full detail during the machine cycle which performed the logical function.
But it must eventually take place, and this is all that is relevant for
the heat generation argument.
|
|
||
|
|
(5.1)
|
We can view Eqs. (5.1) as representing a linear transformation on (nA, nB), which yields (dnA/ dt, dnB / dt). What are the characteristic values of the transformation? They are:
|
|
The eigenvalue l1 = 0 corresponds to a time-independent well population. This is the equilibrium distribution
|
|
The remaining negative eigenvalue must then be associated with deviations from equilibrium, and exp(-l2t) gives the rate at which these deviations disappear. The relaxation time t is therefore in terms of quantity U0, which is the average of UA and UB
|
|
(5.2)
|
The quantity U0 in Eq. (5.2) cancels out, therefore the validity of Eq. (5.2) does not depend on the definition of U0. Letting D = ½(UA - UB), Eq. (5.2) then becomes
|
|
(5.3)
|
To first order in the switching force which causes UA and UB to differ, (U - U0) will remain unaffected, and therefore Eq. (5.3) can be written
|
|
(5.4)
|
Note hat D
is one-half the energy which will be dissipated in the switching process.
The thermal probability distribution within each well will be about the
same before and after switching, the only difference is that the final
well is 2 D lower
than the initial well. The energy difference is dissipated and corresponds
to one-half the hysteresis loop area energy loss generally associated with
switching. Equation (5.4) therefore confirms the
empirically well-known fact that increases in switching speed can only
be accomplished at the expense of increased dissipation per switching event.
Equation (5.4) is, however, true only for a special
model and has no really general significance. To show this consider an
alternative model. Let us assume that information is stored by the position
of a particle along a lone, and that
|
|
(5.5)
|
|
|
(5.6)
|
|
|
(5.7)
|
|
|
|
(5.8)
|
Which show the same direction of variation as ts with D as in the case with the barrier, but do not involve an exponential variation with D/kT. If all other considerations are ignored it is clear that the energy bistable element of Eq. (5.4) is much preferred to the diffusion stabilized element of Eq. (5.8).
The above examples give us some insight into the need for energy dissipation,
not directly provided by the arguments involving entropy consideration.
In the RESTORE TO ONE operation we want the system to settle into the ONE
state regardless of its initial state. We do this by lowering the energy
of the ONE state relative to the ZERO state. The particle will then go
to this lowest state, and on the way dissipate any excess energy it may
have had in its initial state.
A third source of error consists of the fact that even if the system is allowed to relax completely during switching there would still be a fraction of the ensemble of the order exp(-2D/kT) left in the unfavored initial state. (Assuming D >> kT) For the purpose of the subsequent discussion let us call this Boltzmann error. We shall show that no matter how the design compromise between the first two kinds of errors is made, Boltzmann error will never be dominant. We shall compare the errors in a rough fashion, without becoming involved in an enumeration of the various possible exact histories of information.
To carry out this analysis, we shall overestimate Boltzmann error by
assuming that switching has occurred in every machine cycle in the history
of every bit. It is this upper bound on the Boltzmann error which will
be shown to be negligible, when compared to other errors. The Boltzmann
error probability, per switching event is exp(-2D/kT).
During the same switching time bits which arc not being switched are decaying
away at the rate exp(-t/t0).
In the switching time Ts,
therefore, unswitched bits have a probability Ts/t0
of losing their information. If the Boltzmann error is to be dominant
|
|
(6.1)
|
|
|
(6.2)
|
|
|
(6.3)
|
Now consider the relaxation to the switched state. The error incurred due to incomplete relaxation is exp(-Ts/ts)), which according to Eq. (6.3) satisfies
|
|
(6.4)
|
The right-hand side of this inequality has as its argument 1/2 exp(-D/kT) which is less than 1/2. Therefore the right-hand side is large compared to exp(-2D/ kT), the Boltzmann error, whose exponent is certainly larger than unity. We have thus shown that if the Boltzmann error dominates over the information decay, it must in turn be dominated by the incomplete relaxation during switching.
A somewhat alternate way of arguing the same point consists in showing that the accumulated Boltzmann error, due to the maximum number of switching events permitted by Eq. (5.4), is small compared to unity.
Consider now, instead, the diffusion stabilized element of Eq. (5.8). For it, we can find instead of Eq. (6.4) the relationship
|
|
(6.5)
|
When we attempt to consider a more realistic machine model, in which
switching forces are applied to coupled devices, as is done for example
in diodeless magnetic core logic/ it becomes difficult to maintain analytically
a clean-cut breakdown of error types, as we have done here. Nevertheless
we believe that there is still a somewhat similar separation which is manifested.
Acknowledgment:
Some of these questions were first posed by E. R. Piore a number of years ago. In its early stages2,5 this project was carried forward primarily by the late John Swanson. Conversations with Gordon Lasher were essential to the development of the ideas presented in the paper.
References