Next: Chaos & Order

Information Theory

About two hundred and twenty five years ago, an invention was made which led to a new field of engineering. The invention was described by a Scottish economist, Adam Smith, in The Wealth of Nations:

``In the first steam engines, a boy was constantly employed to open and shut the communication between the boiler and the cylinder, according as the piston either ascended or descended. One of these boys, who loved to play with his companions, observed that by tying a string from the handle of the valve which opened this communication to another part of the machine, the valve would open and shut without his assistance, leaving him at liberty to divert himself with his playfellows."

What had this boy -- whom we may regard as an early systems engineer -- actually invented? A piece of string scarcely qualifies as a mechanical invention. What is its function? The string is transmitting information. This is the one material needed for the production process that was not identified until late in the day. To make a steam engine capable of withstanding high pressures you require the steels produced by the Bessemer process, but you also require something else -- information -- the information that specifies what the engine will look like, and what steps are required to make it look that way.

This invisible material started to become visible in the 1930's; the rise of telecommunications, the invention of the triode valve, which would be superseded by the transistor, and the first computers made us explicitly aware of information as something that could be created, transformed and transmitted. The discipline of information science arose, allowing us to measure quantity of information as easily as we could measure weight of steel.

Measuring Information

The founder of information theory was a man called Claude Shannon, initially trained at MIT, subsequently working at Bell Labs, the industrial research arm of the Bell Telephone company. In `The Mathematical Theory of Comunication', published in 1948, he describes a method for measuring the information content of a signal.

So how do we measure it? One approach is to think of information as something that reduces uncertainty. If I don't know what's going to happen next, and someone tells me, the information removes my uncertainty. So I can say that the information received is the same size as the uncertainty it removed. This would be helpful, if I had a way of measuring the size of an uncertainty.

If I'm uncertain what's going to happen next, then, at least as far as I'm concerned, there are several possibilities open. And if all the possibilities are equally likely, then the more possibilities there are, the more uncertain I am. To be more quantitative: suppose I am in a situation where any one of N possible things could happen next, and the probability that thing i will happen is p(i). What I'd like is some function f(p(1),p(2),...,p(N)) that would give a numerical value to my uncertainty -- and hence a numerical value to the information that would resolve my uncertainty.

To conform to the common-sense meaning of `information', the function f should have the following characteristics:

1. It should be continuous: a small change in p(i) shouldn't make a big difference to my knowledge.

2. If all the outcomes of an event are equally likely, then I become more uncertain as the number of outcomes increases.

3. The total information conveyed by learning the outcome of an event should be the same whether I learn it by a series of clues or by a single revelation.

Curiously enough, given these conditions, there is only one function that fits:

I = (summation from i = 1 to N of) p(i)log_base_2(p(i)).

If the p(i) are all equal, this can be simplified to I = log_base_2 N.

Before we can do anything with this expression, we need to clarify just what we mean by `probability'. What do we mean?

There are two schools of thought among probability theorists: the frequentist and subjectivist schools. They have very different definitions:

Considering an event which has N possible outcomes, a frequentist who says ``Outcome i has probability p(i)'' means that if we repeat the event M times, the outcome i will be recorded a fraction p of the time, where p approaches p(i) as M approaches infinity. A subjectivist who makes the same remark, on the other hand, will mean that a rational person will have a degree of belief p(i) that outcome i will occur; this degree of belief might be measured by the odds that a person is prepared to wager on the outcome.

Now consider the case where you're going to toss a coin and I'm going to guess the result. Unknown to me, you have a trick coin, and before I came into the room you had already tossed it an infinite number of times, verifying that it came up heads each time. So according to the frequentist definition, p(head)=1, p(tail)=0 and the numerical value of my uncertainty is therefore zero, that is, I am not missing any information. In fact, however, I don't know it's a trick coin, so I am quite uncertain as to the outcome. This shows that the interpretation to be used in calculating information content must be the subjectivist interpretation.

We can illustrate this by asking how much information is conveyed when I tell you the answer to the question ``What integer between 0 and 9 inclusive am I about to show?''. Application of the above formula shows that this information is worth 3.3... bits.

Let's see how this can be applied to more interesting situtations. Suppose that, instead of a mystery number, I'm going to show you a mystery picture. What is its information content?

To fix our ideas, let us suppose that the picture is taken from a computer screen, that the computer screen is made up of 1,000 pixels by 1,000 pixels, and that the screen is black and white only. How many possible pictures can this screen display?

The total number of possible screens is rather large; about 2 to the power of a million, or 10 to the power 300,000. For comparison, the number of electrons in the universe is thought to be around 10 to the eightieth

Conveying Information

A related though distinct topic is, how can we most effectively {\it convey} information? This problem lies more within psychology than within mathematics.

Many engineering methods, for example, finite-element analysis, can generate very large vectors of numbers, representing, for example, the temperature at each point on a structure. The human body has a particular organ, the eye, which is uniquely powerful at absorbing large amounts of data once they have been turned into pictures.

Data representing the distribution of variables over the surface of objects can be turned into pictures in a natural way. With a little more effort, it is possible to turn other datasets into comprehensible pictures. One method of doing this is bitmapping. A computer screen is made up of a large number of pixels, and the colour of each pixel may be set by 16, 24 or 36 bits. Addressing individual pixels is known as bitmapping.

Recall our discussion of neural nets in Lecture 3. Here's a simpler example of a neural net, designed to implement an XOR gate. We could design such a net ourselves, but we could also train it, by giving it a long series of training examples and modifying the weights by back-propagation until the desired performance is achieved.

The process of training is not well-understood; neural net researchers don't know what starting configuration will lead to the most rapid learning. We have tried to investigate this by choosing a range of values for the initial weights, then displaying the results as a bitmap. This approach is limited, since we can only choose two connections to study at a time, but the results are suggestive.

Edward Tufte is an expert in graphics design, and gives many examples of the effective use of pictures to convey information. One particularly ingenious innovation is Chernoff's use of cartoon faces to represent combinations of independent variables -- the shape of the eyes may represent personal income, the shape of the nose the subject's age, etc.. This takes advantage of the fact that humans are `hard-wired' to be good at recognising faces -- thus, it is relatively easy to see that a particular combination of variables doesn't fit in to its neighbourhood in a graph.

Compressing Information

We begin by asking, ``What is a random sequence?'' On general, it is a sequence where, however many members of the sequence we've seen, we have no idea what the next one will be. The opposite of randomness is redundancy: if a message is redundant, it can be shortened without loss of information.

Generating a random sequence is more difficult than one might think; simply writing down numbers with no particular plan in mind is not a good method. We tend to avoid apparent patterns, such as repeated digits, even though a truly random sequence will have repeated digits every tenth place, on average. There are algorithms for generating pseudo-random sequences on a computer; some indication of the depth of this problem can be gathered from the fact that it occupies pages 1-173 of the second volume of Donald Knuth's densely-written textbook, The Art of Computer Programming. The only foolproof way of generating a truly random stream that I know of is measuring the intervals between successive nuclear decays in a radioactive isotope.

Testing a candidate sequence for randomness is also a non-trivial problem. If we can consistently predict the next element with a success rate greater than chance, the sequence is definitely not random. But if we have so far failed to make such a successful prediction, the most we can say is that we haven't detected the pattern yet.

Why would we want to generate a random sequence anyway? One use for a random sequence is to create an unbreakable code. Such a code can be created using a one-time pad: the message to be encrypted is translated to a sequence of ones and zeroes, and this sequence is then combined with a random sequence of the same length using the Exclusive Or (`XOR') operation. The result is a pseudo-random sequence that can be decoded only by XOR-ing it with a second copy of the original random sequence. The drawback to this encoding scheme is that both sender and receiver of the coded message must have identical copies of the random sequence used, and these sequences can only be used once (Hence `one-time pad').

If the opposite of randomness is predictability, or redundancy, then most English-language sentences are quite redundant: we can remove a fraction of the letters and what's left will still be comprehensible. Consider these two patterns:

order.eps

we all recognise the left pattern as orderly, the right pattern as chaotic. Now consider describing these patterns to someone over a phone line. We could describe the left pattern easily: ``3 by 12 block of black, corners at A1 and C12" whereas the right pattern requires more information to specify: ``One black at A1, one at E1, ...". This suggests a definition: a pattern displays order if it can be described concisely. Or, more quantitatively, a pattern displays order if it can be specified using less information than the information formula would lead us to expect.

Any kind of symmetry will reduce the information needed to describe a pattern. If an image has a vertical axis of symmetry, for example, we can describe the left half of the image and say, ``The right is a reflection of the left''. Indeed, if any part of the pattern contains any clue about any other part, this reduces the information necessary to describe the pattern.

A message in English therefore has a lower information content than an arbitrary string of letters. If we know the message is in English, reading the letter `q' leaves us quite certain that the next letter will be `u'. Indeed, given almost any passage of English text, we can remove one letter in every three and still have a fair chance of reading the resultant message. And the more orderly the English message is, the lower its information content.

For example, which is more orderly, a telephone directory or a novel? I would say the novel: if you rip a page out of the novel, you can deduce some or all of what happened on that page from what comes before and after. But there's no way of deducing the contents of the missing page of the phone book. So we should conclude that the novel has lower informatio n content. This seems reasonable; we can write a decent one-page summary of most novels, but we can't write a useful summary of a phone book.

By the same reasoning, a poem has lower information content than a prose passage of the same length, because the metre and rhyme scheme provide clues which would make it easier for us to fill in a lost word. Very structured poems, such as sonnets, limericks and haikus, have the lowest information content of all.

We should distinguish between `having high information content' and `being interesting' or `being significant'. The phone book qualifies for the first of these, but not either of the others. In fact, the qualities that make a book interesting all seem to be ones that reduce information content -- having an intricate plot and consistent characterisation, for example, reduce the number of ways in which the book can develop.

Just as English sentences can be compressed significantly, so can pictures of the natural world. The researcher Michael Barnsley at Georgia Tech has shown that images of the natural world can also be reduced to a simple pattern, then re-created by repeatedly plotting this pattern on different scales and at different orientations. [Barnsley, p.40; p. 92; plates 8.43-9.8.14]. An example is the picture of a fern shown during the lecture, which was generated by iterative mapping of a simple generating pattern. Barnsley has remarked that there's very little room to store information in a fern spore, and that this suggested to him that the description of the complex form of a full-grown fern must be encodable in a very compact way. Using Barnsley's technique, natural scenes can be compressed many times over. This tells us that those scenes possess an intrinsic order. And this should not surprise us, because the natural world is ordered. This order is what makes possible the laws of science.

Speech can be compressed very effectively; almost eight-fold compression is achievable with no perceptible loss of quality, and recognisable speech can survive a further eight-fold compression. Compressing music is more difficult; lossless compression can reduce the size of music files by 20-30%, whereas lossy compression, such as mp3, can achieve up to eightfold compression.

Movies can be compressed by a factor of 15 to 20. The most popular form of compression, mpeg, takes advantage of the facts that (i) most pixels are the same colour as their neighbours and (ii) most pixels stay the same colour for extended periods of time. This allows the moving picture to be divided into chunks, extended through time and space.

It is possible to create uncompressible pictures; for example, a picture in which each pixel's value is chosen at random cannot be compressed without loss of information content. On the other hand, such pictures don't look like anything -- because there's no order of any kind in the image, we can't make any sense of what we see.

To sum up: in this lecture I have shown that information can be measured, and I have introduced a method -- bitmapping -- for maximizing the information portrayed on a graph. Using this method, we have examined the training of neural nets. We have discussed the information content of images and noted that the randomness of a signal is directly correlated with its information content: the more random a signal, the more information it contains and the less it can be compressed.

Next: Suggestions for Essay

John Jones
Mon Oct 21 08:32:47 PDT 2007