## Monday, May 4, 2009

### Formalizing "Shorter Programs"

In a separate discussion, my friend Carl said:
With respect to Occam assigning exponentially diminishing probability to special miracles, I tend to think of this in terms of the broad set of hypotheses to be considered: if my probabilities are to sum up to 1 I can't coherently assign all my 'the world is a lie' probability mass to whatever hypothesis has been brought to my attention in the last five minutes. The code of a short program can be contained in astronomically many ways within a larger program. An indifference principle gets you going from there.
By "short program," Carl is presumably referring to something like the minimum description length (MDL) principle for explaining our observations. I'm curious to know how exactly he's envisioning its application, though.

For purposes of illustration, consider this fictional scenario. My friend Joe calls me in the evening with a worried tone in his voice. He says, "I've got something to tell you. I was just brushing my teeth, when I heard a voice. It said, 'Joe, I have an important message. You need to write a special number on your toothpaste tube, or else your toothpaste will cease to work properly. That number is 2835023981. Do as I say and all will be fine.' Then the voice disappeared."

Now, there are lots of hypotheses we can imagine here. In particular, I'll consider 10^10 + 1 of them:

(0) My friend imagined the whole thing. Writing a number on the toothpaste tube will accomplish nothing.
(1) In fact, my friend needs to write a number on the tube, but that number is not the one he was told, but rather 0000000001.
(2) My friend needs to write not the number he was told, but 0000000002.
...
(2835023981) My friend needs to write the number he was told, 2835023981.
...
(10^10) My friend needs to write not the number he was told, but 1000000000.

Obviously, there are lots more hypotheses to consider -- e.g., that my friend needs to write a number bigger than 10^10, that it needs to include decimals, that he needs to write it on his forehead instead of his toothpaste tube, that he needs to eat green cheese instead of writing a number, and so on. But just these 10^10 + 1 hypotheses give a sense of the literally exponential number of potential complicated scenarios.

How does MDL evaluate each of these hypotheses? One suggestion I can imagine is as follows.

(0) This hypothesis just involves ordinary physics -- things like Maxwell's equations, or perhaps rules of string theory -- plus maybe some physical constants. Given those initial conditions, it would be possible in principle to compute the entire history of the universe, including the evolution of humans, the birth of my friend, and my reception of his phone call. (If the "data" to be explained here are my personal observations, then perhaps the program would also have to specify who I am. It could then compute the pattern of perceptual inputs I receive throughout my lifetime, including the auditory waves from the speaker of my phone with my friend's voice.)
(1) This hypothesis involves mostly ordinary physics, including all of the same information as before. However, it includes an extra specification that, contrary to ordinary physical law, my friend should start getting cavities unless he writes 0000000001 on his toothpaste.
(2) Ditto as above, except with 0000000002.
...
(10^10) Ditto as above, except with 1000000000.

I think this illustrates what Carl meant about a shorter program being contained within astronomically many longer programs.

However, there's a problem here. It may be that computing program (0) would allow one successfully to determine my pattern of observations, including my friend's delusion of hearing a voice and the specific sequence of neuron firings that caused him to pick the number 2835023981. But I don't have the computing power or time to test whether that's the case. For all I know, the laws of physics could predict that my friend would imagine he had to write the number -17.6 on his mirror instead. For practical purposes, the level of abstraction here is too fine-grained to be useful for ordinary humans. It's like trying to predict the stock market by modeling quark-level interactions in traders' brains.

So we move to a higher-level model, perhaps psychological. For example,

(0) The human brain is prone to certain kinds of imagined experiences. In order to explain all sorts of psychological phenomena throughout history, this hypothesis has a probability distribution over types of malfunctions that tend to produce weird sensations. Joe's experience corresponds to malfunction #611 combined with #28, plus a specific association with toothpaste and the number 2835023981.

Even this explanation assumes a more sophisticated model of psychology than we currently possess, but I think it gets at the idea of trying to explain the observation using fewer bits than just restating the entire account of what happened. In contrast, hypothesis (2835023981) still has to model most of human psychology, but it also includes the stipulation that "There is indeed an exception to ordinary laws of dental hygiene that will give Joe cavities unless he writes 2835023981 on his toothpaste, and moreover, this information will be communicated to Joe by a pattern of sound waves in his bathroom."

Of course, the other hypotheses seem even worse in description length. For instance, hypothesis (5928342301) has to model most of human psychology and then specify, "There is an exception to ordinary laws of dental hygiene that will give Joe cavities unless he writes 5928342301 on his toothpaste. Moreover, a pattern of sound waves in Joe's bathroom will give him a false message, telling him that the number is actually 2835023981." Here, we're encoding two ten-digit numbers instead of one, plus some extra linguistic information.

Carl, is this roughly the kind of reasoning that you had in mind? What should we do about the fact that, in practice, I don't have a good enough theory about the distribution of human mental abnormalities to say that Joe's experience corresponds to malfunctions #611 and #28? My actual description of his experience -- for instance, an email message I might send to you, written to be as short as possible but still understandable -- would require almost as many extra bits as hypothesis (5928342301) does, no?

Finally, here's a slightly tangential question that I'm also curious about. Basic Solomonoff induction, were it computable, would give us a prior distribution over finite or infinite binary strings. How would we transform our experiences of the world, like Joe's phone call, into binary strings in order to apply these prior probabilities? Or would we apply Solomonoff induction in a way that doesn't require predicting digits of binary strings?

1. It seems like your examples apply a double-standard by neglecting functional counterparts of (0) but not those of (1), in part because of the use of compressed verbal descriptions that understate the complexity of the miracles and the simplicity of a universe with just a few simple mathematical physical laws.

A simple Game of Life style cellular automata setup can generate Turing-complete computers, life, etc with a short program. There are cosmically many such programs that are more or less functionally interchangeable.

A program to assign specific qualia or specific toothpaste miracles to particular structures in the cellular automata world would require a number of bits for each quale (to specify the qualia, to specify the processes to elicit the qualia, and something like an embryonic AI to apply the compressed data) or miracle and would.

Also, the evidence that the delusional person gives number x strengthens both the naturalistic hypotheses that predict x and the wackier ones. If previously both types distributed probability mass evenly over the possible numbers the guy could give, the balance of credence won't change.

2. "It seems like your examples apply a double-standard by neglecting functional counterparts of (0) but not those of (1)"

I'm not quite sure what you mean here. Is the suggestion that we could come up with lots of variants of (0) that would generate each of the corresponding numbers 1, ..., 10^10?

"and something like an embryonic AI to apply the compressed data"

I'm not sure I understand what this means....

"Also, the evidence that the delusional person gives number x strengthens both the naturalistic hypotheses that predict x and the wackier ones. If previously both types distributed probability mass evenly over the possible numbers the guy could give, the balance of credence won't change."

A good point. In the same way, someone can't make a supernatural claim more believable by including extra details in the story (unless, say, those details are highly accurate predictions of future events).

---

Is it not the case that in practice we have to work at a higher level than cellular automata? We don't actually know what kinds of psychological phenomena they predict. We can have faith that there exist automata that do explain our observations, but no one has actually checked that. It would be nice if I could tell you what happened to my friend just by saying, "Run cellular automaton #1892 for 8,293,110 steps." But no one actually knows if there's an automaton like that, and until one is demonstrated, we can't assume that it must exist. (In a similar way, we can have faith that science will one day explain seemingly paranormal phenomena. But until those explanations are actually produced, we can't claim that the mysteries have been solved.)

3. "But until those explanations are actually produced, we can't claim that the mysteries have been solved.)"

Yes, there's still subjective uncertainty, but so what? If I roll some fair dice, I'll get a particular result that I cant predict exactly. Do we have an important unsolved mystery in that case, unless alternative (non-fair dice, supernatural intervention, etc) hypotheses quite disproportionately favored the actual result?

4. "Do we have an important unsolved mystery in that case, unless alternative (non-fair dice, supernatural intervention, etc) hypotheses quite disproportionately favored the actual result?"

Good point.

Still, a supernatural-toothpaste theory does assign higher likelihood to the observations. I guess your claim is that the bits saved on explaining the data are more than outweighed by the extra bits needed to describe the supernatural theory. In a similar way, a supernatural-die-comes-up-1 theory has 6 times the likelihood of a random-die theory, but it presumably takes far more than lg 6 bits to describe.

5. A quick follow-up to my previous comment. The whole question of whether to posit a paranormal explanation is really just an instance of the model selection / overfitting problem from statistics: Do you introduce additional parameters to better fit the data? In the die example, the extra parameter says, "This die is such that it comes up 1 when rolled at such and such place and time."

(Of course, all inference is just a special case of model selection, but the analogy is particularly clear here.)