With respect to Occam assigning exponentially diminishing probability to special miracles, I tend to think of this in terms of the broad set of hypotheses to be considered: if my probabilities are to sum up to 1 I can't coherently assign all my 'the world is a lie' probability mass to whatever hypothesis has been brought to my attention in the last five minutes. The code of a short program can be contained in astronomically many ways within a larger program. An indifference principle gets you going from there.By "short program," Carl is presumably referring to something like the minimum description length (MDL) principle for explaining our observations. I'm curious to know how exactly he's envisioning its application, though.
For purposes of illustration, consider this fictional scenario. My friend Joe calls me in the evening with a worried tone in his voice. He says, "I've got something to tell you. I was just brushing my teeth, when I heard a voice. It said, 'Joe, I have an important message. You need to write a special number on your toothpaste tube, or else your toothpaste will cease to work properly. That number is 2835023981. Do as I say and all will be fine.' Then the voice disappeared."
Now, there are lots of hypotheses we can imagine here. In particular, I'll consider 10^10 + 1 of them:
(0) My friend imagined the whole thing. Writing a number on the toothpaste tube will accomplish nothing.
(1) In fact, my friend needs to write a number on the tube, but that number is not the one he was told, but rather 0000000001.
(2) My friend needs to write not the number he was told, but 0000000002.
...
(2835023981) My friend needs to write the number he was told, 2835023981.
...
(10^10) My friend needs to write not the number he was told, but 1000000000.
Obviously, there are lots more hypotheses to consider -- e.g., that my friend needs to write a number bigger than 10^10, that it needs to include decimals, that he needs to write it on his forehead instead of his toothpaste tube, that he needs to eat green cheese instead of writing a number, and so on. But just these 10^10 + 1 hypotheses give a sense of the literally exponential number of potential complicated scenarios.
How does MDL evaluate each of these hypotheses? One suggestion I can imagine is as follows.
(0) This hypothesis just involves ordinary physics -- things like Maxwell's equations, or perhaps rules of string theory -- plus maybe some physical constants. Given those initial conditions, it would be possible in principle to compute the entire history of the universe, including the evolution of humans, the birth of my friend, and my reception of his phone call. (If the "data" to be explained here are my personal observations, then perhaps the program would also have to specify who I am. It could then compute the pattern of perceptual inputs I receive throughout my lifetime, including the auditory waves from the speaker of my phone with my friend's voice.)
(1) This hypothesis involves mostly ordinary physics, including all of the same information as before. However, it includes an extra specification that, contrary to ordinary physical law, my friend should start getting cavities unless he writes 0000000001 on his toothpaste.
(2) Ditto as above, except with 0000000002.
...
(10^10) Ditto as above, except with 1000000000.
I think this illustrates what Carl meant about a shorter program being contained within astronomically many longer programs.
However, there's a problem here. It may be that computing program (0) would allow one successfully to determine my pattern of observations, including my friend's delusion of hearing a voice and the specific sequence of neuron firings that caused him to pick the number 2835023981. But I don't have the computing power or time to test whether that's the case. For all I know, the laws of physics could predict that my friend would imagine he had to write the number -17.6 on his mirror instead. For practical purposes, the level of abstraction here is too fine-grained to be useful for ordinary humans. It's like trying to predict the stock market by modeling quark-level interactions in traders' brains.
So we move to a higher-level model, perhaps psychological. For example,
(0) The human brain is prone to certain kinds of imagined experiences. In order to explain all sorts of psychological phenomena throughout history, this hypothesis has a probability distribution over types of malfunctions that tend to produce weird sensations. Joe's experience corresponds to malfunction #611 combined with #28, plus a specific association with toothpaste and the number 2835023981.
Even this explanation assumes a more sophisticated model of psychology than we currently possess, but I think it gets at the idea of trying to explain the observation using fewer bits than just restating the entire account of what happened. In contrast, hypothesis (2835023981) still has to model most of human psychology, but it also includes the stipulation that "There is indeed an exception to ordinary laws of dental hygiene that will give Joe cavities unless he writes 2835023981 on his toothpaste, and moreover, this information will be communicated to Joe by a pattern of sound waves in his bathroom."
Of course, the other hypotheses seem even worse in description length. For instance, hypothesis (5928342301) has to model most of human psychology and then specify, "There is an exception to ordinary laws of dental hygiene that will give Joe cavities unless he writes 5928342301 on his toothpaste. Moreover, a pattern of sound waves in Joe's bathroom will give him a false message, telling him that the number is actually 2835023981." Here, we're encoding two ten-digit numbers instead of one, plus some extra linguistic information.
Carl, is this roughly the kind of reasoning that you had in mind? What should we do about the fact that, in practice, I don't have a good enough theory about the distribution of human mental abnormalities to say that Joe's experience corresponds to malfunctions #611 and #28? My actual description of his experience -- for instance, an email message I might send to you, written to be as short as possible but still understandable -- would require almost as many extra bits as hypothesis (5928342301) does, no?
Finally, here's a slightly tangential question that I'm also curious about. Basic Solomonoff induction, were it computable, would give us a prior distribution over finite or infinite binary strings. How would we transform our experiences of the world, like Joe's phone call, into binary strings in order to apply these prior probabilities? Or would we apply Solomonoff induction in a way that doesn't require predicting digits of binary strings?