Monday, May 4, 2009

Formalizing "Shorter Programs"

In a separate discussion, my friend Carl said:
With respect to Occam assigning exponentially diminishing probability to special miracles, I tend to think of this in terms of the broad set of hypotheses to be considered: if my probabilities are to sum up to 1 I can't coherently assign all my 'the world is a lie' probability mass to whatever hypothesis has been brought to my attention in the last five minutes. The code of a short program can be contained in astronomically many ways within a larger program. An indifference principle gets you going from there.
By "short program," Carl is presumably referring to something like the minimum description length (MDL) principle for explaining our observations. I'm curious to know how exactly he's envisioning its application, though.

For purposes of illustration, consider this fictional scenario. My friend Joe calls me in the evening with a worried tone in his voice. He says, "I've got something to tell you. I was just brushing my teeth, when I heard a voice. It said, 'Joe, I have an important message. You need to write a special number on your toothpaste tube, or else your toothpaste will cease to work properly. That number is 2835023981. Do as I say and all will be fine.' Then the voice disappeared."

Now, there are lots of hypotheses we can imagine here. In particular, I'll consider 10^10 + 1 of them:

(0) My friend imagined the whole thing. Writing a number on the toothpaste tube will accomplish nothing.
(1) In fact, my friend needs to write a number on the tube, but that number is not the one he was told, but rather 0000000001.
(2) My friend needs to write not the number he was told, but 0000000002.
(2835023981) My friend needs to write the number he was told, 2835023981.
(10^10) My friend needs to write not the number he was told, but 1000000000.

Obviously, there are lots more hypotheses to consider -- e.g., that my friend needs to write a number bigger than 10^10, that it needs to include decimals, that he needs to write it on his forehead instead of his toothpaste tube, that he needs to eat green cheese instead of writing a number, and so on. But just these 10^10 + 1 hypotheses give a sense of the literally exponential number of potential complicated scenarios.

How does MDL evaluate each of these hypotheses? One suggestion I can imagine is as follows.

(0) This hypothesis just involves ordinary physics -- things like Maxwell's equations, or perhaps rules of string theory -- plus maybe some physical constants. Given those initial conditions, it would be possible in principle to compute the entire history of the universe, including the evolution of humans, the birth of my friend, and my reception of his phone call. (If the "data" to be explained here are my personal observations, then perhaps the program would also have to specify who I am. It could then compute the pattern of perceptual inputs I receive throughout my lifetime, including the auditory waves from the speaker of my phone with my friend's voice.)
(1) This hypothesis involves mostly ordinary physics, including all of the same information as before. However, it includes an extra specification that, contrary to ordinary physical law, my friend should start getting cavities unless he writes 0000000001 on his toothpaste.
(2) Ditto as above, except with 0000000002.
(10^10) Ditto as above, except with 1000000000.

I think this illustrates what Carl meant about a shorter program being contained within astronomically many longer programs.

However, there's a problem here. It may be that computing program (0) would allow one successfully to determine my pattern of observations, including my friend's delusion of hearing a voice and the specific sequence of neuron firings that caused him to pick the number 2835023981. But I don't have the computing power or time to test whether that's the case. For all I know, the laws of physics could predict that my friend would imagine he had to write the number -17.6 on his mirror instead. For practical purposes, the level of abstraction here is too fine-grained to be useful for ordinary humans. It's like trying to predict the stock market by modeling quark-level interactions in traders' brains.

So we move to a higher-level model, perhaps psychological. For example,

(0) The human brain is prone to certain kinds of imagined experiences. In order to explain all sorts of psychological phenomena throughout history, this hypothesis has a probability distribution over types of malfunctions that tend to produce weird sensations. Joe's experience corresponds to malfunction #611 combined with #28, plus a specific association with toothpaste and the number 2835023981.

Even this explanation assumes a more sophisticated model of psychology than we currently possess, but I think it gets at the idea of trying to explain the observation using fewer bits than just restating the entire account of what happened. In contrast, hypothesis (2835023981) still has to model most of human psychology, but it also includes the stipulation that "There is indeed an exception to ordinary laws of dental hygiene that will give Joe cavities unless he writes 2835023981 on his toothpaste, and moreover, this information will be communicated to Joe by a pattern of sound waves in his bathroom."

Of course, the other hypotheses seem even worse in description length. For instance, hypothesis (5928342301) has to model most of human psychology and then specify, "There is an exception to ordinary laws of dental hygiene that will give Joe cavities unless he writes 5928342301 on his toothpaste. Moreover, a pattern of sound waves in Joe's bathroom will give him a false message, telling him that the number is actually 2835023981." Here, we're encoding two ten-digit numbers instead of one, plus some extra linguistic information.

Carl, is this roughly the kind of reasoning that you had in mind? What should we do about the fact that, in practice, I don't have a good enough theory about the distribution of human mental abnormalities to say that Joe's experience corresponds to malfunctions #611 and #28? My actual description of his experience -- for instance, an email message I might send to you, written to be as short as possible but still understandable -- would require almost as many extra bits as hypothesis (5928342301) does, no?

Finally, here's a slightly tangential question that I'm also curious about. Basic Solomonoff induction, were it computable, would give us a prior distribution over finite or infinite binary strings. How would we transform our experiences of the world, like Joe's phone call, into binary strings in order to apply these prior probabilities? Or would we apply Solomonoff induction in a way that doesn't require predicting digits of binary strings?

Saturday, May 2, 2009

Normal Beliefs: An Insanity Defense?

I am a primate running patchwork cognitive algorithms on relatively fragile wetware. We know that brain devices fail at relatively high rates. 19% of the US population has a mental illness of some sort, with a small fraction of these cases involving serious insanity or delusion. In addition, some people simply lack certain normal abilities, such as the 7% of males who are colorblind.

I and many of my associates have extraordinarily strange beliefs. Many of these are weird facts -- e.g., that an exact copy of me exists within a radius of 10^(10^29) meters. But others are logical conclusions (e.g., that libertarian free will is incoherent) and methodological notions (e.g., that Occam's razor makes the parallelism solution to the mind-body problem astronomically improbable). These latter kinds of beliefs theoretically involve certainty or near certainty.

But given my understanding of the frailty of human beliefs in general -- to say nothing of the tempting possibility that correct knowledge is out of the question, or that all of these statements are entirely meaningless -- should I assign nonzero probability to the possibility that I'm wrong about these conceptual matters?

One answer is to say "no": We all start with assumptions, and I'm making the assumptions that I'm making. This is my attitude toward things like Bayes' theorem and Occam's razor. In the same way that my impulse to prevent suffering is ultimately something that I want to do, "just because," so my faith in math and Bayesian epistemology is simply something the collection of atoms in my brain has chosen to have, and that's that. (I wonder: Is there any sense in which it would be possible to assign probability less than 1 to the Bayesian framework itself? Prima facie, this would be simply incoherent.)

But what about other, less foundational conclusions, like the incoherence of libertarian free will? It's not obvious to me that the negation of this conclusion would contradict my epistemological framework, since my position on the issue may stem from lack of imagination (I can't conceive of anything other than determinism or random behavior) rather than clear logical contradiction. On this point itself I'm uncertain -- maybe libertarian free will is logically impossible. But I'm not smart enough to be sure. And even if I felt sure, I very well might be mistaken, or even -- as suggested in the first paragraph -- completely insane.

Can probability be used to capture uncertainties of this type? In practice, the answer is clearly yes. I've done enough math homework problems to know that my probability of making an algebra mistake is not only nonzero but fairly high. And it's not incoherent to reason about errors of this type. For instance, if I do a utility calculation involving a complex algebraic formula, I may be uncertain as to whether I've made a sign error, in which case the answer would be negated. It's perfectly reasonable for me to assign, say, 90% probability to having done the computation correctly and 10% to having made the sign error and then multiply these by their corresponding utility-values-if-correct-computation. There's no mystery here: I'm just assigning probabilities over the conceptually unproblematic hypotheses "Brian got the right answer" vs. "Brian made a sign error."

In practice, of course, it's rarely useful to apply this sort of reasoning, because the number of wrong math answers is, needless to say, infinite. (Still, it might be useful to study the distribution of correct and incorrect answers that occur in practice. This reminds me of the suggestion by a friend that mathematicians might study the rates at which conjectures of certain types turn out to be true, in order to better estimate probabilities of theorems they can't actually prove. Indeed, statistical techniques have been used within the domain of automated theorem proving.) When someone objects to a rationalist's conclusion about such and such on the grounds that "Your cognitive algorithm might be flawed," the rationalist can usually reply, "Well, maybe, sure. But what am I going to do about it? Which element of the huge space of alternatives am I going to pick instead?"

Perhaps one answer to that question could be "Beliefs that fellow humans, running their own cognitive algorithms, have arrived at." After all, those people are primates trying to make sense of their environment just like you are, and it doesn't seem inconceivable that not only are you wrong but they're actually right. This would seem to suggest some degree of philosophical majoritarianism. Obviously we need to weight different people's beliefs according to the probability that their cognitive algorithms are sound, but we should keep in mind the fact that those weights are themselves circular.

How concerned should we be that, say, people who believe in parallelism of mind and body are actually correct?