Bayesianism — preposterous mumbo jumbo
July 23, 2014Leave a commentGo to comments
from Lars Syll
Neoclassical economics nowadays usually assumes that agents that have to make choices under conditions of uncertainty behave according to Bayesian rules (preferably the ones axiomatized by Ramsey (1931), de Finetti (1937) or Savage (1954)) – that is, they maximize expected utility with respect to some subjective probability measure that is continually updated according to Bayes theorem. If not, they are supposed to be irrational, and ultimately – via some “Dutch book” or “money pump” argument – susceptible to being ruined by some clever “bookie”.
Bayesianism reduces questions of rationality to questions of internal consistency (coherence) of beliefs, but – even granted this questionable reductionism – do rational agents really have to be Bayesian? As I have been arguing elsewhere (e. g.here and here) there is no strong warrant for believing so.
In many of the situations that are relevant to economics one could argue that there is simply not enough of adequate and relevant information to ground beliefs of a probabilistic kind, and that in those situations it is not really possible, in any relevant way, to represent an individual’s beliefs in a single probability measure.
The view that Bayesian decision theory is only genuinely valid in a small world was asserted very firmly by Leonard Savage when laying down the principles of the theory in his path-breaking Foundations of Statistics. He makes the distinction between small and large worlds in a folksy way by quoting the proverbs ”Look before you leap” and ”Cross that bridge when you come to it”. You are in a small world if it is feasible always to look before you leap. You are in a large world if there are some bridges that you cannot cross before you come to them.
As Savage comments, when proverbs conflict, it is proverbially true that there is some truth in both—that they apply in different contexts. He then argues that some decision situations are best modeled in terms of a small world, but others are not. He explicitly rejects the idea that all worlds can be treated as small as both ”ridiculous” and ”preposterous” … Frank Knight draws a similar distinction between making decision under risk or uncertainty …
Bayesianism is understood [here] to be the philosophical principle that Bayesian methods are always appropriate in all decision problems, regardless of whether the relevant set of states in the relevant world is large or small. For example, the world in which financial economics is set is obviously large in Savage’s sense, but the suggestion that there might be something questionable about the standard use of Bayesian updating in financial models is commonly greeted with incredulity or laughter.
Someone who acts as if Bayesianism were correct will be said to be a Bayesianite. It is important to distinguish a Bayesian like myself—someone convinced by Savage’s arguments that Bayesian decision theory makes sense in small worlds—from a Bayesianite. In particular, a Bayesian need not join the more extreme Bayesianites in proceeding as though:
• All worlds are small.
• Rationality endows agents with prior probabilities.
• Rational learning consists simply in using Bayes’ rule to convert a set of prior
probabilities into posterior probabilities after registering some new data.Bayesianites are often understandably reluctant to make an explicit commitment to these principles when they are stated so baldly, because it then becomes evi-dent that they are implicitly claiming that David Hume was wrong to argue that the principle of scientific induction cannot be justified by rational argument …
Bayesianites believe that the subjective probabilities of Bayesian decision theory can be reinterpreted as logical probabilities without any hassle. Its adherents therefore hold that Bayes’ rule is the solution to the problem of scientific induction. No support for such a view is to be found in Savage’s theory—nor in the earlier theories of Ramsey, de Finetti, or von Neumann and Morgenstern. Savage’s theory is entirely and exclusively a consistency theory. It says nothing about how decision-makers come to have the beliefs ascribed to them; it asserts only that, if the decisions taken are consistent (in a sense made precise by a list of axioms), then they act as though maximizing expected utility relative to a subjective probability distribution …
A reasonable decision-maker will presumably wish to avoid inconsistencies. A Bayesianite therefore assumes that it is enough to assign prior beliefs to as decisionmaker, and then forget the problem of where beliefs come from. Consistency then forces any new data that may appear to be incorporated into the system via Bayesian updating. That is, a posterior distribution is obtained from the prior distribution using Bayes’ rule.
The naiveté of this approach doesn’t consist in using Bayes’ rule, whose validity as a piece of algebra isn’t in question. It lies in supposing that the problem of where the priors came from can be quietly shelved.
Savage did argue that his descriptive theory of rational decision-making could be of practical assistance in helping decision-makers form their beliefs, but he didn’t argue that the decision-maker’s problem was simply that of selecting a prior from a limited stock of standard distributions with little or nothing in the way of soulsearching. His position was rather that one comes to a decision problem with a whole set of subjective beliefs derived from one’s previous experience that may or may not be consistent …
But why should we wish to adjust our gut-feelings using Savage’s methodology? In particular, why should a rational decision-maker wish to be consistent? After all, scientists aren’t consistent, on the grounds that it isn’t clever to be consistently wrong. When surprised by data that shows current theories to be in error, they seek new theories that are inconsistent with the old theories. Consistency, from this point of view, is only a virtue if the possibility of being surprised can somehow be eliminated. This is the reason for distinguishing between large and small worlds. Only in the latter is consistency an unqualified virtue.
Say you have come to learn (based on own experience and tons of data) that the probability of you becoming unemployed in the US is 10%. Having moved to another country (where you have no own experience and no data) you have no information on unemployment and a fortiori nothing to help you construct any probability estimate on. A Bayesian would, however, argue that you would have to assign probabilities to the mutually exclusive alternative outcomes and that these have to add up to 1, if you are rational. That is, in this case – and based on symmetry – a rational individual would have to assign probability 10% to becoming unemployed and 90% of becoming employed.
That feels intuitively wrong though, and I guess most people would agree. Bayesianism cannot distinguish between symmetry-based probabilities from information and symmetry-based probabilities from an absence of information. In these kinds of situations most of us would rather say that it is simply irrational to be a Bayesian and better instead to admit that we “simply do not know” or that we feel ambiguous and undecided. Arbitrary an ungrounded probability claims are more irrational than being undecided in face of genuine uncertainty, so if there is not sufficient information to ground a probability distribution it is better to acknowledge that simpliciter, rather than pretending to possess a certitude that we simply do not possess.
I think this critique of Bayesianism is in accordance with the views of Keynes’ A Treatise on Probability (1921) and General Theory (1937). According to Keynes we live in a world permeated by unmeasurable uncertainty – not quantifiable stochastic risk – which often forces us to make decisions based on anything but rational expectations. Sometimes we “simply do not know.” Keynes would not have accepted the view of Bayesian economists, according to whom expectations “tend to be distributed, for the same information set, about the prediction of the theory.” Keynes, rather, thinks that we base our expectations on the confidence or “weight” we put on different events and alternatives. To Keynes expectations are a question of weighing probabilities by “degrees of belief”, beliefs that have preciously little to do with the kind of stochastic probabilistic calculations made by the rational agents modeled by Bayesian economists.
The bias toward the superficial and the response to extraneous influences on research are both examples of real harm done in contemporary social science by a roughly Bayesian paradigm of statistical inference as the epitome of empirical argument. For instance the dominant attitude toward the sources of black-white differential in United States unemployment rates (routinely the rates are in a two to one ratio) is “phenomenological.” The employment differences are traced to correlates in education, locale, occupational structure, and family background. The attitude toward further, underlying causes of those correlations is agnostic … Yet on reflection, common sense dictates that racist attitudes and institutional racism must play an important causal role. People do have beliefs that blacks are inferior in intelligence and morality, and they are surely influenced by these beliefs in hiring decisions … Thus, an overemphasis on Bayesian success in statistical inference discourages the elaboration of a type of account of racial disadavantages that almost certainly provides a large part of their explanation
Read my lips — statistical significance is NOT a substitute for doing real science!
July 25, 2014Leave a commentGo to comments
from Lars Syll
Noah Smith has a post up today telling us that his Bayesian Superman wasn’t intended to be a knock on Bayesianism and that he thinks Frequentism is a bit underrated these days:
Frequentist hypothesis testing has come under sustained and vigorous attack in recent years … But there are a couple of good things about Frequentist hypothesis testing that I haven’t seen many people discuss. Both of these have to do not with the formal method itself, but with social conventions associated with the practice …
Why do I like these social conventions? Two reasons. First, I think they cut down a lot on scientific noise. “Statistical significance” is sort of a first-pass filter that tells you which results are interesting and which ones aren’t. Without that automated filter, the entire job of distinguishing interesting results from uninteresting ones falls to the reviewers of a paper, who have to read through the paper much more carefully than if they can just scan for those little asterisks of “significance”.
Hmm …
A non-trivial part of teaching statistics is made up of teaching students to perform significance testing. A problem I have noticed repeatedly over the years, however, is that no matter how careful you try to be in explicating what the probabilities generated by these statistical tests – p-values – really are, still most students misinterpret them. And a lot of researchers obviously also fall pray to the same mistakes:
Are women three times more likely to wear red or pink when they are most fertile? No, probably not. But here’s how hardworking researchers, prestigious scientific journals, and gullible journalists have been fooled into believing so.
The paper I’ll be talking about appeared online this month in Psychological Science, the flagship journal of the Association for Psychological Science, which represents the serious, research-focused (as opposed to therapeutic) end of the psychology profession.
“Women Are More Likely to Wear Red or Pink at Peak Fertility,” by Alec Beall and Jessica Tracy, is based on two samples: a self-selected sample of 100 women from the Internet, and 24 undergraduates at the University of British Columbia. Here’s the claim: “Building on evidence that men are sexually attracted to women wearing or surrounded by red, we tested whether women show a behavioral tendency toward wearing reddish clothing when at peak fertility. … Women at high conception risk were more than three times more likely to wear a red or pink shirt than were women at low conception risk. … Our results thus suggest that red and pink adornment in women is reliably associated with fertility and that female ovulation, long assumed to be hidden, is associated with a salient visual cue.”
Pretty exciting, huh? It’s (literally) sexy as well as being statistically significant. And the difference is by a factor of three—that seems like a big deal.
Really, though, this paper provides essentially no evidence about the researchers’ hypotheses …
The way these studies fool people is that they are reduced to sound bites: Fertile women are three times more likely to wear red! But when you look more closely, you see that there were many, many possible comparisons in the study that could have been reported, with each of these having a plausible-sounding scientific explanation had it appeared as statistically significant in the data.
The standard in research practice is to report a result as “statistically significant” if its p-value is less than 0.05; that is, if there is less than a 1-in-20 chance that the observed pattern in the data would have occurred if there were really nothing going on in the population. But of course if you are running 20 or more comparisons (perhaps implicitly, via choices involved in including or excluding data, setting thresholds, and so on), it is not a surprise at all if some of them happen to reach this threshold.
The headline result, that women were three times as likely to be wearing red or pink during peak fertility, occurred in two different samples, which looks impressive. But it’s not really impressive at all! Rather, it’s exactly the sort of thing you should expect to see if you have a small data set and virtually unlimited freedom to play around with the data, and with the additional selection effect that you submit your results to the journal only if you see some catchy pattern. …
Statistics textbooks do warn against multiple comparisons, but there is a tendency for researchers to consider any given comparison alone without considering it as one of an ensemble of potentially relevant responses to a research question. And then it is natural for sympathetic journal editors to publish a striking result without getting hung up on what might be viewed as nitpicking technicalities. Each person in this research chain is making a decision that seems scientifically reasonable, but the result is a sort of machine for producing and publicizing random patterns.
There’s a larger statistical point to be made here, which is that as long as studies are conducted as fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant, we will see lots of dramatic claims based on data patterns that don’t represent anything real in the general population. Again, this fishing can be done implicitly, without the researchers even realizing that they are making a series of choices enabling them to over-interpret patterns in their data.
Indeed. If anything, this underlines how important it is not to equate science with statistical calculation. All science entail human judgement, and using statistical models doesn’t relieve us of that necessity. Working with misspecified models, the scientific value of significance testing is actually zero – even though you’re making valid statistical inferences! Statistical models and concomitant significance tests are no substitutes for doing real science. Or as a noted German philosopher once famously wrote:
There is no royal road to science, and only those who do not dread the fatiguing climb of its steep paths have a chance of gaining its luminous summits.
Statistical significance doesn’t say that something is important or true. Since there already are far better and more relevant testing that can be done (see e. g. here and here)- it is high time to consider what should be the proper function of what has now really become a statistical fetish. Given that it anyway is very unlikely than any population parameter is exactly zero, and that contrary to assumption most samples in social science and economics are not random or having the right distributional shape – why continue to press students and researchers to do null hypothesis significance testing, testing that relies on a weird backward logic that students and researchers usually don’t understand?
Suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and has a standard error (telling us how much the mean varies from one sample to another) of 20.
Does this imply that the data do not disconfirm the hypothesis? Given the usual normality assumptions on sampling distributions the one-tailed p-value is approximately 0.11. Thus, approximately 11% of the time we would expect a score this low or lower if we were sampling from this voucher system population. That means – using the ordinary 5% significance-level — we would not reject the null hypothesis although the test has shown that it is “likely” that the hypothesis is false.
In its standard form, a significance test is not the kind of “severe test” that we are looking for in our search for being able to confirm or disconfirm empirical scientific hypothesis. This is problematic for many reasons, one being that there is a strong tendency to accept the null hypothesis since they can’t be rejected at the standard 5% significance level. In their standard form, significance tests bias against new hypothesis by making it hard to disconfirm the null hypothesis.
And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as “probably confirmed.” But looking at our example, standard scientific methodology tells us that since there is only 11% probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more “reasonable” to conclude that we have a case of disconfirmation. Especially if we perform many independent tests of our hypothesis and they all give about the same result as our reported one, I guess most researchers would count the hypothesis as even more disconfirmed.
And, most importantly, of course we should never forget that the underlying parameters we use when performing significance tests are model constructions. Our p-value of 0.11 means next to nothing if the model is wrong. As David Freedman writes in Statistical Models and Causal Inference:
I believe model validation to be a central issue. Of course, many of my colleagues will be found to disagree. For them, fitting models to data, computing standard errors, and performing significance tests is “informative,” even though the basic statistical assumptions (linearity, independence of errors, etc.) cannot be validated. This position seems indefensible, nor are the consequences trivial. Perhaps it is time to reconsider.