iSoul In the beginning is reality

Tag Archives: Probability And Statistics

Approaching the unknown

We have some knowledge but it is not complete knowledge, not even arguably near complete. So what should we do about the areas where knowledge is lacking? We should certainly continue to investigate. But what do we say in the mean time? What can we justify saying about the unknown side of partial knowledge?

There are three basic approaches to the unknown: (a) assume as little as possible about the unknown and project that onto the unknown; (b) assume the unknown is exactly like the known and project the known onto the unknown; or (c) assume the unknown is like what is common or typical with what is known and project that onto the unknown.

Approach (a) uses the principle of indifference, maximum entropy, and a modest estimate of the value of what is known to the unknown. It takes a very cautious, anything-can-happen approach as the safest way to go.

Approach (b) uses the principle of the uniformity of nature, minimum entropy, and a confident estimate of the value of what is known to the unknown. It takes an intrepid, assertive, approach as the most knowledgeable way to go.

Approach (c) uses the law of large numbers, the central limit theorem, the normal distribution, averages, and a moderate estimate of the value of what is known to the unknown. It takes a middle way between overly cautious and overly confident approaches as the best way to go.

The three approaches are not mutually exclusive. All three may use the law of large numbers, the normal distribution, and averages. They all may sometimes use the principle of indifference or the uniformity of nature. So calling these three different approaches is a generalization about the direction that each one takes, knowing that their paths may cross or converge on occasion.

It is also more accurate to say there is a spectrum of approaches, with approaches (a) and (b) at the extremes and approach (c) in the middle. This corresponds to a spectrum of distributions with extremes of low and high variability and the normal distribution in the middle.

This suggests there is a statistic of a distribution that varies from, say, -1 to +1 for extremes of low and high variability that is zero for the normal distribution. So it would be a measure of normality, too. The inverse of the variability or standard deviation might do.

Compare the three approaches with an input-output exercise:

  1. Given input 0 with output 10, what is output for input 1?
    1. Could be anything
    2. The same as for input 0, namely, 10
    3. The mean of the outputs, namely, 10
  2. Also given input 1 with output 12, what is output for input 2?
    1. Still could be anything
    2. The linear extrapolation of the two points (10+2n), namely, 14
    3. The mean of the outputs, namely, 11
  3. Also given input 2 with output 18, what is output for input 3?
    1. Still could be anything
    2. The quadratic extrapolation of the two points (10+2n+n^2), namely, 25
    3. The mean of the outputs, namely, 40/3
  4. Now start over but with the additional information that the outputs are integers from 1 to 100.
    1. The values 1 to 100 are equally likely
    2. The values 1 to 100 are equally likely
    3. The values 1 to 100 are equally likely
  5. Given input 0 with output 0, what is output for input 1?
    1. Bayesian updating
    2. The same as for input 0, namely, 0
    3. The mean of the outputs, namely, 0
  6. Also given input 1 with output 5, what is output for input 2?
    1. Bayesian updating
    2. The linear extrapolation of the two points (5n), namely, 10
    3. The mean of the outputs, namely, 2.5, so 2 or 3 are equally likely
  7. Also given input 2 with output 9, what is output for input 3?
    1. Bayesian updating
    2. Since there are limits, extrapolate a logistic curve ((-15+30*(2^n) / (1+2^n)), namely, 12
    3. The mean of the outputs, namely, 14/3, rounded to 5

2008

Uniqueness and uniformity

If everything were completely unique, we would have no way of identifying them as to what they are. If everything were completely identical, or uniform, we would have no way of distinguishing them. We conclude that the world is somewhere in between: everything is a combination of the unique and the uniform.

If all events were completely independent, or unrelated, we would have no way of identifying them as to what they are. If all events were completely identical, we would have no way of distinguishing them. We conclude that all events are a combination of the independent and the identical.

So it is not possible to have two completely unique or identical individuals. Nor is it possible to have two completely unrelated or identical events.

In statistics, we assume the least about events we don’t know about: we assume they are independent and make the least possible inference. We assume we know nothing other than what we are given in the data. We take multiple trials and use the law of large numbers to infer safe conclusions. Or we adopt a maximum entropy prior distribution as a minimal assumption.

In natural science, we assume the most about things we don’t know about. This is based on an assumption of the uniformity of nature. The natural world that we don’t know is like the natural world we do know about. We assume that what we don’t know about is the same as what we do know about. That is, we assume everything we know is all we need to know – until we know more. Then we revise and make the same assumption.

If we begin natural science with no prior knowledge and pick up a rock, we conclude that everything is rock. If we then step in a puddle, we conclude that everything is a rock or a puddle. If we let go of the rock and it falls to the ground, we conclude that all rocks fall to the ground just like that rock.

In history, the less we assume about events we don’t know, the better. Events are assumed to be unique though somehow related to other events. Through historical study we infer the relation of events. So history is more like statistics than natural science.

Natural history takes the approach of natural science toward studying the past. It assumes that all events in the past are like events in the present. So the past and the present are alike and history is the repetition of similar events. This is an anti-historical approach to history because it ignores or downplays the uniqueness of events.

2008

Science and statistics

Here is a story about two statisticians and two scientists. They are given a problem: what are the frequencies of the letters in English texts? The junior statistician has no research budget whereas the senior statistician has a modest research budget. Similarly, the junior scientist has no research budget but the senior scientist has a large research budget.

The junior statistician has no budget to collect frequency data and, being a careful statistician, makes no assumptions about what is unknown. So the conclusion is made that the frequency of each letter is 1/26th. A note is added that if funds were available, a better estimate could be produced.

The senior statistician has a modest budget and so arranges to collect a random sample of English texts. Since English is an international language, a sample of countries is selected at random. Written English goes back about 500 years so a sample of years is selected at random. A list of genres is made and a sample of genres is selected at random. Then libraries and other sources are contacted to collect sample texts. They are scanned and analyzed for their letter frequencies. The letter frequencies of the sample are used as the unbiased estimate of the population frequencies. Statistical measures of uncertainty are also presented.

The junior scientist has no budget to collect data but happens to own a CD with classic scientific texts. With a little programming, the letter frequencies on the CD are determined. These frequencies are presented as the frequencies of all English texts. No measures of uncertainty are included. It is simply assumed that English texts are uniform so any sample is as good as another. However, a caveat is made that the conclusion is subject to revision based on future data collection.

The senior scientist has a large research budget from a government grant. Arrangements are made to collect a massive amount of data from electronic sources such as the Internet and several large libraries. The written texts are scanned and combined with the electronic sources into a large database and then the letter frequencies are determined. These frequencies are announced as the letter frequencies for all English texts. No measure of certainty is included. It is not mentioned that future data collection could lead to a revised conclusion.

The senior scientist collects a prize for successfully completing the project. The others are forgotten.

Who had the best approach? Why aren’t scientists and statisticians more alike?

2008

Evidence of Absence

Evidence of Absence: Completeness of Evidential Datasets

Elliott Sober presents a likelihood argument about the motto “Absence of evidence is not evidence of absence” (Sober 2009).  He states the Law of Likelihood this way:

The Law of Likelihood. Evidence E favors hypothesis H1 over hypothesis H2 precisely when Pr(E│H1) > Pr(E│H2). And the degree to which E favors H1 over H2 is measured by the likelihood ratio Pr(E│H1)/Pr(E│H2).

He argues that the likelihood ratio is more useful than the difference but it has two other problems:  it is not defined if Pr(E│H2) = 0 and has a wider range if the denominator is larger than the numerator than vice versa.  While these are more practical than theoretical objections, they may be eliminated by the following, which we shall call the likelihood ratio index:

Likelihood Ratio Index = log((1+Pr(E│H1))/(1+Pr(E│H2))).

This index ranges from -1 to +1 and equals zero when the two probabilities are equal.

Sober uses the following scenario about the fossil record to illustrate his analysis:

Suppose you are wondering whether two species that you now observe, X and Y, have a common ancestor. … Suppose you observe that there is a fossil whose trait values are intermediate between those exhibited by X and Y. How does the discovery of this fossil intermediate affect the question of whether X and Y have a common ancestor?

His analysis is summed up in two tables, CA is where Common Ancestry and SA is Separate Ancestry.

Figure 5. Either X and Y have a common ancestor (CA) or they do not (SA). Cells represent probabilities of the form Pr(± intermediate│±CA). Gradualism is assumed.
CA SA
There existed an intermediate. 1 q
There did not. 0 1-q

Concerning this figure, he notes:

If there is an intermediate form, this favors CA over SA, and the strength of this favoring is represented by the ratio 1/q. This ratio has a value greater than unity if q<1. On the other hand, if there is no intermediate, this infinitely favors SA over CA, since (1-q)/0 = ∞ (again assuming that q<1).6 The non-existence of an intermediate form would have a far more profound evidential impact than the existence of an intermediate.

Then he makes an assumption:

(SO)    a = Pr(we have observed an intermediate │CA & there exists an intermediate) =

Pr(we have observed an intermediate │SA & there exists an intermediate).

This leads to the next figure:

Figure 6. Either X and Y have a common ancestor (CA) or they do not (SA). Cells represent probabilities of the form Pr(± we have observed an intermediate│±CA). Gradualism is assumed.
CA SA
We have observed an intermediate. a qa
We have not. 1-a 1-qa

He concludes:

As long as there is some chance that we’ll observe an intermediate if one exists, and there is some chance that intermediates will not exist if the separate ancestry hypothesis is true, the failure to observe a fossil intermediate favors SA over CA. In this broad circumstance, absence of evidence (for a common ancestor) is evidence that there was no such thing. The motto – that absence of evidence isn’t evidence of absence — is wrong.

And also:

Suppose you look for intermediates and fail to find them. This outcome isn’t equally probable under the two hypotheses if a>0 and q< 1. Entries in each column must sum to unity in Figure 6 just as they must in Figure 5. When the two parameters fall in this rather inclusive value range, failing to observe an intermediate is evidence against the CA hypothesis, contrary to the motto. What is true, without exaggeration, is that for many values of the parameters, not observing an intermediate provides negligible evidence favoring SA, compared with the much stronger evidence that observing an intermediate provides in favor of CA.

The point to be argued for here shows that the last sentence should change in some cases, including the case of intermediate fossils.

First of all, the main question about fossils is not the existence of intermediates but the existence of long sequences of fossils.  The view of “evolutionary stasis” (species fixity) was dying out in the nineteenth century and is extinct today.  There is no controversy about the idea that fossils can be put into short sequences that span varieties, species, or genera.  The controversy is about long sequences of fossils that span families, orders, classes, phyla, kingdoms, and domains, reaching to common (universal) ancestry.  Evolutionists are committed to the existence of long sequences and creationists are committed to their non-existence.

However, this does not affect the form of the analysis.  We could revise the scenario this way:

Suppose you are wondering whether two short fossil sequences that you now observe, X and Y, have a connecting sequence of fossils. To bring evidence to bear on this question, you might look at the similarities and differences (both phenotypic and genetic) that characterize the two short fossil sequences. But the traits of a third object might be relevant as well. Suppose you observe that there is a fossil (or short fossil sequence) whose trait values are intermediate between those exhibited by X and Y. How does the discovery of this intermediate fossil (or short fossil sequence) affect the question of whether X and Y have a connecting sequence of fossils?

How does this affect the motto?  Consider the example Sober relates in a footnote:

The administration of George W. Bush justified its 2003 invasion of Iraq by saying that there was evidence that Iraq possessed “weapons of mass destruction.” After the invasion, when none turned up, Donald Rumsfeld, who then was Bush’s Secretary of Defense, addressed the doubters by invoking the motto; see http://en.wikiquote.org/wiki/Donald_Rumsfeld.

The Washington Post reported on January 12, 2005 (http://www.washingtonpost.com/wp-dyn/articles/A2129-2005Jan11.html):

The hunt for biological, chemical and nuclear weapons in Iraq has come to an end nearly two years after President Bush ordered U.S. troops to disarm Saddam Hussein. The top CIA weapons hunter is home, and analysts are back at Langley.

The Post article quoted an unnamed intelligence official who said:

“We’ve talked to so many people that someone would have said something. We received nothing that contradicts the picture we’ve put forward. It’s possible there is a supply someplace, but what is much more likely is that [as time goes by] we will find a greater substantiation of the picture that we’ve already put forward.”

This is how a search ends in failure: (1) there is an absence of evidence of what was sought, and (2) there is a reasonable expectation that what would be found in the future would merely substantiate what has already been discovered.

Knowledge management consists of various efforts to gather and mine information of value.  A database of knowledge on a certain subject, called a knowledge base, is complete if closed-world reasoning applies to it, which means whatever it does not know to be true must be false.  In other words, all relevant knowledge about that subject is in the knowledge base.  This might be true for example of a company’s employee information database; if someone’s name is not in the database, they are not an employee of the company.

Let us call a knowledge base “effectively complete” precisely when all knowledge about the subject is either in the knowledge base or similar to what is in the knowledge base.  The quotation from the intelligence official above presents an instance of an effectively complete knowledge base because the expectation is that as time goes by “we will find a greater substantiation of the picture that we’ve already put forward.”

Because of the common assumption of the uniformity of nature, it is common in science to consider a knowledge base what we are calling effectively complete.  Essentially this means that the knowledge at hand is as all the knowledge required.  So, for example, one can examine a database about the properties of copper and make conclusions about all the copper in the universe.

Something similar may be said about the fossil record.  After more than two hundred years of fossil hunting, large databases of fossils are available for research [footnote].  Many interesting fossils have been discovered but there is still a complete absence of long fossil sequences.  At this point, there is no substantial reason to expect that long fossil sequences will ever be found.  In this sense at least, the knowledge base called the fossil record is effectively complete.

If a database is effectively complete, then the closed-world premise may be invoked: what is not currently known to be true is false, or what is not currently known to exist does not exist.  In this sense, the absence of evidence is evidence of absence.  What’s not in the database of evidence, must not exist; what’s not known in the database of facts, must be false.

Reference

Elliott Sober, Absence of evidence and evidence of absence: Evidential transitivity in connection with fossils, fishing, fine-tuning, and firing squads. Philosophical Studies 143 (1):63 – 90 (2009)

November 2013