# Science and statistics

Here is a story about two statisticians and two scientists. They are given a problem: what are the frequencies of the letters in English texts? The junior statistician has no research budget whereas the senior statistician has a modest research budget. Similarly, the junior scientist has no research budget but the senior scientist has a large research budget.

The junior statistician has no budget to collect frequency data and, being a careful statistician, makes no assumptions about what is unknown. So the conclusion is made that the frequency of each letter is 1/26th. A note is added that if funds were available, a better estimate could be produced.

The senior statistician has a modest budget and so arranges to collect a random sample of English texts. Since English is an international language, a sample of countries is selected at random. Written English goes back about 500 years so a sample of years is selected at random. A list of genres is made and a sample of genres is selected at random. Then libraries and other sources are contacted to collect sample texts. They are scanned and analyzed for their letter frequencies. The letter frequencies of the sample are used as the unbiased estimate of the population frequencies. Statistical measures of uncertainty are also presented.

The junior scientist has no budget to collect data but happens to own a CD with classic scientific texts. With a little programming, the letter frequencies on the CD are determined. These frequencies are presented as the frequencies of all English texts. No measures of uncertainty are included. It is simply assumed that English texts are uniform so any sample is as good as another. However, a caveat is made that the conclusion is subject to revision based on future data collection.

The senior scientist has a large research budget from a government grant. Arrangements are made to collect a massive amount of data from electronic sources such as the Internet and several large libraries. The written texts are scanned and combined with the electronic sources into a large database and then the letter frequencies are determined. These frequencies are announced as the letter frequencies for all English texts. No measure of certainty is included. It is not mentioned that future data collection could lead to a revised conclusion.

The senior scientist collects a prize for successfully completing the project. The others are forgotten.

Who had the best approach? Why aren’t scientists and statisticians more alike?

2008