Statistics

From dKosopedia

1 Introduction
2 Measures of Central Tendency
3 Measures of Spread
4 MAGIC Criteria
5 Polls and Surveys
6 Graphics
7 Hypothesis Testing
8 Original Diaries

Introduction

This page contains a basic introduction to statistics, with a slight emphasis on application to politics. This is an adaptation of a series of diaries by dkos member plf515; links to the original diaries (and discussions therein) are given at the end of this page.

Measures of Central Tendency

There are various ways to classify variables. One useful way is to distinguish between continuous and categorical data. Data is continuous if it can (at least in theory) take on any number. Data is categorical if it can only take on certain numbers. For example, weight, income, age and IQ are continuous. Political party, hair color, and marital status are categorical.

When you have continuous data, two things that you often want to know are "What values are likely?" and "How spread out are the values?" Today, we will look at the first question, which, in statistician's language, is called central tendency. The most common measure of central tendency is the mean, which is often called the average. The other commonly quoted measure of central tendency is the median. We'll look at those two and a couple others.

The mean is probably familiar. Add up the numbers, divide by how many numbers there are, and you've got it. So, for example, if the IQs of the people in your family are

155	(that would be you)
135	(your sister)
70	(her wingnut husband)

then the average is (155 + 135 + 70)/ 3 = 120

The median is the number that splits the data into two equal halfs, with half being higher, and half lower (there are slightly more technical definitions, but this will do for our purposes).

Two other, less commonly used measures are the mode and the trimmed mean. The mode is the most common value, and the trimmed mean is the mean after you throw out some extreme values (typically the highest 10% and the lowest 10%).

When do you want each? When do you want to use none of them?

There are some situations where no measure works well. The most common is when the data are multimodal. That means that the data have common values that are separated by some uncommon values. For example, if you had a bunch of athletes from different sports (basketball players, football players, and jockeys), and were intrested in their weights, then no measure of central tendency would be good.

But, more often, you want some measure of central tendency, and have to decide which one.

The mean is a bad choice if the data are skewed, which means that there are some extreme values. One common example of this is income. Some people make a whole lot more than the average person, but no one makes that much less. For instance, if the average income in the USA is $30,000 per year (I made that up) then there are some people who make millions more than that, but the poorest people make $30,000 less. When the data are skewed, the median and the trimmed mean are good choices. (You don't see the trimmed mean much, but it can be very useful).

The mode is sometimes also a good choice. Suppose, for example, you are reporting on a country where nearly everyone is a peasant making almost nothing, and there are a few multibillionaires making a lot, and a few more people in the middle. Like this

Income	Number of people
$100 per year	1,000,000
$1000 to $100,000 per year	10,000
More	500

then the mean would be distorted by the few people making huge amounts, and the median would be distorted by the pople making a middle amount; the mode would be $100 per year, and that would be a good representation of the income.

Another thing that often goes wrong with the mean is to average things that can't be averaged. The most common is to average percentages. This is a bad idea. I can get into why if people ask, but this diary is already getting very long, so I will stop here and wait for questions, comments and so on. OK, people have asked for an explanation of why averaging percentages is bad, so here is one (with made up data). Suppose the vote in some political race is as follows: State Democrat Republcan Calif 60% 40% NY 65% 35% South Dakota 35% 65% Alaska 40% 60% (other states data too) If one averages the percentages, one would get 50% each, but that isn't right. A percentage is a form of a fraction, and you have to add the numerators and denominators and then form a new percentage, that is, add up the NUMBER voting Dem and Repub. and then get the percentage from the total.

Measures of Spread

Statistics can be divided into two big areas: Descriptive statistics and inferential statistics. Descriptive statistics is about describing data, and inferential statistics is about making inferences from a sample to a population. Suppose, for instance, you were interested in the average income of adults in the USA. You can't get the information on the whole population, so you take a sample. (We'll get into ways to do this in a later diary). When you try to say things about the whole population based on your sample, that's inferential statistics. When you are just talking about your sample, that's descriptive statistics.

Sometimes, though, you do have the whole population. If you wanted to find the average SAT score in a class of students, you could ask everyone. Then you don't need to infer anything.

(By the way, don't get used to these terms being sensible. Statisticians often use familiar words in unfamiliar ways; in particular, when statisticians use the words significance, power, random, and confidence, they don't mean exactly what they do in everyday discorse. Don't blame me, I didn't make up the terms).

OK, enough background. Let's say you've collected the data on whatever it is you are interested in. There are often several things you are interested in. You are interested in what a typical person is like, and for this, the measures of central tendency are good. You can think of this as ways to formalize the idea of a best guess. But you are also interested in how good that guess is. For that, you need a measure of spread. There are several popular ones. By far the most common is the standard deviation. Others are the variance, range, and the interquartile range.

The standard deviation of a sample is gotten by

Finding the mean
Subtracting each value in your sample from the mean
Squaring each of these
Adding the result of step 3
Dividing by n
Taking the square root of step 5

In equation form, this is

For the variance, just leave out step 5.

The range is just the lowest value to the highest (it's usually given as both numbers). The interquartile range requires first dividing the data into quartiles, which essentially means putting them into order, then taking the bottom quarter, the middle (which is the same as the median), and the top quarter. The interquartile range is the range from the first quartile to the third (if you remember percentiles, then the first quartile is the same as the 25%tile and the third quartile is the 75%tile).

Enough math. Those who want more formal definitions and examples can, of course see wikipedia or some such.

When is each of these good? Or bad?

Well, the standard deviation is usually good for the cases where the mean is a good measure of central tendency (see yesterday's diary). The variance is not used much in everyday reporting, it's mostly used for further statistical work. The range is almost always useful, and easy to interpret, and the interquartile range ought to be used a lot more, because, once you understand it, it's easy to interpret, and it gives a good sense of the spread.

I've had requests for examples of when SD is better, and when the IQR or range is better. Briefly, if you think the mean is a good measure of central tendency, then usually the SD is a good measure of spread. If you use the median, then you often want the IQR and range in addition to (or even instead of) the SD. And, if there is no good measure of central tendency, there is likely to be no good measure of spread. Some concrete examples: If you wanted to know the average IQ of Kossacks, then (presuming you could get a good sample, which I will talk about in another diary) the mean would be a good measure of central tendency, and the SD a good measure of spread. IQ is normally distributed (we'll get to that in another diary, too) (actually, there is evidence that IQ isn't *exactly* normally distributed, but it's close). OTOH, if you wanted to know about the income of kossacks, then the median would be a good measure of central tendency, and, while the SD wouldn't exactly be WRONG, I would want to look at IQR and range as well. Finally, if you wanted to look at the heights and weights of professional athletes (as a whole group) then *no* measure of CT would be really good, nor would any measure of spread, because the group is composed of people who are too different from one another.

MAGIC Criteria

This section is based on the absolutely wonderful book

Statistics as Principled Argument

by Robert Abelson. It's an easy read, and I urge those interested in this stuff to go buy a copy.

Abelson lists five criteria by which to judge a statistical argument. He calls them the MAGIC criteria

Magnitude: How big is the effect?
Articulation: How precisely stated is it?
Generality: How widely does it apply?
Interesting: How interesting is it?
Credibility: How believable is it?

We can tell how big an effect is through various measures of effect size. We will get into some of these in later diaries, but some of the common ones are correlation coefficients, the difference between two means, and regression coefficients. Big effects are impressive. Small effects are not. How big is big depends on context, and on what we already know. If we find, for example, that a new diet plan lets people lose (on average) 10 pounds in a month, that's pretty big. 10 ounces in a month is pretty small. But if it was a diet tested on rats, 10 ounces might be a lot.

Articulation is measured in what Abelson calls Ticks and Buts. A 'tick' is a statement, and a 'but' is an exception. The more ticks the better, the fewer buts the better. There are also blobs, which are masses of undifferentiated results. Blobs are, as you might have guessed, bad.

Generality refers to how general an effect is. Does it apply to all humans everywhere? That would be very general. Or does it apply only to people who have posted 50 or more diaries on dailyKos? That would be pretty specific. Usually, more general effects are of greater value than more specific ones, but you should be sure that the study states how general it is.

Interestingness is very hard to measure precisely, but one way is to say how different the reported effect size is from what we thought it would be. For example, I once read a study that showed that Black people, on average, earn less that White people. Upsetting, but not interesting. I knew that already, and the size of the difference was large (which I thought it would be) but not huge (which I also knew, because, after all, even the average White person doesn't earn all that much). But then it went on to say that, while Black men earned a lot less than White men (more than I thought the difference would be), Black women and White women earned almost the same (that's really interesting! I would have thought that Black women earned much less than Whites!)

Finally, credibility. The more hard a result is to believe, the more stringent you have to be about the evidence supporting it.

Polls and Surveys

A poll or survey is an attempt to judge the opinion of a population on some question. One class of question is who will win an election. The population is everyone who is eligible to vote, but we are really interested in those who actually vote - if a person has no intention of voting, then it doesn't matter who that person would vote for. Of course, we can't ask everyone in the population how they plan to vote - even in the most local election, there are just too many people. It also turns out that it is better to spend money and effort getting the questions right and getting the sampling plan right, rather than trying to get a really large sample. Sample size makes surprisingly little difference, which is why you will see national surveys with sample sizes of only about 1,000 or so. Once we have results from our sample, we use that to make inferences about the whole population - after all, we are not interested in how our small sample will vote (or what they think of an issue) we want to know who will win the election and whether a candidate is doing better with a particular group, and so on.

There are three aspects to getting accurate results from a poll:

Getting the sample right
Getting the questions right
Analyzing the data correctly,

None of these three is simple, but we'll get a little into each. Recognize, though, that entire books have been written about each of these, and this is only one diary!

A sampling plan requires two things: A sampling frame (which is just a list of everyone in the population) and a plan to draw the sample. In the famous Literary Digest poll (which predicted FDR would lose in 1936) the big problem was the sampling frame. They used a list of people with phone numbers or cars. This leaves out a lot of people, and counts some twice (even now, but this was much more so back in 1936). Not only does it leave people out, but it leaves them out in a way that is biased (it oversamples wealthy people, and wealthy people, then as now, were more likely to vote Republican). Nowadays, people get much better sampling frames, but I do not know the details of how they do this.

Once you get a sampling frame, the next step is to draw a sample. The simplest is what is called simple random sampling. In the purest form of this, each person in the frame is given a number, and you use a random number generator to choose people, one after another, until you have enough people (we'll cover how many are needed in a minute). Almost no one does this; what they do instead is to randomly order the people, and then choose the 1st, 11th, 21st and so on; or 1st, 6th, 11th and so on. Technically, this isn't simple random sampling, but it's close enough for our purposes. There are more complex schemes, too: In stratified sampling, you divide the sampling frame into groups based on some characteristic, and then sample randomly from each group. In clustered sampling, you randomly choose particular clusters (usually census blocks or some other geographic unit) and sample randomly within clusters. These each have advantages and disadvantages, but I am not going to get into details. The important thing is to choose a sampling plan with care. (If there is interest, I can devote a later diary to this topic).

You also need to know how many people you need in your sample. This depends on three things: The type of sampling scheme, the approximate percentages of people who will choose each alternative, and the desired accuracy. Here, I'll just deal with simple random sampling. It turns out that it's easier to estimate percentages that are near 50%, and hard to estimate percentages close to 0 or 100. Usually in polls, we are, in fact, most interested in numbers around 50% - it would be nice to know if Clinton will beat her Republican opponent by 80-20 or more, but what we're really interested in is close races. Also, it doesn't get really hard to estimate proportions until they are quite close to 0 or 100. Accuracy, in polls, is expressed in several ways, all of which are mathematically equivalent. By far the most common is in terms of `margin of error'. Unfortunately, this could mean a couple different things, but what most poll reporters seem to use it as is what's called a confidence interval. So, if you see a result: Bush's popularity is 31%, with a margin of error of 3%, that means that the 95% confidence interval for his popularity is 28% to 34%, and what that means is that we can be 95% sure that his popularity is in that range (note, for the quant people reading this: I know that isn't the technically correct way to put that, but this is an introduction). We'll get into this a bit more when we talk about analysis, because it turns out that these estimates aren't that great. In any case, as you might expect, more people means a more accurate result. While the details are really quite complex (see below) the general relationship is that accuracy goes up as the square root of sample size. In other words, if you use 4 times as many people, the results will be twice as accurate. Nine times as many people, and the results will be 3 times as accurate.

Next, the questions: It's easy to get wrong results. This can be done on purpose or accidentally. I'll talk about each. If you do it on purpose, then you are doing what seems to be called a push-poll. There is a great example of this in the wonderful book (and BBC television series) Yes Minister, in which the civil servants run rings around a Minister. (If you haven't seen it, you should, it's hysterical). The Minister says he wants to do a poll about mandatory military service for young people. The civil servants ask what results he wants. He says he wants the truth. They say "But which truth?" He looks confused. They demonstrate (I'm paraphrasing).

CS: Do you think you think young people have too much time on their hands?
M: Yes.
CS: Do you think it would be good for people to give something back to their country?
M: Yes
CS: Do you think young people could use more discipline in their lives?
M: Yes
CS: Are you in favor of national service?
M: Yes.

Then they start again

CS: Do you think the state has the right to employ people against their will?
M: No
CS: Do you think it's a good idea to train people in the use of weapons?
M: No
CS: Do you think the world needs to be more military?
M: No
CS: Are you in favor of national service?
M: No

Sometimes, though, you do it by accident, as when two seminary students are trying to decide if it's permissible to smoke and pray simultaneously. They can't figure it out, so they decide to ask their superiors. A week later they meet again. The first student says "I asked, and it turns out it's fine". The other says: "How odd, my superior told me it was forbidden" "What did you ask?" "I asked if it was OK to pray while smoking" "That explains it! I asked if it was OK to smoke while praying!" But there are much more subtle effects. For example, a poll once found that Hilary Rodham Clinton had lower popularity than Hilary Clinton. Another poll found that people express more racist attitudes when the person asking the questions has a southern accent (fascinating example of multiple biases). There's no way to be sure that you've not tapped into some bias, but here are some tips:

Change the order of the choices (e.g., ask some people "Do you plan to vote for Bush or Kerry?" and others "Do you plan to vote for Kerry or Bush?"
Train the interviewers to ask about the choices in a neutral tone of voice.
Don't say who hired the pollsters (I once got a call where the person said "This is Bella Abzug for Mayor headquarters, whom do you plan to support for mayor?"
Don't ask double barreled questions (e.g. "Do you think that the troops should be brought home and given help adjusting to their home life?")
Don't ask double negative questions.
If you ask about something that might be complex, give details and encourage questions (e.g. not "About how much money did you make last year?" but "About how much money did you make last year, before taxes, and counting all your income including odd jobs, capital gains, tips, and social security or other payments from the government?"

Finally, analysis: Here I will just talk about the case where there are two choices. E.g., "Do you plan to vote for Bush or Kerry?" It turns out that this is very difficult to estimate. The standard formulas don't work very well (if quant people are interested in details, let me know). The best way to estimate these numbers is via the bootstrap. If people are interested in details, let me know, but the bootstrap is essentially a computer-intensive method of analyzing this sort of data that makes very few assumptions about what the structure of the data is. (note to the pedant patrol: Yes IS, because it's the structure I am referring to, not the data).

Graphics

Statistical graphics, properly used, let us envision great quantities of information and to have insights into relationships among variables that would be difficult, if not impossible, to get from words and formulas alone. Improperly used, however, they obfuscate or even distort the truth, or waste paper on data that could be better summarized in a table or in text. You can look at a talk I gave at Yale University, here: My Yale talk

Thre are a huge number of statistical graphics. Some are bautiful and elegant, conveying great quantities of information. Others are less so. And some are just awuful, distorting the data, or simply not presenting it well.

A few people have dominated the field of graphics. The two biggest names are Edward Tufte, who has a website here Tufte website and William Cleveland, whose homepage is here: Cleveland homepage. A third person who is less well-known but deserves more recognition is Michael Friendly, whose site is here Friendly homepage

Tufte deals with the general presentation of information; Cleveland is more focused on statistics per se. Tufte bases his rules on his formidable intuition, Cleveland has done some actual experiments on graphical perception. Friendly and Cleveland both offer programs to create the graphs they recommend, which makes them very valuable to data analysts, like me.

OK, enought chitchat. Here are some principles of graphical design, taken from here and there, esp. the works of Tufte and Cleveland; go buy their books!. A good graph will

Show the data
Induce the reader to think about the substance
Not distort the data
Present many numbers in a small space
Make large data sets coherent
Encourage the eye to look at different parts of the data
Reveal the data at several levels of detail
Serve a clear purpose
Engender a clear vision of the data
Help the viewer understand the data

Next, some of my own observations:

Before deciding on a graph, we should think about how many variables are involved, and whether they are continuous or categorical.

If there is one variable and it is categorical, then one common choice is a Pie Chart. Avoid them. They often distort the data. Experiments have shown that a) People are very bad at judging angles and b) Rotating the pie changes people's perceptions, as does choice of color. A much better choice is the dot chart (The actual dot chart is fairly far down the page). But if there are only a few categories, a table or text may be better.

If there is one variable, and it is continuous, a very common choice is the Histogram; here is what William Cleveland says about histograms:

The histogram is a widely used graphical method that is at least a century old. But maturity and ubiquity do not guarantee the efficacy of a tool. The histogram is a poor method for comparing groups of univariate measurement.

One relatively straightforward better choice is the box plot. Another good choice is the density plot (see my Yale talk). If you have two variables, and they are both categorical a good choice is the mosaic plot (see Michael Friendly's website), if there is one continous and one conintuous, you can use side-by-side boxplots (see my Yale talk). If both are continuous, the traditional choice is the scatterplot.

Hypothesis Testing

The first thing we have to talk about is the different schools of statistics. There are at least three: The frequentist, the Bayesian and the decision theoretic.

The rest of this section will be about frequentist statistics. In this school of statistics, we first set up what is known as the null hypothesis, which is usually (but not always) an assertion that nothing is going on. e.g.

Income does not vary between Blacks and Whites

There is no relationship between IQ and political party.

It has to be stated in a form that can be tested by an experiment or by observational data, and so that it is possible to calculate the probability of a result, given that the null hypothesis is true. For example, if the first null hypothesis were true, then if we measured income for a random sample of Blacks and Whites statistical theory shows that, if this were true. that difference would be normally distributed with a mean of 0 and a standard deviation that depends on the sample size and the dependent variable's variability in the population.

(The normal distribution is that bell shaped curve you've seen everywhere; not all bell-shaped curves are normal, but I am trying to keep this non-technical).

Next, we calculate a test statistic. What that test statistic should be will depend on the null we are testing. For the first null hypothesis, it would likely be a t-test. For the latter, it would be the parameter estimates of a logistic regression equation. There are many many possibilities

We can then compare our test statistic to the distribution under the null hypothesis. If our test statistic is unlikely under the null, then we reject the null hypothesis. Otherwise, we fail to rejct it (note that we never accept the null hypothesis). How unlikely? Well, we can set any level we like. In the social sciences, 5% has become standard, despite the fact that Ronald Fisher, who did more than anyone to invent this stuff, said that no sane researcher would use the same value every time).

There are then four things that could happen:

A. We could reject the null and be correct
B. We could reject the null and be incorrect
C. We could fail to reject the null and be correct.
D. We could fail to reject the null and be incorrect.

We can't know whether we are correct, but we can figure out the probability of our being wrong.

B in the list above is known as a type 1 error. This is like saying something is happening when nothing really is

D is known as a type 2 error. This is like saying nothing is happening when something really is.

Type 1 error is closely related to the significance level and the p-value

1 - the probability of D is known as power.

Now, all this is controversial. Jacob Cohen called this procedure Null Hypothesis Significance Testing (NHST) and said that only his wife's calming influence stopped him from calling it Statistical Hypothesis Inference Testing. Paul Meehl said that signficnace testing is what has prevented psychology from becoming a science, and my favorite grad school professor said that when he saw a lot of p values in a paper, he figured the authors were peeing all over. It is generally accepted that it is better to estimate effect sizes, and give their confidence limits.

Original Diaries

Part One: Measures Central Tendencies
Part Two: Measures of Spread
Part Three: MAGIC Criteria
Part Four: Polls and Surveys
Part Five: Review (nb: the text of diary number 5 is not included in this dkosopedia page)
Part Six: Graphics
Part Seven: Hypothesis Testing

Retrieved from "http://localhost../../../s/t/a/Statistics.html"

This page was last modified 13:47, 21 June 2006 by dKosopedia user Dmsilev. Content is available under the terms of the GNU Free Documentation License.