Reading research – 8 classic red flags

Ten years ago, few trainers had access to research studies.  These days with Google University, we have moved into the era of research wars.  It is a battle of quantity, novelty and link littering.  Unfortunately, few seem to be reading past the abstract soundbites to see if the study in question is any good.  Even more problematic are lure of pop psychology magazines with sexy titles, articles that probably misinform more than educate.

Every professor and textbook on the subject of research sends a consistent message.  Read research with a critical mind.  Not all studies are well executed.  Peer reviewed academic papers are no exception.  Sometimes journals will publish research with poor design to inspire further research.  Looking at study design is scientific, not sacrilegious.

As readers of studies, we can take steps to improve our research reading abilities.  We can face our own biases directly.    Do we militantly tear apart research that goes against our point of view, while offering leniency to findings that feel warm and fuzzy?  More importantly, do we know how to read an analyze research?

I won’t pretend to be a research expert.  Rather, over the next series of blogs I will be highlighting what I have learned from Experimental Psychology by Myers and Hansen.  It is a worthwhile investment for anyone wanting to flex his or her mental muscles.

One core concept of research is internal validity.  As the name implies, we need to assess if a study is valid on the inside.  External validity, by contrast, would look at whether results apply to “the real world.”  Lesson number one is that internal validity should not be sacrificed for external validity.  If a study is not valid on the inside, there is nothing of substance to apply to the “real world.”

Campbell identified eight “Classic Threats to Internal Validity.”  They apply to research involving experimentation.  This includes true experiments and quasi-experiments.  True experiments have strict parameters or rules.  Quasi-experiments are not bad, just different.  In both researchers manipulate a variable and then measure the effect it has on another variable.

IV isolation

For example, we might want to know if training method A is faster than training method B.  We divide a number of dogs into two groups and compare results of those two methods.  The type of training is the variable being manipulated.  We call this the independent variable (IV).  The goal of experimentation is to isolate the independent variable, to ensure that no other factor is interfering or confounding the results.

Revisiting our dog-training example, let’s say that group A tested on a Monday and group B tested on Tuesday.  If Monday was sunny and Tuesday was stormy, any claim that treatment A was better is highly suspect.  Stormy weather could have agitated the dogs in group B.  The independent variable was not adequately isolated.  The study would not have internal validity.

The following itemizes Campbell’s Classic Threats to Internal Validity and provides examples.  One step we can take toward understanding research is to understand how these threaten validity.

Our thunderstorm example above is a history threat.  Dogs in group B had a shared history during the experiment that differed from dogs in group A.  Training methods varied.  However, so did weather.  No one can say for sure which training method was faster because the weather interfered.  History threats can be subtle.  Another example would be if one group receives an orientation while the other does not.  Orientation can prime one group, giving them a head start.  It would also be a history threat.

Maturation threat reflects internal changes.  An obvious example might be age.  Behaviour can change as puppies mature.  Maturation can also mean the maturation of knowledge.  Students handling dogs during experiments will have gained knowledge throughout the term. It would not be wise to test group A with new students and group B at the end of term.  Increased knowledge by the end of term can mean that students guess the hypothesis or influence results.

Subjects rarely get the same test results when re-tested.  Practice leads to improvement, even without treatment of any kind.  Suppose we take a group of anxious dogs and test their heart rate.  Heart rates can drop simply because the dog habituates and becomes more comfortable.  A second round of testing should show habituation.  It is not enough to ask if a dog improved, we need to know if the dog improved more so than dogs that did not receive any treatment.  Otherwise, we have a testing threat.

Measuring results is not without potential pitfalls.  Instrumentation threats involve data collection problems.  Equipment can fail or be unreliable.  Scales used to score results need to be set correctly.  Assume we want to know if dogs become anxious at the dog park.  Imagine if the measurement options are:  highly anxious; moderately anxious; mildly anxious and fully relaxed.  Answers obviously will be weighted toward the “anxious” side of the scale.  Unless a dog is fully relaxed, it is by default labelled as anxious. Had moderately relaxed and slightly relaxed been offered as choices, an entirely different picture may have emerged.

Random selection between groups is important.  This process helps balance the characteristics between groups.  When groups are not random by design or chance, this is a selection threat.  Assume that wrandom balancinge want to know which training technique obtains a faster recall.  Group A dogs are mostly short hounds and toy breeds.  Group B has mostly large dogs with a smattering of Border Collies and Whippets.  Under those conditions, we could not claim that group B training produced faster recalls.  To avoid accidental selection threats, random selection and balancing offers an even comparison between groups.  Researcher choice is not random.

Mortality should be listed in an experiment.  It is the dropout rate.  When large numbers drop out of an experiment, it’s a red flag.  According to the text, “Often it means that the treatment is frightening, painful, or distressing.  If the dropout rate is very high, it can mean that the treatment is sufficiently obnoxious that typical subjects would choose to leave, and the ones who remain could be unusual in some respect.”  Assume we are testing a protocol to help reactive dogs.  Many drop out.  Those who remain seem to improve.  The obvious question is whether those who left were distressed or deteriorated so much so they did not return.  That is critical information.

The seventh threat comes with a big word:  Statistical Regression.  Extreme test results are unreliable.  Think back to grade school I.Q. tests.  Scoring low could mean you had the flu.  If an experiment uses subjects with extreme characteristics, we can expect some of that to level out on its own.  Testing a new anxiety treatment on highly anxious dogs can appear to work.  That result looks similar to statistical regression.  As with a testing threat, it is not enough to ask if an animal improved.  We need to ensure that improvement happened because of the treatment.

Finally, we come to selection interaction threats.  It’s the one-two punch of threats.  It happens when a selection threat combines with another threat.  Returning to our experiment that asks which dog training method is faster, suppose we non-randomly select dogs from two training schools.  Immediately, that creates a selection threat.  Now suppose school A has a trick team.  Students at this school are motivated to join the team.  The second training school does not offer tricks sessions.  That creates a history threat.  Trick dogs would have a wide array of skills to draw on – to guess the right answer instead of learning it via the training method tested.  Selection threat combines in this case with a history threat to create one hot mess.

classic threats to validity copy

Campbell’s Classic Threats are the tip of the iceberg in terms of red flags.  It can make it seem no research can hold up to its standard.  Following a defined process for evaluating research is a far sight better than pointing to number of subjects and chanting “the n value is too low.”  It may not be possible to control for every bump in the road.  Experimental Psychology states, “Control as many variables as possible.  If a variable can be held constant in an experiment, it makes sense to do so even when any impact of that variable on the results may be doubtful.”

Knowing the threats to internal validity are only useful if you start using them to read studies more carefully.  It might be tempting to annihilate an experiment you dislike.  Perhaps a more interesting exercise would be review an experiment you love and have shared.  Challenge your bias.  Look at the design and the various threats to internal validity.  Did you find any?

(Part 2, 3 and more….about those n values, non-experimental research and more.)