Reading Research: Does Size Matter?

Following up from Reading Research part one where I review key aspects of the book Experimental Psychology by Myers and Hansen, I thought it would be interesting to tackle the question, “Does size matter?”  Of course, by size, I mean the sample size in a research study – often referred to as “n.”

Many introductory books make the point that small sample sizes are a red flag.  Small samples might provide flawed information.  A small group might be comprised of unique or unusual individuals – subjects who do not reflect the majority of the population.

Yes, Skinner, undoubtedly one of the most well known names in psychology did experiments with only a few subjects.  His work is held in high regard for being tightly controlled.  Much of his work has held up over many decades.

skinner

That presents quite a discrepancy to resolve.  Large samples are purportedly good, yet Skinner’s exceptionally controlled research used small samples.  Assuming that both premises are true, an explanation must exist.

Generally, large samples are beneficial.  One reason is that large samples are more likely to reflect the whole of the population.  There is another reason illustrated by the following example.

Let us pretend that we want to know if a dog training technique is better than another is.  We randomly divide the dogs into two groups.  Group A learns a task with our technique.  Group B learns with a different technique.  We train each group of dogs and compare the results.  This research design is called a two independent group design.

Statistics analyze the data. A standard t-test is a probable choice for this study.  Each statistical test makes assumptions in its calculations.  Standard t-tests assume that the data we are collecting creates a normal curve (bell curve.)  Without enough participants, there isn’t enough data to make a solid, fully formed bell curve.  As the diagram below shows, without a fully formed curve, it is impossible to compare if the curves are similar or different from one another.  Standard t-tests generally require at least twenty subjects in each group – but thirty is better.

normal curve

There are other forms of research.  Some researchers prefer to focus on details.  These details can be lost when data is pooled or averaged.  Instead of blending the results of many subjects, “small n” researchers focus intently at the individual responses of a few.

Such an approach can offer key insights.  For example, if we measured how dogs learn new skills, blended results might create a gentle sloping curve.  Individual results could paint a jagged process – breakthroughs and setbacks.

average graph

There are a number of low n experiments including ABA; Multiple Baseline Design; Changing Criterion Designs and Discrete Trial Designs.

Here are a couple examples to show how some of these processes work.

In an ABA design, the subject acts as both the experimental and control group.  Assume that we want to test a new anxiety treatment.  A baseline is measured during phase one (The first A in ABA).  Treatment is then given during the second phase (B).  Finally, treatment is discontinued (A).  ABA design allows us to see if the treatment has an effect.  We can also see if results disappeared when treatment stopped.  There are many variations of the ABA design such as ABABA, ABACADA and so on.  The reversals allow researchers to see if the order of treatment is having an impact rather than the actual treatment.

Discrete trials are common in conditioning experiments.  For example, we might want to know if dogs discriminate sound better with one ear versus the other.  In other words, we want to know if dogs are left or right “eared.”  Dogs learn to discriminate a tone.  Probe tones are presented to the left or right ear.  The dog’s responses – how quickly they discriminate on either ear is measured.  Comparisons are made.  A response is measured over many treatment conditions.  In this case, hearing is measured across a number of manipulations.  Humans who participated in a similar experiment each performed over 2000 trials.  The sample might be small, but the volume of data is massive.  It requires meticulous record keeping and data analysis.

The question should not be “does size matter?”  That is an overly simplistic question.

Of course, size matters.  Bigger is not always better when it comes to sample sizes.  What matters is whether the size of the sample works with the type of study and the statistical analysis used.

The various types of research are like tools.  A hammer is no better or worse than a screwdriver.  Using a hammer to drive a screw is fraught with problems.  It is similar with studies.  Different types of research serve a different purpose – they need to be used correctly.  Keep looking at sample sizes.  Also, look to see if that sample matches the type of research.  It can be helpful to grab a few studies, look up the sample size and look up the type of study.  Start becoming familiar with the jargon.

I would highly recommend Experimental Psychology to anyone wanting a deeper understanding.  My blogs are just highlighting a few small sections.  Well worth the investment.

Part one on reading research:  Internal Validity can be found here.
Part three – coming soon.

Reading research – 8 classic red flags

Ten years ago, few trainers had access to research studies.  These days with Google University, we have moved into the era of research wars.  It is a battle of quantity, novelty and link littering.  Unfortunately, few seem to be reading past the abstract soundbites to see if the study in question is any good.  Even more problematic are lure of pop psychology magazines with sexy titles, articles that probably misinform more than educate.

Every professor and textbook on the subject of research sends a consistent message.  Read research with a critical mind.  Not all studies are well executed.  Peer reviewed academic papers are no exception.  Sometimes journals will publish research with poor design to inspire further research.  Looking at study design is scientific, not sacrilegious.

As readers of studies, we can take steps to improve our research reading abilities.  We can face our own biases directly.    Do we militantly tear apart research that goes against our point of view, while offering leniency to findings that feel warm and fuzzy?  More importantly, do we know how to read an analyze research?

I won’t pretend to be a research expert.  Rather, over the next series of blogs I will be highlighting what I have learned from Experimental Psychology by Myers and Hansen.  It is a worthwhile investment for anyone wanting to flex his or her mental muscles.

One core concept of research is internal validity.  As the name implies, we need to assess if a study is valid on the inside.  External validity, by contrast, would look at whether results apply to “the real world.”  Lesson number one is that internal validity should not be sacrificed for external validity.  If a study is not valid on the inside, there is nothing of substance to apply to the “real world.”

Campbell identified eight “Classic Threats to Internal Validity.”  They apply to research involving experimentation.  This includes true experiments and quasi-experiments.  True experiments have strict parameters or rules.  Quasi-experiments are not bad, just different.  In both researchers manipulate a variable and then measure the effect it has on another variable.

IV isolation

For example, we might want to know if training method A is faster than training method B.  We divide a number of dogs into two groups and compare results of those two methods.  The type of training is the variable being manipulated.  We call this the independent variable (IV).  The goal of experimentation is to isolate the independent variable, to ensure that no other factor is interfering or confounding the results.

Revisiting our dog-training example, let’s say that group A tested on a Monday and group B tested on Tuesday.  If Monday was sunny and Tuesday was stormy, any claim that treatment A was better is highly suspect.  Stormy weather could have agitated the dogs in group B.  The independent variable was not adequately isolated.  The study would not have internal validity.

The following itemizes Campbell’s Classic Threats to Internal Validity and provides examples.  One step we can take toward understanding research is to understand how these threaten validity.

Our thunderstorm example above is a history threat.  Dogs in group B had a shared history during the experiment that differed from dogs in group A.  Training methods varied.  However, so did weather.  No one can say for sure which training method was faster because the weather interfered.  History threats can be subtle.  Another example would be if one group receives an orientation while the other does not.  Orientation can prime one group, giving them a head start.  It would also be a history threat.

Maturation threat reflects internal changes.  An obvious example might be age.  Behaviour can change as puppies mature.  Maturation can also mean the maturation of knowledge.  Students handling dogs during experiments will have gained knowledge throughout the term. It would not be wise to test group A with new students and group B at the end of term.  Increased knowledge by the end of term can mean that students guess the hypothesis or influence results.

Subjects rarely get the same test results when re-tested.  Practice leads to improvement, even without treatment of any kind.  Suppose we take a group of anxious dogs and test their heart rate.  Heart rates can drop simply because the dog habituates and becomes more comfortable.  A second round of testing should show habituation.  It is not enough to ask if a dog improved, we need to know if the dog improved more so than dogs that did not receive any treatment.  Otherwise, we have a testing threat.

Measuring results is not without potential pitfalls.  Instrumentation threats involve data collection problems.  Equipment can fail or be unreliable.  Scales used to score results need to be set correctly.  Assume we want to know if dogs become anxious at the dog park.  Imagine if the measurement options are:  highly anxious; moderately anxious; mildly anxious and fully relaxed.  Answers obviously will be weighted toward the “anxious” side of the scale.  Unless a dog is fully relaxed, it is by default labelled as anxious. Had moderately relaxed and slightly relaxed been offered as choices, an entirely different picture may have emerged.

Random selection between groups is important.  This process helps balance the characteristics between groups.  When groups are not random by design or chance, this is a selection threat.  Assume that wrandom balancinge want to know which training technique obtains a faster recall.  Group A dogs are mostly short hounds and toy breeds.  Group B has mostly large dogs with a smattering of Border Collies and Whippets.  Under those conditions, we could not claim that group B training produced faster recalls.  To avoid accidental selection threats, random selection and balancing offers an even comparison between groups.  Researcher choice is not random.

Mortality should be listed in an experiment.  It is the dropout rate.  When large numbers drop out of an experiment, it’s a red flag.  According to the text, “Often it means that the treatment is frightening, painful, or distressing.  If the dropout rate is very high, it can mean that the treatment is sufficiently obnoxious that typical subjects would choose to leave, and the ones who remain could be unusual in some respect.”  Assume we are testing a protocol to help reactive dogs.  Many drop out.  Those who remain seem to improve.  The obvious question is whether those who left were distressed or deteriorated so much so they did not return.  That is critical information.

The seventh threat comes with a big word:  Statistical Regression.  Extreme test results are unreliable.  Think back to grade school I.Q. tests.  Scoring low could mean you had the flu.  If an experiment uses subjects with extreme characteristics, we can expect some of that to level out on its own.  Testing a new anxiety treatment on highly anxious dogs can appear to work.  That result looks similar to statistical regression.  As with a testing threat, it is not enough to ask if an animal improved.  We need to ensure that improvement happened because of the treatment.

Finally, we come to selection interaction threats.  It’s the one-two punch of threats.  It happens when a selection threat combines with another threat.  Returning to our experiment that asks which dog training method is faster, suppose we non-randomly select dogs from two training schools.  Immediately, that creates a selection threat.  Now suppose school A has a trick team.  Students at this school are motivated to join the team.  The second training school does not offer tricks sessions.  That creates a history threat.  Trick dogs would have a wide array of skills to draw on – to guess the right answer instead of learning it via the training method tested.  Selection threat combines in this case with a history threat to create one hot mess.

classic threats to validity copy

Campbell’s Classic Threats are the tip of the iceberg in terms of red flags.  It can make it seem no research can hold up to its standard.  Following a defined process for evaluating research is a far sight better than pointing to number of subjects and chanting “the n value is too low.”  It may not be possible to control for every bump in the road.  Experimental Psychology states, “Control as many variables as possible.  If a variable can be held constant in an experiment, it makes sense to do so even when any impact of that variable on the results may be doubtful.”

Knowing the threats to internal validity are only useful if you start using them to read studies more carefully.  It might be tempting to annihilate an experiment you dislike.  Perhaps a more interesting exercise would be review an experiment you love and have shared.  Challenge your bias.  Look at the design and the various threats to internal validity.  Did you find any?

(Part 2, 3 and more….about those n values, non-experimental research and more.)