Reading Research: Does Size Matter?

Following up from Reading Research part one where I review key aspects of the book Experimental Psychology by Myers and Hansen, I thought it would be interesting to tackle the question, “Does size matter?”  Of course, by size, I mean the sample size in a research study – often referred to as “n.”

Many introductory books make the point that small sample sizes are a red flag.  Small samples might provide flawed information.  A small group might be comprised of unique or unusual individuals – subjects who do not reflect the majority of the population.

Yes, Skinner, undoubtedly one of the most well known names in psychology did experiments with only a few subjects.  His work is held in high regard for being tightly controlled.  Much of his work has held up over many decades.

skinner

That presents quite a discrepancy to resolve.  Large samples are purportedly good, yet Skinner’s exceptionally controlled research used small samples.  Assuming that both premises are true, an explanation must exist.

Generally, large samples are beneficial.  One reason is that large samples are more likely to reflect the whole of the population.  There is another reason illustrated by the following example.

Let us pretend that we want to know if a dog training technique is better than another is.  We randomly divide the dogs into two groups.  Group A learns a task with our technique.  Group B learns with a different technique.  We train each group of dogs and compare the results.  This research design is called a two independent group design.

Statistics analyze the data. A standard t-test is a probable choice for this study.  Each statistical test makes assumptions in its calculations.  Standard t-tests assume that the data we are collecting creates a normal curve (bell curve.)  Without enough participants, there isn’t enough data to make a solid, fully formed bell curve.  As the diagram below shows, without a fully formed curve, it is impossible to compare if the curves are similar or different from one another.  Standard t-tests generally require at least twenty subjects in each group – but thirty is better.

normal curve

There are other forms of research.  Some researchers prefer to focus on details.  These details can be lost when data is pooled or averaged.  Instead of blending the results of many subjects, “small n” researchers focus intently at the individual responses of a few.

Such an approach can offer key insights.  For example, if we measured how dogs learn new skills, blended results might create a gentle sloping curve.  Individual results could paint a jagged process – breakthroughs and setbacks.

average graph

There are a number of low n experiments including ABA; Multiple Baseline Design; Changing Criterion Designs and Discrete Trial Designs.

Here are a couple examples to show how some of these processes work.

In an ABA design, the subject acts as both the experimental and control group.  Assume that we want to test a new anxiety treatment.  A baseline is measured during phase one (The first A in ABA).  Treatment is then given during the second phase (B).  Finally, treatment is discontinued (A).  ABA design allows us to see if the treatment has an effect.  We can also see if results disappeared when treatment stopped.  There are many variations of the ABA design such as ABABA, ABACADA and so on.  The reversals allow researchers to see if the order of treatment is having an impact rather than the actual treatment.

Discrete trials are common in conditioning experiments.  For example, we might want to know if dogs discriminate sound better with one ear versus the other.  In other words, we want to know if dogs are left or right “eared.”  Dogs learn to discriminate a tone.  Probe tones are presented to the left or right ear.  The dog’s responses – how quickly they discriminate on either ear is measured.  Comparisons are made.  A response is measured over many treatment conditions.  In this case, hearing is measured across a number of manipulations.  Humans who participated in a similar experiment each performed over 2000 trials.  The sample might be small, but the volume of data is massive.  It requires meticulous record keeping and data analysis.

The question should not be “does size matter?”  That is an overly simplistic question.

Of course, size matters.  Bigger is not always better when it comes to sample sizes.  What matters is whether the size of the sample works with the type of study and the statistical analysis used.

The various types of research are like tools.  A hammer is no better or worse than a screwdriver.  Using a hammer to drive a screw is fraught with problems.  It is similar with studies.  Different types of research serve a different purpose – they need to be used correctly.  Keep looking at sample sizes.  Also, look to see if that sample matches the type of research.  It can be helpful to grab a few studies, look up the sample size and look up the type of study.  Start becoming familiar with the jargon.

I would highly recommend Experimental Psychology to anyone wanting a deeper understanding.  My blogs are just highlighting a few small sections.  Well worth the investment.

Part one on reading research:  Internal Validity can be found here.
Part three – coming soon.

New Isn’t Better. Better is Better.

Protocols created by pet professionals have been around for as long as I can remember. By protocol, I mean a system or recipe that provides a step-by-step instructional guide, presented as something that works better or differently than generally accepted practices described in standardized scientific terms.

I have used protocols in the past and I think the desire to find that shiny, new, better protocol comes from a passionate desire to help more dogs. No matter how good we are as trainers, there is that client that struggles. We want to help.

Thus is born the argument that we need more tools in the training toolbox. Don’t get me wrong, I am completely in favour of better and I am in favour of learning about new things. New does not mean better. New does not mean it will stand the test of time. New does not mean that it belongs in my training toolbox.  Perhaps it belongs in my “interesting new factoid” box.  Evidence supporting some cognitive ability in dogs does not mean it will successful translate into something useful for dog training. Not even all of Skinner’s work held up upon review. That is okay. That is how science works.

This is why the idea that “there is a study” is not sufficient to say that new is better. It is like those people who replaced their butter consumption with margarine. Trans fat is not better even if it was newer than butter.

New

Will a specific study or protocol hold up over time? I do not know. I only know that before me is a human being, a client with a dog. It is my job to:

  • Choose a strategy that has evidence of being effective.
  • Ascertain that the strategy is suitable for that particular problem.
  • Understand, mitigate, avoid and warn about risks and costs.
  • Factor in the client’s capabilities and safety measures.

That is a tall order to fill. Human psychology has wrestled with this problem. It resulted in a series of task reports by Chambless and Ollendick on evidence based practice. It recognizes that new treatments may become available, and we need to balance the potential for new effective treatments with the scientific evidence at hand. As a result, many psychological associations list ratings of treatment options.

For example, the Society of Clinical Psychology lists the treatments for panic disorders as:

  • Cognitive Behavioral Therapy NEW (strong research support)
  • Applied Relaxation (modest research support)
  • Psychoanalytic Treatment (modest research support/controversial)

The terms “strong research” and “modest research” reflects specific criteria that explains the amount and type of research supporting that treatment.   An untested treatment plan may or may not work. Its omission from the list is an honest way of communicating that we just do not know.

Should controversy exist, this is also noted, creating a transparent system.  Reviewing, revisiting and questioning evidence does not constitute a personal attack. As the task force explains:

“Experts are not infallible. All humans are prone to errors and biases. Some of these stem from cognitive strategies and heuristics that are generally adaptive and efficient. Others stem from emotional reactions, which generally guide adaptive behavior as well but can also lead to biased or motivated reasoning.”

Criteria leading to a “strong research support” (well-established) designation are stringent. According to the Chambless and Ollendick’s criteria:


I – At least two good between group design experiments demonstrating efficacy in one or more of the following ways:

A Superior (statistically significantly so) to pill or psychological placebo or to another treatment.

B Equivalent to an already established treatment in experiments with adequate sample sizes.

OR

II – A large series of single case design experiments (N>9) demonstrating efficacy. These experiments must have:
A Used good experimental designs and
B Compared the intervention to another treatment as in IA.

Further Criteria for both I and II

III Experiments must be conducted with treatment manuals.
IV Characteristics of the client samples must be clearly specified.
V Effects must have been demonstrated by at least two different investigators or investigating teams.

(Bolded areas by myself to highlight the many requirements.)


You do not need to be a researcher in order to see that this is well beyond, “There is a new study – here is my new protocol.” Well established treatments have multiple, reputable studies with multiple researchers and teams that review and debate the merits of that evidence..  Even those listed as “modest research support” go well beyond one study and an idea.

How do we choose what is the best therapy for a particular client?  The task force suggests,

“…evidence should be considered in formulating a treatment plan, and a cogent rationale should be articulated for any course of treatment recommended.”

In dog training circles, protocols are marketed differently than the above.  Clients and trainers alike are told that we, “need more tools, or dogs will die.”  This insinuates that nothing but more protocols can save lives, overlooking that this is not the issue at hand.  The choice of options is not between new protocols and death. Our choice lies between therapies with a strong body of evidence and others with little to none.

More choices and more protocols create an ethical dilemma. We do not know if shiny, new things are better than placebo, nor do we know if they carry risks. We are working without the safety net that testing provides.

We also create an opportunity cost. We abandon the well-established treatments in favour of the unknown. There is a finite amount of time, money and resources in a client’s life.  Attention to the new takes time and attention away from a strategy that has a strong track record of working.

Even if we could mash methods and offer multiple strategies, it is unlikely that anyone has tested or reviewed if methods are complimentary. Do the effects of our shiny, new protocol trigger blocking effects in the tried and true? Without testing, this presents yet another concerning unknown. It is entirely possible that we are setting the client up to fail.

Out of the plethora of shiny new protocols, perhaps some will stand the test of time.  We remain in the dark until rigorous testing happens.

We, as dog trainers, have no right to override or skip testing or review. Our experiences and anecdotes are not superior to the tenents of scientific processes. Nothing gives us the right to let our ego grow to the point where we believe we can create a protocol – skip testing – sell it to clients at will – without disclosure – while taking payment for that service.

Until shiny, new studies and protocols become tested and reliable, we have choices to make for the individual client before us. If we choose to go the route of shiny and new, then at the very least clients deserve to know that they are signing up for something experimental. They also have a right to know that a supported treatment is available to them elsewhere.

new and untested

To be quite blunt, while we dabble in the new and untested, we are asking our clients to be our guinea pigs.

New does not mean better. Better is better. We will know we have better when we have proof that it’s better.  In the meantime, perhaps our focus is better served at becoming better at that which already meets “well established” treatment guidelines.

Define “Works”

Some say that statistics lie.  Personally, I’ve never met a dishonest statistic.  I have however met dishonest people who misuse statistics.  There are people who misread them.

Unless you want to be the person who is spending $50.00 a month on the latest supplement or therapy program, take some time to learn about controls.  Let’s face it, if all those miracle treatments worked, people wouldn’t have drawers full of fads that did not work.  Same goes for dog training miracle treatments.

Let’s have some fun with statistics by creating some really basic fake study results.

Here is our study:

Scientists told a bunch of people to eat strawberries three times a day.  They want to know if strawberries help people with upset stomach feel better.

You look in a magazine and see that strawberries offer clear benefits to people with upset stomach.  Within the article is a diagram – a graph.  It looks something like this:

do nothing copy
Impressive looking results.  Forty percent of people felt better when compared to people who did nothing to change their diet.  Oddly, some people who did nothing felt better too.  Maybe they had the flu and it cleared up on its own.  It certainly looks like we should be eating strawberries with every meal.

Another researcher repeated the same study with one change.  They found similar results, so we have the beginning of a possible trend.

junk

Forty percent of people felt better eating strawberries.  This time the researcher compared junk food eaters to the newly named Strawberry Miracle Diet.

Seriously, how is that revolutionary?  Comparing real food to junk and candy is an absurd idea.  We know junk food is bad.  Comparing to a bad control obviously makes the strawberries look good.  Overall, it doesn’t prove a “Miracle” strawberry diet works.

Researcher number three is skeptical and their study has yet another chart.

Placebo
This researcher did something interesting.  They created a placebo group.  This means that some of the people ate strawberries.  The other half were given a sugar pill.  A placebo is a treatment with NO medical effect.  Comparing strawberries to a treatment with no benefit helps weed out the placebo effect – when people truly think something is helping but it’s not.  Yes, our quirky human brains are funny that way.

Researcher number three has provided clear evidence that strawberries are no better than sugar pills – a treatment known to do NOTHING.  Suddenly the Strawberry Miracle Diet looks a bit hinky.

Some might say, “Who cares, as long as people feel better?”  In certain cases, certainly this is true.  For example, people in severe pain might be given a placebo if they have reached the maximum dosage on pain medication.

Exceptions aside, an insidious danger lurks.  One more chart to illustrate what could potentially happen.

pineappleAnother researcher compared a variety of fruits to each other..  Most fruits made 40% of people feel better – except pineapple.  Pineapple stands out, with almost 80% of people feeling better.  That is a significant difference.

Who knows why pineapple is superior.  We need more research.  Perhaps it’s all the digestive enzymes.  Pineapple is better than placebo.  Pineapple is MUCH better than strawberries.  Pineapple actually seems to offer some REAL benefit.  As real as a fake study can be.

For numbers to be meaningful, you need a point of reference for comparison.  For example, a horse is big compared to a mouse.  A horse is small compared to an elephant.  The horse’s size does not change.  However, you only get a real indication of size if you compare to other things.

Similarly, treatments can seem to work.  Only by thoughtfully comparing them to solid reference points can we understand what the numbers really mean.

This is a sticky topic because with our irrational minds, placebos “work.”  You could probably test bracelets made from the hair off a monkey’s butt and some people will swear it “worked.”  Their testimonials will convince other people to wear monkey butt hair bracelets too.

The real tragedy of eating strawberries isn’t that strawberries are bad, immoral or evil.  The tragedy is that it takes people away from pineapples – treatments that work.

You could substitute supplements, holistic remedies, conventional medicine, anxiety treatments or any other “cure” on the planet.  This applies for dog training treatments as much as it does for “Miracle Diets.”  In a world where dog training results are evaluated with owner surveys, you cannot completely avoid the placebo effect.

The fact that something “works” doesn’t mean it works better than placebo.  If you want to know if a protocol, therapy or product works, you need to spend as much, if not more time looking at the control groups.