Novel program research: standards and background

PEER staff
7/3/2018

Overview

This is a primer on research standards.

It has three parts.

Intuitive (but insufficient) evaluation
Standards of research quality
Implementation of quality research

Part one

Intuitive (but insufficient) evaluation:

Anecdote
Simple numerical comparison

Part two

Standards of research quality:

The statutory definition of 'evidence'
The Maryland Scientific Methods scale

Part three

Implementation of quality research:

Pre-registration
Reproducibility

"Evidence"

“Evidence-based practice” is kind of like “the right outcome” of a trial.

Nobody's against it – but people have very different opinions of what it is!

But they're alike in another way: Not all opinions are created equal.

Anecdotes vs. evidence

“Our program definitely works. Just look at Timmy! He went through it, and turned his whole life around!”

This is not, by itself, evidence of a program's effectiveness.

Why?

Here are Timmy and a couple of his friends.

plot of chunk unnamed-chunk-1

But here's the rest of Timmy's program.

plot of chunk unnamed-chunk-2

The point

With enough participants and normal conditions, you're effectively guaranteed to have successes even in a bad program.

In other words:

Anecdotes aren't good evidence!

What if we get more numbers?

Imagine two programs doing the same thing.

One of them achieves 40 units of effect on average.
The other achieves 80 units of effect on average.

We normally believe that the 80-unit program is better.

Here's what we probably imagine...

plot of chunk unnamed-chunk-3

But here's what could be happening!

plot of chunk unnamed-chunk-4

So the takeaway:

Numeric comparisons – X is bigger than Y – require statistical context.

Without it, they cannot serve as good evidence.

So what can?

MS law provides helpful guidelines.

MISS. CODE ANN. §27-103-159 gives some relevant definitions, including:

“Evidence-based program” shall mean a program or practice that has had multiple site random controlled trials across heterogeneous populations demonstrating that the program or practice is effective for the population.

Let's break that down.

An evidence-based program has had:

Multiple-site
random controlled
trials (plural)
across heterogeneous populations
demonstrating effectiveness.

Why all those requirements?

All of these ensure that some common-sense questions that we have are answered.

Effectiveness: Does the program do something?
Trials: Is the observed effect distinguishable from noise and error?
Random controlled: Is the observed effect because of the program?
Multiple-site, heterogeneous populations: Does the observed effect generalize to us?

(We also want to know whether a program is cost-effective compared to available options, but that's another story)

Why all those requirements?

The importance of effectiveness and generalizability should be pretty clear.

“Distinguishing from noise and error” is just a matter of providing the statistical context that simple numerical comparison does not.

But why randomized, controlled trials?

RCTs and causation

The short answer: RCTs are our best method of establishing that A causes B.

Imagine you’re a researcher for a shoe company; you’re testing a running shoe that is supposed to shave time off of your sprint.

So you set up a test: Runners in your shoes versus runners in some different shoe.

Shoe trials

After statistical analysis, we find the group with your shoe crossed the finish line significantly before the other group.

But wait: You had your group running 100m, while the comparison group ran 200m!

This comparison wasn’t fair; even if the results are good, we can't say they were because of the shoe.

Statistical control and fairness

This is the essence of controlling for confounding variables: basic fairness in comparisons.

(Statistical) control = making sure everybody has the same starting line before comparing them.

Statistical control continued

There are several ways to control for confounding variables. For instance:

Simple physical setup of the trial
- Don’t use different length tracks
Various mathematical methods
- Multiply short-track group time by two

(Obviously this last is just for the sake of the example, and would not be appropriate in a real setting)

Statistical control continued

These methods of control can be very sophisticated. But there's a problem:

You have to know that a confounding variable exists in order to control for it.
And it's impossible to know ahead of time what all the confounding variables are.

A relevant quote

“… the golden rule of causal analysis: No causal claim can be established by a purely statistical method, be it propensity scores, regression, stratification, or any other distribution-based design.”

-Judea Pearl, “Causality,” p. 350

RCTs and causation

Well-conducted random sampling guarantees that all possible confounding variables are randomly distributed among conditions – which is to say, there’s no correlation between any trait and group membership.

Which means the groups, overall, start and finish on the same lines….

Which lets us assume that if they finish at different times, it's because of the program.

Back to the question: Why RCTs?

Randomized controlled trials are:

Epistemically preferable
- they enable causal inferences
Practically preferable
- the math for control and testing is far simpler
Legally preferable
- No program tested solely via anything less than RCTs can ever meet the Mississippi standard of evidence!

As compared to nonrandomized evaluation.

So to summarize:

The MS standard for evidence-based practice is the gold standard. Research quality drops off dramatically the more of these standards you lose.

In medicine: 50-80% of positive results in initial clinical trials are overturned by subsequent RCTs (Ioannidis (2005), Zia et al. (2005))
In business: 80-90% of new products and strategies tested under RCTs by Google and Microsoft have found no significant effects (Manzi (2012))
In education: 91% of rigorous RCTs conducted by the Institute for Education Sciences showed weak or no positive effects (CEBP (2013))

But sometimes, the perfect is the enemy of the good.

Gold is rare. What if we don't have any and still need to act?

MISS. CODE ANN. §27-103-159 provides some loose definitions of less rigorous alternatives:

“Research-based program” shall mean a program or practice that has some research demonstrating effectiveness, but that does not yet meet the standard of evidence-based practices.
“Promising practices” shall mean a practice that presents, based upon preliminary information, potential for becoming a research-based or evidence-based program or practice.

But these definitions are very loose!

So to make things easier...

We've adopted an existing scale to rate research below the MS standard of evidence:

The Maryland Scientific Methods scale.

The MSM scale?

Described by Farrington et al. (2002) in Evidence-based Crime Prevention.

It's a five-point ordinal scale – 1 is the worst, 5 is the best!

It rates our general ability to draw conclusions from the study.

Or said another way: it rates what threats to our desired conclusions are ruled out.

The MSM scale (and threats at each level)

Simple descriptive association
- threats: causal direction, confounders
Pre-post testing
- threats: confounders
Control group
- threats: nonequivalence of groups
Control group plus high-quality statistical controls
- threats: inadequate control
Randomized control group
- threats: inappropriate implementation and analysis

The MSM scale

It's not safe to make inferences from any trial below level 3.

So that's where we've drawn our line for “high-quality research”…

(Although you should always want the gold standard if possible – cf. the earlier slide on the percentage of preliminary studies overturned by rigorous RCTs.)

But it takes more than just this.

Everything said so far assumes that the research is well-conducted.

There is a crisis of reproducibility in science, especially social science!

Some have gone so far as to suggest that most published research is false.

This problem affects random and nonrandom studies alike.

How does the problem happen?

There are several ways.

P-hacking
HARKing
Reasonable-seeming choices in response to the data

And this isn't an exhaustive list! Practices like these can make even crazy results seem scientifically justified.

How do we fix the problem?

Here's where pre-registration and reproducibility come in.

A simplified overview of the process:

You complete a research plan
You submit the plan to the state (pre-registration)
You do your research
You write up your research and provide the state with documentation (reproducibility)

Pre-Registration: The basics

An excellent conceptual overview of the process is here.

It's strongly recommended reading even if you skip the articles already mentioned!

The report that you pre-register should conform to one of two existing, internationally accepted standards: CONSORT or TREND.

Pre-Registration: CONSORT standard

CONSORT is for randomized trials.

Some important CONSORT materials:

The CONSORT website is extremely helpful, and has other checklists and documents that may be useful to you!

Pre-Registration: TREND standard

TREND is for non-randomized trials.

Some important TREND materials:

TREND is designed to work with CONSORT, so the earlier website will be helpful even here.

Pre-Registration: Before research

You will write up a report that includes every item on your checklist except those under 'results' and 'discussion' at the initial submission phase.

This initial writeup must be completed before any aspect of research begins – including even assigning subjects to conditions.

Writeups completed after any phase of research begins cannot meet the standards of this section.

Reproducibility: After research

After the research is done, you will finish your report, including 'results' and 'discussion', and resubmit.

Note that adequate answers to many of the checklist elements will require fairly technical decisions. This checklist is not a substitute for skilled research staff – it just makes their jobs easier and their results more trustworthy!

Reproducibility

When submitting your final research paper:

Submit all raw data
- Raw data should be time-stamped from original entry and unaltered
- Analytic spreadsheets (if any) and data storage spreadsheets should be distinct
Submit all analytic code
- As much analysis as possible should be done via code
- Code should be literate
- Anything and everything not achieved through code should be written up exhaustively

Reproducibility

The goal: The reader should be able to take

Your raw data (unaltered from original entry)

Apply

Your code (understandable to a non-coding English speaker)

And get

Your results

With no further manipulation necessary!

References without direct links

Coalition for Evidence-Based Policy (2013). Randomized Controlled Trials Commissioned by the Institute of Education Sciences Since 2002: How Many Found Positive Versus Weak or No Effects. Retrieved from http://coalition4evidence.org/wp-content/uploads/2013/06/IES-Commissioned-RCTs-positive-vs-weak-or-null-findings-7-2013.pdf

Farrington, D.P., Gottfredson, D.C., Sherman, L.W. & Welsh, B.C. (2002). The Maryland Scientific Methods Scale. In Farrington, D.P., MacKenzie. D. L., Sherman, L.W.,& Welsh, B.C. (Eds.), Evidence-Based Crime Prevention (pp. 13-21). London: Routledge.

Ioannidis, J.P.A. (2005). Contradicted and Initially Stronger Effects in Highly Cited Clinical Research. Journal of the American Medical Association, 294(2), 218-228.

Manzi, J. (2012). Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society. New York: Perseus Books Group.

Pearl, J. (2009). Causality (2nd ed.). Cambridge: Cambridge University Press.

Zia, M. I., Siu, L. L., Pond, G. R., & Chen, E. X. (2005). Comparison of Outcomes of Phase II Studies and Subsequent Randomized Control Studies Using Identical Chemotherapeutic Regimens. Journal of Clinical Oncology, 23(28), 6982-6991.