My sense has been that the PER community still implement subpar standards of research reporting that minimizes our ability to carry out meaningful meta-analysis. I’m not an expert, but I’m assuming that the scores with standard deviations / standard errors would be necessary for a meta-analysis, right? So I’m curious. I’m going to quickly take a look at some  recent papers that report FCI scores as a major part of their study, and see what kind of information is provided by the authors. Here’s how I’ll break it down.

Raw-(ish) Data:

N = number of students

Pre = FCI pre-score either as raw score out of 30 or a percentage (with or without standard deviation / standard error of mean)

Post = FCI post-score either as a raw score out of 30 or a percentage (with or without standard deviation / standard error of mean)

Calculated Data:

g = normalized gain with or without errors bars / confidence intervals

<G> = average normalized gain with or without errors bars / confidence intervals

Gain = Post minus Pre (with or without standard deviation / standard error of mean)

APost = ANOVCA adjusted post score (with or without standard error of mean)

d = Cohen’s d is a measure of effect size (with or without confidence intervals)

I’m leaving out statistical transparency such t-statistics or p-values, or other measures from ANOVA, and I’m sure there are others, such as accompanying data about gender, under-represented minorities, ACT scores, declared major, etc.

Anyway, here we go:

1. Thacker, Dulli, Pattillo, and West (2014) ,”Lessons from large-scale assessment: Results from conceptual inventories

Raw Data: N

Accompanying Data: None

Calculated Data:  g with standard error of the mean (mostly must be read from graphs)

2. Lasry, Charles and Whittaker, “When teacher-centered instructors are assigned to student-centered classrooms”

Raw Data: N, Pre with standard deviation

Accompanying Data: None

Calculated Data: g with standard error of mean (must be read from graphs), Apost with standard error,

Raw Data: N

Accompanying Data: Gender, major, ACT

Calculated Data: g with standard error of mean (must be read from graphs)

Raw Data: N, Pre (with standard deviation), Post (with standard deviation),

Accompanying Data:  Others related to study, CLASS, for example

Calculate Data: g with standard error of mean

5. Couch and Mazur: Peer Instruction: Ten years of experience and results”

Raw Data: N, Pre (without standard deviation), PostPre (without standard deviation)

Calculated Data: g (with out standard deviation), d (without confidence intervals)

Raw Data: N, Pre (with SD), Post (with SD),

Accompanying Data: Gender, race, etc.

Calculated Data: Gain (with SD), d (with CI)

Raw Data: N, pre (with SE), Post (with SE)

Accompanying Data: Gender, majority/minority

Calculated Data: Gain (with SE), d (with CI)

So, what do I see?

Of my quick grab of 7 recent papers, only 3 papers meet the criteria for reporting the minimum raw data that I would think are necessary to perform meta-analyses. Not coincidentally, two of these three papers are from the same research group. Also, probably not coincidentally, all three papers include data both in graphs and tables and include errors bars or confidence intervals. They also consistently reported measures related to any statistical analyses performed.

Four of the papers did not fully report raw data. One of the four almost gave all the raw information needed, reporting ANCOVA adjusted post scores rather than raw post scores. Even here the pre-score data is buried and Apost and g scores can almost only be gleaned from graphs. Two of the papers did not give raw data about pre or post. They reported normalized gain information with errors bars shown, but they could only be read from a graph. These two papers did some statistical analyses, but didn’t report them fully. The last of the four reported pre and post scores but didn’t include standard error or deviations. They carried out some statistically analysis as well, but did not report it meaningfully or include confidence intervals.

I don’t intend this post to be pointing the finger at anyone, but rather to point out how inconsistent we are. Responsibility is community-wide–authors, reviewers, and editors. My sense looking at these papers, even the ones that didn’t fully report data, is that this is much better than what was historically done in our field. Statistical tests were largely performed, but not necessarily reported out fully. Standard errors were often reported, but often needing to be read from small graphs.

There’s probably a lot some person could dig into with this, but it’s probably not going to be me.