Statistical Methods for Replicability Assessment
Large-scale replication studies like the Reproducibility Project: Psychology (RP:P) provide invaluable systematic data on scientific replicability, but most analyses and interpretations of the data fail to agree on the definition of "replicability" and disentangle the inexorable consequences of known selection bias from competing explanations. We discuss three concrete definitions of replicability based on (1) whether published findings about the signs of effects are mostly correct, (2) how effective replication studies are in reproducing whatever true effect size was present in the original experiment, and (3) whether true effect sizes tend to diminish in replication. We apply techniques from multiple testing and post-selection inference to develop new methods that answer these questions while explicitly accounting for selection bias. Re-analyzing the RP:P data, we estimate that 22 out of 68 (32 directional claims were false (upper confidence bound 47 estimate that among claims significant at the stricter significance threshold 0.005, only 2.2 out of 33 (7 18 difference in effect size between original and replication studies and, after adjusting for multiplicity, identify five (11 replication). We estimate that the effect size declined by at least 20 replication study relative to the original study in 16 of the 46 (35 pairs (lower confidence bound 11 assumptions about the true effect sizes.
READ FULL TEXT