Researcher degrees of freedom in statistical software contribute to unreliable results: A comparison of nonparametric analyses conducted in SPSS, SAS, Stata, and R.

Cooper B Hodges, Bryant M Stone, Paula K Johnson, James H Carter, Chelsea K Sawyers, Patricia R Roby, Hannah M Lindsey
Author Information
  1. Cooper B Hodges: Department of Neurology, University of Utah School of Medicine, Salt Lake City, UT, USA. ORCID
  2. Bryant M Stone: Department of Psychology, Southern Illinois University, Carbondale, 1125 Lincoln Drive, Carbondale, IL, 62901, USA. Bryant.Stone@siu.edu. ORCID
  3. Paula K Johnson: Department of Neurology, University of Utah School of Medicine, Salt Lake City, UT, USA. ORCID
  4. James H Carter: Department of Psychology, Stanford University, Stanford, CA, USA. ORCID
  5. Chelsea K Sawyers: Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA. ORCID
  6. Patricia R Roby: Center for Injury Research and Prevention, Children's Hospital of Philadelphia, Philadelphia, PA, USA. ORCID
  7. Hannah M Lindsey: Department of Neurology, University of Utah School of Medicine, Salt Lake City, UT, USA. ORCID

Abstract

Researcher degrees of freedom can affect the results of hypothesis tests and consequently, the conclusions drawn from the data. Previous research has documented variability in accuracy, speed, and documentation of output across various statistical software packages. In the current investigation, we conducted Pearson's chi-square test of independence, Spearman's rank-ordered correlation, Kruskal-Wallis one-way analysis of variance, Wilcoxon Mann-Whitney U rank-sum tests, and Wilcoxon signed-rank tests, along with estimates of skewness and kurtosis, on large, medium, and small samples of real and simulated data in SPSS, SAS, Stata, and R and compared the results with those obtained through hand calculation using the raw computational formulas. Multiple inconsistencies were found in the results produced between statistical packages due to algorithmic variation, computational error, and statistical output. The most notable inconsistencies were due to algorithmic variations in the computation of Pearson's chi-square test conducted on 2 × 2 tables, where differences in p-values reported by different software packages ranged from .005 to .162, largely as a function of sample size. We discuss how such inconsistencies may influence the conclusions drawn from the results of statistical analyses depending on the statistical software used, and we urge researchers to analyze their data across multiple packages to check for inconsistencies and report details regarding the statistical procedure used for data analysis.

Keywords

References

  1. Alder, A. G., & Vollick, D. (2000). Undergraduate statistics in psychology: A survey of Canadian institutions. Canadian Psychology/Psychologie Canadienne, 41(3), 149–151. [DOI: 10.1037/h0086864]
  2. Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., Fiedler, S., Funder, D. C., Kliegl, R., & Nosek, B. A. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27(2), 108–119. [DOI: 10.1002/per.1919]
  3. Bakker, M., & Wicherts, J. M. (2011). The (mis) reporting of statistical results in psychology journals. Behavioral Research Methods, 43(3), 666–678. https://doi.org/10.3758/s13428-011-0089-5 [DOI: 10.3758/s13428-011-0089-5]
  4. Bauer, D. F. (1972). Constructing confidence sets using rank statistics. Journal of the American Statistical Association, 67, 687–690. https://doi.org/10.1080/01621459.1972.10481279 [DOI: 10.1080/01621459.1972.10481279]
  5. Bergmann, R., Ludbrook, J., & Spooren, W. P. J. M. (2000). Different outcomes of the Wilcoxon-Mann-Whitney test from different statistics packages. The American Statistician, 54(1), 72–77. https://doi.org/10.1080/00031305.2000.10474513 [DOI: 10.1080/00031305.2000.10474513]
  6. Berle, D., & Starcevic, V. (2007). Inconsistencies between reported test statistics and p-values in two psychiatry journals. International Journal of Methods in Psychiatric Research, 16(4), 202–207. https://doi.org/10.1002/mpr.225 [DOI: 10.1002/mpr.225]
  7. Bliss, C. I. (1967). Statistics in biology, 1, McGraw-Hill.
  8. Bock, R. D. (1975). Multivariate statistical methods in behavioral research. McGraw-Hill.
  9. Borghi, J. A., & Van Gulick, A. E. (2018). Data management and sharing in neuroimaging: Practices and perceptions of MRI researchers. PLoS One, 13(7), Article e0200562. https://doi.org/10.1371/journal.pone.0200562
  10. Brown, B. L., Hendrix, S. B., Hedges, D. W., & Smith, T. B. (2012). Multivariate analysis for the biobehavioral and social sciences: A graphical approach. John Wiley & Sons.
  11. Campbell, I. (2007). Chi-squared and Fisher–Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine, 26, 3661–3675. https://doi.org/10.1002/sim.2832 [DOI: 10.1002/sim.2832]
  12. Caperos, J. M., & Pardo Merino, A. (2013). Consistency errors in p-values reported in Spanish psychology journals. Psicothema, 25(3), 408–414. https://doi.org/10.7334/psicothema2012.207 [DOI: 10.7334/psicothema2012.207]
  13. Chambers, C. (2019). The seven deadly sins of psychology: A manifesto for reforming the culture of scientific practice. Princeton University Press. [DOI: 10.2307/j.ctvc7742b]
  14. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Houghton Mifflin.
  15. Cramér, H. (1946). Mathematical methods of statistics. Princeton University Press.
  16. Dar, R., Serlin, R. C., & Omer, H. (1994). Misuse of statistical test in three decades of psychotherapy research. Journal of Consulting and Clinical Psychology, 62(1), 75–82. https://doi.org/10.1037//0022-006x.62.1.75 [DOI: 10.1037//0022-006x.62.1.75]
  17. Epskamp, S. (2019). Reproducibility and replicability in a fast-paced methodological world. Advances in Methods and Practices in Psychological Science, 2(2), 145–155. [DOI: 10.1177/2515245919847421]
  18. Friedrich, J., Childress, J., & Cheng, D. (2018). Replicating a National Survey on statistical training in undergraduate psychology programs: Are there “new statistics” in the new millennium? Teaching of Psychology, 45(4), 312–323. https://doi.org/10.1177/0098628318796414 [DOI: 10.1177/0098628318796414]
  19. García-Pérez, M. A. (2012). Statistical conclusion validity: Some common threats and simple remedies. Frontiers in Psychology, 3, 325. https://doi.org/10.3389/fpsyg.2012.00325 [DOI: 10.3389/fpsyg.2012.00325]
  20. García-Pérez, M. A., & Núñez-Antón, V. (2020). Asymptotic versus exact methods in the analysis of contingency tables: Evidence-based practical recommendations. Statistical Methods in Medical Research, 29(9), 2569–2582. https://doi.org/10.1177/0962280220902480 [DOI: 10.1177/0962280220902480]
  21. Garside, G. R., & Mack, C. (1976). Actual type I error probabilities for various tests in the homogeneity case of the 2×2 contingency table. The American Statistician, 30, 18–21.
  22. Gibbons, J. D., & Chakraborti, S. (2011). Nonparametric statistical inference (5th ed.). Taylor & Francis Group.
  23. Grieder, S., & Steiner, M. (2020). Algorithmic jingle jungle: A comparison of implementations of Principal Axis Factoring and promax rotation in R and SPSS. PsyArXiv. https://doi.org/10.31234/osf.io/7hwrm
  24. Grizzle, J. E. (1967). Continuity correction in the χ-test for 2 × 2 tables. The American Statistician, 21(4), 28–32.
  25. Haber, M. (1982). The continuity correction and statistical testing. International Statistical Review, 50, 135–144. [DOI: 10.2307/1402597]
  26. Hill, I. D., & Peto, R. (1971). Algorithm AS 35: Probabilities derived from finite populations. Applied Statistics, 20, 99–105. [DOI: 10.2307/2346642]
  27. Hitchcock, D. B. (2009). Yates and contingency tables: 75 years later. Electronic Journal for History of Probability and Statistics, 5, 1–14.
  28. Hodges, J. L., & Lehmann, E. L. (1963). Estimation of location based on ranks. Annals of Mathematical Statistics, 34(2), 598–611. https://doi.org/10.1214/aoms/1177704172 [DOI: 10.1214/aoms/1177704172]
  29. Hoekstra, R., Kiers, H. A., & Johnson, A. (2012). Are assumptions of well-known statistical techniques checked, and why (not)? Frontiers in Psychology, 3, 137. https://doi.org/10.3389/fpsyg.2012.00137 [DOI: 10.3389/fpsyg.2012.00137]
  30. Huang, Y., & Bentler, P. M. (2015). Behavior of asymptotically distribution free test statistics in covariance versus correlation structure analysis. Structural Equation Modeling: A Multidisciplinary Journal, 22(4), 489–503. https://doi.org/10.1080/10705511.2014.954078 [DOI: 10.1080/10705511.2014.954078]
  31. Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), 0696–0701. https://doi.org/10.1371/journal.pmed.0020124 [DOI: 10.1371/journal.pmed.0020124]
  32. Ioannidis, J. P. (2014). How to make more published research true. PLoS Medicine, 11(10), Article e1001747. https://doi.org/10.1371/journal.pmed.1001747
  33. Joanes, D. N., & Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. The Statistician, 47, 183–189. [DOI: 10.1111/1467-9884.00122]
  34. Keeling, K. B., & Pavur, R. J. (2007). A comparative study of the reliability of nine statistical software packages. Computational Statistics & Data Analysis, 51(8), 3811–3831. [DOI: 10.1016/j.csda.2006.02.013]
  35. Keselman, H. J., Huberty, C. J., Lix, L. M., Olejnik, S., Cribbie, R. A., Donahue, B., Kowalchuk, R. K., Lowman, L. L., Petoskey, M. D., Keselman, J. C., & Levin, J. R. (1998). Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses. Review of Educational Research, 68(3), 350–386. [DOI: 10.3102/00346543068003350]
  36. Köhler, T., Landis, R. S., & Cortina, J. M. (2017). From the editors: Establishing methodological rigor in quantitative management learning and education research: The role of design, statistical methods, and reporting standards. Academy of Management Learning and Education, 16(2), 173–192. https://doi.org/10.5465/amle.2017.0079 [DOI: 10.5465/amle.2017.0079]
  37. Lane, K. A., Banaji, M. B., Nosek, B. A., & Greenwald, A. G. (2007). Understanding and using the implicit association test: What we know (so far) about the method. In B. Wittenbrink & N. Schwarz (Eds.), Implicit measures of attitudes (pp. 59–102). Guilford Press.
  38. Lehmann, E. L. (1998). Nonparametrics: Statistical methods based on ranks (revised 1st ed.). Prentice Hall.
  39. Levine, T. R., & Atkin, C. (2004). The accurate reporting of software-generated p-values: A cautionary research note. Communication Research Reports, 21(3), 324–327. https://doi.org/10.1080/08824090409359995 [DOI: 10.1080/08824090409359995]
  40. Lydersen, S., Fagerland, M. W., & Laake, P. (2009). Recommended tests for association in 2 × 2 tables. Statistics in Medicine, 28, 1159–1175. [DOI: 10.1002/sim.3531]
  41. Mantel, N. (1976). The continuity correction. The American Statistician, 30, 103–104.
  42. Maxwell, E. A. (1976). Analysis of contingency tables and further reasons for not using Yates correction in 2 × 2 tables. Canadian Journal of Statistics, 4, 277–290. [DOI: 10.2307/3315141]
  43. McCoach, D. B., Rifenbark, G. G., Newton, S. D., Li, X., Kooken, J., Yomtov, D., Gambino, A. J., & Bellara, A. (2018). Does the package matter? A comparison of five common multilevel modeling software packages. Journal of Educational and Behavioral Statistics, 43(5), 594–627. [DOI: 10.3102/1076998618776348]
  44. McCullough, B. (2000). Is it safe to assume that software is accurate? International Journal of Forecasting, 16(3), 349–357. [DOI: 10.1016/S0169-2070(00)00032-7]
  45. McCullough, B. D., & Heiser, D. A. (2008). On the accuracy of statistical procedures in Microsoft excel 2007. Computational Statistics & Data Analysis, 52(10), 4570–4578. [DOI: 10.1016/j.csda.2008.03.004]
  46. McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12, 153–157. [DOI: 10.1007/BF02295996]
  47. Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., Du Sert, N. P., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), Article 0021. https://doi.org/10.1038/s41562-016-0021 [DOI: 10.1038/s41562-016-0021]
  48. Mundry, R., & Fischer, J. (1997). Use of statistical programs for nonparametric tests of small samples often leads to incorrect p values: Examples from animal behaviour. Animal Behaviour, 56, 256–259. [DOI: 10.1006/anbe.1998.0756]
  49. Neave, H. R., & Worthington, P. L. (1988). Distribution-free tests. Unwin Hyman Ltd..
  50. Nuijten, M. B., Borghuis, J., Veldkamp, C. L., Dominguez-Alvarez, L., Van Assen, M. A., & Wicherts, J. M. (2017). Journal data sharing policies and statistical reporting inconsistencies in psychology. Collabra. Psychology, 3(1).
  51. Open Science Collaboration. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science, 7(6), 657–660. [DOI: 10.1177/1745691612462588]
  52. Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), Article aac4716. https://doi.org/10.1126/science/aac4716
  53. Osborne, J. (2008). Sweating the small stuff in educational psychology: How effect size and power reporting failed to change from 1969 to 1999, and what that means for the future of changing practices. Educational Psychology, 28, 151–160. [DOI: 10.1080/01443410701491718]
  54. Osborne, J., & Waters, E. (2002). Four assumptions of multiple regression that researchers should always test. Practical Assessment, Research, & Evaluation, 8, Article 2.
  55. Oster, R. A., & Hilbe, J. M. (2008a). An examination of statistical software packages for parametric and nonparametric data analyses using exact methods. The American Statistician, 62(1), 74–84. https://doi.org/10.1198/000313008X268955 [DOI: 10.1198/000313008X268955]
  56. Oster, R. A., & Hilbe, J. M. (2008b). Rejoinder to “an examination of statistical software packages for parametric and nonparametric data analyses using exact methods”. The American Statistician, 62(2), 173–176. https://doi.org/10.1198/000313008X306853 [DOI: 10.1198/000313008X306853]
  57. Pearson, E. S. (1947). The choice of statistical test illustrated on the interpretation of data classed in a 2 x 2 table. Biometrika, 34, 139–167. [PMID: 20287827]
  58. Peikert, A., & Brandmaier, A. M. (2021). A reproducible data analysis workflow with R Markdown, Git, Make, and Docker. Quantitative and Computational Methods in Behavioral Sciences, Article e3763. https://doi.org/10.5964/qcmb.3763
  59. Potvin, C., & Roff, D. A. (1993). Distribution-free and robust statistical methods: Viable alternative to parametric statistics? Ecology, 74(6), 1617–1628. [DOI: 10.2307/1939920]
  60. Prescott, R. J. (2019). Two-tailed significance tests for 2 × 2 contingency tables: What is the alternative? Statistics in Medicine, 38, 4264–4269. [DOI: 10.1002/sim.8294]
  61. Richardson, J. T. E. (1990). Variants of chi-square for 2 × 2 contingency tables. British Journal of Mathematical and Statistical Psychology, 43, 309–326. [DOI: 10.1111/j.2044-8317.1990.tb00943.x]
  62. Schatz, P., Jay, K. A., McComb, J., & McLaughlin, J. R. (2005). Misuse of statistical tests in archives of clinical neuropsychology publications. Archies of Clinical Neuropsychology, 20(8), 1053–1059. https://doi.org/10.1016/j.acn.2005.06.006 [DOI: 10.1016/j.acn.2005.06.006]
  63. Siegel, S. (1957). Nonparametric statistics. The American Statistician, 11(3), 13–19.
  64. Siegel, S., & Castellan, N. J. (1988). Non-parametric statistics for the behavioural sciences (2nd ed.). McGraw-Hill.
  65. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632 [DOI: 10.1177/0956797611417632]
  66. Stefanescu, C., Berger, V. W., & Hershberger, S. (2005). Yates’s continuity correction. In B. S. Everit & D. Howell (Eds.), Book Yates’s continuity correction (Vol. 4, pp. 2127–2129). John Wiley & Sons.
  67. Trisovic, A., Lau, M. K., Pasquier, T., & Crosas, M. (2022). A large-scale study on research code quality and execution. Scientific Data, 9(60), 1–16. https://doi.org/10.1038/s41597-022-01143-6 [DOI: 10.1038/s41597-022-01143-6]
  68. Wang, J., & Johnson, D. E. (2019). An examination of discrepancies in multiple imputation procedures between SAS® and SPSS®. The American Statistician, 73(1), 80–88. https://doi.org/10.1080/00031305.2018.1437078 [DOI: 10.1080/00031305.2018.1437078]
  69. Whitley, E., & Ball, J. (2002). Statistics review 6: Nonparametric methods. Critical Care, 6, 509–513. https://doi.org/10.1186/cc1820 [DOI: 10.1186/cc1820]
  70. Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van Aert, R. C. M., & van Assen, M. A. L. M. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking [review]. Frontiers in Psychology, 7(1832). https://doi.org/10.3389/fpsyg.2016.01832
  71. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. [DOI: 10.2307/3001968]
  72. Xu, K., Nosek, B., & Greenwald, A. (2014). Psychology data from the race implicit association test on the project implicit demo website. Journal of Open Psychology Data, 2(1), e3. https://doi.org/10.5334/jopd.ac [DOI: 10.5334/jopd.ac]
  73. Yates, F. (1934). Contingency tables involving small numbers and the χ test. Supplement to the Journal of the Royal Statistical Society, 1(2), 217–235. https://doi.org/10.2307/2983604 [DOI: 10.2307/2983604]

MeSH Term

Humans
Software
Sample Size
Research Design
Chi-Square Distribution
Correlation of Data

Word Cloud

Created with Highcharts 10.0.0statisticalsoftwareresultsdatapackagesinconsistenciesResearcherdegreesfreedomtestsconductedconclusionsdrawnoutputacrossPearson'schi-squaretestanalysisWilcoxonSPSSSASStataRcomputationalduealgorithmicanalysesusedStatisticalcanaffecthypothesisconsequentlyPreviousresearchdocumentedvariabilityaccuracyspeeddocumentationvariouscurrentinvestigationindependenceSpearman'srank-orderedcorrelationKruskal-Wallisone-wayvarianceMann-WhitneyUrank-sumsigned-rankalongestimatesskewnesskurtosislargemediumsmallsamplesrealsimulatedcomparedobtainedhandcalculationusingrawformulasMultiplefoundproducedvariationerrornotablevariationscomputation2 × 2tablesdifferencesp-valuesreporteddifferentranged005162largelyfunctionsamplesizediscussmayinfluencedependingurgeresearchersanalyzemultiplecheckreportdetailsregardingprocedurecontributeunreliableresults:comparisonnonparametricNonparametricproceduresReproducibilityconclusionvalidity

Similar Articles

Cited By (6)