Research papers throw around “statistically significant” constantly. Most people reading them don’t get what it actually means, honestly. The concept matters because it determines which findings get taken seriously and which ones researchers just ignore as artefacts that could have easily arisen due to the process of gathering and analyzing experimental data. You don’t need a statistics degree to understand it but the way academics explain things doesn’t help.
What Statistical Significance Actually Means
Statistical significance measures what is the probability that what was observed could have happened if there was no true effect in play. In other words, that something was observed due to the variability inherent in any measurement or data gathering exercise a researcher might engage in. When using null hypothesis statistical tests (NHST) a cutoff adopted in multiple disciplines and by multiple regulators is that a p-value has to be lower than 0.05 before a researcher can declare with confidence that the experiment data is unlikely to have been generated under the tested null hypothesis. A p-value of 0.05 means there’s only a 5 percent chance the results would have happened randomly if there was no effect to speak of.
That 0.05 threshold is pretty arbitrary when you think about it. It became standard decades ago and everyone stuck with, but there’s nothing magical about 5 percent versus 6 percent or 4 percent. In fact, several disciplines such as physics have adopted several times more stringent cut-offs for declaring a discovery.
A common confusion which arises with the use of p-values is that a p-value of 0.03 doesn’t mean there’s a 97 percent chance the hypothesis is correct. Instead, it means that if there was actually no effect, you’d see results this extreme only 3 percent of the time, or less. Those sound similar but they’re completely different logically. Even researchers who should know better might mix this up in their papers.
Sample Size Messes Things Up
Sample size affects significance in ways that throw people off constantly. Small studies need bigger effects to reach significance because there’s more uncertainty, large studies can find tiny meaningless differences and call them significant just because they had enough data. Creates problems when trying to read research and figure out if it matters.
Say a study has 10,000 participants and finds drinking an extra glass of water correlates with 0.5 percent better test scores. That could hit statistical significance but who actually cares about half a percent? The paper claims significance though, journalists write headlines, everyone acts like it’s a breakthrough. Small studies have the opposite issue where they might miss real effects because they didn’t get enough people.
Researchers need to decide on the significance level and set a sample size before collecting data (or adopt a model for sequentially analyzing the data), otherwise it’s easy to see what the data shows then pick an alpha to give a significant result. Equally bad is so-called “peeking” where one violates the logic of basic statistical significance calculation methods and computes p-values after every several observations until the threshold is set. This messes up the validity but happens anyway in poorly conducted studies.
The Replication Crisis Exposed Problems
The American Statistical Association put out a statement warning against misuse of p-values, saying the p-value was never supposed to be a substitute for scientific reasoning. Psychology and social sciences had this replication crisis recently. Lots of famous studies couldn’t be replicated when other researchers tried repeating them, which was embarrassing. Many significant findings turned out to be just noise that happened to cross the p-value threshold by using p-value peeking, multiple testing, or other misapplications or abuses of proper statistical methodology. The latter is known as p-hacking, and is when researchers try multiple analyses until something shows significance. Test enough variables or slice data enough ways and you’ll find significance somewhere. Not always deliberate fraud, sometimes researchers think they’re just exploring the data properly. Still inflates how many significant findings appear in journals though.
Publication bias compounds everything. Journals want to publish significant findings, not studies where nothing happened. Failed replications often don’t get published so the literature ends up full of false positives. Significant studies get attention, null results disappear.
What to Look For When Reading Research
One thing you can do is calculate statistical significance from the raw data yourself. This way of replicating the p-value calculation will also help you get a better sense of the effect size which is how practical the effect is. Other than effect size measures, you can also look at the sample size involved, confidence intervals, correlation coefficients, and anything else that can give more context to the p-values. Confidence intervals are particularly useful as they show a confidence range: a wide interval means high uncertainty, a tight interval shows that the results are not only significant but have also narrowed down the true effect well. A good set of resources on the proper use of various statistical tests can be found here.
Another thing that you can do is to check if researchers pre-registered their hypotheses and analysis plans before starting. Pre-registration prevents p-hacking because they commit to their approach before seeing what the data looks like and it makes the results more trustworthy. Studies without pre-registration deserve more skepticism, especially in fields that already have replication issues.
Multiple studies showing the same thing beat one significant result every time. Single studies don’t prove as much as people think, even when the significance looks impressive on paper due to issues related to external validity (the generalizability of the outcome). Meta-analyses combining multiple studies give better evidence than individual papers, and secondary evidence from systematic reviews helps mitigate the false discovery rate with the 0.05 threshold.
Conclusion
When you calculate statistical significance properly you still need good methodology and appropriate sample sizes and honest reporting. Statistical significance is just one piece, not everything. Reading research means looking past p-values to understand what actually happened in the study and whether those findings will probably hold up. Significance gets the headlines but doesn’t automatically mean the research matters or will replicate later when someone else tries it.