Actions

Data dredging

From WikiMD's Wellness Encyclopedia

Spurious correlations - spelling bee spiders
Chocolate with high Cocoa content as a weight-loss accelerator

Data dredging, also known as data fishing, data snooping, or p-hacking, is a practice in statistics and data analysis where large volumes of data are searched to find statistically significant patterns or correlations without a prior hypothesis. The term implies a misuse of data analysis techniques and is considered a methodological issue, as it increases the likelihood of finding spurious results. Data dredging is often criticized because it violates the principle of hypothesis testing that is central to the scientific method and statistical inference.

Overview[edit | edit source]

Data dredging occurs when an analyst iteratively searches a dataset for patterns or relationships without a specific hypothesis in mind. This approach contrasts with traditional statistical methods, where a hypothesis is formulated before data analysis begins. In the context of research, data dredging can lead to the publication of misleading findings, as the probability of finding at least one statistically significant result by chance alone increases with the number of analyses performed.

Techniques and Examples[edit | edit source]

Common techniques of data dredging include:

  • Extensive use of data mining tools without a predefined hypothesis.
  • Performing multiple hypothesis tests on the same dataset.
  • Selectively reporting results that are statistically significant while ignoring those that are not (a practice known as "cherry-picking").

An example of data dredging is analyzing a dataset from a medical trial by testing numerous combinations of variables until a statistically significant result is found. This result is then presented as if it were the original hypothesis, ignoring the multiple comparisons problem and the increased risk of Type I error (falsely rejecting the null hypothesis).

Consequences[edit | edit source]

The primary consequence of data dredging is the generation of false positives, leading to conclusions that may not be replicable in subsequent studies. This undermines the reliability of scientific research and can contribute to the replication crisis in some fields. Additionally, data dredging can waste resources on follow-up studies designed to investigate findings that are actually artifacts of the dredging process rather than genuine discoveries.

Mitigation Strategies[edit | edit source]

To mitigate the effects of data dredging, researchers and analysts can adopt several strategies:

  • Pre-registration of studies and hypotheses to commit to a specific analysis plan before examining the data.
  • Correction for multiple comparisons, using statistical techniques such as the Bonferroni correction or the False Discovery Rate (FDR), to adjust the significance threshold based on the number of tests performed.
  • Transparency in reporting all analyses conducted, including those that did not yield significant results, to provide a complete picture of the research conducted.

See Also[edit | edit source]