Good–Turing frequency estimation

From WikiMD's Wellness Encyclopedia

Good–Turing frequency estimation is a statistical technique used to predict the probability of encountering an unseen event in a sample, based on the frequencies of events already observed. This method is particularly useful in the field of natural language processing, linguistics, and bioinformatics, where it is often necessary to estimate the distribution of rare or unseen phenomena. The technique was introduced by I.J. Good and Alan Turing during World War II, initially as a cryptanalytic tool, but has since found widespread application in various scientific disciplines.

Overview[edit | edit source]

The Good–Turing frequency estimation revises the observed frequencies of events in a sample to better estimate the true frequencies in the entire population. The core idea is to reduce the frequency of observed events slightly and allocate that frequency to unseen events. This adjustment is based on the observation that if an event has never been seen in a sample, it does not mean that its true frequency in the population is zero.

Formulation[edit | edit source]

The formula for the Good–Turing estimate for the frequency of unseen events is given by:

\[ P_0 = \frac{N_1}{N} \]

where \(P_0\) is the probability of encountering an unseen event, \(N_1\) is the number of events that occur exactly once in the sample, and \(N\) is the total number of events observed.

For events that have been seen, the adjusted frequency \(f^*\) is calculated as:

\[ f^* = (f+1) \frac{N_{f+1}}{N_f} \]

where \(f\) is the original frequency of the event, \(N_f\) is the number of events that occur exactly \(f\) times in the sample, and \(N_{f+1}\) is the number of events that occur exactly \(f+1\) times.

Applications[edit | edit source]

Good–Turing frequency estimation has been applied in various fields for different purposes. In natural language processing, it is used for smoothing language models and handling out-of-vocabulary words. In ecology, it helps estimate the number of unseen species in a habitat. In bioinformatics, it aids in predicting the diversity of genetic sequences in a sample.

Limitations[edit | edit source]

While the Good–Turing method provides a way to estimate unseen events, it has limitations. The accuracy of the estimates can be affected by the size of the sample and the distribution of event frequencies. For very rare events, the method may still overestimate the true frequency. Additionally, the method assumes that the sample is representative of the population, which may not always be the case.

See Also[edit | edit source]

References[edit | edit source]

Good–Turing frequency estimation Resources
Wikipedia
WikiMD
Navigation: Wellness - Encyclopedia - Health topics - Disease Index‏‎ - Drugs - World Directory - Gray's Anatomy - Keto diet - Recipes

Search WikiMD

Ad.Tired of being Overweight? Try W8MD's physician weight loss program.
Semaglutide (Ozempic / Wegovy and Tirzepatide (Mounjaro / Zepbound) available.
Advertise on WikiMD

WikiMD's Wellness Encyclopedia

Let Food Be Thy Medicine
Medicine Thy Food - Hippocrates

WikiMD is not a substitute for professional medical advice. See full disclaimer.
Credits:Most images are courtesy of Wikimedia commons, and templates Wikipedia, licensed under CC BY SA or similar.

Contributors: Prab R. Tumpati, MD