Good–Turing frequency estimation
Good–Turing frequency estimation is a statistical technique used to predict the probability of encountering an unseen event in a sample, based on the frequencies of events already observed. This method is particularly useful in the field of natural language processing, linguistics, and bioinformatics, where it is often necessary to estimate the distribution of rare or unseen phenomena. The technique was introduced by I.J. Good and Alan Turing during World War II, initially as a cryptanalytic tool, but has since found widespread application in various scientific disciplines.
Overview[edit | edit source]
The Good–Turing frequency estimation revises the observed frequencies of events in a sample to better estimate the true frequencies in the entire population. The core idea is to reduce the frequency of observed events slightly and allocate that frequency to unseen events. This adjustment is based on the observation that if an event has never been seen in a sample, it does not mean that its true frequency in the population is zero.
Formulation[edit | edit source]
The formula for the Good–Turing estimate for the frequency of unseen events is given by:
\[ P_0 = \frac{N_1}{N} \]
where \(P_0\) is the probability of encountering an unseen event, \(N_1\) is the number of events that occur exactly once in the sample, and \(N\) is the total number of events observed.
For events that have been seen, the adjusted frequency \(f^*\) is calculated as:
\[ f^* = (f+1) \frac{N_{f+1}}{N_f} \]
where \(f\) is the original frequency of the event, \(N_f\) is the number of events that occur exactly \(f\) times in the sample, and \(N_{f+1}\) is the number of events that occur exactly \(f+1\) times.
Applications[edit | edit source]
Good–Turing frequency estimation has been applied in various fields for different purposes. In natural language processing, it is used for smoothing language models and handling out-of-vocabulary words. In ecology, it helps estimate the number of unseen species in a habitat. In bioinformatics, it aids in predicting the diversity of genetic sequences in a sample.
Limitations[edit | edit source]
While the Good–Turing method provides a way to estimate unseen events, it has limitations. The accuracy of the estimates can be affected by the size of the sample and the distribution of event frequencies. For very rare events, the method may still overestimate the true frequency. Additionally, the method assumes that the sample is representative of the population, which may not always be the case.
See Also[edit | edit source]
References[edit | edit source]
Good–Turing frequency estimation Resources | |
---|---|
|
Search WikiMD
Ad.Tired of being Overweight? Try W8MD's physician weight loss program.
Semaglutide (Ozempic / Wegovy and Tirzepatide (Mounjaro / Zepbound) available.
Advertise on WikiMD
WikiMD's Wellness Encyclopedia |
Let Food Be Thy Medicine Medicine Thy Food - Hippocrates |
Translate this page: - East Asian
中文,
日本,
한국어,
South Asian
हिन्दी,
தமிழ்,
తెలుగు,
Urdu,
ಕನ್ನಡ,
Southeast Asian
Indonesian,
Vietnamese,
Thai,
မြန်မာဘာသာ,
বাংলা
European
español,
Deutsch,
français,
Greek,
português do Brasil,
polski,
română,
русский,
Nederlands,
norsk,
svenska,
suomi,
Italian
Middle Eastern & African
عربى,
Turkish,
Persian,
Hebrew,
Afrikaans,
isiZulu,
Kiswahili,
Other
Bulgarian,
Hungarian,
Czech,
Swedish,
മലയാളം,
मराठी,
ਪੰਜਾਬੀ,
ગુજરાતી,
Portuguese,
Ukrainian
Medical Disclaimer: WikiMD is not a substitute for professional medical advice. The information on WikiMD is provided as an information resource only, may be incorrect, outdated or misleading, and is not to be used or relied on for any diagnostic or treatment purposes. Please consult your health care provider before making any healthcare decisions or for guidance about a specific medical condition. WikiMD expressly disclaims responsibility, and shall have no liability, for any damages, loss, injury, or liability whatsoever suffered as a result of your reliance on the information contained in this site. By visiting this site you agree to the foregoing terms and conditions, which may from time to time be changed or supplemented by WikiMD. If you do not agree to the foregoing terms and conditions, you should not enter or use this site. See full disclaimer.
Credits:Most images are courtesy of Wikimedia commons, and templates Wikipedia, licensed under CC BY SA or similar.
Contributors: Prab R. Tumpati, MD