Using zipf frequencies as a representativeness measure in statistical active learning of natural language

Çobanoğlu, Onur
Active learning has proven to be a successful strategy in quick development of corpora to be used in statistical induction of natural language. A vast majority of studies in this field has concentrated on finding and testing various informativeness measures for samples; however, representativeness measures for samples have not been thoroughly studied. In this thesis, we introduce a novel representativeness measure which is, being based on Zipf's law, model-independent and validated both theoretically and empirically. Experiments conducted on WSJ corpus with a wide-coverage parser show that our representativeness measure leads to better performance than previously introduced representativeness measures when used with most of the known informativeness measures.


O. Çobanoğlu, “Using zipf frequencies as a representativeness measure in statistical active learning of natural language,” M.S. - Master of Science, Middle East Technical University, 2008.