Lectures‎ > ‎

Lecture 10: Language Modeling



  • Download the Project Gutenberg text for "Alice in Wonderland". 
  • Compute uni-, bi-, and trigram statistics for the entire text.  Remove everything other than sequences of alphabetic characters.  Just treat the text as one long sequence of words and don't worry about the first few and last few words.
  • It's up to you how you represent your models (as a class, as a concrete data structure consisting of tuples/lists, etc.)
  • Write a function or method sample_string(model,n) that generates text samples consisting of n words from your models according to the uni-, bi-, and trigram statistics.
  • Write a function or method string_cost(model) that computes the log probability of a given string in each model (returning -infinity for strings that don't occur).
  • Write a function or method per_word_perplexity(string) that computes the perplexity of a new string.

Thomas M. Breuel,
Jun 23, 2010, 8:29 AM