Wednesday, May 5, 2010

The Brown Corpus

Speech recognizers make educated guesses at what is being said. They play the odds. For example, the phrase “serve as the inspiration,” is ten times more likely than “serve as the installation,” which sounds similar. Such statistical models become more precise given more data. Helpfully, the digital word supply leapt from essentially zero to about a million words in the 1980s when a body of literary text called the Brown Corpus became available. Millions turned to billions as the Internet grew in the 1990s. Inevitably, Google published a trillion-word corpus in 2006. Speech recognition accuracy, borne aloft by exponential trends in text and transistors, rose skyward. But it couldn’t reach human heights. —Robert Fortner, "Rest in Peas: The Unrecognized Death of Speech Recognition"


(Via Jenny D)

No comments:

Post a Comment