Ars Technica has a fascinating report on a major data-mining project involving the Congressional Record. The project uses an unsupervised learning program to automatically code several million speeches as one of forty (automatically generated) categories. The authors show some interesting variation in agenda salience over time demonstrating this method, suggesting that the method is attuned to fluctuations. The model could be relatively easily ported to other languages, including Italian, as long as one has a program to 'stem' the words so as to populate the vocabulary matrix.
Technorati Tags: data+mining, US, Senate, Hansard, agenda, salience