With more than a billion words in its corpus, Hansard contains multitudes. It is a massive document of every word spoken in Canada’s parliamentary debates since 1880, and thanks to a team of researchers at the University of Toronto Scarborough, Hansard is now fully searchable online as far back as 1901.
The new version is part of LiPaD: The Linked Parliamentary Data Project, a spinoff of another research question, according to principal investigator Christopher Cochrane, associate professor of political science at UTSC. In 2013, Dr. Cochrane along with two PhD students, two postdoctoral fellows and Graeme Hirst, professor of computer science at UTSC, began work on research involving Hansard to see whether computers could be used to detect the ideology and emotions of speakers based on their word patterns. At the time, the digitized version of Hansard mostly comprised scanned hard-copy transcripts saved as PDFs.
“We realized the data was not up to a level that would allow us to carry out the research, that we would have to undertake the pretty arduous task of converting essentially pictures of pages into machine-readable text,” Dr. Cochrane says. With that hard work now out of the way using optical character recognition software, he hopes the new tool will allow researchers and the public to study Hansard more effectively.
“This is a really rich repository of information about Canadian history,” he says. “People who are interested in this kind of work are going to be able to explore these and other data and make interesting connections we couldn’t even begin to fathom.” While only English proceedings are currently available through LiPaD, the team is interested in making the French text available as well.
The LiPaD website features an advanced search function that includes filters by politician and party. It also allows people to navigate a timeline by decade and download data sets for analysis. To date, the site gets about 1,300 unique visitors per month, according to Dr. Cochrane.