Not long after Dalhousie University launched its Institute for Big Data Analytics last year, the new research unit struck up an innovative partnership with the Department of Foreign Affairs, Trade and Development and an Ottawa-based firm, GSTS. The goal, says Stan Matwin, a computer scientist and the Dalhousie institute’s director, is to sift through massive amounts of satellite data on ship movements as a way of analyzing typical and atypical trajectories of large freighters and other sea-bound vessels. Since satellites gather about four million readings per day, says Dr. Matwin, “This, by definition, is a big-data problem.”
If scientists can use the data to develop models for normal ship movements along a particular ocean route, they will be in a better position to identify ships that are travelling erratically – due to inclement weather, for example, or for more nefarious reasons like piracy. Coast guard agencies can use the information to deploy security vessels or assist ships navigating into busy ports of call, explains Dr. Matwin. It’s an exciting opportunity, he says, for the institute’s data scientists and graduate students to apply sophisticated technical solutions to real-world problems.
Such undertakings reveal a sharp uptick in interest – among both academics and their students – in big-data research at universities in Canada and around the world. Computer science faculties have been teaching and researching very technical topics related to database management, data mining and machine learning for many years, but the potential of big data and its applications go well beyond these bounds.
Indeed, with dramatic increases in computing power and the exponential growth in the amount of data being collected, many universities are connecting sophisticated big-data analytics with applications in areas such as business, health care and public policy. In some institutions, like the University of Toronto, these connections are emerging organically through interdisciplinary research teams. In others – Dalhousie, as well as Ryerson University, Simon Fraser University and the University of Calgary – administrators are responding to mounting student and industry demand by establishing specialized big-data departments, courses, degrees and external partnership arrangements. Some of these institutions have nominated candidates for Canada Research Chairs in big-data analytics research; Dalhousie’s Dr. Matwin holds one of the first such posts.
With all this interest in big-data analytics, it appears to have reached a tipping point – or an “inflection point,” as Tamer Özsu puts it. Dr. Özsu, a University of Waterloo professor of computer science, compares the surge in interest to the explosion in genomics research in the early part of this century.
Funding agencies are taking note. In the United States, says Dr. Özsu, the Obama administration has made big-data research a priority. There’s no comparable program here, but Canada’s three major research granting agencies – the Natural Sciences and Engineering Research Council, Social Sciences and Humanities Research Council, Canadian Institutes of Health Research plus the Canada Foundation for Innovation – released a consultation document (PDF) asking for feedback on the components of a granting program that would support research into the management of big data.
As the document notes, “The focus of data analysis is rapidly shifting to embrace not simply technical development but also new ways of thinking about social, economic and cultural expression and behaviour. Indeed, innovative information and communications technologies are enabling the transformation of the fabric of society itself, as data becomes the new currency for research, education, government and commerce.”
Despite its potential, there is no generally agreed-upon definition for what is big data, a catch-all phrase that seems to apply to a very broad range of information. Wikipedia defines it as “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.”
Real world application
Some examples of big data include the torrent of GPS signals emitted by cell phones, the transaction records that accumulate in the servers of companies with busy e-commerce sites, or the enormous amount of keyboard strokes from workplace computers to monitor, perhaps, employee performance. As Wikipedia notes, managing and understanding data sets that contain so many types of information represents an entirely different sort of analysis from more traditional research approaches.
In fact, standard statistical tools may not generate meaningful predictions because samples that appear to be large by conventional research standards may represent only a tiny slice of the overall data set. Programmers may be able to gather and analyze tens of thousands of tweets on Twitter, for example, but these may account for a mere fraction of the total, thus limiting generalizations about the data.
Because of this, says Dr. Özsu of U of Waterloo, data experts focus on the four Vs – volume, velocity, variety and validity – when they work with big data. Volume refers to the amount of data; variety to the number of types of data; velocity to the speed of data processing; and validity (or sometimes “veracity”) to the uncertainty of data.
By definition, then, there is a great deal of data in many different formats, and the programming tools must be capable of analyzing them quickly and accurately. In some cases, the information may be extremely heterogeneous – a vast soup that can include snippets of text and images and all sorts of background noise.
To begin to explain the patterns involves techniques to categorize the types of data and tools to “clean” databases of extraneous information. Periklis Andritsos, an assistant professor at the University of Toronto’s faculty of information, says one useful tool he uses measures connections between data points that aren’t numerical in nature – for example, the frequency with which certain words or names come up in relation to other search phrases. These analytical methods can be used in a wide range of contexts. “Everywhere you see data,” he says, “you see opportunities for these applications.”
The world of finance is one important area. Dennis Kira, a professor of supply chain and business technology management at Concordia University’s John Molson School of Business, says different applications exist for fields like credit-card fraud detection, equity trading and forensic accounting, with systems designed to sift through billions of transactions and look for anomalies. “That’s why the banks are really gung-ho,” says Dr. Kira, who has taught a data-mining course for five years for finance, marketing and management students. “It’s like looking for a needle in a haystack,” but with the new applications “you know it when you see it.”
Not surprisingly, most universities that have moved to establish big-data programs have done so with industry partnerships. Besides the marine project, Dalhousie in 2011 established a $7-million relationship between its computer science department and Boeing Co. to research aviation safety by mining and assessing the reams of data produced by every aircraft. The institute has also embarked on an environmental monitoring project with the World Wildlife Fund.
Ryerson University has partnered with OMERS Ventures and the Ontario Centres of Excellence to launch a technology-oriented accelerator known as One Eleven, says Mohamed Lachemi, the university’s vice-president, academic. Based in a Google Canada facility in downtown Toronto, it will provide space and facilities to entrepreneurs with start-up ventures related to big data. Besides hiring two research chairs and establishing a master’s program in big data, Ryerson plans to seek senate approval to offer a certificate in big-data analytics through the Chang School of Continuing Studies.
Dr. Lachemi recently found himself discussing the potential of using big-data analytical techniques with an administrator at a Toronto hospital. “They don’t necessarily have the infrastructure to do this,” he says. “If we create a platform with people from all different disciplines around the table, we can address problems in a better way.”
Meanwhile, research teams at U of T, University of California Berkeley and New York University’s Centre for Urban Science and Progress, or CUSP, are looking at using repositories of “urban informatics” – from 311 service calls to real-time traffic-sensor signals to energy-consumption levels of buildings – to develop models that help decision makers deploy resources more efficiently and make large-scale infrastructure investments. Research of this kind may help answer the question of how to improve urban quality of life.
CUSP will offer graduate degrees in urban data science and will sponsor research that uses the city itself as both a lab and a source of raw material. One team at the centre will deploy sensitive sound-detection equipment around Manhattan to develop topographical maps of the city’s infamous noise levels. Aristides Patrinos, CUSP’s deputy director for research, says the analysis could be used to develop strategies to mitigate noise pollution in residential areas and around schools. “These [measures] are not huge leaps of faith,” he says.
Many students are drawn to the field of big data because it offers them a chance to get involved in an emerging knowledge-intensive profession. When it launched its master’s program on big data, Simon Fraser University cited a 2011 McKinsey Global Institute study that projects huge demand for people with skills in big-data management and analysis. By 2018, the study said, there could be a shortage of 140,000 to 190,000 workers in a field the Globe and Mail described as the “fastest-growing job market you’ve never heard of.”
Dr. Matwin agrees that many doctoral students and postdoctoral fellows drawn to Dalhousie’s Institute for Big Data Analytics are looking to equip themselves with skills that will be in high demand in the future. To that end, Dalhousie, as well as Université de Montréal, are developing proposals for a master’s program in big-text data as well as an undergraduate computer science degree with a specialization in big data. SFU, meanwhile, recently announced that its school of computing science is offering a new, four-semester professional master’s program in big data starting this fall.
Universities are marketing these programs as interdisciplinary in nature. Dalhousie’s big-data students can work on applications in business or medicine. In U of T’s information faculty, the students are about evenly split between those who want to learn the technical elements and those who are interested in its potential applications, says Dr. Andritsos. These students, from fields such as engineering, architecture and the social sciences, “are excellent at figuring out what stories the data is telling.” Some, he says, want to pursue academic research and some want to get involved with start-up companies.
What’s clear is that the big-data skills that students acquire through these programs are increasingly valuable outside academia. Dr. Matwin says the practical experience Dalhousie’s new institute can provide is extremely important for the growing private- and public-sector jobs that rely on the analytical skills the students are learning. “We are working to meet the demand,” he says.
John Lorinc is a Toronto-based journalist who frequently writes about urban issues.