The age of Big Data is upon us. Fuelled by an incendiary mix of overblown claims and dire warnings, the public debate over the handling and exploitation of digital information on an astronomically large scale has been framed in stark terms: on one side are transformative forces that could immeasurably improve the human condition; on the other, powers so subversive and toxic that a catastrophic erosion of fundamental liberties looks inevitable.
The tension between these opposites has marooned the discussion of Big Data. It is stuck somewhere between Bletchley Park – the former Government Communications Headquarters (GCHQ) location where the godfather of the computational universe, Alan Turing, primed today's Big Data explosion during the Second World War – and the satirical tomfoolery of South Park, which recently portrayed the living core of all data as an incarcerated Father Christmas cruelly wired up to a machine by the US's National Security Agency (NSA).
We know from Edward Snowden's widely publicised whistle-blowing revelations that the NSA – in collusion with GCHQ – lifted vast amounts of data from Google and Yahoo, under the once-top-secret codename, Muscular. At the same time, we're told that the potential for beneficial insights mined from anonymous, adequately protected data is enormous.
Big Data helps us find things we "might like" to buy on Amazon, for example, but it has also left us vulnerable to surveillance by state and other agencies. Companies such as Google and Facebook are essentially Big Data businesses, whose staggering profitability stems from the application of data analysis to advertising: these "free" services are paid for by personal data surrendered automatically with every click.
In finance, meanwhile, optimists foresee a theoretical end to all stock-market crashes, thanks to insights derived from huge-scale data-crunching, while others predict an automated, algorithmic road to ruin. Similarly, the cost and efficiency of healthcare provision is set to be radically transformed for the better with access to massive amounts of data – likewise the development of new drugs and treatments. But what about the mining of medical data without patient consent? So the debate goes on.
One aspect of Big Data, however, is beyond question: it is indeed very big, and it's getting bigger by the millisecond. An IBM report in September estimated that 2.5 quintillion bytes of data are created every day (that's 25 followed by 17 zeros, or roughly 10 million laptop hard drives) and that 90 per cent of the world's data has been generated in the past two years: everything from geo-tagged phone texts and tweets to credit-card transactions and uploaded videos. By 2020, it's thought that the number of bytes will be 57 times greater than all the grains of sand on the world's beaches.
So what's actually going on at the coalface of Big Data, a code-centric world of striping, load-balancing, clustering and massively parallel processing? What do the analysts working with Big Data say it's going to do for us?
"You get a fuller picture of the phenomenon you're interested in, with more dimensions, and that lets you derive greater insights," says Big Data pioneer Doug Cutting, chief architect at enterprise software company Cloudera and founder of the popular open-source Big Data tool Hadoop. Cutting's work on internet search technology for Yahoo during the mid-2000s provided the ideal proving ground for combining vastly increased computing power with huge and diverse datasets. "And from that we've seen a new style of computing emerge."
The revolutionary effects of this new approach cannot be understated, especially within the scientific community. For Brad Voytek, professor of computational cognitive science and neuroscience at the University of California San Diego, and "data evangelist" for app-based taxi service Uber, Big Data has had a profound effect on the traditional scientific method. "You can sweep through huge amounts of data and come up with new observations," he says. "That's where the power of Big Data comes in. It's automating the observation process. It's making everything easier but in a way that few people yet understand. It's going to dramatically speed up the scientific process and people have been doing some really cool stuff with it."
Michael Schmidt, founder and chief executive of American "machine-learning" start-up Nutonian, established a Big Data landmark when, in partnership with robotics engineer Hod Lipson at Cornell University, New York, he created Eureqa – a piece of software that deduced Newton's Second Law of Motion by analysing data from the chaotic movements of a double pendulum. What took Newton years, the Eureqa algorithm accomplished in a matter of hours. With Nutonian, Schmidt is now opening up that Big Data technology beyond the college lab.
"We want to accelerate the process that scientists go through, to help you discover very deep principles from data," he says. "We want to explain how things work." The range of Eureqa's uses couldn't be more striking, from the construction of better warplanes to helping save the lives of infants. Schmidt is currently working with the United States Air Force, analysing the strength of advanced super-alloys used in engine components. "They are really interested in anticipating failures – knowing when things are going to break, explode or stop working. We were able to show them the most important things that go into a failure of a particular engine part, at a finer resolution than ever before."
Eureqa has also been used to help discover the optimal moment to remove breathing tubes from prematurely born babies. "It's really critical when you remove that tube, and allow the child to start breathing on its own," says Schmidt. "Premature babies are hooked up to every monitoring device you can imagine and we were able to take that data and winnow it down to a few of these key metrics that drive the future health of the babies. Which is pretty neat."
Harnessed to Big Data, this kind of analysis becomes the work of hours and minutes. "Traditionally you could spend years before you could conclude on a result. What's changed is that we have these huge datasets. You can rapidly accelerate the entire discovery process."
While the benefits of this revolutionary increase in analytical speed are clear, Big Data is often inseparable from its source and context, especially in the public realm, where ethical concerns are paramount. Justin Keen, professor of health politics at the University of Leeds, co-authored a June 2013 paper published in Policy & Internet, the journal of the Oxford Internet Institute. In it, he addressed issues of privacy and access in relation to Big Health Data. "The potential for much greater exploitation of data held by government departments in England and all around the world is real," he says. "We just haven't got proper governance arrangements at the moment – we don't know what rules should govern what NHS data gets published, and in what sort of format."
Early in 2013, Health Secretary Jeremy Hunt set the goal of a paperless NHS by April 2018, in line with programmes including care.data, which links patient data across different parts of the NHS. It is hoped that the resulting increase in preventative treatments, coupled with improvements in health management, will save billions and improve the quality of healthcare. The sticking point is patient confidentiality.
"I'm very happy to see that in the past month or two, senior civil servants have actually put the brakes on," says Keen. "Releases of data through care.data and other channels are actually going to be slowed down until we've got these governance arrangements right. But we're not going to get the releases of data that advocates are hoping for as early as they might have hoped for it."
Despite this slow-down, the Big Data community appears to be echoing Keen's note of caution. "From my perspective as a person who works in data, of course I want as much as I can get, because the more data you've got, the more interesting things you can do with it," says Francine Bennett, chief executive and co-founder of London-based Big Data specialists, Mastodon C, which mined available data to co-create the CDEC Open Health Data Platform, a showcase for insights generated by Big Health Data. "However, as a person who's knowledgeable about data – and as a citizen of the UK k whose health data is in these systems – I know that it could be enormously damaging to privacy to release things which shouldn't be released. It's hard to put the genie back in the bottle. I'm keen for it to be done in a measured way."
Gil Elbaz, founder and chief executive of open-data platform Factual, began his career as a database engineer in Silicon Valley in the 1990s before co-founding Applied Semantics, acquired by Google in 2003 for $102m. Applied Semantics developed AdSense, the technology that matches online advertisements to the pages being browsed and the person browsing them. "The approach we took to the contextual targeting of ads was all rooted in processing huge amounts of data," says Elbaz, whose Factual company website affirms his core belief in "making data accessible".
"We take data privacy very seriously, and if somebody's data is theirs, they should have the right to keep it private. That being said, there are significant opportunities where data shouldn't be kept fully private, because it's to society's benefit for it to be open," he says, citing David Cameron's October 2013 announcement, at the Open Government Partnership summit, of a public register of business ownership. "Data at Factual is primarily business data," says Elbaz. "These businesses want it to be available."
Even where the privacy question is not an issue, Elbaz is concerned that information can get trapped in hard-to-reach databases. "Too often today data is not accessible. For example, why is it that software can't automatically check – given the age of a patient and any drug – whether a dosage is healthy or lethal? Why can't it be flagged? The reason is that there is no open API (Application Programming Interface, or app-creation tool) that has drugs and dosage ranges. It does not exist. Is there a database? Yes. But it'll take a long time even to find the right person to buy that data from. To me, this is insane."
So where's it all leading us? For some, the ultimate goal of Big Data has been defined as a kind of supreme foresight: an ability to predict what people want before they know they want it. Elbaz takes a more functional view. "My holy grail is that if any piece of software needs access to information, it can find that access at a reasonable cost," he says. "To me, it is not crazy rocket science – it's the basic fabric of how a global information system should work."
For Schmidt, the quest for enlightenment has only just begun. "A lot of promises have been made for Big Data in the hope that it has this enormous value, and we're starting to chip away a little at that, but there's still so much to be done."
Doug Cutting, however, has little interest in the notion that Big Data will supply some kind of predictive super-power. "I'm an engineer. I focus down on the plumbing. I think I have a more concrete imagination about what is possible. I don't believe it's possible to have an oracle that can predict what I'll be interested in doing tomorrow. Moreover, I find surprises invigorating; I'd hate to lose spontaneity in the world."
However, he adds, certain kinds of things can be done better. "To me, the holy grail is removing limitations and being able to achieve the interconnectedness that we want; to be able to take advantage of all the data and do all the things we imagine are possible. I don't think we want to get there overnight as a society. We need to embrace these things and understand what we want to happen and what we don't want to happen – build the right societal, legal and business structures. We need to evolve."
Three eye-catching big data ventures
1. Open Data Institute
Aim: free data for all
Co-founded by Sir Tim Berners-Lee, the inventor of the World Wide Web, to encourage the exploitation of freely available data – aka "open data" – the not-for-profit Open Data Institute has positioned itself as both a catalyst for data innovation and a global hub for data expertise. Based in Shoreditch, east London, the ODI oversees a network of collaborative international "nodes", including Dubai and Buenos Aires, and has incubated a growing bunch of Big Data start-ups – for example, Mastodon C (see main feature), which identified potential NHS savings of about £200m by crunching data relating to branded and generic drugs; and Placr, which analyses real-time transportation and timetable information to improve daily travel. theodi.org
2. The Human Brain Project
Aim: to reveal the workings of human consciousness
Flush with €1bn in funding, the Human Brain Project is a 10-year quest to reveal the hidden workings of consciousness. The scale of this task is so immense – the brain has around 100 trillion neural connections – that many still doubt it can be achieved, but Switzerland-based project leader Henry Markram believes his collaborative Big Data approach, using statistical simulations and vast supercomputing power across "swarms" of researchers, might do the trick. One aspect of the plan involves mining a huge amount of available data on mental disorders from public hospitals as well as pharmaceutical company databases; algorithms will then isolate revealing patterns and connections. In a decade's time, the neural picture should be much clearer. humanbrainproject.eu
3. IBM's Computational Creativity
Aim: to make computers 'creative'
Following a line of computer evolution that runs from Deep Blue (which beat Gary Kasparov at chess in 1997) through Watson (which beat human opponents on the US quiz show Jeopardy! in 2011), IBM has continued its ingenious manipulation of huge datasets with a system designed to generate creativity. Big-data analytics techniques have been deployed by IBM's Thomas J Watson Research Center to create new food recipes – what you might call technouvelle cuisine – mined from sources including Wikipedia and Fenaroli's Handbook of Flavor Ingredients, then tweaked with an algorithm designed to add creativity to matched ingredients. The results (from Vietnamese apple kebab to Cuban lobster bouillabaisse) have impressed human chefs. research.ibm.comReuse content