Big Data Challenges: Volume, Variety, Velocity & Veracity
“Some people use the analogy of looking for a needle in a haystack. But that’s not what we’re doing. We’re actually looking for lots of needles in many, many haystacks.”
— Fred Wright
By David Hunt | PDF Version
The field of statistics has undergone a radical change since Alyson Wilson began her career as a mathematical statistician with the National Institutes of Health two decades ago.
“We used to do very carefully designed experiments. Every record was full of rich data, very carefully controlled,” says Wilson, now a statistician at NC State University. “Today, we often have to deal with ‘found data.’ When you’re dealing with every Web page written in Japanese or every Wikipedia entry in Polish, the issues are different than when you’re dealing with 30 carefully controlled experiments on one topic.”
The data-processing tools of the digital age give researchers — as well as business managers, public officials and other decision-makers — a staggering volume of data to manage and analyze. Coping with this flood of “big data” is often likened to drinking from a fire hose.
If only it were that simple.
The Internet, one of the fastest-growing channels of digital information, is like a data-driven Niagara Falls, surging with an endless, churning, unstoppable flood of bits and bytes. Trying to pinpoint and analyze a particular piece of information on the Web is like trying to pick out a specific drop of water as it rushes over the falls.
PennyStocks, an online investment company, estimates that well over a million gigabytes of data are transferred over the Internet every minute. Computer chip maker Intel says it would take nearly five years to watch all the video footage that streams over the Internet in just one second.
Of course, the Internet isn’t the only tool that generates swarms of data. Hospitals and medical offices create electronic medical records for millions of patients, utility companies track power usage by millions of homes and businesses, and retailers maintain sales and inventory records on millions of products.
The huge and growing volume of data available for analysis is just one challenge for scientists, Wilson says. Researchers in the field face a set of often-overlapping difficulties known as the four Vs: volume, variety, velocity and veracity. Information overload, a term coined by social scientist Bertram Gross 50 years ago and popularized by futurist Alvin Toffler in his 1970 bestseller Future Shock, is no longer an impending threat — it’s business as usual. The world is simply flooded with too much data moving much too quickly for any one person to collect, verify and understand.
That’s where math comes in.
Wilson is an expert in working with heterogeneous data to understand complex problems — the variety aspect of big data. To understand the challenges, let’s take a look at a real-world example that Wilson described in a 2013 article in the journal Technometrics.
A company developing a new missile for the U.S. military needs to assess the reliability of numerous systems, subsystems and components, including the frame, warhead, propulsion system, targeting system and guidance system.
To do that, companies typically conduct a series of pass/fail tests on individual parts and then apply statistical models to estimate their likely future performance. The analysis is complicated because the components are interrelated, so that a failure in one area affects others downstream. But in almost every case, scientists face the same problem: analyzing a variety of data from different sources in a complex system.
A promising new approach, devised by Wilson and researcher Jiqiang Guo at the Virginia Tech Social and Decision Analytics Laboratory, gives scientists a way to combine component-level data into a unified model for a whole system. The sophisticated mathematical methodology, based on a probability theorem called Bayes’ law, allows researchers to more accurately estimate reliability for an entire product.
Data falls broadly into two categories, says computer scientist Randy Avent: structured and unstructured. You know what structured data is if you’ve ever entered numbers into a spreadsheet. It’s the type of information that fits neatly in a column or row, like a baseball team’s scores for the season or the price of each item in your grocery cart. Unstructured data includes a lot of the digital information that streams over the Internet, such as text, video, sound and images.
Avent, president of Florida Polytechnic University, says making sense of huge amounts of heterogeneous data requires scientists to work across disciplines.
“We used to value very deep disciplinary advances,” he says. “But a lot of the problems we’re working on today can’t be solved by a single discipline.”
Avent’s academic career mirrors that way of thinking, combining studies in the life sciences, mathematics and engineering. At NC State, where he served as associate vice chancellor for research development for three years before moving to Florida Polytechnic, Avent spearheaded a pioneering effort to create an interdisciplinary cluster of researchers in data-driven science.
That faculty cluster, now led by Wilson, brings together top researchers in computer science, mathematics and statistics to confront the challenges of big data and train the next generation of analytics scientists. With Wilson as principal investigator, it’s also leading a nationwide collaboration funded by a $60 million grant from the NSA — the National Security Agency — to make groundbreaking advances in the science of data analysis.
The Laboratory for Analytic Sciences or LAS, housed on NC State’s ultramodern Centennial Campus, is one of the largest sponsored-research projects in the university’s history — a testament to the importance the federal government places on solving the challenges of big data.
Big Data, High Security
Forrest Allen, program manager at LAS, says NC State was the logical choice to lead the multidisciplinary center because of its culture of collaboration. Although private universities such as the Massachusetts Institute of Technology and Carnegie Mellon may be better known, NC State is quickly gaining a reputation with Washington policymakers as the go-to place for public-private partnerships.
In fact, the university has become a hub for hubs. In addition to the analytic science lab, the university is also leading multidisciplinary centers combating the spread of nuclear weapons, redesigning the nation’s electric power grid, developing wearable health-monitoring devices and helping farmers cope with the effects of a changing climate.
Allen, who spent a decade in Washington, D.C., as a congressional staffer and deputy assistant secretary in the Department of Energy, says NC State delivers both real-world experience and broad expertise. Instead of being strong in just a few scientific disciplines such as engineering and computer science, he says, NC State excels across a broad array of disciplines, including industrial design, genetics, the humanities and social sciences, business management, statistics and clinical sciences.
“We have world-class computer scientists, but we also have historians and English professors and psychologists on the same campus,” he says. “NC State has been able to integrate those faculty into an interdisciplinary environment that we really think is the special sauce of LAS.”
The new lab is off to a strong start, Allen says. The NSA has 20 staff members working on campus to support the effort, and that number will eventually increase to 50. Faculty members from seven of NC State’s 10 colleges are already working on projects at LAS. A three-day campus event introducing the lab to faculty attracted more than 150 faculty members eager to discuss research ideas with agency officials.
“Everybody on campus does analysis,” Allen says. “Business faculty do financial analysis, veterinarians do medical analysis, biologists do genetic analysis. The NSA just happens to do intelligence analysis. There’s a lot of commonality.”
Allen says researchers at LAS are approaching the big data challenge from several perspectives, ranging from scientific innovations that speed data processing to social science models that seek to understand human behavior. Much of the work involves trying to improve the signal-to-noise ratio inherent in complex data — cutting through the mass of irrelevant information and pinpointing critical material.
“With as many sensors as we have, we don’t have a giant vacuum cleaner sucking up every signal that the world produces,” he says. “We think the answers are out there in the data. We just have to find the right way to sort through it all.”
Big Data and Bioinformatics
It’s easy to understand the importance of harnessing big data analysis in the service of national security. After all, we all want to help the good guys and stop the bad guys. But what if the bad guy is a parasite and the good guy is a protein?
Welcome to the world of bioinformatics — one of the hottest fields applying big data analytics to real-world problems.
Researchers in bioinformatics develop tools to help make sense of the vast, complex and diverse data sets generated by studies in biological and medical science. Much of that data takes the form of strings of DNA that contain all of an organism’s genetic information. In humans, that amounts to about three billion base pairs of the compounds adenine, guanine, cytosine and thymine — the rungs on DNA’s twisted ladder. The sequence of these bases provides the instructions needed for an organism to develop, survive and reproduce.
In his lab on Centennial Campus, plant pathologist David Bird leads the university’s new bioinformatics cluster, bringing together top researchers in genetics, statistics, computer science and biology. Meeting the challenges of big data has sparked “a philosophical change in the way we do science,” Bird says.
For one thing, researchers now have the computing power and the statistical tools to take on tasks that would have been impossible a decade ago, like running 10 billion tests on a string of DNA.
DNA sequencing gives researchers a window on the inner workings of the genome, providing a wealth of data that explains how proteins are made, identifies the mutations associated with cancer risks, and shows how parasites interact with their hosts at the cellular level, among other insights.
For most of his professional life, Bird has studied nematodes, one of the most abundant animals on the planet. Grab a handful of soil and you’re likely holding thousands of the microscopic creatures, commonly called roundworms.
With more than 30,000 known species — and perhaps a million more yet to be identified — nematodes have adapted to a wide range of habitats, from freshwater lakes to tropical forests, not to mention the human intestine.
“Half the world’s human population is infected with nematodes,” Bird says. “We have to get our hands on controlling them.”
Of the thousands of species, a few are truly nasty, causing ailments such as trichinosis, a disease that affects more than 10 million people, mostly in the developing world.
Plants suffer from nematode infections as well. Parasitic nematodes attack most cultivated plants, including food staples, as well as many common varieties of vegetables, fruit trees and ornamentals, costing the world’s growers an estimated $100 billion a year in lost crops. Unfortunately, nematodes are remarkably efficient killers.
Take root-knot nematodes, for example. The tiny parasitic animals hatch in the soil, make their way to a plant and then dig into its roots. There they secrete proteins that instruct the plant to develop a specialized feeding site for the parasite. After feeding on the plant for several weeks, the adult female lays around 1,000 eggs. During this time, the plant appears oblivious to the feeding nematodes, and it is only when another stress comes along — such as a few weeks without rain — that the plant shows symptoms.
While crop rotation is a common pest management technique, it doesn’t usually work with nematodes because they’re perfectly at home in a wide variety of plants. And they’re patient pests; some species can survive in a desiccated, anhydrobiotic state for more than a century, waiting for a host.
As a microbiologist, Bird thinks the best way to combat the parasite is to understand and short-circuit the chemical processes that allow it to interact with its host. That’s where big data can help. After all, there isn’t one single factor that makes a plant vulnerable to nematodes. Researchers are just beginning to decode the genetic mechanisms that trigger the parasitic proteins responsible for hijacking the host plant during invasion, inducing the formation of feeding sites, manipulating the host’s metabolism for the nutritional benefit of the nematodes and suppressing the host’s defense responses.
“Evolution has done this big 2-billion-year experiment,” Bird says. “We have to look at all that natural variation to infer how the parasite works.”
In Our Genes
“Some people use the analogy of looking for a needle in a haystack,” says statistician Fred Wright. “But that’s not what we’re doing. We’re actually looking for lots of needles in many, many haystacks.”
Wright, a member of NC State’s new bioinformatics faculty cluster and director of the Bioinformatics Research Center, is studying genetic variations in people with cystic fibrosis, an inherited disorder that causes severe damage to the lungs and digestive system.
“Even with modern medicine, some people with cystic fibrosis die at 15, and some live to 50,” Wright says. “It’s that variation that we’re trying to understand. What is it in the constitution of their DNA that allows some people to survive so long?”
To answer that question, Wright and collaborators at the UNC-Chapel Hill Cystic Fibrosis Center are conducting complex genetic profiling on thousands of cystic fibrosis sufferers — a data-crunching challenge that scientists have only recently been able to address thanks to fast, powerful computers. The raw data come from microarrays, a technology for probing millions of genetic markers and measuring thousands of genes simultaneously to determine which are switched on and which are switched off. By comparing the genetic profiles of different people, the researchers are learning how the disease progresses.
“If we find variations that correlate to reduced lung function, then it becomes a matter of working with medical geneticists to understand how the genes may be interacting or mediating the immune system to cause the lung to become inflamed,” he explains. “The eventual hope is that there might be a drug target that could help fix the problem.”
If it seems strange that a statistician is leading a medical research project, it’s time to update your thinking. In the age of big data, health care solutions are as likely to come from analytics as from traditional clinical trials.
“There’s been a change in the last decade,” says Bird, head of the bioinformatics cluster. “Statisticians are no longer just service people that you go to for help with your experiment. They’re now leading the discipline.”
Among those leaders is David Reif, a researcher trained as both a statistician and a geneticist who joined NC State in 2013 after seven years at the U.S. Environmental Protection Agency, where he worked as a scientist in the agency’s National Center for Computational Toxicology. Reif is an expert on the toxic effects of chemicals.
From the pesticides used to protect crops to the pressurized fluid injected into shale formations to extract natural gas and petroleum, toxic chemicals pose a growing risk to people and the environment. Finding the genetic connection between toxins and diseases is crucially important — and enormously difficult.
“Why do people respond differently to the same environment toxins?” Reif asks. “If two people drink the same tap water, why does one person get sick while the other does not?”
The answer may be found in the genetic variations Reif studies in the lab. But even with all the computing power of a major research university at his fingertips to crunch vast amounts of data and churn out volumes of reports, Reif notes that computers don’t perform the most important function in science: thinking.
“The computer doesn’t solve the problem without instructions on where and how to look,” Reif says. “But it’s great at performing a simple task umpteen billion times without getting bored.”
Once computers have done their job of highlighting promising associations, Reif begins the challenging work of interpreting the data. The genetic pathways that lead from toxic exposure to physical illness are rarely marked with clear signposts. But that doesn’t mean researchers have reached the end of the road when it comes to big data.
Ph.D. student Tim Michaelis opens his laptop and pulls up a database he built for a Fortune 500 chemical company in Research Triangle Park using IBM Content Analytics software. To help the company find new clients, Michaelis programmed the database to scour the data on millions of Web pages.
Text on a Web page is, of course, unstructured data — the kind of information that isn’t easily sorted, quantified or interpreted. But students like Michaelis need to learn to work with it, especially considering its exponential rate of growth. IBM estimates that 90 percent of the world’s data was created in the last two years, and 80 percent of it is unstructured.
As he reviews the data, Michaelis talks through the challenges of handling terabytes of information.
“I come back again and again to the questions, ‘How do you know what you know? Is the information good or bad?’ You can’t compare it to anything unless you started with a question — your hypothesis,” he says. “It comes back to basic science.”
Michaelis, who earned a master’s degree in global innovation management at NC State last year, is working as a research associate in the Poole College of Management while he completes a doctorate in psychology. The college’s faculty and students have worked with business partners for more than three decades to lay the foundations of innovation management as an emerging field of science. For companies struggling to survive in today’s hypercompetitive global markets, it’s nothing short of a revolution.
But the real power of innovation management is more than just the ability to read every word on every website pertaining to a business or industry.
“If you have data but no idea how to apply it, then you just have numbers on a page,” Michaelis says. “You have to be able to interpret data, to give it meaning.”
From a student’s perspective, the program offers a new educational paradigm in which questions are as important as answers.
“The old way was: The professor is going to tell me the answer, I’m going write it down and memorize it, and then I’ll get paid a bunch of money because I know that thing,” Michaelis says. “The new way is questioning everything, searching for an answer rather than just receiving one.”
Think and Do
If you guessed that a new way of teaching requires a new breed of teacher, you’re on the right track. Take Michael Kowolenko, an assistant teaching professor in the Poole College of Management and the first teacher to offer a class on data-driven decision-making at NC State. With a Ph.D. in immunology from Northeastern and two decades of experience in both research and operations at an international pharmaceutical firm, Kowolenko understands the real-world challenges of succeeding in the information age.
“I’m a technical guy who made it in business,” he says.
At NC State, Kowolenko works with teams of graduate students on industry-sponsored research projects in fields as diverse as engineering, computer science and human resource management. Whatever the field, each project starts with a question.
“You can throw a lot of analytics at anything, but if you don’t know what problem you’re trying to solve or what question you’re trying to answer, then you’re just throwing good money after bad,” he says. “Your goal has to be to convert information to knowledge. Knowledge is what gives you the ability to make a decision.”
Business executives who have only dealt with structured data — financial statements and inventory reports, for example — may have a hard time putting confidence in unstructured data such as marketing intelligence and Web search results.
“Unstructured text analytics is really messy,” Kowolenko says. “A lot of times you see statisticians and analysts twist in their seats a little bit because our methodologies are so different.”
To overcome their reluctance, he has a simple pitch: “The people who do this best, win.”
Past and Future
At the National Institute of Statistical Sciences in Research Triangle Park, Director Alan Karr was thinking about the future, and not just because he was leaving the agency after 20 years to take a job at nearby RTI International.
“There’s no going back on the world’s capacity for collecting data,” he says. “Soon even your car will have its own Internet address.”
As the challenges of big data evolve and grow, Karr says, North Carolina’s Research Triangle is the natural place to confront them. After all, the region pioneered many of the tools and processes that helped launch the data analytics industry and have fueled its explosive growth.
At his new office at RTI, Karr works in a building named for the late Gertrude Cox, an expert in experimental research design who came to Raleigh in 1940 to join the faculty at NC State. She is credited with helping to establish the statistics departments at NC State and UNC-Chapel Hill, as well as the statistical division at RTI — a legacy felt to this day.
The region is also home to SAS Institute, a data analytics powerhouse founded in 1976 by Jim Goodnight, an NC State alumnus and onetime faculty member. The company — the world’s largest privately held software firm — recently signed a master research agreement with the university to sponsor projects in a wide range of disciplines.
The convergence of major universities, nonprofits and industry leaders such as SAS makes the Triangle a hotbed of collaborative research.
The revolution sparked by the relentless growth of complex data has, in fact, touched virtually every industry and profession — including academia — making the practice of working across disciplines the new normal in labs and research institutes. This new focus on collaboration can’t come too soon, says Karr.
“What’s really behind the challenges of big data isn’t just that data are complex,” he says. “It’s that data are complex in ways that we aren’t used to.”
Or, to quote the futurist Alvin Toffler, “The future always comes too fast and in the wrong order.”