For cloud-based applications that handle big data, the 3Vs of big data -- volume, velocity and variety -- must...
now be supplemented by a fourth V: veracity. That may be especially true when handling data that belongs to others.
Speaking during a session at the Big Data Innovation conference in Boston, Mike O'Rourke, vice president of product development for IBM's Cloud Data Services business unit, said all four factors, plus the under-discussed issue of data ownership, each play a crucial role in driving business agility. Much of the data used in modern cloud-based applications is derived from outside sources, necessitating cleansing before use.
"It's vital for development teams to be agile so they can react quickly and provide fast turnaround for application updates," he said. "That means when dealing with big data, you have to think differently." While O'Rourke didn't go so far as to cite Forrest Gump's "life is like a box of chocolates" axiom, he did make clear that when applications collect, process and store big data, you never know what they're going to get. Consequently, the design of cloud-based applications must be resilient enough to ensure uninterrupted operation regardless of the data they encounter.
Explaining the four V's, O'Rourke said the first, volume, is paramount as the amount of data with which cloud and mobile applications need to interact, both transactional and streaming, soars from terabytes to petabytes.
Variety is the many forms, both structured and unstructured, that applications must be capable of handling. Data coming from video is unstructured, yet applications must be aware of its content, O'Rourke said. "If you're a broadcaster, you don't want to have a large auto accident with a bunch of people killed in a movie and then have an advertisement come up that says go buy a Chevrolet. The two don't mix."
Mike O'Rourkevice president of product development, IBM Cloud Data Services
Velocity, or data in motion, is becoming increasingly important, especially as volume skyrockets. O'Rourke said, "We have so much data moving so fast with the Internet of Things -- through sensors, through social data -- your application must be capable of making decisions in real time." The concept of collecting data for later analysis or processing is obsolete, he added.
Veracity refers to the need for applications to exhibit flexibility as they handle data of varying reliability, a factor he dubbed data uncertainty. As examples, O'Rourke cited fitness devices' and phones' data that might be suddenly unavailable due to a dead battery or lost communications. "Is your application prepared to deal with that interruption?"
Traditionally, businesses collect and process their own data -- be it retail transactions, factory floor process control or insurance policy premium and benefit tracking. It's different in the cloud, according to O'Rourke. "When building applications or applying analytics, the chances are, whatever company you're with, you do not own most of that data."
As an example, he cited a group of university engineering students interning for a summer at IBM. In an application they built for New York City to pinpoint unsafe roadway locations, the students leveraged public data on motor vehicle accidents, weather, sunrise and sunset, humidity, whether the roads are wet, and geospatial information on road signage and roadway lane markings. All the data sets were in the public domain.
In a matter of weeks, the group created an application that identifies specific road locations where repairs or redesign are needed, places where drivers could benefit from earlier warning signage, and intersections where signal changes were needed.
"Because you don't own most of the data you work with, the best advice I can offer to any developer is that it's necessary to clean it, attach to it, and store it before you can look at and analyze it," O'Rourke said.
Internet of Things drives push for big data applications
Making big data applications count
How to make big data apps production-ready