Cloud and big data are surely two of the most-discussed topics when CIOs and IT strategists meet. It's no surprise...
that the union of these technologies would be doubly interesting. From the start, Apache Hadoop has been the focus of cloud-big-data thinking, but there are plenty of refinements to consider in Hadoop planning, and many big data cloud applications aren't suitable for Hadoop. Planners for cloud and big data should ask what the big data storage paradigm will be and whether it matches Hadoop's capabilities, optimize their database planning for cloud access, and track changes in data storage or access policies that could indicate if a change in strategy is needed.
Hadoop is an open source implementation of a Google concept called MapReduce, designed to support the storage and querying of databases distributed across multiple network-connected compute clusters. The basic notion is to allow a single query to find and collect results from all the cluster members, and this model is clearly suitable for Google's model of search support.
The key value proposition for Hadoop is distributed data that's subject to collective inquiry. Most enterprises today collect information in centralized databases and also create separate "abstractions" or aggregations of this data to support more efficient access. Many vendors, including IBM, recognize this trend and don't lead their cloud big-data initiatives with the assumption that Hadoop is the choice technology. CIOs also agree that it's rarely wise to use Hadoop on centralized data, or to distribute data in the cloud simply to be Hadoop compatible.
Hadoop is ideal where data is naturally separated, not just within a data center but across multiple data centers. If that's not the case for your data, then Hadoop isn't likely to be the best option -- even if you're moving applications to the cloud. Here are other points to consider in terms of data storage:
- Do you routinely query distributed data as though it were centralized? If your data access tends to be directed toward specific data clusters, providing for overall query capability may have limited value.
- Are any or all of your query applications highly performance sensitive? Hadoop querying is not as fast as other options for big data. This is particularly true if you're using Hadoop's optional SQL capability.
- Do you create aggregate databases with summary data to support high-level analytics? If so, these databases will likely combine data from multiple data clusters and reduce your need to look at the cluster data directly. However, Hadoop might be helpful here to support the aggregation of information.
The "ideal" Hadoop environment is one where large data volumes are collected locally and used locally, but must also be accessed by analytic applications that dive all the way to the raw data rather than work on summary-level information. If this isn't your situation, other options may be better. Hadoop is good at confining mass data access to the clusters themselves, but you can accomplish something similar by sending queries to local RDBMS appliances or processes at each location and then "joining" the results. Another strategy for avoiding cross-cloud-boundary data access is to create summary databases for analytics that don't require real-time information, and are specialized and small enough to be hosted in the cloud at modest cost or even moved into the cloud ad hoc as needed.
The second point in planning for cloud and big data is to remember that true cloud applications are very different from legacy applications, and this must be the primary design consideration. Your cloud usage, present and planned, will have a major effect on your big data design, enough to create major problems if you make the wrong choice.
Users vary significantly in how they plan to use the cloud. Some expect to host everything there, some to share or hybridize, and some to use the cloud for failover or cloudbursting. Data access is a part of every application, and with the cloud the primary issue is to avoid passing large volumes of data across the cloud boundary. Store data in the cloud or on-premises and try to pass query results and summarized databases, not large quantities of raw data.
The final issue with big data for the cloud is the increased risk that the combination presents. Cloud computing is evolving, and so is big data. We are evolving application design to optimize cloud utility, and at the same time transforming our notion of data and databases with mass-collection applications like the Internet of Things. The changes this combination could bring to applications will affect both cloud and big data plans.
It's early in the cloud cycle (we've realized only about 4% of the total opportunity), but already we can see that the cloud of the future has a more dynamic relationship with legacy or data center applications than with simple failover or cloudbursting. These relationships will almost certainly force application architects to build databases that are more distributed, which suggests that either a Hadoop approach or a distributed RDBMS with distributed query processing will be needed. Looking ahead to which of these approaches is best and making the right decision now could save a lot of money.
Many people believe that the cloud means Hadoop for big data, but that isn't necessarily true. Almost any DBMS can be adapted to cloud deployment if data and query distribution and data abstraction and summarization policies are applied effectively. Look carefully at the nature of the data, and in particular whether mass access can work with summarized versions of detail databases. Make the choice that will support the majority of your data flows, with particular focus on how those flows cross cloud boundaries. A little attention in early planning will pay big dividends.
How essential is a Hadoop infrastructure to a big data environment?
Does Hadoop in the cloud give big data a boost?