When it comes to implementing a strategy for big data in the cloud, the good news is you have a lot of choices....
But that's also the bad news. A recent report from Forrester Research stressed that although big data cloud offerings are robust, they also are potentially confusing and may require that enterprises be more flexible and piecemeal in their approach than is typical. The report's conclusion: One size does not fit all in the cloud computing space.
The three largest public cloud platforms -- Amazon Web Services (AWS), Microsoft's Azure and Google -- offer a wide array of big data services, but each is distinctly different. Because every company's big data needs -- and skill sets -- are also different, evaluating all the platforms to ensure you get the right mix of services is important, advised Kirk Borne, a data scientist and professor of astrophysics and computational science at George Mason University and a well-known expert on big data.
"Any big data activity has to start with the question you're trying to answer," Borne said. "You need to understand the business case before you take the plunge and make sure you choose the right services from what's available."
Here is a look at big data offerings from the three major cloud platforms, as well as from Altiscale, a fourth, newer company that has its own Hadoop as a platform cloud
AWS: Options and opportunities
In every decision to move to the cloud, a platform's "ecosystem" -- the services, partners, experts and systems integrators -- plays an important role. And, according to Ashish Thusoo, co-founder and CEO of Quoble, a big-data-as-a-service provider on AWS, Google and Azure, the AWS ecosystem is larger and better developed than any other cloud platform. The AWS ecosystem makes the platform very compelling -- and comfortable -- for enterprise clients wanting to move big data into the cloud, he said.
Part of that appeal is the wide variety of services available. Amazon's suite of big data services include Elastic MapReduce, or EMR, for Hadoop; Kinesis for data streaming; RedShift for cluster-based data warehouses; Amazon RDS for Aurora and My SQL (among others); DynamoDB for NoSQL; the Super Simple Storage, or S3, options; and the brand-new Amazon Machine Learning.
"AWS can offer historical reports and dashboards on the past, streaming and analytics on present data, and now future predictive modeling tools," said Mike Gualtieri, an analyst with Forrester Research and co-author of a Forrester big data research report. He said AWS' RedShift is a particularly tantalizing draw for enterprise customers tired of slow reporting in traditional database environments. "RedShift is the fastest-growing AWS service, and it makes sense to move your data there to run all your analytics," he said. "It's a really logical use of the cloud."
Google BigQuery: A developer's dream
Google's big data platform -- BigQuery -- is designed for streaming data and continuous analytics. The platform has a predictive data API, a number of other Google-specific APIs and standard Java offerings. "The thing about Google is it's very developer oriented -- much more so than the other platforms," Gualtieri said. "In Google, you have to take their proprietary technology and APIs and be smart enough to figure it out." The Google platform offers Hadoop as a big data option, but Gualtieri warns that companies will need in-house expertise if they want to run Hadoop on Google. "It's there, but you'd better know how to provision it yourself, down to the command-line level," he said.
But for some companies, Quoble's Thusoo said, Google is the perfect fit, particularly if price and performance matter the most. "We've done benchmarks on Google for price and performance, and Google takes the lead there," he said. "The price/performance thing is typically very important for startups, so Google is a great choice for them."
Azure: The power of Hadoop
The Azure platform's big data offering -- HDInsight, along with SQL database and storage -- is designed to work seamlessly with Microsoft's popular Excel spreadsheet. And that's a huge selling point with customers because, according to the Forrester report, working with big data where it is located (also known as data gravity) makes the process much easier. "I really think Azure has the edge in the hybrid space," Thusoo said, "in large part because Microsoft can leverage its on-premises presence."
HDInsight is powered by Apache Hadoop, which is also a draw, Forrester's Gualtieri said. "Microsoft has a number of different tools including machine learning and predictive analytics," he said. "And for anyone who wants to do big data analytics with Hadoop, Azure is a very good option because it's easy to provision and has a powerful control panel."
Altiscale: All big data all the time
Three-year-old Altiscale was started to provide Hadoop as a service in its own cloud. The company's founder, Raymie Stata, was CTO at Yahoo and developed Hadoop as a service for the Web giant before launching Altiscale. "From its very essence Altiscale was designed to be unique," Altiscale COO Mike Maciag explained. Users can get "generic" big data services from the big three cloud providers, but Altiscale's cloud is custom designed from the hardware up to run big data more efficiently, Maciag said. "The big cloud providers offer good compute-intensive functionality for a lot of 'North/South' processing but big data is more massively parallel processing, meaning it is 'East/West' traffic. Altiscape was purpose-built for Hadoop to avoid the noisy neighbor problem."
Implementing big data
Big data infrastructures