Big data, one of the most popular IT topics, is often little more than a catchword, with little in the way of actionable...
steps or useful tools associated with the topic. But developers need development frameworks and big data tools. To find the big data tools you need, start with a database and usage model for information storage, explore broad big data development environments, and identify tools that support your critical missions.
Although associating big data with Hadoop is common, database architects know that most companies already have commitments to other database technologies. In fact, Hadoop is the newest of three common database models and is often used for unstructured data. Cloud Foundry by VMware or Hortonworks (supported by Microsoft and Teradata) are popular Hadoop implementations. The older two database models are SQL (MySQL, PostgreSQL) and No-SQL (MongoDB or Cassandra, for example). Typically, the former is used for storage of table-structured data; the latter is used increasingly for transactional data that is difficult to fit into a pure tabular model.
A subtle truth about big data is that its very bigness almost assures that all three of these database models will play some role. A big data developer's first task is to ensure that all the uses of big data are supported for whichever of the three broad database models happens to be in use or is ideal for a given storage application. One way to do this is to look at databases. not in how they're organized but in how they're used, and work to harmonize tools across all the database models in use or contemplated.
Harmonize database usage with the database model
Normally, database usage is categorized as being either analytics or real-time transactional. The purpose of the former is to support business decisions, normally in the form of planning; the latter is intended to sustain normal business activity by supporting and recording work. The first step in big data development is to harmonize whichever of these two usage categories is important with the database model in use or planned.
Tabular-structured data is by far the easiest to analyze, in no small part because such data is also easily summarized to produce higher-level abstractions of a large database. All popular databases, whatever their model, can be provided with an SQL overlay using a tool (Lingua for Hadoop, for example) to support traditional tabular SQL analytics and aggregation. Analytics tools like Jaspersoft can build an analytic reporting layer on top of database models using software-as-a-service techniques.
Transactional integration may be more complicated because transactional or real-time processing imposes timing constraints on database access that may be difficult to accommodate depending on the underlying database model. Apache Mahout is seeking to enhance machine learning and data classification and filtering by expanding on the basic capabilities of Hadoop (MapReduce). SpliceMachine implements a real-time SQL overlay on Hadoop.
Determine the appropriate development environment
Next, you need to determine the development environment itself. You have three broad options: develop applications directly from big data, build application environments by filtering and preprocessing, or use analytics tools rather than programming.
Python is probably the most popular big data programming tool because it balances sophistication and ease of use, but software companies like IBM, Microsoft and Oracle all offer development tools that will support all the popular programming languages. Try to find language choices appropriate to the level of the developers expected to work with big data, of course. Tools like Talend may be helpful in creating application connections.
Where data must be processed and filtered or database abstraction is to be used, it's important to develop a database model and derivation diagram, and to impose summarization schedules to control the extent to which database subsets are kept up to date. As noted, tools like Mahout or the native MapReduce may be useful in this approach.
Analytics tools offer the easiest approach to big data development. Most users look for analytics tools from their primary software company, including IBM, Microsoft and Oracle, but it also pays to look to specialists like Pivotal, Cloudera, MapR, SAP and Teradata. Jaspersoft and Pentaho can supplement basic database model tools, and the latter is also helpful as a data integration tool. If there are no expectations for real-time transaction support, consider a specialty analytics tool from Actian, HP Vertica, Infobright, MariaDB or Kognitio. Recently however, players from this group have exited, so take care to validate the financial viability of any choice you make.
Selecting big data tools or utilities
Where massive big data changes are contemplated, users report that HP and SAP are both highly adaptable in their big data approaches, and both support all three models of database deployment. This agility may be helpful when you are trying to harmonize a number of disparate database approaches (e.g., after a major application change or corporate merger).
The final point in big data development is the selection of big data tools or utilities to support the database and application environment you're creating. A wide range of tools are available to analyze data for errors, deduplicate information, add in "data appends" to classify companies, build governance and compliance practices, and manage version control and currency of summary and filtered data. Given that these tools are often specialized by database model or even implementation, the important thing is to consider the full scope of your big data efforts, including but not limited to development and analytics. When in doubt, it's best to pick the strategy that has the most flexibility because it's difficult to predict just what direction your big data will take, particularly in the early stages.
It's also important to review the intersection between big data and the cloud. Users are increasingly interested in cloud tools like those from Amazon. In fact, Amazon's "big data lifecycle stages" of Collect > Stream > Store > RDBMS | Data Warehouse | NoSQL > Analytics > Archive are a good place to start any assessment of big data where cloud computing is likely to be involved. Even if you don't see an immediate cloud connection, it's almost inevitable that big data applications and the cloud merge into a common set of practices. Best be ready for it.
Big data vs. SQL development
Making big data projects flexible
Vendors move to create big data infrastructure