Data in the enterprise has always been a patchwork affair. Back before the cloud ever came along to make our lives that much more interesting, there were the competing, conflicting and contrasting data sources that lived in our own local machines: an SAP database here, an Oracle database there and a score of other bits and pieces scattered around.
Now with the rise of the cloud, the new impulse is to move everything offshore into a data center you can reach from anywhere and which in the right hands can be far more than just a passive data store. That’s the theory, anyway. There are still plenty of companies with tons of data who have it split between a cloud and more conventional services, or who still rely on the latter for everything important. Sometimes it’s cheaper that way; sometimes it’s easier; and sometimes it’s just a matter of not fixing what isn’t obviously broken.
Nevertheless, as the cloud becomes more a centerpiece of IT and less an exotic add-on to it, the problems of integrating data held on-premises with data hosted natively in the cloud become less theoretical and more practical. The data you have “out there” and the data you have “in here” aren’t just stored in different places; they may be subject to completely different data-integrity rules, and may not be easily reconciled into a single stream without major heavy lifting. Worse, it may not be possible to do that once and be done with it—such cloud data integration may have to be done as an ongoing process, the better to complement the way your whole organization works.
More resources on cloud data integration
In this piece I’ve outlined a few of the key ways cloud data integration and on-premises data can be reconciled, each suited to its own purpose and with its own drawbacks.
The repository approach
Here, the idea is to take data from disparate sources and unify them all via a single repository, by using a product like IBM’s InfoSphere, Oracle’s Data Service Integrator, or SAP’s BusinessObjects Data Federator. These systems create a kind of meta-warehouse into which data is pulled from disparate sources and updated in real time. There’s even a certain amount of cross-compatibility between these different systems: InfoSphere, for instance, can import SAP BusinessObjects models and reports. The big upside to this approach is centrality: it becomes that much easier to create organization-wide dashboards compiled from many different sources.
On the other hand, replacing access to several disparate repositories of data with one big one doesn’t always solve the needed problems; sometimes it just creates a new management issue (not to mention a retraining one) in place of the old integration issues. Another is the work involved to create the new repository, which is at least partly a given—a certain amount of work is always going to be involved to federate cloud and local data. In short, this sort of thing works best if the ultimate goal is to have everyone in your organization eating from the same buffet table, so to speak—one which happens to be the new repository product.
The toolkit approach
A second approach is not to use a specific product to try and reconcile local and remote data, but rather to give the enterprise a set of tools which it can use to unify data selectively and specifically. These tools can then be re-used in multiple contexts or shared with other organizations. Instead of just funneling everything into a single dashboard, the toolkit approach is about allowing custom connections to be forged between specific data repositories. That way, those used to a given data source—SalesForce, for instance—can continue to use that data source while getting data fed in from somewhere else.
That said, all this comes at the fair cost of requiring the data integration to be built from scratch. “Toolkit” means just that: you’re given the pieces and an IDE to work with them, but no assumptions are made about what kinds of data sources you’re integrating. There may be connectors for common data sources, whether it’s SaaS products like SalesForce or more general platform sources like Oracle or SQL Server, but the ways they are to be coupled with each other is entirely up to you. Informatica’s Integration Cloud and Cloud Connector Toolkit are good examples of such a system.
The process-based approach
A third possible method for unifying cloud and on-premises data has been borne out of the need for different businesses to solve common integration problems without divulging too much about the data they’re using. Talend Unified Platform is an incarnation of this approach.
Much of what Talend is geared for is complementing internal business processes, and allowing the methods for those processes to be re-used. For instance, users can also create custom data-matching and -validation algorithms, which can then be contributed back to other users and re-used in any number of contexts. Also among Talend’s components is a module called Data Quality, which compares various data sources against each other to see what they have in common, and how they might be federated or put into sync automatically.
Finally, Talend includes extensions for working with Hadoop, an increasingly important open source technology used for elegantly processing large amounts of data in the cloud. Like some of the cloud data integration tools described here, Hadoop requires a certain amount of heavy lifting to be used properly—it’s not something that you can simply install and run—but yields great benefits when used well.
Cloud data integration conclusions
As more businesses are incubated in the cloud rather than migrate there, cloud environments will be more frequently used as the starting point for business data rather than one of its endpoints or targets. That said, the number of existing businesses who have their data kept locally—including in their own locally-constructed clouds—is not going to start imploding as quickly as cloud vendors like, for a whole variety of reasons. Elegant data federation systems between local and cloud-based data may accelerate that process, but I suspect the greatest acceleration will come when individual products allow for a full panopoly of approaches under one roof: the dashboard, the toolkit and the process- or method-based approach, all in one.
This was first published in May 2012