EDGEWATER EXPERT’S CORNER: The Pros and Cons of Exposing Data Warehouse Content via Object Models

So you’re the one that’s responsible for your company’s enterprise reporting environment. Over the years, you have succeeded in building out a very stable and yet constantly expanding and diversifying data warehouse, a solid end-user reporting platform, great analytics and flashy corporate dashboards. You’ve done all the “heavy lifting” associated with integrating data from literally dozens of source systems into a single cohesive environment that has become the go-to source for any reporting needs.

Within your EDW, there are mashup entities that exist nowhere else in the corporate domain and now you are informed that some of the warehouse content you have created will be needed as source data for a new customer service site your company is creating.

So what options do you have to accommodate this? The two most common approaches that come to mind are: a) generating extracts to feed to the subscribing application on a scheduled basis; or b) just give the application development team direct access to the EDW tables and views. Both methods have no shortage of pros and cons.

  • Extract Generation – Have the application development team identify the data they want up front and as a post-process to your nightly ETL run cycles, dump the data to the OS and leave consuming it up to the subscribing apps.
Pros Cons
A dedicated extract is a single daily/nightly operation that will not impact other subscribers to the warehouse. You’re uncomfortable publishing secure content to a downstream application environment that may not have the same stringent user-level security measures in place as the EDW has.
Application developers will not be generating ad hoc queries that could negatively impact performance for other subscribing users’ reporting operations and analytics activity. Generating extracts containing large amounts of content may not be the most efficient method for delivering needed information to subscribing applications.
Nightly dumps or extracts will only contain EDW data that was available at the time the extracts were generated and will not contain the near- real-time content that is constantly being fed to the EDW – and that users will likely expect.
  • Direct Access – Give the subscribing application developers access to exposed EDW content directly so they can query tables and views for the content they want as they need it.

 

Pros Cons
It’s up to the application development team to get what they need, how they need it and when they need it. You’re uncomfortable exposing secure content to application developers that may not have the same stringent user-level security measures in place as the EDW has.
More efficient than nightly extracts as the downstream applications will only pull data as needed. Application developers will be generating ad hoc queries that could negatively impact performance for other subscribing users’ reporting operations and analytics activity.
Near-real-time warehouse content will be available for timely consumption by the applications.

 

While both of the above options have merits, they also have a number of inherent limitations – with data security being at the top of the list. Neither of these approaches enforces the database-level security that is already implemented explicitly in the EDW – side-stepping this existing capability will force application developers to either reinvent that wheel or implement some broader, but generally less stringent, application-level security model.

There is another option, though, one we seldom consider as warehouse developers. How about exposing an object model that represents specific EDW content consistently and explicitly to any subscribing applications? You may need to put on your OLTP hat for this one, but hear me out.

The subscribing application development team would be responsible to identify the specific objects (collections) they wish to consume and would access these objects through a secured procedural interface. On the surface, this approach may sound like you and your team will get stuck writing a bunch of very specific custom procedures, but if you take a step back and think it through, the reality is that your team can create an exposed catalog of rather generic procedures, all requiring input parameters, including user tokens – so the EDW security model remains in charge of exactly which data is returned to which users on each retrieval.

The benefits of this approach are numerous, including:

  • Data Security – All requests leverage the existing EDW security model via a user token parameter for every “Get” method.
  • Data Latency – Data being delivered by this interface is as current as it is in the EDW so there are no latency issues as would be expected with extracted data sets.
  • Predefined Get Methods – No ad hoc or application-based SQL being sent to the EDW. Only procedures generated and/or approved by the EDW team will be hitting the database.
  • Content Control – Only the content that is requested is delivered. All Get methods returning non-static data will require input parameter values for any required filtering criteria – all requests can be validated.
  • Data Page Control – Subscribing applications will not only be responsible for identifying what rows they want via input parameters, but also how many rows per page to keep network traffic in check.
  • EDW Transaction Logging – An EDW transaction log can be implemented with autonomous logging that records every incoming request, the accompanying input parameters, the number of rows returned and the duration it took for the transaction to run. This can aid performance tuning for the actual request behaviors from subscribing applications.
  • Object Reuse – Creation of a generic exposed object catalog will allow other applications to leverage the same consistent set of objects providing continuity of data and interface across all subscribing applications.
  • Nested and N Object Retrieval – Creation of single Get methods that can return multiple and/or nested objects in a single database call.
  • Physical Database Objects – All consumable objects are physically instantiated in the database as user-defined types based on native database data types or other user-defined types.
  • Backend Compatibility – Makes no difference what type of shop you are, i.e.; Oracle, Microsoft, IBM, PostgreSQL or some other mainstream RDBMS; conceptually, the approach is the same.
  • Application Compatibility – This approach is compatible with both Java and .NET IDE’s, as well as other application development platforms.
  • Reduced Data Duplication – Because data is directly published to subscribing applications, there is no need for subscribers to store that detail content in their transactional database, just key value references.

There are also a few Cons that also need to be weighed when considering this path:

  • EDW Table Locks – the warehouse ETL needs to be constructed so that tables that are publishing to the object model are not exclusively locked during load operations. This eliminates brown-out situations for subscribing applications.
  • Persistent Surrogate Keys – EDW tables that are publishing data to subscribing applications via the object model will need to have persistent surrogate primary keys so that subscribing applications can locally store key values obtained from the publisher and leverage the same key values in future operations.
  • Application Connection/Session Pooling – Each application connection (session) to the EDW will need to be established based on an EDW user for security to persist to the object model, so no pooling of open connections.
  • Reduced Data Duplication – This is a double-edged sword in this context because subscribing applications will not be storing all EDW content locally. As a result, there may be limitations to the reporting operations of subscribing applications. However, the subscribing applications can also be downstream publishers of data to the same EDW and can report from there. Additionally, at the risk of convoluting this particular point, I would also point out that “set” methods can also be created which would allow the subscribing application(s) to publish relevant content directly back to the EDW, thus eliminating the need for batch loading back to the EDW from subscribing application(s). Probably a topic for another day, but I wanted to put it out there.

 

So, does that sound like something that you may just want to explore? For more information on this or any of our offerings, please do not hesitate to reach out to us at makewaves@edgewater.com. Thanks!

Cloud Computing Trends: Thinking Ahead (Part 3)

cloud-burstIn the first part of this series we discussed the definition of cloud computing and its various flavors. The second part focused on the offerings from three major players: Microsoft, Amazon, and Google. The third and final part discusses the issues and concerns related to the cloud as well as possible future directions.

A company may someday decide to bring the application in-house due to data security or cost related concerns. An ideal solution would allow creation of a “private in-house cloud” just like some product/ASP companies allow option of running a licensed version in-house or as a hosted service. A major rewrite of existing applications in order to run in a cloud is probably also a non-starter for most organizations. Monitoring and diagnosing applications in the cloud is a concern. Developers must be enabled to diagnose and debug in the cloud and not just in a simulation on a local desktop. Anyone who has spent enough time in the trenches coding and supporting complex applications knows that trying to diagnose complex intermittent problems in a production environment by debugging on a simulated environment on a desktop is going to be an uphill battle to say the least. A credible and sophisticated mechanism is needed to support complex applications running in the cloud. The data and meta-data ownership and security may also give companies dealing with sensitive information a pause. The laws and technology are still playing catch-up when it comes to some thorny issues around data collection, distribution rights, liability, etc.

If cloud computing is to truly fulfill its promise the technology has to evolve and the major players have to ensure that a cloud can be treated like a commodity and allow applications to move seamlessly between the clouds, without requiring a major overhaul of the code. At least some of the major players in cloud computing today don’t have a good history of allowing cross-vendor compatibility and are unlikely to jump on this bandwagon anytime soon. They will likely fight any efforts or trends to commoditize cloud computing. However, based on the history of other platform paradigm shifts they would be fighting against the market forces and the desires of their clients. Similar situations in the past have created opportunities for other vendors and startups to offer solutions that bypass the entrenched interests and offer what the market is looking for. It is not too hard to imagine an offering or a service that can abstract away the actual cloud running the application.

New design patterns and techniques may also emerge to make the transition from one cloud vendor to another easier. Not too long ago this role was played by design patterns like the DAO (data access object) and various OR (object relational) layers to reduce the database vendor lock-in. A similar trend could evolve in the cloud based applications.

All of the above is not meant to condemn cloud computing as an immature technology not ready for the prime time. The discussion above is meant to arm the organization with potential pitfalls of a leading edge technology that can still be a great asset under the right circumstances. Even today’s offerings fit the classic definition of a disruptive technology. Any organization that is creating a new application or over hauling an existing one must seriously consider architecting the application for the cloud. The benefits of instant scalability and “pay for only what you use” are too significant to ignore, especially for small to mid size companies. Not having to tie up your cash in servers and infrastructure alone warrants a serious consideration. Also not having to worry about setting up a data center that can handle the load in case your application goes viral is liberating to say the least. Any application with seasonal demand can also greatly benefit. If you are an online retailer the load on your website probably surges to several times it average volume during the holiday shopping season. Having to buy servers to handle the holiday season load which then remains idle during rest of the year can tie up your capital unnecessarily when it could have been used to grow the business. Cloud computing in its current maturity may not make sense to pursue for every enterprise. However, you should get a solid understanding of what cloud computing has to offer and adjust the way you approach IT today. This will position you to more cost effectively capitalize on what it has to offer today and tomorrow.