EDGEWATER EXPERT’S CORNER: Diving into the Deeper End of SQL – Part 1

SQL is something of a funny language insofar as most every developer I have ever met seems to believe they are “fluent” in it, but the fact of the matter is that most developers just wade around in the shallows and never really dive into the deep end. Instead, from time to time we get pushed into the deep end learning additional bits and pieces expanding our vocabulary to simply keep from drowning.

The real challenge here is that there are several dialects of SQL and multiple SQL based procedural languages (i.e. PL/SQL, T-SQL, Watcom-SQL, PLpg/SQL, NZPLSQL, etc.) and not everything you learn in one dialect is implemented the same in other dialects. In 1986 the ANSI/ISO SQL standard was created with the objective of SQL interoperability across RDBMS products. Unfortunately, since the inception of this standard and with every subsequent revision (8 in all) since, there are still no database vendors that adhere directly to that standard. Individual vendors instead choose to add their own extensions of the language to provide additional functionality. Some of these extensions go full circle and get folded into later versions of the standard and others remain product specific.

Something of a long winded introduction, but necessary for what I want to discuss. Over the coming months I will be posting some write-ups on the deeper end of SQL and discussing some topics that aimed at expanding our SQL vocabularies. Today, I want to talk about window functions. These were introduced as part of the 2003 revision to the ANSI/ISO SQL standard. Window functions are probably one of the most powerful extensions to SQL language ever introduced, and most developers – yes, even the ones that consider themselves fluent in SQL – have never even heard of them. The short definition of a window function is a function that allows us to perform a calculation or aggregate across set of rows within a partition of a dataset having something in common. Something of lack luster definition you say? I agree, but before you click away, take a peek at a couple of examples below and I am sure you’ll find something useful.

For starters, I would like to explain what a “window” of data is. Simply put, a window of data is a group of rows in a table or query with common partition-able attributes shared across rows. In the table below, I have highlighted 5 distinct windows of data. The windows in this example are based on a partition by department. In general data windows can be created with virtually any foreign key that repeats in a dataset or any other repeating value in a dataset. [Image]

Example 1: Ranked List Function – In this example using the RANK function, I will create a ranked list of employees in each department by salary.   Probably not the most exciting example, but think about alternate methods of doing the same with SQL and not having the RANK function and the simple query below gets really ugly….quick. [Image]

Example 2: Dense Ranked List Function – Similar to the RANK function, but the DENSE_RANK value is the same for members of the window having the same salary value. [Image]

Example 3: FIRST and LAST Functions – Using the first and last functions we can easily get the MIN and MAX salary values for the department window and include it with our ranked list. Yup, you are sitting on one row in the window and looking back to the first row and forward to the last row of the same window all at the same time!   No Cursors Needed!!! [Image]

Example 4: LEAD and LAG Functions – These two are without a doubt a couple of the most amazing functions you will ever use. The LAG function allows us to be sitting on one row in a data window and then look back at any previous row in the window of data. Conversely, the LEAD function allows us to be sitting on one row in a data window and then look forward at any upcoming row in the window of data.

Syntax:

LAG (value_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause)

LEAD (value_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause)

In the illustration below from within the context of the data window, I am looking up at the previous record and down at the next record and presenting that data as part of the current record. To look further ahead or behind in the same data window, simply change the value of the offset parameter. Prior to the introduction of these functions mimicking the same functionality without a cursor was essentially impossible and now with a single simple line of code, I can look up or down at other records from a record. Just too darn cool! [Image]

Example 5: LEAD and LAG Functions – Just another example of what you can do with the lead and lag functions to get you thinking. In this example, our billing system has a customer credit limit table where for each customer a single record is active and historical data is preserved in inactive records. We want to add this table to our data warehouse but bring it in as a type-2 dimension and need to end date and key all the records as part of the process. We could write a cursor and loop through the records multiple times to calculate the end date and then post them to the data warehouse…or using the LEAD function we can calculate the end date based on the create date of the next record in the window. The two illustrations depict the data in the source (billing system), then in the target data warehouse table.   All of this with just a dozen lines of SQL using window functions – How many lines of code would this take without using a window functions?

Data in source billing system. [Image]

Transformed data for load to data warehouse as T-2 dimension. [Image]

Example 6: LISTAGG Function – The LISTAGG function allows us to return values a column of columns for multiple rows as a single column, aka a “multi-valued field” – Remember PICK or Revelation? [Image]

One closing note; All of the examples shown were in Oracle, but the equilavent functionallity also exists in MS-SQL Server, IBM DB2 & Netezza and PostgreSQL.

So what do you think? Ready to dive into the deep end and try some of this? At Edgewater Consulting, we have over 25 years of successful database and data warehouse implementations behind us so if you’re still wading in the kiddie pool or worse yet, swimming with the sharks! – give us a call and we can provide you with a complimentary consultation with one of our database experts. To learn more about our consulting services, download our new Digital Transformation Guide.

EDGEWATER EXPERT’S CORNER: The Pros and Cons of Exposing Data Warehouse Content via Object Models

So you’re the one that’s responsible for your company’s enterprise reporting environment. Over the years, you have succeeded in building out a very stable and yet constantly expanding and diversifying data warehouse, a solid end-user reporting platform, great analytics and flashy corporate dashboards. You’ve done all the “heavy lifting” associated with integrating data from literally dozens of source systems into a single cohesive environment that has become the go-to source for any reporting needs.

Within your EDW, there are mashup entities that exist nowhere else in the corporate domain and now you are informed that some of the warehouse content you have created will be needed as source data for a new customer service site your company is creating.

So what options do you have to accommodate this? The two most common approaches that come to mind are: a) generating extracts to feed to the subscribing application on a scheduled basis; or b) just give the application development team direct access to the EDW tables and views. Both methods have no shortage of pros and cons.

  • Extract Generation – Have the application development team identify the data they want up front and as a post-process to your nightly ETL run cycles, dump the data to the OS and leave consuming it up to the subscribing apps.
Pros Cons
A dedicated extract is a single daily/nightly operation that will not impact other subscribers to the warehouse. You’re uncomfortable publishing secure content to a downstream application environment that may not have the same stringent user-level security measures in place as the EDW has.
Application developers will not be generating ad hoc queries that could negatively impact performance for other subscribing users’ reporting operations and analytics activity. Generating extracts containing large amounts of content may not be the most efficient method for delivering needed information to subscribing applications.
Nightly dumps or extracts will only contain EDW data that was available at the time the extracts were generated and will not contain the near- real-time content that is constantly being fed to the EDW – and that users will likely expect.
  • Direct Access – Give the subscribing application developers access to exposed EDW content directly so they can query tables and views for the content they want as they need it.

 

Pros Cons
It’s up to the application development team to get what they need, how they need it and when they need it. You’re uncomfortable exposing secure content to application developers that may not have the same stringent user-level security measures in place as the EDW has.
More efficient than nightly extracts as the downstream applications will only pull data as needed. Application developers will be generating ad hoc queries that could negatively impact performance for other subscribing users’ reporting operations and analytics activity.
Near-real-time warehouse content will be available for timely consumption by the applications.

 

While both of the above options have merits, they also have a number of inherent limitations – with data security being at the top of the list. Neither of these approaches enforces the database-level security that is already implemented explicitly in the EDW – side-stepping this existing capability will force application developers to either reinvent that wheel or implement some broader, but generally less stringent, application-level security model.

There is another option, though, one we seldom consider as warehouse developers. How about exposing an object model that represents specific EDW content consistently and explicitly to any subscribing applications? You may need to put on your OLTP hat for this one, but hear me out.

The subscribing application development team would be responsible to identify the specific objects (collections) they wish to consume and would access these objects through a secured procedural interface. On the surface, this approach may sound like you and your team will get stuck writing a bunch of very specific custom procedures, but if you take a step back and think it through, the reality is that your team can create an exposed catalog of rather generic procedures, all requiring input parameters, including user tokens – so the EDW security model remains in charge of exactly which data is returned to which users on each retrieval.

The benefits of this approach are numerous, including:

  • Data Security – All requests leverage the existing EDW security model via a user token parameter for every “Get” method.
  • Data Latency – Data being delivered by this interface is as current as it is in the EDW so there are no latency issues as would be expected with extracted data sets.
  • Predefined Get Methods – No ad hoc or application-based SQL being sent to the EDW. Only procedures generated and/or approved by the EDW team will be hitting the database.
  • Content Control – Only the content that is requested is delivered. All Get methods returning non-static data will require input parameter values for any required filtering criteria – all requests can be validated.
  • Data Page Control – Subscribing applications will not only be responsible for identifying what rows they want via input parameters, but also how many rows per page to keep network traffic in check.
  • EDW Transaction Logging – An EDW transaction log can be implemented with autonomous logging that records every incoming request, the accompanying input parameters, the number of rows returned and the duration it took for the transaction to run. This can aid performance tuning for the actual request behaviors from subscribing applications.
  • Object Reuse – Creation of a generic exposed object catalog will allow other applications to leverage the same consistent set of objects providing continuity of data and interface across all subscribing applications.
  • Nested and N Object Retrieval – Creation of single Get methods that can return multiple and/or nested objects in a single database call.
  • Physical Database Objects – All consumable objects are physically instantiated in the database as user-defined types based on native database data types or other user-defined types.
  • Backend Compatibility – Makes no difference what type of shop you are, i.e.; Oracle, Microsoft, IBM, PostgreSQL or some other mainstream RDBMS; conceptually, the approach is the same.
  • Application Compatibility – This approach is compatible with both Java and .NET IDE’s, as well as other application development platforms.
  • Reduced Data Duplication – Because data is directly published to subscribing applications, there is no need for subscribers to store that detail content in their transactional database, just key value references.

There are also a few Cons that also need to be weighed when considering this path:

  • EDW Table Locks – the warehouse ETL needs to be constructed so that tables that are publishing to the object model are not exclusively locked during load operations. This eliminates brown-out situations for subscribing applications.
  • Persistent Surrogate Keys – EDW tables that are publishing data to subscribing applications via the object model will need to have persistent surrogate primary keys so that subscribing applications can locally store key values obtained from the publisher and leverage the same key values in future operations.
  • Application Connection/Session Pooling – Each application connection (session) to the EDW will need to be established based on an EDW user for security to persist to the object model, so no pooling of open connections.
  • Reduced Data Duplication – This is a double-edged sword in this context because subscribing applications will not be storing all EDW content locally. As a result, there may be limitations to the reporting operations of subscribing applications. However, the subscribing applications can also be downstream publishers of data to the same EDW and can report from there. Additionally, at the risk of convoluting this particular point, I would also point out that “set” methods can also be created which would allow the subscribing application(s) to publish relevant content directly back to the EDW, thus eliminating the need for batch loading back to the EDW from subscribing application(s). Probably a topic for another day, but I wanted to put it out there.

 

So, does that sound like something that you may just want to explore? For more information on this or any of our offerings, please do not hesitate to reach out to us at makewaves@edgewater.com. Thanks!