Surprises in SQL – State-of-the-art options in the standard query language

New Voice

© Lead Image © Olly, fotolia.com

© Lead Image © Olly, fotolia.com

Article from Issue 183/2016
Author(s):

In recent times, many developers have come to view SQL as inflexible and limited, but this classical database language has some tricks and special features that many users don't know about.

State-of-the-art SQL can do much more than you might think. Despite its popular image as a fairly limited database tool, SQL is no longer restricted to the relational data model but can also handle nested objects and structured documents, features more commonly associated with later technologies like NoSQL. Of course, it all depends on what you call SQL. Not all vendors implement all the features of the various SQL standards that have appeared through the years. In this article, I take you on a tour of some of interesting tricks available through standards-based SQL.

92 and 99

The SQL:92 standard is the starting point for the complete, classical SQL database system we think of today. SQL:92 was already the second major version of the SQL standard, and it achieved a certain level of completeness as an embodiment of the classic relational model. However, developers knew even back in 1992 that the relational model is not ideal for all data.

The third major version of the SQL standard in 1999 brought an end to the plain vanilla relational SQL. All signs pointed to object-oriented programming. The standard featured the concept of the object-relational database, but a couple of years too late as it turned out. Object-relational mappers (ORMs) had already begun to build a bridge between object-oriented programming and the relational data model.

SQL:1999 also introduced some other new features, such as loops. Even though SQL:1999 broke with many traditions, it still remained a declarative language. In other words, it is impossible by definition to tell a database how to execute a query. As long as the results are okay, the database has full freedom. This declarative nature made it difficult for the standard to address loops, because a loop defines the solution's approach. The trick SQL used in SQL:1999 was to define constructs that can only be executed as loops. Programmers can then use these constructs as loops, which should not really exist in a declarative language.

As an example, consider the problem of converting the PHP and SQL code from Listing 1 into pure SQL. The example starts by loading a list of categories into a PHP array and then runs a further query for each category; the query returns the three most popular products from each category.

Listing 1

PHP SQL Pseudocode

 

Performance-conscious users could object at this point that placing database queries in loops of an imperative language – PHP in this case – could cause trouble. A join is typically a better option to make sure that the database returns all the required data at the same time. In this case, there is a problem: the example cannot be handled with a simple join. The examples from Listing 2 show that, no matter how you twist and turn, the result is always the three most popular products all told – and not the three most popular per category.

Listing 2

Two Attempted Joins

 

The problem is that LIMIT does not act in a category-specific way. To achieve the desired results, you need to run LIMIT in a subquery that is restricted to a specific category (Listing 3). However, this is no longer valid SQL; subqueries in the FROM clause cannot access data external to the subquery.

Listing 3

Invalid Subquery

 

The WHERE clause that uses k.category to access a table external to the subquery is thus invalid. Or at least, that was the case in SQL:92. SQL:1999 supports this kind of access if the user precedes the subquery with the new LATERAL keyword (see Listing 4).

Listing 4

With LATERAL

 

This query is equivalent to the PHP code in Listing 1, except that the database executes the loop itself and the latencies between the application and the database are thus avoided. Figure 1 shows the similarities between the Foreach loop and LATERAL.

Figure 1: PHP pseudocode and SQL LATERAL loops compared.

Another advantage of using SQL for everything is that the user can process the results with SQL downstream. For example, you could sort the overall results differently using an ORDER BY clause or write the results directly to a caching table with INSERT INTOSELECT … . The latter case avoids transporting data from the database to the application and then back to the database.

The benefit of being able to process the results with SQL farther down the line applies to any SQL query, of course. As a general rule, users should not assume that SQL databases are just storage bins. Data processing is often easier with SQL than with other programming languages. The results are typically more correct and the performance better. This approach only fails when faced with massively parallel access, such as experienced by Google and Facebook. If you are that big, however, you will definitely have the means to build a proprietary solution, and until you get there, the flexibility that SQL offers is often the better approach.

SQL:1999 introduced a second construct that can also be used like a loop: WITH RECURSIVE. The details of how this works are fairly complex and well beyond the scope of this article, but Figure  2 shows the basics.

The WITH RECURSIVE variant has three benefits:

  • It is better supported by today's databases than LATERAL – for example, it is also supported by SQLite (see Figure  3).
  • Data can pass from one iteration to the next.
  • It is possible to formulate a dynamic termination condition.

The disadvantage is that the loop body cannot be transferred one-to-one to SQL: It must be merged with the part after UNION ALL.

WITH RECURSIVE supports some important use cases. The query shown in Figure 2 is a row generator; it simply returns 10 numbered rows – very practical for generating test data. A more important use case is traversing graphs, such as finding the shortest connection between two persons on a social network.

Figure 2: A WITH RECURSIVE loop in SQL.

Rapid Steps – SQL:2003

After ditching relational-only thought patterns in SQL:1999, in only took four years for the next major revision of the SQL standard. The focus of SQL:2003 was on two points: XML and analytical functions.

XML support is interesting because SQL databases became document stores, as one would call them today. In SQL:2003, users can store XML both as text and as validated documents that can be processed using SQL and XQuery.

Although some databases support the XML extension today, XML was unable to assert itself in web development. Its competitor JSON was simply too attractive. All popular SQL databases have introduced features for handling JSON documents in recent years. These extensions are purely proprietary – each database offers a different feature set.

But the second SQL:2003 focus – analytic functions – has asserted itself. In particular, the window functions are supported by many databases today, and they vastly simplify data preparation.

A window function lets the database programmers use aggregate functions, such as SUM or COUNT, without GROUP BY. Of course, you still need to define the rows to use for the aggregate, but if you don't use GROUP BY, because rows are grouped, you can now use the new OVER clause directly after the aggregate function.

The following query clarifies the effect: it adds an additional column to a query (SELECT *) that has the number of rows in the overall result when done:

SELECT *
 , COUNT(*) OVER ()
 FROM [...]

It makes no difference how you proceed: JOIN, WHERE, HAVING, GROUP BY, ORDER BY  – everything is possible. The COUNT(*) OVER() window does not have any side effects on the rest of the query. The empty bracket in the OVER clause means the COUNT function runs against all the rows of the result.

The window function essentially says: Count all rows! The result of this query is thus an additional column containing the number of rows in the result. To clarify things once again, the row count comes with each row and is thus returned multiple times; however, this does not mean it is determined multiple times!

Of course, there is no need to return the same results umpteen times with the window function. You can use the OVER clause to define the rows against which to run the function. The most important tool for defining the rows is PARTITION BY:

SELECT *
 , COUNT(*) OVER (PARTITION BY category)
 FROM [...]

You can read the preceding expression as: Count the rows with the same value in the category column!. In other words, PARTITION BY delimits the rows just like GROUP BY, but it does not group the rows like GROUP BY does; instead, it simply specifies the rows to which the window function is applied.

If you have multiple rows of the same category, you will receive the same result in each case of these rows. Remember that a window function does not have any side effects on the remaining results of the query. In particular, this means that multiple window functions can be used in a single query without them influencing one another. As an example, if you need both the number of rows in the overall results, and the number of rows per category, you can use the relevant above examples together in a query.

The rows that are visible for a window function can be further delimited if the rows are sorted. That allows OVER clauses to mean something like, all rows before or three rows before to rows after. "Before" and "after" refer to a sort order, which is freely definable using ORDER BY in the OVER clause. This means you can, say, compute a subtotal (Sum of all rows before) or a moving average (Average of three rows before to three rows after) Listing 5); value and time are the column names in this example.

Listing 5

ORDER BY with OVER()

 

The formulation of the BETWEEN range follows similar rules to BETWEEN in the WHERE clause: The start must be somewhere before the finish, and the specified values themselves are part of the range. The AVG function in the example thus generates the average value of up to seven rows: the three rows before, the current row itself, and the next three rows. If you don't have enough rows to fill the window, then it is smaller – this would be the case in the first row, for example, which cannot be preceded by three rows.

You can define the window size by row, but also by value range. For example, you can define windows that cover three days before and three days after  – no matter how many rows these days cover (Listing 6) – just define the BETWEEN range with RANGE instead of ROWS.

Listing 6

With Value Ranges

 

In addition to the well-known aggregate functions, SQL:2003 also introduced ranking functions that strictly require an OVER clause with ORDER BY.

Not Just Aggregate Functions

These ranking functions in particular are ROW_NUMBER, RANK, and DENSE_RANK. As the name would suggest, ROW _NUMBER lets you enumerate rows. RANK and DENSE_RANK are less intuitive. Both return a rank as per the ORDER BY clause. RANK and DENSE_RANK differ from ROW_NUMBER in that ex aequo placements (ties) with RANK and DENSE_RANK take the same rank, as is typical in sports. In other words, you could have two first-place contenders.

RANK and DENSE_RANK differ in the question of how to rank the next contender. RANK omits a placement in this case  – thus returning 3; DENSE_RANK uses placements without gaps – that is, 2 in this example.

SQL:2011 introduced more window functions for access to individual rows of a sorted data window. LAG and LEAD let the user access the previous and next row. FIRST_VALUE, LAST_VALUE, and NTH_VALUE provide access to the first, last, or nth row of the window. The example in Listing 7 illustrates this function by means of a competition.

Listing 7

LAG and FIRST

 

The first column determines the placement using RANK. This step is followed by the name and the number of points achieved. The last two columns show the gap to the competitor just ahead and to the winner. The query uses LAG to access the score of the competitor in front and FIRST_VALUE to see how many points the winner has. The number of points achieved by the current candidate is deducted (-points) to give you the required gaps.

A word of warning on this example: this OVER clause uses ORDER BY without using BETWEEN to delimit the window. In this case, a default clause takes effect: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This default is surprising for two reasons. First, the subsequent rows are excluded from the window. Second, this exclusion is based on RANGE not on ROWS. In other words, you still have competitors with the same score in the window.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • MySQL Workbench 5.1

    A small database is easy to plan on paper, but the structure quickly becomes more complex as you add more elements. MySQL Workbench can help you keep the tables arranged.

  • Data Management

    Open source database management systems offer greater flexibility and lower costs while avoiding vendor lock-in. Finding the right one depends on your project's needs.

  • Patterns in the Archive

    To help him check his Google Drive files with three different pattern matchers, Mike builds a command-line tool in Go to maintain a meta cache.

  • Workspace: phpMyFAQ

    Transform information into knowledge by deploying a phpMyFAQ-based database of frequently asked questions.

  • Top Ten Tops

    The famous Unix admin utility known as Top has many imitators. We take a look at some of the top Top tools.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News