Surprises in SQL – State-of-the-art options in the standard query language
New Voice
In recent times, many developers have come to view SQL as inflexible and limited, but this classical database language has some tricks and special features that many users don't know about.
State-of-the-art SQL can do much more than you might think. Despite its popular image as a fairly limited database tool, SQL is no longer restricted to the relational data model but can also handle nested objects and structured documents, features more commonly associated with later technologies like NoSQL. Of course, it all depends on what you call SQL. Not all vendors implement all the features of the various SQL standards that have appeared through the years. In this article, I take you on a tour of some of interesting tricks available through standards-based SQL.
92 and 99
The SQL:92 standard is the starting point for the complete, classical SQL database system we think of today. SQL:92 was already the second major version of the SQL standard, and it achieved a certain level of completeness as an embodiment of the classic relational model. However, developers knew even back in 1992 that the relational model is not ideal for all data.
The third major version of the SQL standard in 1999 brought an end to the plain vanilla relational SQL. All signs pointed to object-oriented programming. The standard featured the concept of the object-relational database, but a couple of years too late as it turned out. Object-relational mappers (ORMs) had already begun to build a bridge between object-oriented programming and the relational data model.
SQL:1999 also introduced some other new features, such as loops. Even though SQL:1999 broke with many traditions, it still remained a declarative language. In other words, it is impossible by definition to tell a database how to execute a query. As long as the results are okay, the database has full freedom. This declarative nature made it difficult for the standard to address loops, because a loop defines the solution's approach. The trick SQL used in SQL:1999 was to define constructs that can only be executed as loops. Programmers can then use these constructs as loops, which should not really exist in a declarative language.
As an example, consider the problem of converting the PHP and SQL code from Listing 1 into pure SQL. The example starts by loading a list of categories into a PHP array and then runs a further query for each category; the query returns the three most popular products from each category.
Listing 1
PHP SQL Pseudocode
Performance-conscious users could object at this point that placing database queries in loops of an imperative language – PHP in this case – could cause trouble. A join is typically a better option to make sure that the database returns all the required data at the same time. In this case, there is a problem: the example cannot be handled with a simple join. The examples from Listing 2 show that, no matter how you twist and turn, the result is always the three most popular products all told – and not the three most popular per category.
Listing 2
Two Attempted Joins
The problem is that LIMIT
does not act in a category-specific way. To achieve the desired results, you need to run LIMIT
in a subquery that is restricted to a specific category (Listing 3). However, this is no longer valid SQL; subqueries in the FROM
clause cannot access data external to the subquery.
Listing 3
Invalid Subquery
The WHERE
clause that uses k.category
to access a table external to the subquery is thus invalid. Or at least, that was the case in SQL:92. SQL:1999 supports this kind of access if the user precedes the subquery with the new LATERAL
keyword (see Listing 4).
Listing 4
With LATERAL
This query is equivalent to the PHP code in Listing 1, except that the database executes the loop itself and the latencies between the application and the database are thus avoided. Figure 1 shows the similarities between the Foreach
loop and LATERAL
.
Another advantage of using SQL for everything is that the user can process the results with SQL downstream. For example, you could sort the overall results differently using an ORDER BY
clause or write the results directly to a caching table with INSERT INTO
… SELECT
… . The latter case avoids transporting data from the database to the application and then back to the database.
The benefit of being able to process the results with SQL farther down the line applies to any SQL query, of course. As a general rule, users should not assume that SQL databases are just storage bins. Data processing is often easier with SQL than with other programming languages. The results are typically more correct and the performance better. This approach only fails when faced with massively parallel access, such as experienced by Google and Facebook. If you are that big, however, you will definitely have the means to build a proprietary solution, and until you get there, the flexibility that SQL offers is often the better approach.
SQL:1999 introduced a second construct that can also be used like a loop: WITH RECURSIVE
. The details of how this works are fairly complex and well beyond the scope of this article, but Figure 2 shows the basics.
The WITH RECURSIVE
variant has three benefits:
- It is better supported by today's databases than
LATERAL
– for example, it is also supported by SQLite (see Figure 3). - Data can pass from one iteration to the next.
- It is possible to formulate a dynamic termination condition.
The disadvantage is that the loop body cannot be transferred one-to-one to SQL: It must be merged with the part after UNION ALL
.
WITH RECURSIVE
supports some important use cases. The query shown in Figure 2 is a row generator; it simply returns 10 numbered rows – very practical for generating test data. A more important use case is traversing graphs, such as finding the shortest connection between two persons on a social network.
Rapid Steps – SQL:2003
After ditching relational-only thought patterns in SQL:1999, in only took four years for the next major revision of the SQL standard. The focus of SQL:2003 was on two points: XML and analytical functions.
XML support is interesting because SQL databases became document stores, as one would call them today. In SQL:2003, users can store XML both as text and as validated documents that can be processed using SQL and XQuery.
Although some databases support the XML extension today, XML was unable to assert itself in web development. Its competitor JSON was simply too attractive. All popular SQL databases have introduced features for handling JSON documents in recent years. These extensions are purely proprietary – each database offers a different feature set.
But the second SQL:2003 focus – analytic functions – has asserted itself. In particular, the window functions are supported by many databases today, and they vastly simplify data preparation.
A window function lets the database programmers use aggregate functions, such as SUM
or COUNT
, without GROUP BY
. Of course, you still need to define the rows to use for the aggregate, but if you don't use GROUP BY
, because rows are grouped, you can now use the new OVER
clause directly after the aggregate function.
The following query clarifies the effect: it adds an additional column to a query (SELECT *
) that has the number of rows in the overall result when done:
SELECT * , COUNT(*) OVER () FROM [...]
It makes no difference how you proceed: JOIN
, WHERE
, HAVING
, GROUP BY
, ORDER BY
– everything is possible. The COUNT(*) OVER()
window does not have any side effects on the rest of the query. The empty bracket in the OVER
clause means the COUNT
function runs against all the rows of the result.
The window function essentially says: Count all rows! The result of this query is thus an additional column containing the number of rows in the result. To clarify things once again, the row count comes with each row and is thus returned multiple times; however, this does not mean it is determined multiple times!
Of course, there is no need to return the same results umpteen times with the window function. You can use the OVER
clause to define the rows against which to run the function. The most important tool for defining the rows is PARTITION BY
:
SELECT * , COUNT(*) OVER (PARTITION BY category) FROM [...]
You can read the preceding expression as: Count the rows with the same value in the category column!. In other words, PARTITION BY
delimits the rows just like GROUP BY
, but it does not group the rows like GROUP BY
does; instead, it simply specifies the rows to which the window function is applied.
If you have multiple rows of the same category, you will receive the same result in each case of these rows. Remember that a window function does not have any side effects on the remaining results of the query. In particular, this means that multiple window functions can be used in a single query without them influencing one another. As an example, if you need both the number of rows in the overall results, and the number of rows per category, you can use the relevant above examples together in a query.
The rows that are visible for a window function can be further delimited if the rows are sorted. That allows OVER
clauses to mean something like, all rows before or three rows before to rows after. "Before" and "after" refer to a sort order, which is freely definable using ORDER BY
in the OVER
clause. This means you can, say, compute a subtotal (Sum of all rows before) or a moving average (Average of three rows before to three rows after) Listing 5); value
and time
are the column names in this example.
Listing 5
ORDER BY with OVER()
The formulation of the BETWEEN
range follows similar rules to BETWEEN
in the WHERE
clause: The start must be somewhere before the finish, and the specified values themselves are part of the range. The AVG
function in the example thus generates the average value of up to seven rows: the three rows before, the current row itself, and the next three rows. If you don't have enough rows to fill the window, then it is smaller – this would be the case in the first row, for example, which cannot be preceded by three rows.
You can define the window size by row, but also by value range. For example, you can define windows that cover three days before and three days after – no matter how many rows these days cover (Listing 6) – just define the BETWEEN
range with RANGE
instead of ROWS
.
Listing 6
With Value Ranges
In addition to the well-known aggregate functions, SQL:2003 also introduced ranking functions that strictly require an OVER
clause with ORDER BY
.
Not Just Aggregate Functions
These ranking functions in particular are ROW_NUMBER
, RANK
, and DENSE_RANK
. As the name would suggest, ROW _NUMBER
lets you enumerate rows. RANK
and DENSE_RANK
are less intuitive. Both return a rank as per the ORDER BY
clause. RANK
and DENSE_RANK
differ from ROW_NUMBER
in that ex aequo placements (ties) with RANK
and DENSE_RANK
take the same rank, as is typical in sports. In other words, you could have two first-place contenders.
RANK
and DENSE_RANK
differ in the question of how to rank the next contender. RANK
omits a placement in this case – thus returning 3
; DENSE_RANK
uses placements without gaps – that is, 2
in this example.
SQL:2011 introduced more window functions for access to individual rows of a sorted data window. LAG
and LEAD
let the user access the previous and next row. FIRST_VALUE
, LAST_VALUE
, and NTH_VALUE
provide access to the first, last, or nth row of the window. The example in Listing 7 illustrates this function by means of a competition.
Listing 7
LAG and FIRST
The first column determines the placement using RANK
. This step is followed by the name and the number of points achieved. The last two columns show the gap to the competitor just ahead and to the winner. The query uses LAG
to access the score of the competitor in front and FIRST_VALUE
to see how many points the winner has. The number of points achieved by the current candidate is deducted (-points
) to give you the required gaps.
A word of warning on this example: this OVER
clause uses ORDER BY
without using BETWEEN
to delimit the window. In this case, a default clause takes effect: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
. This default is surprising for two reasons. First, the subsequent rows are excluded from the window. Second, this exclusion is based on RANGE
not on ROWS
. In other words, you still have competitors with the same score in the window.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
The Gnome Foundation Struggling to Stay Afloat
The foundation behind the Gnome desktop environment is having to go through some serious belt-tightening due to continued financial problems.
-
Thousands of Linux Servers Infected with Stealth Malware Since 2021
Perfctl is capable of remaining undetected, which makes it dangerous and hard to mitigate.
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.
-
OSI and LPI Form Strategic Alliance
With a goal of strengthening Linux and open source communities, this new alliance aims to nurture the growth of more highly skilled professionals.
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.
-
Juno Tab 3 Launches with Ubuntu 24.04
Anyone looking for a full-blown Linux tablet need look no further. Juno has released the Tab 3.