Creating more readable regular expressions with Simple Regex Language
Clear-Sighted
Regular expressions are a powerful tool, but they can also be very hard to digest. The Simple Regex Language lets you write regular expressions in natural language.
Regular expressions are a fundamental feature of Linux – and many other modern operating systems. A regular expression is a search term with special placeholders representing several possible characters at the same time. The concept of a regular expression is an extension of the idea behind the "wildcard" character used in many GUI search tools, but the power and subtlety of regular expressions far exceeds what you can do with a simple wildcard.
For example, suppose you want to search the system.log
file for errors, but you don't know whether the term Error
will appear with initial cap or all lowercase (Error
or error
). You could use a regular expression as part of the Grep command:
grep -e '[eE]rror' system.log
The expression [eE]
means: There is either a lowercase e
or uppercase E
.
A quick check for capitalization is easy to read and interpret, but some regular expressions are much more exotic. Who is able to say right away what text the following expression describes:
/^(?:\w|[\.\-\+])+(?:@) (?:[a-z]|[0-9]|[\.\-])+(?:\.)[a-z]{2,}$/i
Once you derive an expression like this, it can be a powerful tool for a script or a string search tool like Grep, but for the human who created this expression, and the other humans who comes along later and want to read it, decoding a regular expression can be a time-consuming endeavor. What is more, a small error that creeps into the expression could be difficult to spot, although it could have a significant effect on the value of the search result. An error in a complex regular expression could even form the basis for malicious code and an Internet attack.
The fledgling Simple Regex Language (SRL, [1]) from the developer Karim Geiger aims to address the problem of incomprehensibility in regular expressions. Geiger started SRL as a bit of fun in Fall 2016, and since then, other developers have helped to implement SRL in various coding languages.
The SRL allows you to write regular expressions in natural English. In the previous example of the logfile, the two words Error
and error
start with either E
or e
. In SRL, you could say:
one of "eE"
and follow it with the character string rror
:
one of "eE" literally "rror"
This line forms a complete expression in the SRL. SRL does not consider uppercase and lowercase for keywords, so LITERALLY
is thus the same as literally
. However, for literal strings, uppercase and lowercase are very important: literally "Error"
therefore means something completely different from literally "error"
.
In SRL, the developer can frame strings – in the example rror
– with single or double quotes. You have the option of separating the individual components of the complete expression with a comma or a line break. Adding a break does not change the logic but instead simply improves the legibility:
one of "eE", literally "rror"
The example expression matches all text passages where the character strings error
or Error
appear. Hence the word Terror
ism would be a valid reference.
Empty Words
Spaces (whitespaces) correctly separate the words:
whitespace one of "eE" literally "rror" whitespace
The word error
is usually at the beginning of a line in logfiles. Anyone who is only interested in these lines, just needs to write:
begin with one of "eE" literally "rror"
The test text now needs to start with Error
or error
. However, the expression only works if the program considers each line of the file as text to be retested (similarly to grep
).
Some logfiles mark errors with the abbreviation EE
, which you could include in the expression with:
begin with any of (literally "EE", (one of "eE" literally "rror"))
As with traditional regular expressions, brackets group matching subexpressions. The term any of
serves as a logical Or. In the example, the text looks for lines beginning with either with the character string EE
, or with Error
or error
. The comma is cosmetic.
When the Post Rings
Sometimes characters should be repeated several times. For example, with the abbreviation EE
, there are exactly two E
s in succession. Or in SRL, you could say: literally "E" exactly 2 times
. Instead of exactly 2 times
, you could also write twice
.
In the following expression:
begin with any of (any character, one of".-+") once or more
the expression any character
stands for any letters between A and Z or for a digit between 0 and 9 or an underscore _
. Uppercase and lowercase are of no importance. The permitted characters can be repeated as often as desired; however, there must be at least one character. The entry once or more
ensures a minimum of one character.
If the string you are looking for is an email address, you'll also need to ensure the presence of the @
character: literally "@"
. The domain name behind it may, in turn, be made up of several letters or numbers and the special characters .
and -
:
any of (letter, digit, one of ".-") once or more
The any character
expression does not work for the domain name because domain names prohibit the underscore _
. The letter
and digit
expressions specify letters and numerals without additional characters. The top-level domain, which starts with a period, forms the end:
<C>literally "."<C>
At least two more letters follow:
letter at least 2 times must end
The developer explains that uppercase and lowercase are irrelevant by explicitly adding case insensitive
.
Listing 1 shows the whole expression. The expression deliberately keeps the email address test simple; for example, the standard allows other special characters in front of the @
. The domain name must also always end with a letter or a number.
Listing 1
Checking an Email Address
Testing, Testing, 1, 2, 3
You can test your SRL expression directly at the SRL project website under the menu item Build [2]. Just enter the SRL expression under Your SRL Query, type a test text under Test Input, and have it checked via Run Query (Figure 1). At the bottom of the page, developers immediately find out whether the test text matches the SRL expression. In addition, the page supplies the corresponding regular expression for comparison.
Figure 2 shows the expression for Listing 1 as an example – which, by the way, is identical to the cryptic regular expression at the beginning of this article. If the tester places a check mark in front of Save Query (to the right of Test Input), the server keeps track of all entries. The tester can use the URL at the bottom of the page to access the page with the SRL expression at any time. It remains unclear where the stored data will reside, so testers should not use sensitive data with Test Input.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.
-
OSI and LPI Form Strategic Alliance
With a goal of strengthening Linux and open source communities, this new alliance aims to nurture the growth of more highly skilled professionals.
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.
-
Juno Tab 3 Launches with Ubuntu 24.04
Anyone looking for a full-blown Linux tablet need look no further. Juno has released the Tab 3.
-
New KDE Slimbook Plasma Available for Preorder
Powered by an AMD Ryzen CPU, the latest KDE Slimbook laptop is powerful enough for local AI tasks.
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.