Charting expletives from the Linux Kernel Mailing List
Climate Study
Kernel amateurs are best advised to read summaries of the heated discussions on the Linux Kernel Mailing List (LKML) before they delve in. We analyze 2.5 million postings to study the density of cursing.
Every now and then, a message reaches social media that Linux boss Linus Torvalds has flipped out once again and dressed down kernel colleagues with rude words. Some Linux enthusiasts look on this with amusement, enjoying the tirades of the great dictator over a cool drink after work; others see the harsh nature of the language as representing an intimidating boy's club culture that privileges insiders.
The issue of language on the kernel list has been in the foreground for the last few years. In 2013, Intel developer Sarah Sharp led an effort to improve civility among kernel developers [1], and Red Hat's Lennart Poettering has also spoken up for more politeness and less abusive language [2].
In 2015, Linus responded to criticism by posting a Code of Conflict [3] that affirms the need for civility in the code review process, instructing developers to contact the Linux Foundation's Technical Advisory Board if they feel the process is threatening or abusive, and ending with a directive to not let things get personal:
As a reviewer of code, please strive to keep things civil and focused on the technical issues involved. We are all humans, and frustrations can be high on both sides of the process. Try to keep in mind the immortal words of Bill and Ted, "Be excellent to each other."
Whether you favor the harsh language of some on the kernel list, or whether you still see room for reform, you might have noticed that most of the discussion centers around anecdotes and opinions – no one ever seems to quantify it.
We decided to work through this phenomenon mathematically. For the dataset, we used 2.5 million LKML posts, which were first fed into a MySQL database, and then beaten with Perl and R scripts and presented graphically.
Figure 1 demonstrates the development of the LKML by means of the number of posts over 20 years from 1996 to the present day, with the start of 2016 projected proportionally. The almost linear increase, from 20,000 posts in 1996 to an estimated figure exceeding 270,000 for the current year of 2016, is evidence of the natural growth of the project and its uninterrupted popularity.
Long Tail
What about the number of members; do most of the posts come from a few extra active highfliers, and the rest as a long tail of Linux hobbyists who only write once or twice a year? An R script reads the metadata re-exported from MySQL into CSV format and prints the graphic in Figure 2.
It turns out that a few top posters over the decades have fired off more than 30,000 emails; a few dozen members, Torvalds himself among them, more than 10,000; and then around another 100 have exceeded 5,000. As expected, the curve levels off on its right side.
Expletives
Before entering analysis of civility on the LKML, it is necessary to clarify when exactly a word is a swear word. Clearly, what is considered profane depends strongly on the cultural environment. One possible approach is offered by the gold standard prevailing in the US: the "Seven Words You can Never Say on Television" compiled by the comedian George Carlin in 1972, referencing words that no publicly aired television or radio stations in the US could send into the ether without first masking them with an annoying 1kHz sound [4] (subscription channels like HBO are the exception).
You can probably guess most of the seven words, which, predictably, center on sex acts, body parts, and bodily functions, but if you have any questions, search for the "seven dirty words" on Wikipedia [5]. If you do not know them all, you are very welcome to use an online dictionary on your own for clarification, but please only do this with your browser set to "incognito" mode.
The CPAN Perl module Regexp::Common is available to determine whether a text includes one of the vulgarities; it searches for them at lightning speed with regular expressions using the profanity
key. The filter, however, will not find coded phrasings or blanked-out words such as f*ck; the regular expressions would have to be expanded for this.
But it also finds words that sound offensive to European ears. While an American might think nothing of the expression "a bunch of crap," except perhaps to find it funny depending on the context, Her Britannic Majesty might not be amused at high tea.
If you use regexes to trawl through the historic contributions to the LKML by Linus Torvalds, the filter jumps to July 1996 for the first instance. The member Aaron Tiensivu had written, under the title "Not a Bible Thumper," that the most amazing profanities were concealed in the kernel code (Figure 3). The discussion took its course until Torvalds exercised his authority and stated that, although he was opposed to political correctness, he also didn't see a point in being intentionally rude for no reason, adding ambiguously, "The reason the active kernel messages should be nice is that while I hate politically correct, I do not believe in being actively offensive either except when I _want_ to offend somebody. And there is no point in offending the occasional user."
More recently, Torvalds has also not shied away from arguing with a coarse tone that, if used against work colleagues in an American company, probably would have seen the HR department called to the scene immediately. At the end of 2012, he berated a maintainer who had not, in his opinion, understood the first rule of kernel maintenance: "We do not break userspace." He told the maintainer to "shut the fuck up"; a kernel change that causes problems for a userland program would always be a bug in the kernel (Figure 4).
What has been the historical development of profanities on the LKML? Figure 5 shows that there were two peaks in 2000 and 2008 with around 1,200 expletive emails, with the last decade exhibiting a strongly falling trend. Taking into account that the number of postings per year is constantly increasing, the potty-mouth count is dropping significantly. However, the figure for 2016 only shows the postings up to July, so the adjusted figure would probably be around the 2015 level.
Who uses the most swear words? Listing 1 shows how many posts the ten biggest boors sent out. At the top is the dictator himself. The list includes a number of non-native speakers – in my experience, non-natives often fling around expletives in English with little sensitivity to disguise their limited vocabulary. That said, the top 10 also enshrines some native English speakers.
Listing 1
Top Swearers
01 Linus Torvalds ........ 1308 02 Alexander Viro ........ 759 03 Peter Zijlstra ........ 548 04 Rik van Riel .......... 397 05 Thomas Gleixner ....... 324 06 Alan Cox .............. 322 07 Andrew Morton ......... 278 08 Ingo Molnar ........... 250 09 Christoph Hellwig ..... 243 10 Benjamin Herrenschmidt 180
What range of words do the maintainers use during their stressful work? Nothing out of the ordinary, as you can see from the pie chart in Figure 6: The list fits pretty closely with the usual repertoire of the American construction worker. The clear favorite is the word "crap."
Conclusion
When used in moderation, a strong word can definitely prevent any possible misunderstandings. Linus has said his use of language is intended to keep developers alert and doing their best work – to fix the problems first before sending problematic code up the development tree. On the other hand, Linux bills itself as a meritocracy, and if worthy and potentially productive programmers are choosing not to participate because they are put off by intimidating and sometimes abusive language, the result is a loss for Linux.
Of course, the study described in this article does not attempt to uncover intimidation or abuse but is only searching for the presence of words. As Sarah Sharp points out in a 2013 kernel list post summarizing her position [7], it is possible to use obscenities in a way that is not personally abusive. Saying "If you give a flying fuck about diversity, you should avoid verbal abuse" is not the same as saying "SHUT THE FUCK UP."
Still, real numbers offer real insights into the use of language on the kernel list, and the fact that foul language is on a downward trend should be of some comfort to those who argue for better word choice.
Infos
- Sarah Sharp post on civility: https://lkml.org/lkml/2013/7/15/329
- Lennart Poettering post on civility: https://plus.google.com/app/basic/stream/z13rdjryqyn1xlt3522sxpugoz3gujbhh04
- Linux Code of Conflict: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b0bc65729070b9cbdbb53ff042984a3c545a0e34
- Bleep censor: https://en.wikipedia.org/wiki/Bleep_censor
- Seven Dirty Words: https://en.wikipedia.org/wiki/Seven_dirty_words
- Linus Torvalds, "Re: Not a bible thumper. . .": https://lkml.org/lkml/1996/7/20/1
- Sarah Sharp's summary: https://lkml.org/lkml/2013/7/19/634
- Listings for this article: ftp://www.linux-magazine.com/pub/listings/magazine/192/Perl
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.