Search for processes by start time
Ghost Hunter
How do you find a process running on a Linux system by start time? The question sounds trivial, but the answer is trickier than it first appears.
As the maintainer of a computing cluster [1], Frank also provides his users with commercial software for calculations based on the fair-use principle. A limited number of license keys are available for this software (e.g., 10 keys for the MATLAB [2] simulation software).
Some of these calculations can take up to a week. When a calculation is finished and the process terminates, the license key is automatically returned to the pool of free keys and can be grabbed by another user. However, if users forget to end their processes, no more keys can be handed out, as these have all been allocated. To prevent this, the admins want to automatically search for processes that are older than 10 days. If they find a process matching this criteria, they can check with the users to clarify what should happen to the process.
The Linux kernel manages processes and makes information relating to them available to the user in the /proc
filesystem. At the command line, ps
is the reliable interface to process management. Unfortunately, ps
has dozens of options, and its output is often not very clear either. This can be remedied with a little shell code or possibly a scripting language. This article compares several potential solutions using Bash, Python and Perl scripts, and the Go programming language.
Our goal is to find a solution that detects processes that are still running and were launched at least 10 days ago and then output the results in a list that is sorted in descending chronological order. The output will also include the user's login name or user ID, the PID, the executed program, and the time when the respective process began. If possible, we want to use only on-board tools. For the solutions based on Bash, you will need the ancient procps 3.3.0 release or newer (earlier versions lack some of the features used here).
Bash Variant 1
The first obvious solution is based on the ps
command in combination with awk
, date
, sed
, and sort
. ps
supports an optional output field lstart
, which outputs a process's start time (and date) in a uniform, long format. Additionally, the option -h
must be used to completely suppress the headers in the ps
output.
While finding and implementing the solution (Listing 1) was quick, parsing ps
's output is not trivial, which makes the script relatively unreadable as well as quite long. We encountered the following problems with this solution:
- You have to set the
LC_TIME
environment variable to make sure that localized month names do not suddenly appear (env LC_TIME=C
). - The day of the month has additional spaces before the single-digit numbers. To sort, you have to replace them with a zero using the
sed
parameter (lines 5 and 6, Listing 1). - The start date contains the months in letters instead of numbers; you have to convert them to digits. This can be done with
sed
as shown in lines 7 through 18. - The order of the date components is not suitable for sorting (first month, then day, then time, and finally the year).
awk
changes the order of these four components. - The same applies to filtering from a certain date, since
awk
can also compare strings with<
. - The script uses
date
to generate the appropriate comparison date right at the outset, especially since it can also calculate data with relative specifications. The specification "10 days ago from now" is returned by calling:
date -d 'now -10 days'
date
can format the output very flexibly.
- If you do not specify any parameters when calling the script, it shows all processes older than 10 days.
- All numeric fields must be explicitly specified in
sort
; otherwisesort
will only consider the first field as numeric.
Listing 1
First Bash Attempt
01 #!/bin/sh 02 if [ -n "$1" ]; then limit=$1; else limit=10; fi 03 date="$(date '+%Y %m %d %T' -d "now -$limit days")" 04 env LC_TIME=C ps -eaxho pid,lstart,user,cmd | \ 05 sed -e 's/^ *//; 06 s/ \([1-9]\) / 0\1 /; 07 s/Jan/01/; 08 s/Feb/02/; 09 s/Mar/03/; 10 s/Apr/04/; 11 s/May/05/; 12 s/Jun/06/; 13 s/Jul/07/; 14 s/Aug/08/; 15 s/Sep/09/; 16 s/Oct/10/; 17 s/Nov/11/; 18 s/Dec/12/' | \ 19 awk '$6" "$3" "$4" "$5" "$1 < "'"$date"'" {print $6" "$3" "$4" "$5" "$1" "$7" "$8}' | \ 20 sort -n -k1 -k2 -k3 -k4 -k5
The output from Listing 1 without the sort
parameter with -k
looks like Listing 2 on a computer that was last booted on April 3, 2020.
Listing 2
Output from
$ ./list-processes1.sh | head 2020 04 03 22:32:34 1 root init 2020 04 03 22:32:34 10 root [ksoftirqd/0] 2020 04 03 22:32:34 104 root [kintegrityd] 2020 04 03 22:32:34 105 root [kblockd] 2020 04 03 22:32:34 106 root [blkcg_punt_bio] 2020 04 03 22:32:34 11 root [rcu_sched] 2020 04 03 22:32:34 12 root [migration/0] 2020 04 03 22:32:34 13 root [cpuhp/0] 2020 04 03 22:32:34 14 root [cpuhp/1] 2020 04 03 22:32:34 15 root [migration/1]
In Listing 2, you can immediately see that the sequence of the processes cannot be correct. This is because the time stamps in the field lstart
are only accurate to the second, not to the micro- or nanosecond. Sorting the output by process numbers at the very end solves this problem for the most part. You have to specify all fields up to and including the process number in the sort call, as shown in Listing 2. The output now looks like Listing 3.
Listing 3
Sorted Output
$ ./list-processes1.sh | head 2020 04 03 22:32:34 1 root init 2020 04 03 22:32:34 2 root [kthreadd] 2020 04 03 22:32:34 3 root [rcu_gp] 2020 04 03 22:32:34 4 root [rcu_par_gp] 2020 04 03 22:32:34 6 root [kworker/0:0H-kblockd] 2020 04 03 22:32:34 9 root [mm_percpu_wq] 2020 04 03 22:32:34 10 root [ksoftirqd/0] 2020 04 03 22:32:34 11 root [rcu_sched] 2020 04 03 22:32:34 12 root [migration/0] 2020 04 03 22:32:34 13 root [cpuhp/0]
Now the script only fails if so many processes are started within a single second that the process numbers are reassigned starting from the beginning. For a long time, the limit for this was 65,535 processes, but now Linux systems can also cope with larger process IDs (PIDs).
Bash Variant 2
An in-depth study of the ps
man page reveals other fields that are useful for the task at hand, such as the etimes
output field. etimes
tells you the number of seconds since the process was started, reducing the complexity considerably because you no longer have to parse month names or re-sort fields. This shrinks the command so it can be written in one line. Listing 4 returns all processes that are more than two days old.
Listing 4
Compact Bash Variant
$ ps -eaxho etimes,pid,user,cmd | sort -nr | awk '$1 > 2*24*60*60 {print}' | head 227081 106 root [blkcg_punt_bio] 227081 105 root [kblockd] 227081 104 root [kintegrityd] 227081 57 root [khugepaged] 227081 56 root [ksmd] 227081 55 root [kcompactd0] 227081 54 root [writeback] 227081 53 root [oom_reaper] 227081 52 root [khungtaskd] 227081 51 root [kauditd]
However, this variant also works with an accuracy of one second. Since the code sorts backwards, this is even more noticeable, because the PID 1
does not appear at the beginning of the list. This can be patched up by reading the sort
command options such that if the process age is identical, the PID is used as the sort criterion in ascending order. This is ensured by the parameter specification k1nr,2n
(Listing 5).
Listing 5
Improved Compact Bash Variant
$ ps -eaxho etimes,pid,user,cmd | sort -k1nr,2n | awk '$1 > 2*24*60*60 {print}' | head 226597 1 root init [2] 226597 2 root [kthreadd] 226597 3 root [rcu_gp] 226597 4 root [rcu_par_gp] 226597 6 root [kworker/0:0H-kblockd] 226597 9 root [mm_percpu_wq] 226597 10 root [ksoftirqd/0] 226597 11 root [rcu_sched] 226597 12 root [migration/0] 226597 13 root [cpuhp/0]
The previous call contains the calculation of seconds by awk
in detailed form: 2*24*60*60
corresponds to two times 24 hours of 60 minutes each with 60 seconds each. Instead, the value can also be written directly as 172800
.
The value 86400
is useful for the number of seconds per day when parameterizing the script. Listing 6 expects a parameter for the number of days. You then multiply the passed numerical value by 86,400.
Listing 6
Number of Days as a Parameter
01 #!/bin/sh 02 if [ -n "$1" ]; then 03 limit=$1; 04 else 05 limit=10; 06 fi 07 ps -eaxho etimes,pid,user,cmd | sort -k1nr,2n | awk '$1 > '"$limit"'*86400 {print}'
If you do not enter a numeric value as a call parameter, the script uses a value of 10
as the default case (10 days).
Bash Variant 3
The fact that split seconds were missing induced us to make a third attempt. Instead of the ps
command, entries from the /proc
filesystem are used as the basis here.
The required specification is found in field number 22
(starttime
) of the /proc/<pid>/stat
file. It tells you the number of clock ticks after the Linux kernel started up at the time a process is launched. Specifying the clock ticks is tricky; it is based on the assumption of a clock speed of 100Hz (i.e., 100 ticks per second [3]):
$ getconf CLK_TCK 100
Not all distributions adhere to this: Some use 250 or 1000Hz internally instead. However, they always outwardly report 100Hz. We could not clarify why this is the case. On Debian GNU/Linux, the two values are identical: 100Hz.
Like the previous shell scripts, the one in Listing 7 first reads a parameter again and, if no time span was specified, assumes 10 days as the default. Then awk
reads out two fields: 1
and 22
(the PID and number of clock ticks) in two calls. The first one determines the values for awk
's own process (whose PID in a shell typically resides in $$
); the second one determines the current time in clock ticks since the computer booted.
Listing 7
Bash Script with Clock Ticks
01 #!/bin/sh 02 if [ -n "$1" ]; then 03 limit=$1; 04 else 05 limit=10; 06 fi 07 now=$(awk '{print $22}' /proc/$$/stat) 08 awk '$22 < '$now'-(100*86400*'$limit') {printf "Sec. since boot: %.2f - PID: %i\n", $22/100, $1}' /proc/[1-9]*/stat | sort -n -k4 -k7
Then awk
reads the stat
files of all running processes; this is done by specifying:
/proc/[1-9]*/stat
The number of clock ticks per second (100) and seconds per day (86,400) are hardwired values here for simplicity's sake.
Since we wanted the output as a floating-point number to look nice, the output is restricted to just two decimal places using printf
– the clock ticks are no more accurate than this anyway. sort
then numerically sorts the two relevant fields as columns. The first numeric column lists the number of clock ticks, while the second lists the user ID.
The solution comes quite close to our objective, but cannot display the usernames for the processes. In addition, some processes that were definitely started long after the system booted (for example, the Tor Browser) unexpectedly appear as if they were started zero seconds after the system booted. The init
process, on the other hand, did not start until 468 clock ticks or 4.68 seconds after startup. In the test case, this was probably because the hard disk encryption password had to be entered first.
Removing awk
from the code and specifying the matching fields 22
and 1
directly as parameters of the sort
command makes everything a bit easier. Unfortunately, the result is unreadable output with a huge volume of data.
Annoyingly, the time data is still too imprecise to do without a final sort by PID. In theory, the data should be more precise than in the previous versions, because clock ticks provide more precise information than whole seconds. However, the problem of inaccuracy in case of a PID overflow obviously still exists. All in all, the variants with ps
seem to be the better approach.
Buy this article as PDF
(incl. VAT)