Tracking down hard disk space consumption with Go

Programming Snapshot – Disk Space Analyzer

© Lead Image © alphaspirit, 123RF.com

© Lead Image © alphaspirit, 123RF.com

Article from Issue 276/2023
Author(s):

This month, Mike Schilli writes a quick terminal UI in Go to help track down space hogs on his hard disk.

When installing a new hard disk, users think they are in seventh heaven – it seems almost impossible to ever fill up the drive. Given today's huge capacities, though, virtually everyone becomes a thoughtless space waster sooner or later. As the disk fills up, space needs to be freed up to keep things going. Most of the time, the culprit is huge files no longer in use that are still hanging around in forgotten directories and causing overcrowding. The most effective strategy is often a simple one of tracking down the biggest space wasters and then deleting the unneeded files.

The spacetree Go program helps with this housecleaning by pointing to the most storage-intensive paths starting from a root directory and provides their disk space usage in bytes. As Figure 1 shows, the Go binary, which you call with the directory to be analyzed, switches the terminal to graphics mode and displays the storage paths as an interactive tree. You can expand the green elements in the tree with a mouse click or by pressing the Enter key to determine which folders contain the fattest files.

Figure 1: The Go terminal user interface (UI) with storage-intensive Git directories.

Biggest Fish

In Figure 1, the Go spacetree binary takes a look at the ~/git/ directory and discovers some 14GB of data there. Using the arrow keys, their vi equivalents J and K, or even the mouse if you fancy, you can navigate to the directories highlighted in green and expand them by pressing the Enter key. Figure 2 shows that, for example, the articles/ directory, which contains the articles from 26 years of this column, occupies a total of almost 4GB.

Figure 2: Expanding the directories shows you the biggest fish.

Most of it, some 850MB, resides in the internal Git directory, .git, which I can't touch without disrupting its inner workings. However, for the biggest disposable space waster, check out the subdirectory with the wifi network checker from the October 2023 issue: This directory still has the screencast video in it! By deleting it, I've saved another 850MB. On top of that, the March 2019 progressbar article contains the unused foo and foo.bak files that really need to be removed.

Smarter with Heap

But how does the Go program traverse the deeply nested file structure to count the drive bytes used by each file and print a short list of the largest files? It finds the 10 biggest files from a huge number of files, which can easily be millons on a heavily used system. The simple sounding solution of dropping all the entries into an array, sorting the array in descending order by file size, and then outputting the first 10 entries is not an option. Even a highly effective sorting algorithm would take quite some time to process a million entries, not to mention the RAM wasted by such a huge array. After all, only 10 entries are of interest at the end of the day.

The so-called "min heap" proves to be an efficient data structure for problems like this. The nodes of this binary tree always satisfy the rule that the entries below a given node always have larger values than the node itself. In Figure 3, the node with the value 3 is at the top of the tree root and is smaller than all other nodes in the tree. Note, though, that the rule applies only to directly or indirectly connected nodes. In Figure 3, for example, it is perfectly okay for the first element of the second row from the top, which belongs to the left subtree, to contain a value of 8 while the third row of the right subtree contains a value of 7. They're not in a direct or indirect parent/child relationship, so the rule doesn't apply.

Figure 3: The min heap data structure and its array representation.

Minimal Heap

Software programs represent the tree as a simple array which, as shown in Figure 3, creates subsets of one, two, four, eight, and so on that border on each other seamlessly, each subset representing one horizontal row within the tree. It is now relatively easy to find and remove the smallest entry in the tree, because it is always at the top, in the root element of the tree. Removing the node leaves an empty spot and disrupts the tree order, but the standardized heap algorithm can reorder the tree so that another node takes the place of the removed one. This causes subsequent nodes to bubble up, until the tree again satisfies the heap rules.

This makes the data structure ideal for an actively maintained list of the N largest files found so far. If the tree initially contains less than N nodes, all newly found files are moved into it without checking. On the other hand, if the tree grows to N+1 nodes when a new file is inserted, it needs to be shrunk back to N nodes. This works best if the algorithm removes the root node with the smallest file and lets the remaining nodes move up by repairing the tree.

Figure 4 shows the output of the test program in Listing 1. It first fills the heap with the paths to eight files including their sizes in bytes. However, it limits the maximum size of the heap to six entries. The result is the six largest files from the entire collection, sorted in descending order by file size. As expected, the abc and ghi entries with byte counts of 123 and 124 are dropped as the smallest files.

Listing 1

heap-main.go

01 package main
02 import (
03   "fmt"
04 )
05 func main() {
06   entries := []FsEntry{
07     {path: "abc", used: 123},
08     {path: "def", used: 456},
09     {path: "ghi", used: 124},
10     {path: "jkl", used: 457},
11     {path: "mno", used: 125},
12     {path: "pqr", used: 458},
13     {path: "stu", used: 126},
14     {path: "vwx", used: 999},
15   }
16
17   h := NewTopN(6)
18   for _, e := range entries {
19     h.Add(e)
20   }
21
22   h.Iterate(func(e FsEntry) {
23     fmt.Printf("%s %d\n", e.path, e.used)
24   })
25 }
Figure 4: Output of the test program for the min heap.

Listing 1 uses NewTopN(6) in line 17 to create a min heap with a maximum of six nodes. The for loop starting in line 18 uses Add() in each round to place the next entry from the entries array slice onto the min heap. Finally, Iterate() works its way through the winning entries in the data structure starting in line 22, from the file with the highest byte count to the smallest of the top six.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Ready to Rumble

    A Go program writes a downloaded ISO file to a bootable USB stick. To prevent it from accidentally overwriting the hard disk, Mike Schilli provides it with a user interface and security checks.

  • Making History

    In the history log, the Bash shell records all commands typed by the user. Mike Schilli extracts this data with Go for a statistical analysis of his typing behavior.

  • Perl: Lyrical Logfiles

    Instead of just monitoring incoming requests in your web server's logfile, a sound server makes them audible and lets you listen to the tune of users surfing the site.

  • Motion Sensor

    Inotify lets applications subscribe to change notifications in the filesystem. Mike Schilli uses the cross-platform fsnotify library to instruct a Go program to detect what's happening.

  • Programming Snapshot – Go

    To find files quickly in the deeply nested subdirectories of his home directory, Mike whips up a Go program to index file metadata in an SQLite database.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News