The Power of the Unix Philosophy

Thu Feb 08 2024 | 15 min read

The Unix philosophy describes a mindset for creating software tools that do one thing well and work harmoniously with one another. It eschews monolithic designs in favor of small, self-contained utilities that are easily composable, allowing for great computational flexibility. It originated with Ken Thompson and his colleagues at Bell Labs and is based on their experience writing the Unix operating system. In their seminal work The UNIX Programming Environment, Brian Kernighan and Rob Pike describe it as follows:

"Although that philosophy can't be written down in a single sentence, at its heart is the idea that the power of a system comes more from the relationships among programs than from the programs themselves. Many Unix programs do trivial things in isolation, but, combined with other programs, become general and useful tools."

Many of the ideas of the Unix philosophy stem from practical considerations of the time: programs had to be small and self-contained due to extreme resource constraints in hardware. Interfaces were often limited to physical teletype machines that operated in simple text streams, and computers were often shared resources that had to serve multiple users, further limiting resource availability. In such an environment, extraneous features and bloated monoliths were often not practical.

Core Tenets

The Unix philosophy can be summarized by the following core tenets:

Programs should do one thing and do it well
Programs should be designed to work well together
Programs should handle plain text streams
Everything* is a file

*in practice this is not quite always true

Let's break these down individually.

1. Programs should do one thing and do it well

As mentioned at the start, the Unix philosophy rejects monolithic programs that try to do everything, opting instead for a large number of small programs that each do one specific task. Such programs are often called software tools, drawing comparisons to physical tools that excel at performing a single task; after all, hammers are excellent for driving nails but terrible for driving screws, and combining a hammer and a screwdriver will result in a tool that is well-suited for neither task. Software designed to perform a single task has the added benefit of being much easier to debug and maintain, which increases longevity.

2. Programs should be designed to work well together

In order for the true power of the Unix philosophy to shine, it is not sufficient that programs be simple and single-minded, as real-world computing tasks often require more than what any individual such program can achieve. It is therefore imperative that these programs compose harmoniously in order to achieve ad hoc computation. In most Unix environments this is best exemplified by the use of the pipe | that allows for the output of one program to be seamlessly passed as the input to another, but more generally it means conforming to common interfaces like plain text streams and files to ensure maximal interoperability.

3. Programs should handle plain text streams

The early Unix pioneers described text streams as the universal interface, and elected to use plain text for nearly all data and file formats. This has a number of benefits over binary or proprietary formats:

Plain text is simple, which means it's easier to write programs to use it effectively
Plain text is human-readable, which can make a big difference if you need to e.g. inspect the output of an intermediate command while debugging a long chain of piped programs
Plain text is portable, which means that your programs are much more likely to work on other architectures or operating systems with fewer modifications

4. Everything is a file

When people say that everything is a file in Unix, what they really mean is that everything has a file-like interface in Unix. This allows for vastly simplified interfaces to programs, allowing them to remain small and focused: Unix programs often only accept data through stdin or via files. These same programs can then be used to read/write data from/to phyiscal devices, over networks, etc. Examples of Unix file interfaces include:

Devices: block devices (e.g. hard drives) and character devices (e.g. serial output from a microcontroller)
Network sockets: TCP/UDP sockets allow inter-process communication (IPC) between two remote hosts
Processes: the /proc/ filesystem exposes files and directories for every running process in the system, allowing systems to be queried live using simple file-based tools
Named pipes: similar to regular | pipes, named pipes allow for local IPC between processes, though named pipes can outlive the processes in question

Unix shells have excellent support for handling files and file-like interfaces. For example, most shells allow for redirection of input and output using < and >, respectively. This allows using file contents as inputs/outputs, and because everything in Unix is a file, this can be quite powerful. You can even wrap arbitrary command output in a file-like interface via process substitution, like this Bash example:

diff <(ls dir1) <(ls dir2)

In this example, we use the <() syntax to wrap the result of each ls command as though it were a file, and pass these two "files" to diff for comparison. The result is that we can compare two directories to see if they contain the same files by name at their top level.

Motivating Examples

People who are not accustomed to the Unix philosophy will reject it unless they can understand its power. This section will therefore offer some motivating examples to demonstrate how the Unix philosophy can be used to solve practical everyday problems. Each of the tasks will be completed using the following limited set of basic GNU coreutils programs:

find: find files by name and type
grep: search file contents
cd: change directories
head: output first n lines of file
diff: compare files line by line
du: determine disk usage of file
sort: sort lines in file
uniq: report or omit duplicate lines in file
xargs: run commands for every item in stdin
ps: get information on currently-running processes
md5sum: hash file contents
- Note that MD5 is not suited for cryptography and is easily defeated by modern commodity hardware. However, using it for hashing local files to compare them for equality is fine.

Display the top ten memory-consuming processes

ps aux | sort -b -r -k 4 | head

Here we use ps to list all processes (a) including those running in the background (x) and display additional information like CPU and memory usage (u). We then sort this list lexicographically (sort) ignoring leading blanks (-b) in reverse order (-r) by memory usage corresponding to the fourth column in each line (-k 4). We finally send this sorted list to head to get the first ten lines, the default number of lines used by head.

Find duplicate files (by contents) in current directory

find . -type f | xargs -d "\n" md5sum | sort | uniq -w 32 -D

First we get a list of all files in the current directory (find . -type f), then we apply the MD5 hash to each newline-delimitted file in the list (xargs -d "\n" md5sum) and sort the resulting list of hashes followed by filenames lexicographically (sort). We then print the unique entires in this list (uniq) by comparing based on the first 32 characters representing the MD5 hash (-w 32) and printing all the duplicate lines (-D).

Find the five largest files in your home directory

find ~ -type f | xargs -d "\n" du -h | sort -h -r -k 1 | head -n 5

First we get a list of all files in the home directory (find ~ -type f), then we get the disk usage of each newline-delimitted file in the list (xargs -d "\n" du -h). We pass the -h flag to du in order to output the sizes in human-readable form (3K, 40M, 3G, etc). We then sort this list lexicographically (sort) in reverse order (-r) using the first part of the line corresponding to the disk usage (-k 1) in human-readable form (-h) before printing the first five lines (head -n 5).

Recursively compare two directory trees

diff <(cd dir1 && find . | sort) <(cd dir2 && find . | sort)

We saw an example earlier of using diff to compare two directories, but that example only compares the top-level items in each directory. This one will compare each directory structure recursively to determine whether they contain the same files and directories by name (not contents, like our previous duplicate files example).

Like the earlier example, we make use of process substitution to create two "files" to send to diff for comparison, where each "file" is a list of sorted (sort) paths for each file and directory present in the target directory (cd dir && find .). Note that we cd into each target directory first in each process substitution command: this will prevent the target directory names from appearing in the sorted list of paths, which would cause every entry to be considered a diff if the two target directories have different names.

Find all five-letter English words that end in "out"

grep ^..out$ /usr/share/dict/words

This might only be useful for cheating at Wordle, but I decided to include it as an example of a task that a user can complete locally that they may otherwise delegate to an online search engine. It uses regular expressions to search the /usr/share/dict/words text file of English words (shipped in most Linux distributions) to search for patterns matching the start of a line, any two letters, "out", and then finally the end of a line.

Final Thoughts

At its heart, the Unix philosophy is about simplicity, interoperability, and separation of concerns; about small, simple programs that do their job well and adhere to simple and common interfaces; about unlocking the true potential of a system by using the myriad different combinations of utilities to effect arbitrary computation. It's a way of thinking about software and system design that maximizes flexibility and potential utility, and is as timeless as it is effective. While the original Unix itself may have come and gone, its legacy still prevails to this day, allowing for yesterday's programs to work seamlessly with today's in order to solve tomorrow's problems.