The Power of the Unix Philosophy
"Although that philosophy can't be written down in a single sentence, at its heart is the idea that the power of a system comes more from the relationships among programs than from the programs themselves. Many Unix programs do trivial things in isolation, but, combined with other programs, become general and useful tools."
Many of the ideas of the Unix philosophy stem from practical considerations of the time: programs had to be small and self-contained due to extreme resource constraints in hardware. Interfaces were often limited to physical teletype machines that operated in simple text streams, and computers were often shared resources that had to serve multiple users, further limiting resource availability. In such an environment, extraneous features and bloated monoliths were often not practical.
Core Tenets
The Unix philosophy can be summarized by the following core tenets:- Programs should do one thing and do it well
- Programs should be designed to work well together
- Programs should handle plain text streams
- Everything* is a file
Let's break these down individually.
1. Programs should do one thing and do it well
As mentioned at the start, the Unix philosophy rejects monolithic programs that try to do everything, opting instead for a large number of small programs that each do one specific task. Such programs are often called software tools, drawing comparisons to physical tools that excel at performing a single task; after all, hammers are excellent for driving nails but terrible for driving screws, and combining a hammer and a screwdriver will result in a tool that is well-suited for neither task. Software designed to perform a single task has the added benefit of being much easier to debug and maintain, which increases longevity.2. Programs should be designed to work well together
In order for the true power of the Unix philosophy to shine, it is not sufficient that programs be simple and single-minded, as real-world computing tasks often require more than what any individual such program can achieve. It is therefore imperative that these programs compose harmoniously in order to achieve ad hoc computation. In most Unix environments this is best exemplified by the use of the pipe|
that allows for the output of one program to be
seamlessly passed as the input to another, but more generally it means conforming
to common interfaces like plain text streams and files to ensure maximal interoperability.
3. Programs should handle plain text streams
The early Unix pioneers described text streams as the universal interface, and elected to use plain text for nearly all data and file formats. This has a number of benefits over binary or proprietary formats:- Plain text is simple, which means it's easier to write programs to use it effectively
- Plain text is human-readable, which can make a big difference if you need to e.g. inspect the output of an intermediate command while debugging a long chain of piped programs
- Plain text is portable, which means that your programs are much more likely to work on other architectures or operating systems with fewer modifications
4. Everything is a file
When people say that everything is a file in Unix, what they really mean is that everything has a file-like interface in Unix. This allows for vastly simplified interfaces to programs, allowing them to remain small and focused: Unix programs often only accept data through stdin or via files. These same programs can then be used to read/write data from/to phyiscal devices, over networks, etc. Examples of Unix file interfaces include:- Devices: block devices (e.g. hard drives) and character devices (e.g. serial output from a microcontroller)
- Network sockets: TCP/UDP sockets allow inter-process communication (IPC) between two remote hosts
-
Processes: the
/proc/
filesystem exposes files and directories for every running process in the system, allowing systems to be queried live using simple file-based tools -
Named pipes: similar to regular
|
pipes, named pipes allow for local IPC between processes, though named pipes can outlive the processes in question
<
and >
, respectively. This allows using file contents
as inputs/outputs, and because everything in Unix is a file, this can be quite
powerful. You can even wrap arbitrary command output in a file-like interface via
process substitution, like this Bash example:
diff <(ls dir1) <(ls dir2)
<()
syntax to wrap the result of each
ls
command as though it were a file, and pass these two "files" to
diff
for comparison. The result is that we can compare two directories
to see if they contain the same files by name at their top level.
Motivating Examples
People who are not accustomed to the Unix philosophy will reject it unless they can understand its power. This section will therefore offer some motivating examples to demonstrate how the Unix philosophy can be used to solve practical everyday problems. Each of the tasks will be completed using the following limited set of basic GNU coreutils programs:
find
: find files by name and typegrep
: search file contentscd
: change directorieshead
: output first n lines of filediff
: compare files line by linedu
: determine disk usage of filesort
: sort lines in fileuniq
: report or omit duplicate lines in filexargs
: run commands for every item in stdinps
: get information on currently-running processesmd5sum
: hash file contents- Note that MD5 is not suited for cryptography and is easily defeated by modern commodity hardware. However, using it for hashing local files to compare them for equality is fine.
Display the top ten memory-consuming processes
ps aux | sort -b -r -k 4 | head
ps
to list all processes (a
) including those running
in the background (x
) and display additional information like CPU and memory
usage (u
). We then sort this list lexicographically (sort
) ignoring
leading blanks (-b
) in reverse order (-r
) by memory usage corresponding
to the fourth column in each line (-k 4
). We finally send this sorted list to
head
to get the first ten lines, the default number of lines used by head
.
Find duplicate files (by contents) in current directory
find . -type f | xargs -d "\n" md5sum | sort | uniq -w 32 -D
find . -type f
),
then we apply the MD5 hash to each newline-delimitted file in the list (xargs -d "\n" md5sum
)
and sort the resulting list of hashes followed by filenames lexicographically (sort
).
We then print the unique entires in this list (uniq
) by comparing based on the
first 32 characters representing the MD5 hash (-w 32
) and printing all the duplicate
lines (-D
).
Find the five largest files in your home directory
find ~ -type f | xargs -d "\n" du -h | sort -h -r -k 1 | head -n 5
find ~ -type f
), then
we get the disk usage of each newline-delimitted file in the list (xargs -d "\n" du -h
).
We pass the -h
flag to du
in order to output the sizes in human-readable
form (3K, 40M, 3G, etc). We then sort this list lexicographically (sort
) in reverse
order (-r
) using the first part of the line corresponding to the disk usage
(-k 1
) in human-readable form (-h
) before printing the first five
lines (head -n 5
).
Recursively compare two directory trees
diff <(cd dir1 && find . | sort) <(cd dir2 && find . | sort)
diff
to compare two directories, but that example
only compares the top-level items in each directory. This one will compare each directory structure
recursively to determine whether they contain the same files and directories by name (not contents,
like our previous duplicate files example).
Like the earlier example, we make use of process substitution to create two "files" to send to
diff
for comparison, where each "file" is a list of sorted (sort
) paths
for each file and directory present in the target directory (cd dir && find .
).
Note that we cd
into each target directory first in each process
substitution command: this will prevent the target directory names from appearing in the sorted list
of paths, which would cause every entry to be considered a diff if the two target directories have
different names.
Find all five-letter English words that end in "out"
grep ^..out$ /usr/share/dict/words
/usr/share/dict/words
text file of English words (shipped in most Linux distributions)
to search for patterns matching the start of a line, any two letters, "out", and then finally the
end of a line.