October 2018

  updated: September 2019

coreutils brought to you by the GNU project

This is a long-term project to decode all of the GNU coreutils in version 8.3.

This resource is for novice programmers exploring the design of command-line utilities. It is best used as an accompaniment providing useful background while reading the source code of the utility you may be interested in. This is not a user guide — Please see applicable man pages for instructions on using these utilities.

Status: Complete!

  • Phase 1 [complete] – Each utility has a dedicated page discussing the namespace and execution overview.
  • Phase 2 [complete] – Expanded discussion about important design decisions and algorithms. Tracing utility lineage both from UNIX and early Coreutils. Porting content to something more collaborative. Enhancing source walkthrough to something more useful. Creating a source code evolution visualizer
  • Phase indefinite – Line by line code walkthrough for each utility will be accomplished over a long period. GitHub repo available to gather line-by-line notes. This segment was deferred due to consistent feedback that readers were more interested in high-level discussion.

The GNU Core Utilities

I’ll link the utility pages here at the top. Click the command name for the detailed page decoding that utility. The discussion, source code, and walkthroughs are available on each page. Bolded utilities have been expanded as part of phase 2. Enjoy!

Helpful background for code reading

The GNU coreutils has its foibles. Many of these utilities are approaching 30 years old and include revisions by many people over the years. Here are some things to keep in mind when reading the code:

  • Tiny programs – These utilities are small, (mostly) single-source file programs designed to do one thing and do it well. They are not designed for long life or to scale beyond their role. Consequently, we see designs often considered ‘bad practice’ such as:
    • Many globals
    • Liberal use of macros
    • goto statements
    • Long functions with nested switchs/loops
  • Know POSIX – Start with the Utility Syntax Guidelines. In general, POSIX supports interoperability by defining appropriate inputs and outputs, but leaves the ‘work’ to the implementation. While the GNU coreutils may not strictly conform to POSIX, many ideas are entrenched: permission bits, uids/gids, environment variables, exit status, and about 3718 pages of more trivia.
  • Outside help – Portability is a complex problem and coreutils relies on extra help from a related project: gnulib. Almost every utility includes functions from gnulib which are specially designed for common problems used in many places across various systems – No need to reinvent the wheel.
  • Launched from a shell – The Core utilities expect support from a shell such as bash, zsh, ksh, and others. The shell forks/clones in to the utility, passes the arguments, sets up the environment, redirects I/O via pipes, and retains exit values.
  • Three families – GNU coreutils were originally three distinct packages for shell, text, and file utilities. Utilities within the same type share many of the same design patterns.

Basic design

Most CLI utilities look something close to this:

General CLI procedure

The key ideas:

  • A setup phase for flags, options, localization, etc
  • An argument parsing phase thats reads input to set execution parameters
  • A processing/execution phase that prepares input for one or more syscalls
  • Many opportunities to check constraints and fail out of execution
    • Distinct EXIT status hint about problem location
    • EXIT_FAILURE is general and commonly used
  • Providing feedback after failed execution

This is the framework I’ll use to organize the decoding of each utility. We’ll see that each has a unique variant of this idea which range from a few lines to thousands of lines. I’d categorize the variants in three groups: trivial, wrappers, and full utilities

Trivial utilities
Trivial utilities have a unique set up phase which defines a macro in a couple lines. Then it ‘includes’ the source of another utility in which the macro forces a specific flow control. Examples include: arch, dir, and vdir

Wrapper utilities
Wrappers perform setup and parse command line options which are passed directly as arguments to a syscall. The result of the syscall is the result of the utility. These utilities do little processing on their own. Examples include: link, whoami, hostid, logname, and more

Full utilities
The diagram above shows a design for full utilities. A setup phase, an option/argument parsing phase, and execution. Execution means processing input data and may invoke many syscalls along the way to handle more data until complete. Most utilities fall in to this category.


Digging deeper

Let’s go through the most common ideas shared across many of the utilities. Knowing these concepts beforehand should speed up code reading.

Utility Initialization

All utilities have a short initialization procedure near the beginning of main():

  initialize_main (&argc, &argv);
  set_program_name (argv[0]);
  setlocale (LC_ALL, "");
  bindtextdomain (PACKAGE, LOCALEDIR);
  textdomain (PACKAGE);

  atexit (close_stdout);

This preamble solves a few administrative issues; the most important of which are internationalization and assigning the exit action. I’ll go through each of these lines below. This lines don’t impact the specific action of a utility.

Parsing with Getopt

Ever wonder why command line utilities have had the same look