$Id: turing.xml,v 1.58 2002/11/09 03:53:26 stevegt Exp $
Steve Traugott, TerraLuna, LLC — http://www.stevegt.com
Lance Brown, National Institute of
Environmental Health Sciences — lance@bearcircle.net
Originally
accepted for publication in the proceedings of the
USENIX Large Installation System Administration
conference, Philadelphia, PA Nov 3-8, 2002. Copyright
2002 Stephen Gordon Traugott, All Rights Reserved
Abstract
Hosts in a well-architected enterprise infrastructure
are self-administered; they perform their own maintenance
and upgrades. By definition, self-administered hosts
execute self-modifying code. They do not behave according
to simple state machine rules, but can incorporate complex
feedback loops and evolutionary recursion.
The implications of this behavior are of immediate
concern to the reliability, security, and ownership costs
of enterprise computing. In retrospect, it appears that
the same concerns also apply to manually-administered
machines, in which administrators use tools that execute
in the context of the target disk to change the contents
of the same disk. The self-modifying behavior of both
manual and automatic administration techniques helps
explain the difficulty and expense of maintaining high
availability and security in conventionally-administered
infrastructures.
The practice of infrastructure architecture tool design
exists to bring order to this self-referential chaos.
Conventional systems administration can be greatly
improved upon through discipline, culture, and adoption of
practices better fitted to enterprise needs. Creating a
low-cost maintenance strategy largely remains an art.
What can we do to put this art into the hands of
relatively junior administrators? We think that part of
the answer includes adopting a well-proven strategy for
maintenance tools, based in part upon the theoretical
properties of computing.
In this paper, we equate self-administered hosts to
Turing machines in order to help build a theoretical
foundation for understanding this behavior. We discuss
some tools that provide mechanisms for reliably managing
self-administered hosts, using deterministic ordering
techniques.
Based on our findings, it appears that no tool, written
in any language, can predictably administer an enterprise
infrastructure without maintaining a deterministic,
repeatable order of changes on each host. The runtime
environment for any tool always executes in the context of
the target operating system; changes can affect the
behavior of the tool itself, creating circular
dependencies. The behavior of these changes may be
difficult to predict in advance, so testing is necessary
to validate changed hosts. Once changes have been
validated in testing they must be replicated in production
in the same order in which they were tested, due to these
same circular dependencies.
The least-cost method of managing multiple hosts also
appears to be deterministic ordering. All other known
management methods seem to include either more testing or
higher risk for each host managed.
This paper is a living document; revisions and
discussion can be found at Infrastructures.Org, a project
of TerraLuna, LLC.
1 Foreword
…by Steve Traugott
In 1998, Joel Huddleston and I suggested that an
entire enterprise infrastructure could be managed as one
large “enterprise virtual machine” (EVM)
[bootstrap]. That paper briefly described parts of
a management toolset, later named ISconf [isconf]. This
toolset, based on relatively simple makefiles and shell
scripts, did not seem extraordinary at the time. At one
point in the paper, we said that we would likely use
cfengine [cfengine] the next time around — I had been
following Mark Burgess’ progress since 1994.
That 1998 paper spawned a web site and community at
Infrastructures.Org. This community in turn helped
launch the Infrastructure Architecture (IA) career
field. In the intervening years, we’ve seen the
Infrastructures.Org community grow from a few dozen to a
few hundred people, and the IA field blossom from
obscurity into a major marketing campaign by a leading
systems vendor.
Since 1998, Joel and I have both attempted to use
other tools, including cfengine version 1. I’ve also
tried to write tools from scratch again several times,
with mixed success. We have repeatedly hit indications
that our 1998 toolset was more optimized than we had
originally thought. It appears that in some ways Joel
and I, and the rest of our group at the Bank, were
lucky; our toolset protected us from many of the
pitfalls that are laying in wait for IAs.
One of these pitfalls appears to be deterministic
ordering; I never realized how important it was until I
tried to use other tools that don’t support it. When
left without the ability to concisely describe the order
of changes to be made on a machine, I’ve seen a marked
decrease in my ability to predict the behavior of those
changes, and a large increase in my own time spent
monitoring, troubleshooting, and coding for exceptions.
These experiences have shown me that loss of order seems
to result in lower production reliability and higher
labor cost.
The ordered behavior of ISconf was more by accident
than design. I needed a quick way to get a grip on 300
machines. I cobbled a prototype together on my HP100LX
palmtop one March ’94 morning, during the 35-minute
train ride into Manhattan. I used ‘make’ as the state
engine because it’s available on most UNIX machines.
The deterministic behavior ‘make’ uses when iterating
over prerequisite lists is something I didn’t think of
as important at the time — I was more concerned with
observing known dependencies than creating repeatable
order.
Using that toolset and the EVM mindset, we were able
to repeatedly respond to the chaotic international
banking mergers and acquisitions of the mid-90’s. This
response included building and rebuilding some of the
largest trading floors in the world, launching on
schedule each time, often with as little as a few
months’ notice, each launch cleaner than the last. We
knew at the time that these projects were difficult;
after trying other tool combinations for more recent
projects I think I have a better appreciation for just
how difficult they were. The phrase “throwing a truck
through the eye of a needle” has crossed my mind more
than once. I don’t think we even knew the needle was
there.
At the invitation of Mark Burgess, I joined his LISA
2001 [lisa] cfengine workshop to discuss what we’d found
so far, with possible targets for the cfengine 2.0
feature set. The ordering requirement seemed to need
more work; I found ordering surprisingly difficult to
justify to an audience practiced in the use of
convergent tools, where ordering is often considered a
constraint to be specifically avoided [couch]
[eika-sandnes]. Later that week, Lance Brown and
I were discussing this over dinner, and he hit on the
idea of comparing a UNIX machine to a Turing machine.
The result is this paper.
Based on the symptoms we have seen when comparing
ISconf to other tools, I suspect that ordering is a
keystone principle in automated systems administration.
Lance and I, with a lot of help from others, will
attempt to offer a theoretical basis for this suspicion.
We encourage others to attempt to refute or support this
work at will; I think systems administration may be
about to find its computer science roots. We have also
already accumulated a large FAQ for this paper — we’ll
put that on the website. Discussion on this paper as
well as related topics is encouraged on the
infrastructures mailing list at
http://Infrastructures.Org.
2 Why Order Matters
There seem to be (at least) several major reasons why the
order of changes made to machines is important in the
administration of an enterprise infrastructure:
A “circular dependency” or control-loop problem exists
when an administrative tool executes code that modifies
the tool or the tool’s own foundations (the underlying
host). Automated administration tool designers cannot
assume that the users of their tool will always
understand the complex behavior of these circular
dependencies. In most cases we will never know what
dependencies end users might create.
See sections (8.40), (8.46).
A test infrastructure is needed to test the behavior
of changes before rolling them to production. No tool
or language can remove this need, because no testing is
capable of validating a change in any conditions other
than those tested. This test infrastructure is useless
unless there is a way to ensure that production machines
will be built and modified in the same way as the test
machines. See section (6), ‘The Need for Testing’.
It appears that a tool that produces deterministic
order of changes is cheaper to use than one that permits
more flexible ordering. The unpredictable behavior
resulting from unordered changes to disk is more costly
to validate than the predictable behavior produced by
deterministic ordering. See section (8.58).
Because cost is a significant driver in the
decision-making process of most IT organizations, we
will discuss this point more in section (3).
Local staff must be able to use administrative tools
after a cost-effective (i.e. cheap and quick) turnover
phase. While senior infrastructure architects may be
well-versed in avoiding the pitfalls of unordered
change, we cannot be on the permanent staff of every IT
shop on the globe. In order to ensure continued health
of machines after rollout of our tools, the tools
themselves need to have some reasonable default behavior
that is safe if the user lacks this theoretical
knowledge. See section (8.54).
This business requirement must be addressed by tool
developers. In our own practice, we have been able to
successfully turnover enterprise infrastructures to
permanent staff many times over the last several years.
Turnover training in our case is relatively simple,
because our toolsets have always implemented ordered
change by default. Without this default behavior, we
would have also needed to attempt to teach advanced
techniques needed for dealing with unordered behavior,
such as inspection of code in vendor-supplied binary
packages. See section (7.2.2), ‘Right Packages, Wrong Order’.
3 A Prediction
“Order Matters” when we care about both quality and cost
while maintaining an enterprise infrastructure. If the
ideas described in this paper are correct, then we can
make the following prediction:
The least-cost way to ensure that the
behavior of any two hosts will remain completely
identical is to always implement the same changes
in the same order on both hosts.
This sounds very simple, almost intuitive, and for
many people it is. But to our knowledge, isconf
[isconf] is the only generally-available tool which
specifically supports administering hosts this way.
There seems to be no prior art describing this
principle, and in our own experience we have yet to see
it specified in any operational procedure. It is
trivially easy to demonstrate in practice, but has at
times been surprisingly hard to support in conversation,
due to the complexity of theory required for a proof.
Note that this prediction does not apply only to those
situations when you want to maintain two or more
identical hosts. It applies to any computer-using
organization that needs cost-effective, reliable
operation. This includes those that have many unique
production hosts. See section (6), ‘The Need for Testing’. Section
(4.3) discusses this further, including
single-host rebuilds after a security breach.
This prediction also applies to disaster recovery (DR)
or business continuity planning. Any part of a credible
DR procedure includes some method of rebuilding lost
hosts, often with new hardware, in a new location.
Restoring from backups is one way to do this, but making
complete backups of multiple hosts is redundant — the
same operating system components must be backed up for
each host, when all we really need are the user data and
host build procedures (how many copies of
/bin/ls do we really need on tape?). It is
usually more efficient to have a means to quickly and
correctly rebuild each host from scratch. A tool that
maintains an ordered record of changes made after
install is one way to do this.
This prediction is particularly important for those
organizations using what we call self-administered
hosts. These are hosts that run an automated
configuration or administration tool in the context of
their own operating environment. Commercial tools in
this category include Tivoli, Opsware, and CenterRun
[tivoli] [opsware] [centerrun]. Open-source tools
include cfengine, lcfg, pikt, and our own isconf
[cfengine] [lcfg] [pikt] [isconf]. We will discuss the
fitness of some of these tools later — not all appear
fully suited to the task.
This prediction applies to those organizations which
still use an older practice called “cloning” to create
and manage hosts. In cloning, an administrator or tool
copies a disk image from one machine to another, then
makes the changes needed to make the host unique (at
minimum, IP address and hostname). After these initial
changes, the administrator will often make further
changes over the life of the machine. These changes may
be required for additional functionality or security,
but are too minor to justify re-cloning. Unless order
is observed, identical changes made to multiple hosts
are not guaranteed to behave in a predictable way
(8.47). The procedure needed for
properly maintaining cloned machines is not
substantially different from that described in
section (7.1).
This prediction, stated more formally in
section (8.58), seems to apply to UNIX, Windows,
and any other general-purpose computer with a rewritable
disk and modern operating system. More generally, it
seems to apply to any von Neumann machine with
rewritable nonvolatile storage.
4 Management Methods
All computer systems management methods can be
classified into one of three categories: divergent,
convergent, and congruent.
4.1 Divergence
Divergence (figure 4.1.1) generally implies bad
management. Experience shows us that virtually all
enterprise infrastructures are still divergent today.
Divergence is characterized by the configuration of
live hosts drifting away from any desired or assumed
baseline disk content.
Figure 4.1.1: Divergence
One quick way to tell if a shop is divergent is to
ask how changes are made on production hosts, how
those same changes are incorporated into the baseline
build for new or replacement hosts, and how they are
made on hosts that were down at the time the change
was first deployed. If you get different answers,
then the shop is divergent.
The symptoms of divergence include unpredictable
host behavior, unscheduled downtime, unexpected
package and patch installation failure, unclosed
security vulnerabilities, significant time spent
“firefighting”, and high troubleshooting and
maintenance costs.
The causes of divergence are generally that class of
operations that create non-reproducible change.
Divergence can be caused by ad-hoc manual changes,
changes implemented by two independent automatic
agents on the same host, and other unordered changes.
Scripts which drive rdist, rsync, ssh, scp,
[rdist] [rsync] [ssh] or other change
agents as a push operation [bootstrap] are also a
common source of divergence.
4.2 Convergence
Convergence (figure 4.2.1) is the process most
senior systems administrators first begin when
presented with a divergent infrastructure. They tend
to start by manually synchronizing some critical files
across the diverged machines, then they figure out a
way to do that automatically. Convergence is
characterized by the configuration of live hosts
moving towards an ideal baseline. By definition, all
converging infrastructures are still diverged to some
degree. (If an infrastructure maintains full
compliance with a fully descriptive baseline, then it
is congruent according to our definition, not convergent.
See section (4.3), ‘Congruence’.)
Figure 4.2.1: Convergence
The baseline description in a converging
infrastructure is characteristically an incomplete
description of machine state. You can quickly detect
convergence in a shop by asking how many files are
currently under management control. If an approximate
answer is readily available and is on the order of a
few hundred files or less, then the shop is likely
converging legacy machines on a file-by-file basis.
A convergence tool is an excellent means of bringing
some semblance of order to a chaotic infrastructure.
Convergent tools typically work by sampling a small
subset of the disk — via a checksum of one or more
files, for example — and taking some action in
response to what they find. The samples and actions
are often defined in a declarative or descriptive
language that is optimized for this use. This
emulates and preempts the firefighting behavior of a
reactive human systems administrator — “see a
problem, fix it”. Automating this process provides
great economies of scale and speed over doing the same
thing manually.
Convergence is a feature of Mark Burgess’ Computer
Immunology principles [immunology]. His cfengine is
in our opinion the best tool for this job [cfengine].
Simple file replication tools [sup] [cvsup] [rsync]
provide a rudimentary convergence function, but
without the other action semantics and fine-grained
control that cfengine provides.
Because convergence typically includes an
intentional process of managing a specific subset of
files, there will always be unmanaged files on each
host. Whether current differences between unmanaged
files will have an impact on future changes is
undecidable, because at any point in time we do not
know the entire set of future changes, or what files
they will depend on.
It appears that a central problem with convergent
administration of an initially divergent
infrastructure is that there is no documentation or
knowledge as to when convergence is complete. One must
treat the whole infrastructure as if the convergence
is incomplete, whether it is or not. So without more
information, an attempt to converge formerly divergent
hosts to an ideal configuration is a never-ending
process. By contrast, an infrastructure based upon
first loading a known baseline configuration on all
hosts, and limited to purely orthogonal and
non-interacting sets of changes, implements congruence
(4.3). Unfortunately, this is not the
way most shops use convergent tools such as cfengine.
The symptoms of a convergent infrastructure include
a need to test all changes on all production hosts, in
order to detect failures caused by remaining unforeseen
differences between hosts. These failures can impact
production availability. The deployment process
includes iterative adjustment of the configuration
tools in response to newly discovered differences,
which can cause unexpected delays when rolling out new
packages or changes. There may be a higher incidence
of failures when deploying changes to older hosts.
There may be difficulty eliminating some of the last
vestiges of the ad-hoc methods mentioned in section
(4.1). Continued use of ad-hoc and
manual methods virtually ensures that convergence
cannot complete.
With all of these faults, convergence still
provides much lower overall maintenance costs and
better reliability than what is available in a
divergent infrastructure. Convergence features also
provide more adaptive self-healing ability than pure
congruence, due to a convergence tool’s ability to
detect when deviations from baseline have occurred.
Congruent infrastructures rely on monitoring to detect
deviations, and generally call for a rebuild when they
have occurred. We discuss the security reasons for
this in section (4.3).
We have found apparent limits to how far convergence
alone can go. We know of no previously divergent
infrastructure that, through convergence alone, has
reached congruence (4.3). This makes
sense; convergence is a process of eliminating
differences on an as-needed basis; the managed disk
content will generally be a smaller set than the
unmanaged content. In order to prove congruence, we
would need to sample all bits on each disk, ignore
those that are user data, determine which of the
remaining bits are relevant to the operation of the
machine, and compare those with the baseline.
In our experience, it is not enough to prove via
testing that two hosts currently exhibit the same
behavior while ignoring bit differences on disk; we
care not only about current behavior, but future
behavior as well. Bit differences that are currently
deemed not functional, or even those that truly have
not been exercised in the operation of the machine,
may still affect the viability of future change
directives. If we cannot predict the viability of
future change actions, we cannot predict the future
viability of the machine.
Deciding what bit differences are “functional” is
often open to individual interpretation. For
instance, do we care about the order of lines and
comments in /etc/inetd.conf? We might strip out
comments and reorder lines without affecting the
current operation of the machine; this might seem like
a non-functional change, until two years from now.
After time passes, the lack of comments will affect
our future ability to correctly understand the
infrastructure when designing a new change. This
example would seem to indicate that even
non-machine-readable bit differences can be meaningful
when attempting to prove congruence.
Unless we can prove congruence, we cannot validate
the fitness of a machine without thorough testing, due
to the uncertainties described in section
(8.25). In order to be valid,
this testing must be performed on each production
host, due to the factors described in section
(8.47). This testing itself requires
either removing the host from production use or
exposing untested code to users. Without this
validation, we cannot trust the machine in
mission-critical operation.
4.3 Congruence
Congruence (figure 4.3.1) is the practice of
maintaining production hosts in complete compliance
with a fully descriptive baseline (7.1).
Congruence is defined in terms of disk state rather
than behavior, because disk state can be fully
described, while behavior cannot (8.59).
Figure 4.3.1: Congruence
By definition, divergence from baseline disk state
in a congruent environment is symptomatic of a failure
of code, administrative procedures, or security. In
any of these three cases, we may not be able to assume
that we know exactly which disk content was damaged.
It is usually safe to handle all three cases as a
security breach: correct the root cause, then rebuild.
You can detect congruence in a shop by asking how
the oldest, most complex machine in the infrastructure
would be rebuilt if destroyed. If years of sysadmin
work can be replayed in an hour, unattended, without
resorting to backups, and only user data need be
restored from tape, then host management is likely
congruent.
Rebuilds in a congruent infrastructure are
completely unattended and generally faster than in any
other; anywhere from 10 minutes for a simple
workstation to 2 hours for a node in a complex
high-availability server cluster (most of that two
hours is spent in blocking sleeps while meeting
barrier conditions with other nodes).
Symptoms of a congruent infrastructure include
rapid, predictable, “fire-and-forget” deployments and
changes. Disaster recovery and production sites can
be easily maintained or rebuilt on demand in a
bit-for-bit identical state. Changes are not tested
for the first time in production, and there are no
unforeseen differences between hosts. Unscheduled
production downtime is reduced to that caused by
hardware and application problems; firefighting
activities drop considerably. Old and new hosts are
equally predictable and maintainable, and there are
fewer host classes to maintain. There are no ad-hoc
or manual changes. We have found that congruence
makes cost of ownership much lower, and reliability
much higher, than any other method.
Our own experience and calculations show that the
return-on-investment (ROI) of converting from
divergence to congruence is less than 8 months for
most organizations. See (figure 4.3.2).
This graph assumes an existing divergent
infrastructure of 300 hosts, 2%/month growth rate,
followed by adoption of congruent automation
techniques. Typical observed values were used for
other input parameters. Automation tool rollout began
at the 6-month mark in this graph, causing temporarily
higher costs; return on this investment is in 5
months, where the manual and automatic lines cross
over at the 11 month mark. Following crossover, we
see a rapidly increasing cost savings, continuing over
the life of the infrastructure. While this graph is
calculated, the results agree with actual enterprise
environments that we have converted. There is a CGI
generator for this graph at Infrastructures.Org, where
you can experiment with your own parameters.
Figure 4.3.2: Cumulative costs for fully automated (congruent)
versus manual administration.
Congruence allows us to validate a change on one
host in a class, in an expendable test environment,
then deploy that change to production without risk of
failure. Note that this is useful even when (or
especially when) there may be only one production host
in that class.
A congruence tool typically works by maintaining a
journal of all changes to be made to each machine,
including the initial image installation. The journal
entries for a class of machine drive all changes on
all machines in that class. The tool keeps a lifetime
record, on the machine’s local disk, of all changes
that have been made on a given machine. In the case
of loss of a machine, all changes made can be
recreated on a new machine by “replaying” the same
journal; likewise for creating multiple, identical
hosts. The journal is usually specified in a
declarative language that is optimized for expressing
ordered sets and subsets. This allows subclassing and
easy reuse of code to create new host types. See section (7.1), ‘Describing Disk State’.
There are few tools that are capable of the ordered
lifetime journaling required for congruent behavior.
Our own isconf (7.3.1) is the only
specifically congruent tool we know of in production
use, though cfengine, with some care and extra coding,
appears to be usable for administration of congruent
environments. We discuss this in more detail in
section (7.3.2).
We recognize that congruence may be the only
acceptable technique for managing life-critical
systems infrastructures, including those that:
- Influence the results of human-subject health
and medicine experiments - Provide command, control, communications, and
intelligence (C3I) for battlefield and
weapons systems environments - Support command and telemetry systems for manned
aerospace vehicles, including spacecraft and
national airspace air traffic control
Our personal experience shows that awareness of
the risks of conventional host management techniques
has not yet penetrated many of these organizations.
This is cause for concern.
5 Ordered Thinking
We have found that designers of automated systems
administration tools can benefit from a certain mindset:
Think like a kernel developer, not an application
programmer.
A good multitasking operating system is designed to
isolate applications (and their bugs) from each other
and from the kernel, and produce the illusion of
independent execution. Systems administration is all
about making sure that users continue to see that
illusion.
Modern languages, compilers, and operating systems are
designed to isolate applications programmers from “the
bare hardware” and the low-level machine code, and
enable object-oriented, declarative, and other
high-level abstractions. But it is important to
remember that the central processing unit(s) on a
general-purpose computer only accepts machine-code
instructions, and these instructions are coded in a
procedural language. High-level languages are
convenient abstractions, but are dependent on several
layers of code to deliver machine language instructions
to the CPU.
In reality, on any computer there is only one program;
it starts running when the machine finishes power-on
self test (POST), and stops when you kill the power.
This program is machine language code, dynamically
linked at runtime, calling in fragments of code from all
over the disk. These “fragments” of code are what we
conventionally think of as applications, shared
libraries, device drivers, scripts, commands,
administrative tools, and the kernel itself — all of
the components that make up the machine’s operating
environment.
None of these fragments can run standalone on the bare
hardware — they all depend on others. We cannot
analyze the behavior of any application-layer tool as if
it were a standalone program. Even kernel startup
depends on the bootloader, and in some operating systems
the kernel runtime characteristics can be influenced by
one or more configuration files found elsewhere on disk.
This perspective is opposite from that of an
application programmer. An application programmer
“sees” the system as an axiomatic underlying support
infrastructure, with the application in control, and the
kernel and shared libraries providing resources. A
kernel developer, though, is on the other side of the
syscall interface; from this perspective, an application
is something you load, schedule, confine, and kill if
necessary.
On a UNIX machine, systems administration tools are
generally ordinary applications that run as root. This
means that they, too, are at the mercy of the kernel.
The kernel controls them, not the other way around. And
yet, we depend on automated systems administration tools
to control, modify, and occasionally replace not only
that kernel, but any and all other disk content. This
presents us with the potential for a circular dependency
chain.
A common misconception is that “there is some
high-level tool language that will avoid the need to
maintain strict ordering of changes on a UNIX machine”.
This belief requires that the underlying runtime layers
obey axiomatic and immutable behavioral laws. When
using automated administration tools we cannot consider
the underlying layers to be axiomatic; the
administration tool itself perturbs those underlying
layers. See section (7.2.3), ‘Circular
Dependencies’.
Inspection of high-level code alone is not enough.
Without considering the entire system and its resulting
machine language code, we cannot prove correctness. For
example:
print "hellon";
This looks like a trivial-enough Perl program; it
“obviously” should work. But what if the Perl
interpreter is broken? In other words, a conclusion of
“simple enough to easily prove” can only be made by
analyzing low-level machine language code, and the means
by which it is produced.
“Order Matters” because we need to ensure that the
machine-language instructions resulting from a set of
change actions will execute in the correct order, with
the correct operands. Unless we can prove program
correctness at this low level, we cannot prove the
correctness of any program. It does no good to prove
correctness of a higher-level program when we do not
know the correctness of the lower runtime layers. If
the high-level program can modify those underlying
layers, then the behavior of the program can change with
each modification. Ordering of those modifications
appears to be important to our ability to predict the
behavior of the high-level program. (Put simply, it is
important to ensure that you can step off of the tree
limb before you cut through it.)
6 The Need for Testing
Just as we urge tool designers to think like kernel
developers (5), we urge systems administrators
to think like operating systems vendors — because they
are. Systems administration is actually systems
modification; the administrator replaces binaries
and alters configuration files, creating a combination
which the operating system vendor has never tested.
Since many of these modifications are specific to a
single site or even a single machine, it is unreasonable
to assume that the vendor has done the requisite
testing. The systems administrator must perform the
role of systems vendor, testing each unique combination
— before the users do.
Due to modern society’s reliance on computers, it is
unethical (and just plain bad business practice) for an
operating system vendor to release untested operating
systems without at least noting them as such. Better
system vendors undertake a rigorous and exhaustive
series of unit, system, regression, application, stress,
and performance testing on each build before release,
knowing full well that no amount of testing is ever
enough (8.9). They do this in their own labs;
it would make little sense to plan to do this testing on
customers’ production machines.
And yet, IT shops today habitually have no dedicated
testing environment for validating changed operating
systems. They deploy changes directly to production
without prior testing. Our own experience and informal
surveys show that greater than 95% of shops still do
business this way. It is no wonder that reliability,
security, and high availability are still major issues
in IT.
We urge systems administrators to create and use
dedicated testing environments, not inflict changes on
users without prior testing, and consider themselves the
operating systems vendors that they really are. We urge
IT management organizations to understand and support
administrators in these efforts; the return on
investment is in the form of lower labor costs and much
higher user satisfaction. See section (8.42).
Availability of a test environment enables the
deployment of automated systems administration tools,
bringing major cost savings. See
(figure 4.3.2).
A test environment is useless until we have a means to
replicate the changes we made in testing onto production
machines. “Order matters” when we do this replication;
an earlier change will often affect the outcome of a
later change. This means that changes made to a test
machine must later be “replayed” in the same order on
the machine’s production counterpart. See section (8.45).
Testing costs can be greatly reduced by limiting the
number of unique builds produced; this holds true for
both vendors and administrators. This calls for careful
management of changes and host classes in an IT
environment, with an intent of limiting proliferation of
classes. See section (8.41).
Note that use of open-source operating systems does
not remove the need for local testing of local
modifications. In any reasonably complex
infrastructure, there will always be local configuration
and non-packaged binary modifications which the
community cannot have previously exercised. We prefer
open source; we do not expect it to relieve us from our
responsibilities though.
7 Ordering HOWTO
Automated systems administration is very
straightforward. There is only one way for a user-side
administrative tool to change the contents of disk in a
running UNIX