The Desired State of Things

Table of Contents

It’s 2am. Something is broken in production. An engineer SSHes into the box, finds the problem (a config file with the wrong value), fixes it, restarts the service, watches the metrics recover. Crisis averted. Everyone goes back to sleep.

The Ansible playbook never learns about the fix. The next scheduled run either overwrites it or, more likely, doesn’t run at all because nobody wants to roll the dice on a Friday. Six months later, someone re-provisions the server from the same playbook and the old bug is back. Nobody connects the dots for another two weeks.

This story is so common it barely registers as a problem anymore. We’ve accepted it as the cost of doing business. But it shouldn’t be, and it didn’t used to be.

The Golden Age
#

Configuration management used to understand something fundamental: configuration is not an event. It’s a process. A continuous one.

Mark Burgess began this work in 1993 with CFEngine and later formalized it as Promise Theory in the mid-2000s. The idea was deceptively simple: an agent runs on every host, continuously comparing what is to what should be, and correcting any divergence it finds. Not when someone remembers to run a deployment. Not on a Tuesday. Always.

Puppet picked this up in 2005 and made it accessible. You wrote a manifest describing the desired state of your infrastructure (packages installed, files present, services running) and a catalog compiler turned it into a dependency graph. An agent on every node pulled that catalog every thirty minutes and enforced it. If someone changed /etc/nginx/nginx.conf by hand, the next agent run put it back. If a package got removed, the agent reinstalled it. The system healed itself.

Chef arrived in 2009 and added programmability. Ruby DSL, cookbooks, a richer execution model. SaltStack followed in 2011 with ZeroMQ-based transport and event-driven architecture: faster, more reactive.

These tools had real problems. Puppet’s DSL was its own language you had to learn. Chef required genuine Ruby knowledge. Salt’s documentation was perpetually six months behind the code. The barrier to entry was high, the learning curve steep, and the operational overhead of running a central server with PKI and agent infrastructure was nontrivial.

But they understood the core principle. They ran continuously. They detected drift. They corrected it autonomously. The system’s job was to make reality match the declaration, not just once, but for every instant going forward.

The Simplicity Trade-off
#

Then Ansible arrived in 2012, and the game changed.

No agents. No PKI infrastructure. No central server. Just SSH, a control node, and YAML files. The barrier to entry dropped from weeks to hours. If you could write a YAML list and SSH into a machine, you could automate with Ansible. It was, genuinely, a revolution in accessibility.

Ansible deserved to win on simplicity. The tools it displaced had made themselves too hard to adopt. In infrastructure tooling, a tool people actually use beats a tool that’s theoretically superior but sits undeployed. Puppet and Chef knew this was happening and couldn’t stop it. Their architectural choices made simplification difficult without abandoning what made them powerful.

But Ansible made a trade-off that gets less attention than it should. In exchange for that simplicity, it gave up continuous enforcement.

Ansible is a push-based, run-once tool. It connects to hosts, executes tasks in sequence, and disconnects. It has no agent. It gathers facts about hosts at runtime, and Ansible Automation Platform (formerly Tower) can cache those facts and track job history between runs. But facts describe what is, not what should be. There is no compiled desired state model, no catalog that can be compared against reality to detect drift between executions. You know the system was compliant at the moment the playbook last ran. You don’t know whether it still is. The only way to find out is to run the playbook again.

Red Hat’s own documentation advises users to “think declaratively first.” And some modules genuinely are declarative. The yum module will install a package if it’s missing and do nothing if it’s present. But the execution model is fundamentally procedural. Tasks run in order. The command and shell modules execute arbitrary commands with no idempotency guarantees. There is no reconciliation loop. There is no convergence.

This matters more than the community generally acknowledges. Ansible occupies an awkward middle ground: it’s marketed as a desired state tool but doesn’t actually enforce desired state. It enforces desired state at the moment you run it, and then it walks away.

The other costs accrued over time. Twenty-three levels of variable precedence, a number that exists in the actual documentation and is not an exaggeration. Jinja2 templating that starts simple and becomes its own programming language once you need conditionals and loops. Fork-per-task execution that crawls on large fleets. A Python dependency chain that, on some distributions, requires its own virtual environment just to install.

None of this stopped Ansible from becoming the default. It holds roughly 32% of the configuration management market today, second only to Terraform (which solves a different problem). It became the default not because it was the best at desired state, because it isn’t and doesn’t claim to be, but because it was the easiest to start with. And in infrastructure tooling, easy to start with has a way of becoming impossible to leave.

The Acquisition Graveyard
#

While Ansible grew, the tools that did understand continuous reconciliation were being quietly dismantled.

Chef was acquired by Progress Software in October 2020. It continues to exist in the way that enterprise software continues to exist: maintained, supported, invoiced, but no longer the subject of anyone’s enthusiasm. Its Ruby DSL never achieved the adoption that YAML-based tools managed, and Progress has done little to change that trajectory.

SaltStack was acquired by VMware the same year, then passed to Broadcom when Broadcom acquired VMware in November 2023. The aftermath has been grim. Commits to the Salt repository dropped sharply after June 2024. Broadcom migrated the package repository with one week’s notice, deleted all packages older than Salt v3006, and (in a move that crystallized the community’s frustration) dropped IPv6 support from the new package mirror. Tom Hatch, Salt’s creator, left Broadcom but committed to staying with the open source project. Community mirrors appeared to fill the gaps. Commercial end-of-support is October 2028. The prognosis is not good.

Puppet is the most painful story. Perforce acquired Puppet in May 2022 after Puppet abandoned its IPO plans. For two years, the community watched and waited. Then, in November 2024, Perforce made its move: new binaries and packages would only be published to private repositories. Community contributors must agree to an end-user license agreement. Usage beyond 25 nodes requires a commercial license. Official compiled binaries would no longer be freely available.

The community’s response was immediate and precise. Antoine Beaupre, a long-time contributor, said something that bears repeating: “We’re not forking Puppet; Perforce is forking Puppet. What Perforce is doing right now is taking the open source code that we have collaboratively used, debugged, written, collaborated, stared at and deployed on thousands of machines, and closing access to it to paying customers.”

The OpenVox fork launched its first release on January 21, 2025. It’s functionally equivalent to Puppet 8.11 and maintains full backward compatibility. Gene Liverman and the Vox Pupuli community are doing serious work. But a fork of a tool in decline is still a fork of a tool in decline. The innovation stopped years ago; the fork preserves what existed, it doesn’t advance it.

The pattern is consistent. Every major configuration management tool that understood continuous reconciliation (Puppet, Chef, Salt) has been acquired by a company that did not build it, does not love it, and is extracting value from an installed base rather than investing in the future. CFEngine, the intellectual ancestor of all of them, holds less than 1% market share. The tools that got the architecture right lost the market, and the companies that bought them are letting them die.

The Kubernetes Paradox
#

Here’s what makes this situation strange rather than merely unfortunate.

While traditional configuration management was being hollowed out by acquisitions, the container world was perfecting exactly the pattern those tools pioneered.

Kubernetes implements a reconciliation loop that is, architecturally, the same idea Burgess described with CFEngine: observe actual state, compare it to desired state, take corrective action, repeat. A controller watches for divergence and fixes it. Not when someone remembers to. Not on a schedule. Continuously.

If a pod dies, it comes back. If a deployment’s replica count drifts, it reconverges. If someone manually deletes a service, the controller recreates it. We take this for granted now. It would be absurd to suggest that Kubernetes should only enforce desired state when an operator remembers to run kubectl apply.

The operator pattern extended this further. Encode human expertise (how to back up a database, how to scale a cluster, how to rotate certificates) into software that runs 24/7. The operator watches, detects drift, corrects it, and reports. It’s automation that doesn’t sleep.

Crossplane and Kratix are pushing this pattern outward to cloud resources, databases, queues, and SaaS services. ArgoCD and Flux brought it to GitOps. The principle is the same everywhere: declare what should be true, and let a controller make it true, continuously.

But step outside the cluster boundary, onto the host, the VM, the bare metal server that runs your workloads, and you’re back to 2014. SSH in, run a playbook, hope nothing has drifted since last time. No controller. No reconciliation loop. No self-healing.

We solved desired state for containers. We solved it for cloud resources. For the operating system, the thing that actually runs your workloads, we’ve somehow gone backwards.

The Numbers
#

This isn’t an abstract concern. Configuration drift is expensive and getting more so.

Misconfigurations are a leading cause of cloud security incidents, consistently appearing in the top three attack vectors year after year. IBM’s 2024 Cost of a Data Breach Report puts the average breach cost at $4.88 million, with public cloud breaches averaging $5.17 million, and the average time to even identify a breach at 204 days. That’s nearly seven months of exposure before anyone notices.

The configuration drift management market was valued at $1.2 billion in 2024. Projections put it at $4.5 billion by 2033. That’s a market growing at roughly 16% annually, driven by a problem that existing tools don’t solve well.

The trajectory here is not hard to read. If we can build systems that continuously scan container workloads against desired state and autonomously correct drift, there is no reason the same pattern shouldn’t apply to hosts. Autonomous drift remediation will become a standard expectation, not a premium feature. The direction is clear. The tooling to get there is not.

What About NixOS?
#

Someone always asks about NixOS, and it deserves a fair answer.

NixOS is the most intellectually honest attempt at a fully declarative operating system. The entire system (packages, services, users, kernel configuration) is described in a single configuration.nix file. Atomic upgrades. Instant rollbacks. Reproducible builds. If desired state at the OS level is the goal, NixOS is the purest expression of it.

It’s also, in practice, a mountain. Not a learning curve. A mountain. The Nix language is its own paradigm. The store model produces 30GB+ system footprints. Compilation times on modest hardware can stretch to hours. Flakes, the package management layer the community has been building toward, remained “experimental” for years while being recommended everywhere. The documentation assumes familiarity with concepts that take months to build.

More importantly, NixOS solves the problem by replacing the entire OS paradigm. Most organizations can’t do that. They have RHEL. They have Ubuntu LTS. They have compliance requirements that mandate specific distributions. They have thousands of servers running workloads that were built for a traditional Linux filesystem hierarchy. They need desired state enforcement for the infrastructure they actually have, not the infrastructure they’d have if they could start from scratch.

The answer can’t be “rewrite your operating system.” It has to meet infrastructure where it is.

The Gap
#

So here we are. The tools that understood continuous reconciliation are dead or dying. The tool that dominates doesn’t actually do it. The pattern that works has been proven in containers but not applied to hosts. NixOS works but requires you to change everything else first. And the market for solving this problem is growing at 16% a year.

This isn’t a technology problem. We know how to build reconciliation loops. We’ve done it for containers. We’ve done it for cloud resources. We’ve done it for databases and queues and DNS records and TLS certificates. The pattern is well-understood, battle-tested, and running in production at planetary scale.

The gap is at the host level. The operating system. The packages, files, services, users, permissions, and system settings that constitute the actual running state of a machine. The thing every other layer sits on top of.

Somebody should probably do something about that. So we did.

This is the first in a series. The next post will introduce a tool we’ve been building.

The Golden Age#

The Simplicity Trade-off#

The Acquisition Graveyard#

The Kubernetes Paradox#

The Numbers#

What About NixOS?#

The Gap#