When you're new to sys-adminry, a lot of it can be overwhelming, from the abundance of choices, to the specialized vocabulary. I'm writing a few posts about various aspects of sys-adminry/operations/devops/SRE to take some of the mystery out of it. This post is on the topic of configuration management.
Introduction
I've been lucky enough to see the rise of configuration management or "Infrastructure as Code" in my career. When I began as a part-time Jr. Sys-Admin in 1999, and a full-time sys-admin in 2001, most other system administrators I knew were manually configuring hosts. When a new server came in, they'd SSH into it and make various changes. If you were lucky, they'd semi-automated this process with a series of shell scripts, or Perl commands. More often than not, they simply kept this knowledge in their heads. Some of them even called it "job security", since if it wasn't written down, they couldn't be replaced.
I saw being a sys-admin as a lifetime career and immersed myself in understanding theory and best practices and those included the new idea of "Configuration as Code" or as we'd now call it, "Infrastructure as Code". This is the idea that we can deploy our servers much like computer programs are run, through repeatable automation. Doing this would give us many benefits, including:
- Being able to stand up a new system quickly
- Being able to recover from a disaster quickly
- Reducing the needs and complexity of backups
- Allowing for faster changes across the infrastructure
- Reducing the chances of human error
- Allowing a sys-admin to see all the parts involved in configuring a component of the system
- Seeing changes over time
- Preventing a sys-admin from changing things without going through the official process
In recent years the field of sysadminry is diminished due to the rise of "serverless" deployments, but there are still times when it makes sense to have sys-adminry. Someone has to build out those clouds, after all, and these skills are transferable to small deployments, as well as home labs.
While the abstraction level that most people work on has changed, the core problem that we were solving over twenty years ago still persists.
In this post, I'll discuss a few aspects of configuration management systems: Agent vs Agentless, and the use of DSLs vs native programming languages, the use of ready-made configuration, and migration between configuration management systems.
Before that, let's dive into exactly what configuration management systems do and common patterns between them.
What does a Configuration Management system do?
Once you have your OS installed, you need to take some actions on it to make it do what you want it to do. That might mean changing config files, adding software, adding users, changing permissions, etc. That's what a configuration management system does- it handles the configuration on your system, including all of those tasks, and several more.
Most configuration management systems are both declarative and idempotent. Declarative means that instead of saying what to do, like you would in a procedural programming language or shell script, you say what you want there to be. For example, if you want a directory called foo
to exist, with owner alice
, you'd declare that you want the directory foo to exist, and alice to be the owner, and the configuration management system would determine what steps needed to be taken to make that happen, whether that's creating the directory, or changing permissions on the existing directory.
Idempotent means that it's possible to run the same command again and again without unintended side effects. In the sample of our directory above, if the directory is already there and has the right permissions, it doesn't need to be created again. That's idempotent, at least for our purposes of configuration management.
Agents vs Agentless
I've used various configuration management systems. In my professional life, I used CFEngine and Puppet the most, but I've also used Chef, bcfg2, Ansible, and most recently PyInfra.
One of the differences between these systems is Agent vs Agentless, which is essentially the difference between a "push" model and a "pull" model.
In an agent-based system, a host that's under configuration management will contact the configuration management server and ask for the current configuration management scheme. The host then makes whatever changes are necessary to make to bring it in line. These systems are called Agent based because most of them contain some kind of local running daemon that checks in with the server. I don't know this as fact, but the term agent may come from CFEngins, as it uses the program "cf-agent" to make the changes on the system.
Agentless systems work by having the configuration management system "push" configuration out to the hosts, usually through the use of SSH.
Determining which you should use depends on the needs of your environment.
Push based tools tend to be easier to set up. You don't need a central server, and essentially any host that has the configuration management system on it, and has access to the hosts can act as configuration management host. Agentless systems also often allow you to run one-off commands across your infrastructure. This isn't a common task, but being able to do this through the configuration management system is nice.
In my experience, agentless configuration management begins to have challenges when the number of hosts scales, or when you need a more formal environment.
Once you reach a certain number of machines, an agentless system will need to run push sessions in parallel, adding complexity. You also run the risk of two sysadmins running the configuration management system on a host at the same time. The solution that's sometimes offered as a solution is a server that will handle the configuration management push for you, but then you might as well be using a pull-based system.
Another issue that we encountered is that having a pull-based system provided several benefits when it came down to formality. In one environment, some less-than-cooperative colleagues would shut down the configuration management system when they made changes. We would notice which hosts had not been updating normally in our monitoring system and be able to address the problem quickly. If we had used a push-based system, this might have been harder to detect. Additionally, since our configuration management was managed through a version control system, there was an element of formality in making changes. Changes had to be checked in before they could apply. Without that mechanism in place, I think we would have run into sysadmins who didn't check things into version control, either through forgetfulness or deliberately. Either way, centralizing the version control system made this easier, and a pull-based system made this simpler to implement.
My rule of thumb is that if you have a minimal team of three or under, or your host count is under 200, a push-based system is probably better, but for anything larger, I'd be more inclined to use a pull, or Agent-based configuration management system.
DSLs vs General Purpose Programming Languages
Expressing what to change on a server can be complicated, and different configuration management systems tackle the problem differently. Some, like CFEngine and Puppet, make up their own domain-specific language. Others, like Ansible, use a Data Serialization Language (also DSL) while still others, like Chef and PyInfra use a pure programming language, such as Ruby or Python. Which choice is right for you? Let's explore it.
The benefits of using a domain-specific language are that it is language neutral, and often these domain-specific languages are simple to use and expressive for the tasks they're made for. The downside is that sysadmins must learn the domain-specific language.
Ansible uses YAML as its configuration language, which means that it retains some of the benefits of a domain-specific language, but because YAML is well known and understood, it's easier for someone to write code that emits YAML. This issue of parsing and emitting is not just a theoretical benefit; when I've worked with tools like CFEngine and Puppet, we were often doing running pre-run hooks that generated either config files or snippets automatically, so being able to do so easily can be a real benefit because it's easy to generate. For example, if some of your configuration data comes from another team, and they store it in a database, you can use a hook to pull in that database information and put it in your configuration.
On the other hand, while YAML is one of the more friendly data serialization formats, it's still harder to read than either a domain specific language like CFEngine or Puppet, or a general purpose computing language.
The third option is to express the configuration in a standard programming language. Chef does this with Ruby, and PyInfra does this with Python. This has many benefits. General purpose programming languages are very expressive, which can eliminate repetitive or verbose configuration, or the need for pre-processing, since the query to an external system can be done at run-time.
Reasons not to use a general purpose programming language are that the expressive power of programming comes with a cost. If your staff is unfamiliar with the programming language the configuration management system uses, they'll have to learn it. If the staffer is unfamiliar with any programming language, this would be an additional barrier. Worse still are "clever programmers" who may generate configuration code that is complicated, or "too clever", leading to unreadability, challenges predicting what the result will be, or even the possibility of procedural code sneaking into the configuration management system. The result of a "clever programmer" may be code that is difficult to read, be brittle, or may have unintended consequences when run. On the flip side, code written by someone who is unfamiliar with the target language, or programming in general can come with its own problems, including inconsistency, inefficiency, or simply "being strange".
My opinion is that in most cases, dspite its challenges, I feel a configuration management system that uses a general purpose language is usually a good choice. The issue of staff needing to learn a programming language is offset by the fact that staffers will have to learn a DSL, or store configuration in a complex system like Ansible's YAML, which may as well be its own DSL. Regarding programming, for the most part the kind of programming that a sys-admin will need to know to work with one of these systems is limited, and I believe most if not all system administrators should have some programming familiarity. The issue of code quality is significant, and a general purpose language may lead to shooting oneself in the foot, but this is a tradeoff between the possibility of problems due to bad practices and problems due to software design.
That said, it may be worthwhile to not introduce this possibility into your configuration stack if your staff is large and or you don't have the staff resources to check over the code that's being checked in.
Ready Made Configuration
Several configuration management systems have ready to use configuration that you can download and use. Chef calls these recipes, and Ansible has Ansible Galaxy. The question is whether these ready-made bundles are something you should use.
Ansible Galaxy stands out from other configuration management systems in that some of the roles available on Ansible Galaxy are provided from the vendor themselves. That said, my advice is to use these to study how they work, but generally avoid pre-made configuration.
Ready-made configuration seems like an attractive offer to a new sys-admin, and an even more attractive offer to management, who sees this as a way to do more with less expense, but the fact is that every organization's needs will differ and there's no way that someone else can know what's right for your environment. Moreover, using someone's configuration could lead to the sysadmins not understanding how the system is configured. Unlike importing a library in a computer program, because of the breath of complexity, downloading configuration from a third party is more akin to copy and pasting code without looking at it first. If you're going to use one of these systems, I strongly recommend at the very least reading the upstream configuration first, if not re-making it for your environment from scratch.
Migrating Configuration Management Systems
Much of the fear in setting up a configuration management system is the fear of how difficult it will be to change the decision once you start. I'm here to assuage those fears; once you're using a configuration management system, migrating to another is easy.
The fact of using a configuration management system involves deeply understanding your system and environment, which makes migrating to another system relatively easy. My suggestion is not to take the work on at once, but to take the migration in stages, starting with only new machines. As you migrate new hosts, you will eventually migrate all the roles. At this point, you can either migrate older machines, or simply allow them to fade out of the infrastructure over time.
Conclusion
If you're curious about configuration management, I hope that I've made the task of choosing a configuration management system a little easier.
If you have any questions, be sure to send them to me!