IT automation
IT automation is in its larger sense—the processes and software that help with the management of the IT infrastructure (servers, networking, and storage). In the current shift, we are assisting to a huge implementation of such processes and software.
The history of IT automation
At the beginning of IT history, there were very few servers and a lot of people were needed to make them work properly, usually more than one person for each machine. Over the years, servers became more reliable and easier to manage so it was possible to have multiple servers managed by a single system administrator. In that period, the administrators manually installed the software, upgraded the software manually, and changed the configuration files manually. This was obviously a very labor-intensive and error-prone process, so many administrators started to implement scripts and other means to make their life easier. Those scripts were (usually) pretty complex and they did not scale very well.
In the early years of this century, data centers started to grow a lot due to companies' needs. Virtualization helped in keeping prices low and the fact that many of these services were web services, meant that many servers were very similar to each other. At this point, new tools were needed to substitute the scripts that were used before, the configuration management tools.
CFEngine was one of the first tools to demonstrate configuration management capabilities way back in the 1990s; more recently, there has been Puppet, Chef, and Salt, besides Ansible.
Advantages of IT automation
People often wonder if IT automation really brings enough advantages considering that implementing it has some direct and indirect costs. The main advantages of IT automation are:
- Ability to provision machines quickly
- Ability to recreate a machine from scratch in minutes
- Ability to track any change performed on the infrastructure
For these reasons, it's possible to reduce the cost of managing the IT infrastructure by reducing the repetitive operations often performed by system administrators.
Disadvantages of IT automation
As with any other technology, IT automation does come with some disadvantages. From my point of view these are the biggest disadvantages:
- Automating all of the small tasks that were once used to train new system administrators
- If an error is performed, it will be propagated everywhere
The consequence of the first is that new ways to train junior system administrators will need to be implemented.
Limiting the possible damages of an error propagation
The second one is trickier. There are a lot of ways to limit this kind of damage, but none of those will prevent it completely. The following mitigation options are available:
- Always have backups: Backups will not prevent you from nuking your machine; they will only make the restore process possible.
- Always test your infrastructure code (playbooks/roles) in a non-production environment: Companies have developed different pipelines to deploy code and those usually include environments such as dev, test, staging, and production. Use the same pipeline to test your infrastructure code. If a buggy application reaches the production environment it could be a problem. If a buggy playbook reaches the production environment, it could be catastrophic.
- Always peer-review your infrastructure code: Some companies have already introduced peer-reviews for the application code, but very few have introduced it for the infrastructure code. As I was saying in the previous point, I think infrastructure code is way more critical than application code, so you should always peer-review your infrastructure code, whether you do it for your application code or not.
- Enable SELinux: SELinux is a security kernel module that is available on all Linux distributions (it is installed by default on Fedora, Red Hat Enterprise Linux, CentOS, Scientific Linux, and Unbreakable Linux). It allows you to limit users and process powers in a very granular way. I suggest using SELinux instead of other similar modules (such as AppArmor) because it is able to handle more situations and permissions. SELinux will prevent a huge amount of damage because, if correctly configured, it will prevent many dangerous commands from being executed.
- Run the playbooks from a limited account: Even though user and privilege escalation schemes have been in UNIX code for more than 40 years, it seems as if not many companies use them. Using a limited user for all your playbooks, and escalating privileges only for commands that need higher privileges will help prevent you nuking a machine while trying to clean an application temporary folder.
- Use horizontal privilege escalation: The
sudo
is a well-known command but is often used in its more dangerous form. Thesudo
command supports the '-u
' parameter that will allow you to specify a user that you want to impersonate. If you have to change a file that is owned by another user, please do not escalate toroot
to do so, just escalate to that user. In Ansible, you can use thebecome_user
parameter to achieve this. - When possible, don't run a playbook on all your machines at the same time: Staged deployments can help you detect a problem before it's too late. There are many problems that are not detectable in a dev, test, staging, and qa environment. The majority of them are related to load that is hard to emulate properly in those non-production environments. A new configuration you have just added to your Apache HTTPd or MySQL servers could be perfectly OK from a syntax point of view, but disastrous for your specific application under your production load. A staged deployment will allow you to test your new configuration on your actual load without risking downtime if something was wrong.
- Avoid guessing commands and modifiers: A lot of system administrators will try to remember the right parameter and try to guess if they don't remember it exactly. I've done it too, a lot of times, but this is very risky. Checking the man page or the online documentation will usually take you less than two minutes and often, by reading the manual, you'll find interesting notes you did not know. Guessing modifiers is dangerous because you could be fooled by a non-standard modifier (that is,
-v
is not the verbose mode forgrep
and-h
is not thehelp
command for the MySQL CLI). - Avoid error-prone commands: Not all commands have been created equally. Some commands are (way) more dangerous than others. If you can assume a
cat
command safe, you have to assume that add
command is dangerous, sincedd
perform copies and conversion of files and volumes. I've seen people usingdd
in scripts to transform DOS files to UNIX (instead ofdos2unix
) and many other, very dangerous, examples. Please, avoid such commands, because they could result in a huge disaster if something goes wrong. - Avoid unnecessary modifiers: If you need to delete a simple file, use
rm ${file}
notrm -rf ${file}
. The latter is often performed by users that have learned that; "to be sure, always userm -rf
", because at some time in their past, they have had to delete a folder. This will prevent you from deleting an entire folder if the${file}
variable is set wrongly. - Always check what could happen if a variable is not set: If you want to delete the contents of a folder and you use the
rm -rf ${folder}/*
command, you are looking for trouble. If the${folder}
variable is not set for some reason, the shell will read arm -rf /*
command, which is deadly (considering the fact that therm -rf /
command will not work on the majority of current OSes because it requires a--no-preserve-root
option, whilerm -rf /*
will work as expected). I'm using this specific command as an example because I have seen such situations: the variable was pulled from a database which, due to some maintenance work, was down and an empty string was assigned to that variable. What happened next is probably easy to guess. In case you cannot prevent using variables in dangerous places, at least check them to see if they are not empty before using them. This will not save you from every problem but may catch some of the most common ones. - Double check your redirections: Redirections (along with pipes) are the most powerful elements of Linux shells. They could also be very dangerous: a
cat /dev/rand > /dev/sda
command can destroy a disk even if acat
command is usually overlooked because it's not usually dangerous. Always double-check all commands that include a redirection. - Use specific modules wherever possible: In this list I've used shell commands because many people will try to use Ansible as if it's just a way to distribute them: it's not. Ansible provides a lot of modules and we'll see them in this book. They will help you create more readable, portable, and safe playbooks.
Types of IT automation
There are a lot of ways to classify IT automation systems, but by far the most important is related to how the configurations are propagated. Based on this, we can distinguish between agent-based systems and agent-less systems.
Agent-based systems
Agent-based systems have two different components: a server and a client called agent.
There is only one server and it contains all of the configuration for your whole environment, while the agents are as many as the machines in the environment.
Note
In some cases, more than one server could be present to ensure high availability, but treat it as if it's a single server, since they will all be configured in the same way.
Periodically, client will contact the server to see if a new configuration for its machine is present. If a new configuration is present, the client will download it and apply it.
Agent-less systems
In agent-less systems, no specific agent is present. Agent-less systems do not always respect the server/client paradigm, since it's possible to have multiple servers and even the same number of servers and clients . Communications are initialized by the server that will contact the client(s) using standard protocols (usually via SSH and PowerShell).
Agent-based versus Agent-less systems
Aside from the differences outlined above, there are other contrasting factors which arise because of those differences.
From a security standpoint, an agent-based system can be less secure. Since all machines have to be able to initiate a connection to the server machine, this machine could be attacked more easily than in an agent-less case where the machine is usually behind a firewall that will not accept any incoming connections.
From a performance point of view, agent-based systems run the risk of having the server saturated and therefore the roll-out could be slower. It also needs to be considered that, in a pure agent-based system, it is not possible to force-push an update immediately to a set of machines. It will have to wait until those machines check-in. For this reason, multiple agent-based systems have implemented out-of-bands wait to implement such feature. Tools such as Chef and Puppet are agent-based but can also run without a centralized server to scale a large number of machines, commonly called Serverless Chef and Masterless Puppet, respectively.
An agent-less system is easier to integrate in an infrastructure that is already present, since it will be seen by the clients as a normal SSH connection and therefore no additional configuration is needed.