In the summer of 2012, I had the great oportunity to clean up our hosting infrastructure. Instead of running many differently configured VMs, mostly one per customer, we started building a real redundant infrastructure with two really beefy physical database machines (yay) and quite many (22) virtual machines for caching, web app servers, file servers and so on.

All components are fully redundant, every box can fail without anybody really needing to do anything (one exception is the database - that’s also redundant, but we fail over manually due to the huge cost in time to failback).

Of course you don’t manage ~20 machines manually any more: Aside of the fact that it would be really painful to do for those that have to be configured in an identical way (the app servers come to mind), you also want to be able to quickly bring a new box online which means you don’t have time to manually go through the hassle of configuring it.

So, In the summer of 2012, when we started working on this, we decided to go with puppet. We also considered Chef but their server was really complicated to set up and install and there was zero incentive for them to improve because that would, after all, disincentivse people from becoming customers of their hosted solutions (the joys of open-core).

Puppet is also commerically backed, but everything they do is available as open source and their approach for the central server is much more «batteries included» than what Chef has provided.

And finally, after playing around a bit with both Chef and puppet, we noticed that puppet was way more bitchy and less tolerant of quick hacks around issues which felt like a good thing for people dabbling with HA configuration of a multi machine cluster for the first time.

Fast forward one year: Last autumn I found out about ansible (linking to their github page - their website reads like a competition in buzzword-bingo) and after reading their documentation, I immediately was convinced:

  • No need to install an agent on managed machines
  • Trivial to bootstrap machines (due to above point)
  • Contributors don’t need to sign a CLA (thank you so much, ansibleworks!)
  • No need to manually define dependencies of tasks: Tasks are run requentially
  • Built-in support for cowsay by default
  • Many often-used modules included by default, no hunting for, say, a sysctl module on github
  • Very nice support for rolling updates
  • Also providing a means to quickly do one-off tasks
  • Very easy to make configuration entries based on the host inventory (which requires puppetdb and an external database in the case of puppet)

Because ansible connects to each machine individually via SSH, running it against a full cluster of machines is going to take a bit longer than with puppet, but our cluster is small, so that wasn’t that much of a deterrent.

So last Sunday evening I started working on porting our configuration over from puppet to Ansible and after getting used to the YAML syntax of the playbooks, I made very quick progress.

progress

Again, I’d like to point out the excellent, built-in, on-by-default support for cowsay as one of the killer-features that made me seriously consider starting the porting effort.

Unfortunately though, after a very promising start, I had to come to the conclusion that we will be sticking with puppet for the time being because there’s one single feature that Ansible doesn’t have and that I really, really want a configuration management system to have:

I’ts not possible in Ansible to tell it to keep a directory clean of files not managed by Ansible in some way

There are, of course, workarounds, but they come at a price too high for me to be willing to pay.

  • You could first clean a directory completely using a shell command, but this will lead to ansible detecting a change to that folder every time it runs which will cause server restarts, even when they are not needed.

  • You could do something like this stack overflow question but this has the disadvantage that it forces you into a configuration file specific playbook design instead of a role specific one.

What I mean is that using the second workaround, you can only have one playbook touching that folder. But imagine for example a case where you want to work with /etc/sysctl.d: A generic role would put some stuff there, but then your firewall role might put more stuff there (to enable ip forwarding) and your database role might want to add other stuff (like tweaking shmmax and shmall, though that’s thankfully not needed any more in current Postgres releases).

So suddenly your /etc/sysctl.d role needs to know about firewalls and databases which totally violates the really nice separation of concerns between roles. Instead of having a firewall and a database role both doing something to /etc/sysctl.d, you know need a sysctl-role which does different things depending on what other roles a machine has.

Or, of course, you just don’t care that stray files never get removed, but honestly: Do you really want to live with the fact that your /etc/sysctl.d, or worse, /etc/sudoers.d can contain files not managed by ansible and likely not intended to be there? Both sysctl.d and sudoers.d are more than capable of doing immense damage to your boxes and this sneakily behind the watching eye of your configuration management system?

For me that’s inacceptable.

So despite all the nice advantages (like cowsay), this one feature is something that I really need and can’t have right now and which, thus, forces me to stay away from Ansible for now.

It’s a shame.

Some people tell me that implementing my feature would require puppet’s feature of building a full state of a machine before doing anything (which is error- prone and frustrating for users at times), but that’s not really true.

If ansible modules could talk to each other - maybe loosly coupled by firing some events as they do stuff, you could just name the task that makes sure the directory exists first and then have that task register some kind of event handler to be notified as other tasks touch the directory.

Then, at the end, remove everything you didn’t get an event for.

Yes. This would probably (I don’t know how Ansible is implemented internally) mess with the decouplling of modules a bit, but it would be so far removed from re-implementing puppet.

Which is why I’m posting this here - maybe, just maybe, somebody reads my plight and can bring up a discussion and maybe even a solution for this. Trust me: I’d so much rather use Ansible than puppet, it’s crazy, but I also want to make sure that no stray file in /etc/sysctl.d will bring down a machine.

Yeah. This is probably the most words I’ve ever used for a feature request, but this one is really, really important for me which is why I’m so passionate about this. Ansible got so f’ing much right. It’s such a shame to still be left unable to really use it.

Is this a case of xkcd1172? Maybe, but to me, my request seems reasonable. It’s not? Enlighten me! It is? Great! Let’s work on fixing this.



blog comments powered by Disqus