Accidentally creating a server in the wrong datacenter

By

Yesterday I was working on upgrading the servers that power Wikimedia's Docker registry (see T272550). Since these are virtual machines, I was just creating new ones and going to delete the old ones later (because VMs are cattle, not pets).

We have a handy script to create new VMs, so I ran the following command:

legoktm@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 4 --disk 20 codfw_B registry2004.eqiad.wmnet

In this command codfw_B refers to the datacenter and row to create the VM in, and registry2004.eqiad.wmnet is the requested fully qualified domain name (FQDN).

If you're familiar with Wikimedia's datacenters, you'll notice that I created the VM in codfw (Dallas) and gave it a name as if it were in eqiad (Virginia). Oops. I only noticed right as the script finished creation. (Also the 2XXX numbering is for codfw. 1XXX servers are in eqiad.)

Normally we have a decommissioning script for deleting VMs, but when I tried running it, it failed because the VM hadn't fully been set up in Puppet yet!

Then I tried just adding it to puppet and continuing enough of the VM setup that I could delete it, except our CI correctly rejected my attempt to do so because the name was wrong! I was stuck with a half-created VM that I couldn't use nor delete.

After a quick break (it was frustrating), I read through the decom script to see if I could just do the steps manually, and realized the error was probably just a bug, so I submitted a one-line fix to allow me to delete the VM. Once it was merged and deployed, I was able to delete the VM, and actually create what I wanted to: registry2004.codfw.wmnet.

Really, we should have been able to catch this when I entered in the command, since I specified the datacenter even before the FQDN. After some discussion in Phabricator, I submitted a patch to prevent such a mismatch. Now, the operator just needs to specify the hostname, registry2004, and it will build the FQDN using the datacenter and networking configuration. Plus it'll prompt for user confirmation that it was built correctly. (For servers that use numbers afterwards, it'll check those too.)

Once this is deployed, it should be impossible for someone to repeat my mistake. Hopefully.