Accidentally creating a server in the wrong datacenter

Yesterday I was working on upgrading the servers that power Wikimedia's Docker registry (see T272550). Since these are virtual machines, I was just creating new ones and going to delete the old ones later (because VMs are cattle, not pets).

We have a handy script to create new VMs, so I ran the following command:

legoktm@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 4 --disk 20 codfw_B registry2004.eqiad.wmnet

In this command codfw_B refers to the datacenter and row to create the VM in, and registry2004.eqiad.wmnet is the requested fully qualified domain name (FQDN).

If you're familiar with Wikimedia's datacenters, you'll notice that I created the VM in codfw (Dallas) and gave it a name as if it were in eqiad (Virginia). Oops. I only noticed right as the script finished creation. (Also the 2XXX numbering is for codfw. 1XXX servers are in eqiad.)

Normally we have a decommissioning script for deleting VMs, but when I tried running it, it failed because the VM hadn't fully been set up in Puppet yet!

Then I tried just adding it to puppet and continuing enough of the VM setup that I could delete it, except our CI correctly rejected my attempt to do so because the name was wrong! I was stuck with a half-created VM that I couldn't use nor delete.

After a quick break (it was frustrating), I read through the decom script to see if I could just do the steps manually, and realized the error was probably just a bug, so I submitted a one-line fix to allow me to delete the VM. Once it was merged and deployed, I was able to delete the VM, and actually create what I wanted to: registry2004.codfw.wmnet.

Really, we should have been able to catch this when I entered in the command, since I specified the datacenter even before the FQDN. After some discussion in Phabricator, I submitted a patch to prevent such a mismatch. Now, the operator just needs to specify the hostname, registry2004, and it will build the FQDN using the datacenter and networking configuration. Plus it'll prompt for user confirmation that it was built correctly. (For servers that use numbers afterwards, it'll check those too.)

Once this is deployed, it should be impossible for someone to repeat my mistake. Hopefully.


That time I broke Wikipedia, but only for vandals

As one of the top contributors to MediaWiki, the free wiki engine that powers Wikipedia, I'm often asked how I got started. To celebrate Wikipedia's 20th birthday, here's that unfortuante story.

In late 2012, I was a bored college student who was spending most of his time editing Wikipedia. I reverted a lot of vandalism, and eventually began developing anti-vandalism IRC bots to allow patrollers like myself to respond to vandalism even faster than before.

I had filed a bug request asking for the events from our anti-abuse "edit filter" to be broadcast to the realtime IRC recent changes feed (at the time the only way to get a continuous, live feed of edits. A few months later no one had implemented it and I was annoyed.

After complaining to a few people about this, they suggested I fix it myself. The code is all open source and I know how to program, what could go wrong?

It's at this point I should've told someone I didn't actually know PHP; I knew plenty of Python and had just learned Java in my intro to computer science class.

I really had no clue what I was doing, but I submitted a patch that kind of looked right. I asked my friend Ori to review it, and he promptly approved the change and deployed it on the real servers that power Wikipedia.

The broken patch
My very broken patch

I was pretty excited, my first ever patch had been merged and deployed! The millions of people who visited Wikipedia every day would get served a page that included my code.

I then went to go test the change and it did. not. work. I made a test edit that I knew would trigger a filter, except instead of getting a notification from the realtime feed, I saw the Wikimedia Error screen.

In fact, for about 30 minutes any wannabe vandal (and a few innocent users) who triggered a filter would see the error page:

Old Wikimedia error page
This really wasn't a sustainable way to stop vandalism

I immediately told Ori that it was broken and his reaction was along the lines of: "You didn't test it??" He had assumed I knew what I was doing and tested my code before submitting it...oops. He very quickly fixed the issue for me, and then started teaching me how to properly test my patches.

The one line fix
The one line fix

He introduced me to MediaWiki-Vagrant, a then-new project to automate setting up a development instance, which is now used by a majority of MediaWiki developers (I was user #2).

There were a lot of things that went wrong in this story that should have caught this failure before it ended up on our servers. We didn't have any automated testing or static analysis to point out my patch was obviously flawed. We didn't do a staged rollout to a few users first before exposing all of Wikipedia to it.

This incident has stuck in my head ever since and I'm pretty confident it couldn't happen today because we've implemented those safeguards. I've spent a lot of time developing better static analysis tools (MediaWiki-CodeSniffer and phan especially) and building infrastructure to help us improve test coverage. We have proper canary deploys now, so these obvious errors should never make it to a full deployment.

It really sucked knowing that my patch had broken Wikipedia, but at the same it was invigorating. Getting my code onto one of the biggest websites in the world was actually pretty straightforward and within reach. If I learned a bit more PHP and actually tested my code first, I could fix bugs on my own instead of waiting for someone else to do it.

I think this mentality really represents one of my favorite parts about Wikipedia: if something is broken, just fix it.


Starting a new job

Last week I officially joined the Site Reliability Engineering team at the Wikimedia Foundation. I'll be working with the Service Operations team, which "...takes care of public and “user-visible” services."

I'm glad to be back at the WMF; I had originally started working there in 2013 but recently took a break to finish school. SRE will be my ninth distinct team at the WMF, and I'm looking forward to even more adventures.

As part of transitioning into my new role, I have unsubscribed myself from most MediaWiki bug mail and Gerrit notifications. Once I get more situated I'll put out a more detailed request for new maintainers for the components that need them. I'll continue taking care of maintenance as needed until then.

P.S.: I created a new userbox about Rust on mediawiki.org.


PGP key consolidation

Note: A signed version of this announcement can be found at https://legoktm.com/w/index.php?title=PGP/2020-12-14_key_consolidation.

I am consolidating my PGP keys to simplify key management. Previously I had a separate key for my wikimedia.org email, I am revoking that key and have added that email as an identity to my main key.

I have revoked the key 6E33A1A67F4E2DF046736A0E766632234B56D2EC (legoktm at wikimedia dot org). I have pushed the revocation to the SKS Keyservers and additionally published it at https://legoktm.com/w/index.php?title=PGP/2020-12-14_revocation.

My main key, FA1E9F9A41E7F43502CA5D6352FC8E7BEDB7FCA2, now has a legoktm at wikimedia dot org identity. An updated version can be fetched from keys.openpgp.org, the SKS Keyservers, or https://legoktm.com/w/index.php?title=PGP. It should also be included in the next Debian keyring update. I took this opportunity to extend the expiry for another two years to 2022-12-14.


Legoktm, B.S.

Me, wearing my cap and gown, in front of the San Jose State University sign
Photo by Jesus Tellitud and Blue Nguyen
The Trustees of the California State University
on recommendation of the faculty of the
College of Humanities and the Arts
have conferred upon

Kunal Mehta

the degree of

Bachelor of Science

Journalism

So this makes me a scientist now, right? I used to joke that I was putting in all this work for a piece of paper, but now I'm actually very proud of this piece of paper.

I want to thank all of my family, who really kept me going and supported me no matter what.

Thanks to my professors and teachers at De Anza and San Jose State for giving me the opportunity and platform to explore and grow my love for journalism. I'm proud to be an alum of La Voz, Update News and the Spartan Daily.

Of course, the real treasure was all the friends I made along the way. But seriously, I'm so glad I met all of you, and I will treasure our relationships.

Thanks to my colleagues at the Wikimedia Foundation and the Freedom of the Press Foundation for their furthering my professional development, assistance with networking, and just constant support. Also for indulging my sticker addiction.

To the IRC cabal: thanks for being the group of people I can turn to for help, no matter what time of day nor location. And for all the huggles.

I hope to celebrate in-person with you all ... soon.

P.S. Here are some more photos.


Inside Scoop - Spartan Daily is a Pacemaker finalist!

Inside Scoop is a column about the operation of the Spartan Daily, San Jose State's student newspaper.

Spartan Daily is a Pacemaker finalist

It feels like forever ago, but at the end of the Spring 2019 semester, I was chosen as the next executive editor of the Spartan Daily. I definitely felt a lot of pressure to keep up high quality newspaper that my predecessors put out three days a week. But at the same time, I had that inner drive of "What's next?" and "How do we get even better than before?"

A few days after my selection, Marci, our design chief, and I went to Peanuts (the de facto Spartan Daily hangout restaurant) to discuss the "Pacemaker". I knew winning an ACP Pacemaker is pretty much the top award in college journalism as they're unofficially known as the "Pulitzer Prizes of college journalism", but I didn't realize what exactly it took to win one. Luckily Marci had some experience in this area - she had already won two Pacemakers as a part of The Advocate during community college.

She walked me through ACP associate director Gary Lundgren's 2018 "The Pacemaker" presentation, which started off with examples from past winners. To be honest, it was intimidating. Most of those papers looked significantly better than ours and it didn't seem possible to get on their level, to the point that I didn't even make it one of my goals (and I had some pretty lofty goals).

There were some basic tips in Gary's presentation, which became the foundation of how I wanted to improve the Spartan Daily:

  1. Tell human stories.
  2. Storytelling results from verbal and visual planning.
  3. Engage your readers.
  4. Take readers behind the scenes.
  5. Use a variety of story formats.
  6. Headlines require layers of information.
  7. Storytelling images and video add realism.
  8. In-depth reporting packages have impact.
  9. Content packaging matters.
  10. Tell stories across multiple platforms.
  11. Simplicity is always the best strategy.
  12. Listen to your readers.

Trying to address all of these really required redesigning the whole story flow, from coming up with the idea, to writing a story and finally publishing it. It became a philosophical shift as we tried to take the guidance to heart.

Previously, story ideas were mostly pitched from staff writers, which allowed for a greater diversity of story ideas. But it came with a downside, staff writers often favored writing about the same topics and the ideas rarely had coordination when writers did work on similar stories. And I don't think continually allowing staff writers to pick what they write about really fulfilled what Gary recommended.

So instead we flipped it around and had editors come up with the majority of story ideas. Aside from the benefit that editors usually came up with higher-quality ideas, it allowed for our ideas to be more cohesive, focusing on what each editor wanted in their section.

Building on that, we picked the topics for our "special" issues and content ahead of time so we could spend more time planning them out, hopefully leading to a higher-quality result. I'm pretty proud of our women's soccer special and the Fighting 'fake news' special.

Then we streamlined the editing process. Normally, stories go through three primary rounds of edits. "First edits" are done by the section editor, who has been working closely with the writer already. "Second edits" would be done by the executive or managing editor, and finally a copy editor would do "copy edits".

As a writer, I had seen this process often times become repetitive and frustrating when one round of edits tries to undo what a previous one did. To avoid that, each round of edits was given specific things to look for and fix. First edits ensured the 5 W's (who, what, where, why, when) were answered, at least 3 sources were used, and also that the story actually matches what they initially pitched. Second edits would then go into the finer details, improving the lead, improving word choices, eliminating reptition and so on. And of course copy edits would ensure compliance with AP Style and do basic fact checking. Editors were empowered to send stuff back if they came across something that should've already been fixed by that point.

And then there's all the little stuff we did to make our content crisp: we switched to ragged right for columns (differentiating opinions and news stories), used label heads (a headline with no active verb like "Period problems") for our feature stories and enforced having borders on all of our photos. We also added more regular, dependable content so we weren't rushing at the last minute to fill all the space. These aptly named "space fillers" included the crime blotter, campus images, and weekly columns.

I think we also made significant improvements in recognizing individual accomplishment while still ensuring our collective success (and failure) was treated as team successes and failures. Each week the editors selected a "Staff Writer of the Week" to highlight someone who went above and beyond. We also eliminated grading penalties that punished individual writers for making mistakes like spelling a name wrong when in reality, such a mistake could only be made if there were significant failures in the editing process as well.

But at the same time some initatives to strengthen our content by adding longer student profiles or featuring classes kind of faltered. At first we didn't really have a large enough stuff to have writers work on much longer pieces, but once people had settled in, I really didn't push it that much. The few long-form profiles we did were good starts, but the more experience we had writing and editing them, the better they would've been.

And some changes that I tried to implement at the very beginning like weekly doubletruck features and editorials left an understandably sour taste in some editors' mouths after some early failures as we just weren't ready to handle that workload yet. Again, I didn't really push it after that.

Part of that was because in addition to all those changes I was really trying to follow the recommendation of simplicity. We favored having less, higher-quality content rather than trying to do more and sacrificing quality. So instead of putting out a 20-page paper that was reaching to fill space, we put out 8 and 12 pagers that aimed to be more cohesive. Most of our newspapers were 6-8 pages because that seemed to be our natural limit given the quality we were striving for.

The other significant shift was planning things out ahead of time as much as possible along with contingency plans, giving us plenty of flexibility when things inevitably didn't turn out as expected. It sounds so basic, but it really came in key for all of our special content. And having a plan meant we could jump on a plane to Colorado with less than 24 hours notice, but that's a story for another time...

So with respect to all the ideas that I dropped early on, I don't know whether we would have been able to pull it all off or we would've doomed ourselves if we continued down that extremely ambitious path.

I think it was especially fortunate that the next executive editor, Chelsea, was my news editor and had endured the ups and downs of all of these changes. I'd like to think she also saw the value in them, often refining or continuing them throughout her semester.

All that said, when the list of 2020 Newspaper Pacemaker finalists was published, I scrolled through the list, seeing all of our typical competitors: The Daily Californian (UC Berkeley), Daily Bruin (UCLA) and Daily Trojan (USC). "Sigh, we're going to lose to all of them again," I thought.

And then I scrolled a little farther and saw the familiar face of Lindsey Graham staring back at me, from my column about the EARN IT Act.

Spartan Daily is a Pacemaker finalist

Unbelievable.

As far as I know, this is the first time the Spartan Daily has been a Pacemaker finalist. I have no expectations that we'll be named a winner at the award ceremony later today, it just really means a lot that we were even named a finalist.

As I wrote last time, "This is probably one of the most team-based awards that I've had my individual name on." This wouldn't have been possible without Victoria, my managing editor, and the rest of the editors and staff writers. And I'm always grateful for the support and well, advice from our advisors, Richard Craig and Mike Corpos.

And then the Spartan Daily has four 4 individual finalists too, all in design categories!

Thanks to Marci for reviewing and editing this post before publication.