Celebrating 2 years of MediaWiki codesearch

MediaWiki codesearch logo

It's been a little over 2 years since I announced MediaWiki codesearch, a fully free software tool that lets people make regex searches across all the MediaWiki-related code in Gerrit and much more. While I expected it to be useful to others, I didn't anticipate how popular it would become.

My goal was to replace the usage of the proprietary GitHub search that many MediaWiki developers were using due to lack of a free software alternative, but doing so meant that it needed to be a superior product. One of the biggest complaints about searching via GitHub was that it pulled in a lot of extraneous repositories, making it hard to search just MediaWiki extensions or skins.

codesearch is based on hound, a code search engine written in go, originally maintained by etsy. It took me all of 10 minutes to get an initial prototype working using the upstream docker image, but I ran into an issue pretty quickly: the repository selector didn't scale to our then-500+ git repositories (now we're at more like 900!). So it wouldn't really be possible to just search extensions easily.

After searching around for other upstream code search engines and not having much luck finding things I liked, I went back to hound and instead tried running multiple instances at once and it more or less worked. I wrote a small ~50 line Python proxy to wrap around the different hound instances and provide a unified UI. The proxy was sketch enough that I wrote "Please don't hurt me." in the commit message!

But it seems to have held up over time, surprisingly well. I attribute that to having systemd manage everything and the fact that hound is abandoned/unmaintained/dead upstream, creating a very stable platform, for better or worse. We've worked around most of the upstream bugs so I usually pretend it's a feature. But if it doesn't get adopted sometime this year I expect we'll create our own fork or adopt someone else's.

I recently used the anniversary to work on puppetizing codesearch so there would be even less manual maintenance work in the future. Shoutout to Daniel Zahn (mutante) for all of his help in reviewing, fixing up and merging all the puppet patches. All of the package installation, systemd units and cron jobs are now declared in puppet - it's really straightforward.

For those interested, I've documented the architecture of codesearch, and started writing more comprehensive docs on how to add a new search profile and how to add a new instance.

Here's to the next two years of MediaWiki codesearch.


Automatically rewriting links to use HTTPS on Wikimedia sites

In March 2018, Facebook began automatically rewriting links to use HTTPS using the HSTS preload list. Now all Wikimedia sites (including Wikipedia) do the same.

If you're not familiar with it, the HSTS preload list tells browsers (and other clients) that the website should only be visited over HTTPS, not the insecure HTTP.

However, not all browsers/clients support HSTS and users stuck on old versions might have outdated versions of the list.

Following Facebook's lead, we first looked into the usefulness of adding such functionality to Wikimedia sites. My analysis from July 2018 indicated that 2.87% of links on the English Wikipedia would be rewritten to use HTTPS. I repeated the analysis in July 2019 for the German Wikipedia, which indicated 2.66% of links would be rewritten.

I developed the SecureLinkFixer MediaWiki extension (source code) to do that in July 2018. We bundle a copy of the HSTS preload list (in PHP), and then add a hook to rewrite the link if it's on the list when the page is rendered.

The HSTS preload list is pulled from mozilla-central (warning: giant page) weekly, and committed into the SecureLinkFixer repository. That update is deployed roughly every week to Wikimedia sites, where it'll take at worst a month to get through all of the caching layers.

(In the process we (thanks Michael) found a bug with Mozilla not updating the HSTS list...yay!)

By the end of July 2019 the extension was deployed to all Wikimedia sites - the delay was mostly because I didn't have time to follow-up on it during the school year. Since then things have looked relatively stable from a performance perspective.

Thank you to Ori & Cyde for bringing up the idea and Reedy, Krinkle, James F & ashley for their reviews.


Inside Scoop - Week 4: One month later

Inside Scoop is a weekly column about the operation of the Spartan Daily, San Jose State's student newspaper.

It's been a month and I'm tired. Three newspapers again this week: Tuesday ("the shut it down issue"), Wednesday ("the 9/11 issue") and Thursday ("the Frog Dorm issue").

Tuesday: out at 1:31a.m.#

We had late breaking news that there was a potential security flaw in the self-checkout machines used at various campus dining stores. It took a while for us to figure out exactly what the exploit was, reproduce it and report it to the correct people.

But since I spent a decent amount of time doing that, I wasn't doing the normal stuff that I do (helping with headlines, cutlines, reviewing pages). Some people ended up waiting on me, at which point I realized how much of a SPOF I had become. For some things, it's important that I'm the person who makes the decision, but for a lot of the production night questions and what not, there's no need for people to be blocked on me, especially when I'm doing other (also important) stuff.

Wednesday: out at 12:59p.m.#

Goal accomplished: we got out before 1a.m. (just barely). The main thing that held us back was lack of planning around the 9/11 story and art for the front page. I had some photos of the memorials from when aismallard and I went around to Ground Zero, one of which we were able to use.

The story was a bit messy/unorganized, but that was mostly because we (editors) didn't give good story direction, and we opened it up a bit too late to give good feedback so the author could change it. So we had to do that ourselves.

We also started doing opinion pages and content as ragged right to distinguish it from the rest of the paper. So far it's gotten a good critical reception from our advisors, but we still have some implementation issues, notably consistency.

Thursday: out at 12:45a.m.#

All of the English pages were done by 12:25a.m., it just took the Spanish page a bit longer.

The front story about frogs in the dorms was really fun to read and edit, but I think we missed the better story angle. Instead of talking about the individual impacts, we should have first talked about the social and community aspects of "The Frog Dorm," which we left towards the end of the story. Our goal was to do this weekly, but I'm not really sure how if it's possible to top this one.

We didn't have good photos for the university scholar series, but I went to the event and watched one of my reporters on how they reported. That was probably one of the most valuable things to see, since I know exactly how to help him (and hopefully others) going forward.

Also, putting the meme of the week right below an editorial about lacking mental health resources and suicide was pretty dumb from a layout perspective. Oops.

The Spanish page turned out nicely, hopefully it happens on a regular basis. And finishes earlier.


Inside Scoop - Week 3: Ready to Repeat

Inside Scoop is a weekly column about the operation of the Spartan Daily, San Jose State's student newspaper.

Ready to repeat cover

It was a short week, but we put out some goodish newspapers: Wednesday ("the Taco Bell issue") and Thursday ("the women's soccer issue").

Wednesday: out at 1:12a.m.#

Taco Bell is back in the Student Union, and honestly, for some students, it's going to be the biggest news story of the semester. We received some feedback that we should have featured some more serious/hard news issues (the CSU legal aid story on the right column for example) rather than Taco Bell. I think we did a decent job of balancing the coverage to ensure we weren't just advertising Taco Bell, a decent amount of the story discussed people saying they preferred the previous Mexican restaurant, describing it as more authentic.

We also brought back the crime blotter, which should be a regular feature going forwards. Aside from being a good space filler, it's also given us some interesting leads on other stories to look into.

Thursday: out at 1:23a.m.#

One of the projects that we had discussed doing over the summer was a special sports preview, inspired by other college publications like the Daily Orange. We picked women's soccer early on because they've been on fire lately, and won the Mountain West championship last year. I'm going to write a separate post about the process we went through to put it together, but to keep it short, I'm rather pleased with how it came out, given the constraints and challenges we faced. And we're not the only ones who think that!

But because much of my attention that night was focused on finishing the soccer special, I don't think the rest of the paper was as high quality as it could have been. Some of our stories were off target, and didn't match the original pitch or even the headline. One thing I noticed last semester is that our Opinion "counterpoints" features need to be written for each other to have a good debate. These two were written independently, and totally ignore each other. That's not a problem with any of the writers, but how we're pushing the packaging.


Inside Scoop - Week 2: Long nights

Inside Scoop is a weekly column about the operation of the Spartan Daily, San Jose State's student newspaper. Yes, I'm late this week :(

We put out three papers this week: Tuesday ("the pride issue"), Wednesday ("the football team wants to win issue"), and Thursday ("the parking issue").

Tuesday: out at 1:30a.m.#

We mostly tried too hard on this issue. We tried to do a cutout on the front page plus a text-wrap illustration, both of which required redesigning the front page later in the evening. Those would might have worked later on in the semester, but hurt us time wise. Also, there's not a single picture of a student on the front page.

I wrote a review of Taylor Swift's Lover that I was rather proud of. Just took me until like 3a.m. to get over my writer's block.

I also helped shoot the women's soccer game the previous Thursday. At the time I was pretty demotivated that our photos did not turn out that well, but they weren't that bad. Some were pretty good.

More importantly, I got to school at 6:30a.m. to deliver the newspaper to the various newsstands around campus. We hadn't hired our carriers yet, so some of us were still filling in on delivery duty. This gave me a decent amount of insight on how people actually consume our newspaper. We usually put teasers for some of the inside content on the bottom of the page, but then they're not visible in our newsstands. So now we're trying just keeping them above the fold.

Wednesday: out at 1:13a.m.#

Starting to get a bit faster, but the quality was definitely a bit lower. We messed up on a lot of small design things, like text being too close to lines, jumps being misaligned with columns, and the front page cutline having two people labeled as "(right)".

I don't think we did a great job with the front-page textbook story, mostly because we didn't talk to any professors. It's something we could/should continue to look into, but the importance to students will die down a bit because they're no longer actively purchasing them.

Also, we spelled a name wrong, and that really sucks.

Thursday: out at 1:51a.m.#

Yeah, we cut it a little too close to the 2a.m. print deadline. The last four pages to be finished (1, 2, 4-5) were definitely rushed and had some major issues. Mostly it was a lack of planning, and leaving a lot of the design and layout elements until the last minute. We had some paper sketches and dummies, but we really should have thrown stuff into InDesign a lot earlier.

Even though we were doing a doubletruck/spread, the 4 different elements (campus voices, ParkStash story, ALPRs, and infographic) just seem like they were thrown onto the page with no consideration for creating a coherent design. The news packaging was good, but the layout didn't really represent that.

We really killed it in the Opinion section. A fantastic editorial cartoon that overshadowed the editorial, a column about pending abortion care legislation, perspective on gaming and violence, and some memes.

The "Spartan meme of the week" replaces the former "Spartunes" feature, really as an attempt to boost reader engagement. Spartunes involved editors picking a song that fit some theme, and then putting them all in a shared Spotify playlist. The main problem was that the only interesting part about Spartunes was which editor picked which song - and it's only interesting if you know the editors themselves. And given that most people don't, it's not very interesting.

In comparison, people seem to enjoy the memes of the week, and we've already had at least one student submit memes for consideration. That's more reader engagement than Spartunes can claim.

I also delivered the newspaper again today morning. It was fun, and I hope to never do it again. I'm just not a morning person.


Inside Scoop - Week 1: Welcome Back

Inside Scoop is a weekly column about the operation of the Spartan Daily, San Jose State's student newspaper.

It's been a minute, but school started on Wednesday, and we had to put out another issue (10MB PDF). It turned out pretty nicely I think. We started working on it a few weeks beforehand, assigning stories, and then beginning to meet up and layout the pages starting a week beforehand (at least for the people who were already in town).

It was a true collaborative effort, with basically everyone on the editorial board contributing to the paper in one way or another, something I was really proud of. All the stories were written by editors, which was nice from the editing point of view (since they were well written to begin with), but it caused a few problems from the layout/production side, since the editors also had to focus on editing their own stories rather than just focusing on the pages.

For the first issue, we tried to make a "Welcome Guide" of sorts, partially to attract new students as readers. I don't think it really worked for the latter part, but we have a lot of work to do in that area. We're working on moving newsstands around to be in more convenient locations for readers to pick up, as well as looking into doing some focus testing to see what students are looking for.

The other main issue we've been dealing with is the rollout of our new website on https://sjsunews.com/. The architecture is bizzare to say the least - it's an AngularJS frontend calling out to a Drupal backend, where we enter content. I'm not really sure why that architecture was picked versus just creating a Drupal skin, but I also entered the process pretty late. I usually don't disclose to people my software background, to avoid all of the "oh can you help with this computer issue" type questions or get the responsibility of the website thrust upon me, but in this case I wish I had earlier. ¯\(ツ)/¯

There are a decent amount of features missing but those are being worked on, I hope. Functionally, it's a pretty big regression from our other outdated website, but the design is nice when everything is finally rolled out.

As a sidenote, I also figured out why the old website was so slow to post articles - it used to take 10-15 minutes to show up. The old website was effectively a static site generator, and one of the sidebar items was "Recent articles". So whenever a new article is published, it would have to update every single generated page...which was over 30,000 of them. Yeah, not surprised that it took 10 minutes to publish a story.

All of the editors this semester are going to be writing a column or blog or some regular feature thing. So this is going to be mine - an inside look at the production of the Spartan Daily.