Building a less terrible URL shortener

The demise of goo.gl is a good opportunity to write about how we built a less terrible URL shortener for Wikimedia projects: w.wiki. (I actually started writing this blog post in 2016 and never got back to it, oops.)

URL shorteners are generally a bad idea for a few main reasons:

  1. They obfuscate the actual link destination, making it harder to figure out where a link will take you.
  2. If they disappear or are shut down, the link is broken, even if the destination is fully functional.
  3. They often collect extra tracking/analytics information.

But there are also legitimate reasons to want to shorten a URL, including use in printed media where it's easier for people to type a shorter URL. Or circumstances where there are restrictive character limits like tweets and IRC topics. The latter often affects non-ASCII languages even more when limits are measured in bytes instead of Unicode characters.

At the end of the day, there was still considerable demand for a URL shortener, so we figured we could provide one that was well, less terrible. Following a RfC, we adopted Tim's proposal, and a plan to avoid the aforementioned flaws:

  1. Limit shortening to Wikimedia-controlled domains, so you have a general sense of where you'd end up. (Other generic URL shorteners are banned on Wikimedia sites because they bypass our domain-based spam blocking.)
  2. Proactively provide dumps as a guarantee that if the service ever disappeared, people could still map URLs to their targets. You can find them on dumps.wikimedia.org and they're mirrored to the Internet Archive.
  3. Intentionally avoid any extra tracking and metrics collection. It is still included in Wikimedia's general webrequest logs, but there is no dedicated, extra tracking for short URLs besides what every request gets.

Anyone can create short URLs for any approved domain, subject to some rate limits and anti-abuse mechanisms via a special page or the API.

All of this is open source and usable by any MediaWiki wiki by installing the UrlShortener extension. (Since this launched, additional functionality was added to use multiple character sets and generate QR codes.)

The dumps are nice for other purposes too, I use them to provide basic statistics on how many URLs have been shortened.

I still tend to have a mildly negative opinion about people using our URL shortner, but hey, it could be worse, at least they're not using goo.gl.


Side quest: creating a "main" tool

I like Simon Willison's framing of using large language models (aka LLMs, aka "AI") to enable side quests of things you wouldn't normally do.

Could I have done this without LLM assistance? Yes, but not nearly as quickly. And this was not a task on my critical path for the day—it was a sidequest at best and honestly more of a distraction.

So, yesterday's side quest: writing a tool that checks out the default branch of a Git repository, regardless of what it's named.

Context: most of my work these days happens on GitHub, which involves creating PRs off the main branch, which means I'm frequently going back to it, via git checkout main and then usually a git pull to fast-forward the branch.

But just to make things a little more interesting, the SecureDrop server Git repository's main branch is named develop, which entirely screws with muscle memory and autocomplete. Not to mention all the older projects that still use a master branch.

For a while now I've wanted a tool that just checks out the main branch, regardless of what it's actually named, and optionally pulls it and stashes pending changes.

I asked Claude 3.5 Sonnet for exactly that:

I want a Rust program named "main" that primarily checks out the main branch of a Git repository (or master if it's called that).

I want to invoke it as:

  • main - just checkout the main branch
  • main stash - stash changes, then checkout main, then pop the stash
  • main pull - checkout main and then git pull
  • main stash pull or main pull stash - stash changes, checkout main, then pull, then pop the stash

It was mostly there, except it hardcoded the main and master branches intead of looking it up via Git. I asked:

Is there a smarter way to determine the main branch? What if it's called something other than main or master?

And it adjusted to checking git symbolic-ref refs/remotes/origin/HEAD, which I didn't know about.

I cleaned up the argument handling a little bit, added --version and published it on Salsa.

It took me about 5-10 minutes for this whole process, which according to xkcd is an efficiency positive (saves 1 second, but I do it ~5 times a day) over 5 years.

It probably would've taken me 2-3x as long without using an LLM, but honestly, I'm not sure I would've ever overcome the laziness to write something so small.

Anyways, so far I haven't really gotten around to writing about my experiences and feelings about LLMs yet, so here's literally the smallest piece of work to kick that off.


Making it easier for Toolforge tools to surface replag

A number of tools hosted on Toolforge rely on the replicated MediaWiki databases, dubbed "Wiki Replicas".

Every so often these servers have replication lag, which affects the data returned as well as the performance of the queries. And when this happens, users get confused and start reporting bugs that aren't solvable.

This actually used to be way worse during the Toolserver era (sometimes replag would be on the scale of months!), and users were well educated to the potential problems. Most tools would display a banner if there was lag and there were even bots that would update an on-wiki template every hour.

A lot of these practices have been lost since the move to Toolforge since replag has been basically zero the whole time. Now that more database maintenance is happening (yay), replag is happening slightly more often.

So to make it easier for tool authors to display replag status to users with a minimal amount of effort, I've developed a new tool: replag-embed.toolforge.org

It provides an iframe that automatically displays a small banner if there's more than 30 seconds of lag and nothing otherwise.

As an example, as I write this, the current replag for commons.wikimedia.org looks like:

The replica database (s4) is currently lagged by 1762.9987 seconds (00:29:22), you may see outdated results or slowness. See the replag tool for more details.

Of course, you can use CSS to style it differently if you'd like.

I've integrated this into my Wiki streaks tool, where the banner appears/disappears depending on what wiki you select and whether it's lagged. The actual code required to do this was pretty simple.

replag-embed is written in Rust of course, (source code) and leverages in-memory caching to quickly serve responses.

Currently I'd consider this tool to be beta quality - I think it is promising and ready for other people to give it a try, but know there are probably some kinks that need to be worked out.

The Phabricator task tracking this work is T321640; comments there would be appreciated if you try it out.


Running the ArchiveTeam Warrior under Podman

I'm finally back on an unlimited internet connection, so I've started running the ArchiveTeam Warrior once again.

The Warrior is a software application for archiving websites in a crowdsourced manner, especially when there's a time crunch when a website announces that it's closing or planning to delete things. Currently the default project is to archive public Telegram channels.

Historically the Warrior was distributed as a VirtualBox appliance, which was a bit annoying to run headlessly and was unnecessarily resource intensive because it required full virtualization. But they now have a containerized version that is pretty trivial to set up.

Relatedly, I've recently been playing with Podman's "Quadlet" functionality, which I really, really like. Instead of needing to create a systemd service to wrap running a container, you can specify what you want to run in a basically systemd-native way:

[Unit]
Description=warrior

[Container]
Image=atdr.meo.ws/archiveteam/warrior-dockerfile

PublishPort=8001:8001

Environment=DOWNLOADER=<your name>
Environment=SELECTED_PROJECT=auto
Environment=CONCURRENT_ITEMS=4

AutoUpdate=registry

[Service]
Restart=on-failure
RestartSec=30
# Extend Timeout to allow time to pull the image
TimeoutStartSec=180

[Install]
# Start by default on boot
WantedBy=multi-user.target default.target

I substituted in my username and dropped this into ~/.config/containers/systemd/warrior.container, ran systemctl --user daemon-reload and systemctl --user start warrior and it immediately started archiving! Visiting localhost:8001 should bring up the web interface.

You can then run systemctl --user cat warrior to see what the generated .service file looks like.

The AutoUpdate=registry line tells podman-auto-update to automatically fetch image updates and restart the running container. You'll likely need to enable/start the timer for this, with systemctl --user enable podman-auto-update.timer.

The one thing I haven't figured out yet is gracefully shutting down, which is important to avoid losing unfinished data. I suspect the Restart=always is harmful here, since I do want to explicitly shutdown in some cases.

P.S. I also have a infrequently updated Free bandwidth wiki page that contains other suggestions for how to use your internet connection for good.

Update (2024-07-14): I changed the restart options to Restart=on-failure and RestartSec=30, which fixes the issue with restarting immediately after a graceful shutdown and correctly restarting if it starts up before networking is ready.


Basic anti-abuse monitoring for Mastodon

Back in February, Mastodon and the connected Fediverse faced a spam attack caused by unattended instances with an open signup policy. Bots quickly registered accounts and then sent spammy messages that were relayed through the Fediverse.

It was annoying and the normal moderation tool of limiting or blocking entire instances wasn't effective since the attackers were coming from a wide set of places. Since then people have developed shared blocklists that you can subscribe to, but that has its own downsides.

So here's the tool I developed that we used for wikis.world: masto-monitor.

The workflow is straightfoward:

  • Poll the federated timeline for all public posts
  • Check them against a manually curated list of patterns
  • If there's a match, report it using the API

This allows us to have an automated process checking all incoming posts while still enabling humans to make any moderation decisions.

The code itself is pretty straightforward that it doesn't really merit much explanation. The matching logic is very very basic, it just looks for substring matches. I think the approach is worth developing further, allowing people to write more expressive rules/filters that trigger automated reports.

But, I'm not planning to do so myself since we don't currently have a need, so people are welcome to fork it to enhance it.


#wikimedia-rust Matrix to IRC bridge is back

tl;dr: You can now chat in #wikimedia-rust:matrix.org and reach folks on IRC

Nearly a year ago, the official Libera.Chat <-> Matrix bridge was shut down. There's a lot that went wrong in the technical and social operation of the bridge that the Libera.Chat staff have helpfully documented, but from a community management perspective the bridge, when it worked, was fantastic.

But, we now have a bridge back in place! Bridgebot is a deployment of matterbridge, which is primarily used in Wikimedia spaces for bridging IRC and Telegram, but it also speaks Matrix reasonably well.

Thanks to Bryan Davis for starting/maintaining the bridgebot project and Lucas Werkmeister for deploying the change; we now have a bot that relays comments between IRC and Matrix.

12:12:50 <wm-bb> [matrix] <legoktm> ooh, I think the Matrix bridging is working now
12:12:58 <legoktm> o/

And if you look at the view.matrix.org logs, those messages are there too.

So if you're interested in or working on Wikimedia-related things that are in Rust, please join us, in either IRC or Matrix or both :)